The coefficients in this post (and most of the estimates I make in other posts) are estimated using linear regression, specifically an ordinary least squares (OLS) formula. Basically, I start off with the idea that, maybe, overall, time has some effect on OBP, since pitching and batting ability don’t increase at the same rate, and then beyond that maybe the DH rule has an effect. Then, I fit a line to the data, and maybe the Betas (amount you multiply the variables by to get the output value) are significant and maybe they aren’t.

OLS is sort of an unholy abomination of calculus and linear algebra – basically, it projects the multidimensional data (here, OBP and a yes-or-no “Is there a DH?” question with yes=1 and no=0) onto a time plot and then uses calculus to find a function that minimizes the sum of squared errors (the total squared distance between the actual data points and the data points that the model would predict).

As you’d imagine, this is a pretty tedious thing to do, which is why R (and other computer programs) can be so useful – once the data is loaded, it’s as simple as calling a function like “name-I-give-the-model <- lm(OBP ~ t + tsq + DH)". R generates a table that gives me the estimates for the Betas as well as the standard errors (which are functions of, among other things, the amount of data that we have) and the t-values (which are handy representations of how statistically significant the estimates should be).

There are some drawbacks to OLS, including the likelihood that omitting a relevant variable can distort the coefficient estimates, and the fact that it can be difficult to find good representations of all relevant variables. For example, it's probably impossible to quantify strategy preferences like the preference for small ball that you cite.

]]>Secondly, I’ve been wanting to learn R, but haven’t yet had a serious project to push me into diving in too deep. Reading “Understanding Sabermetrics” by Costa, Huber and Saccoman, I’ve been thinking that the time may be near to try to figure out how some of the constants were derived at, and R may be the tool. But I’d be applying NPB data instead of MLB data.

My main complaint about the book has been that how most of constants used in formulas were derived all happen “off stage” using some sort of magic. Furthermore, concepts like “regressions of β3=0″ are 20 years behind me, like most of my “maths.” I can figure out most any programming language through example, though, so would be most interested to see the source code to how you created the above tables.

Thirdly, while your MLB data may be over a shorter time period, NPB has played a shorter season than MLB, especially considering that the first few years of its existence was mainly made up of a few tournaments. So I would guess that the total number of games for NPB from 1937 and MLB from 1955 PER TEAM may be fairly close. But the number of teams is much, much lower, so you are actually getting many less plate appearance opportunities from the NPB data.

Also, the two-league system started in 1950. This seems to be when a lot of official record books start taking records seriously, like the Modern Era stats in MLB compared to the Dead Ball Era and such. I would suggest starting from there.

Furthermore, the sacrifice bunt has been a big part of the game in Japan since the Yomiuri Giants visited the Dodgers’ training camp in the 1960s. The emphasis on “small ball” techniques may also play a role in differences between OBP especially.

Finally, I have HPB and SF for recent NPB seasons, and with some time, can enter them for seasons going back to 1936 (for HBP) and/or 1939 (when SF started). Seeing your original MLB R code would help in deciding where to concentrate my data input efforts. (Please note, I don’t have so much free time to do this in a timely manner – but would be interested in getting this done slowly.)

]]>