More on Home Runs Per Game July 9, 2010
Posted by tomflesher in Baseball, Economics.Tags: baseball-reference.com, Japan, R, Rays, regression, replication, Baseball, home runs, Japanese baseball, Chow test
add a comment
In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.
As a reminder, the MLB regression gave us a regression equation of
where is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.
Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly differe
nt. Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.
After running the same regression with t=1 in 1950, I got these results:
| Estimate | Std. Error | t-value | p-value | Signif | |
| B0 | 0.2462 | 0.0992 | 2.481 | 0.0148 | 0.9852 |
| t | 0.0478 | 0.0062 | 7.64 | 1.63E-11 | 1 |
| tsq | -0.0006 | 0.00009 | -7.463 | 3.82E-11 | 1 |
| DH | 0.0052 | 0.0359 | 0.144 | 0.8855 | 0.1145 |
This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an value of .4045, meaning it explains about 40% of the variation in home runs per game.
There’s a slightly interesting pattern to the residual home runs per game (. Although
it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.
Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.
In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes
where is the combined sum of squared residuals,
and
are the individual (i.e. MLB and Japan) sum of squared residuals,
is the number of parameters, and
and
are the number of observations in each group.
The critical value for 90% significance at 4 and 192 degrees of freedom would be 1.974 according to Texas A&M’s F calculator. That means we don’t have enough evidence that the parameters are different to treat them differently. This is probably an artifact of the small amount of data we have.
In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.
As a reminder, the MLB regression gave us a regression equation of
where is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.
Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly differe
nt. Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.
After running the same regression with t=1 in 1950, I got these results:
| Estimate | Std. Error | t-value | p-value | Signif | |
| B0 | 0.2462 | 0.0992 | 2.481 | 0.0148 | 0.9852 |
| t | 0.0478 | 0.0062 | 7.64 | 1.63E-11 | 1 |
| tsq | -0.0006 | 0.00009 | -7.463 | 3.82E-11 | 1 |
| DH | 0.0052 | 0.0359 | 0.144 | 0.8855 | 0.1145 |
This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an value of .4045, meaning it explains about 40% of the variation in home runs per game.
There’s a slightly interesting pattern to the residual home runs per game (. Although
it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.
Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.
In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes
Back when it was hard to hit 55… July 8, 2010
Posted by tomflesher in Baseball, Economics.Tags: baseball-reference.com, R, regression, sabermetrics, Stuff Keith Hernandez Says, Baseball, home runs, Year of the Pitcher, Willie Mays, talent pool dilution
add a comment
Last night was one of those classic Keith Hernandez moments where he started talking and then stopped abruptly, which I always like to assume is because the guys in the truck are telling him to shut the hell up. He was talking about Willie Mays for some reason, and said that Mays hit 55 home runs “back when it was hard to hit 55.” Keith coyly said that, while it was easy for a while, it was “getting hard again,” at which point he abruptly stopped talking.
Keith’s unusual candor about drug use and Mays’ career best of 52 home runs aside, this pinged my “Stuff Keith Hernandez Says” meter. After accounting for any time trend and other factors that might explain home run hitting, is there an upward trend? If so, is there a pattern to the remaining home runs?
The first step is to examine the data to see if there appears to be any trend. Just looking at it, there appears to be a messy U shape with a minimum around t=20, which indicates a quadratic trend. That means I want to include a term for time and a term for time squared.
Using the per-game averages for home runs from 1955 to 2009, I detrended the data using t=1 in 1955. I also had to correct for the effect of the designated hitter. That gives us an equation of the form
The results:
| Estimate | Std. Error | t-value | p-value | Signif | |
| B0 | 0.957 | 0.0328 | 29.189 | 0.0001 | 0.9999 |
| t | -0.0188 | 0.0028 | -6.738 | 0.0001 | 0.9999 |
| tsq | 0.0004 | 0.00005 | 8.599 | 0.0001 | 0.9999 |
| DH | 0.0911 | 0.0246 | 3.706 | 0.0003 | 0.9997 |
We can see that there’s an upward quadratic trend in predicted home runs that together with the DH rule account for about 56% of the variation in the number of home runs per game in a season (). The Breusch-Pagan test has a p-value of .1610, indicating a possibility of mild homoskedasticity but nothing we should get concerned about.
Then, I needed to look at the difference between the predicted number of home runs per game and the actual number of home runs per game, which is accessible by subtracting
This represents the “abnormal” number of home runs per year. The question then becomes, “Is there a patt
ern to the number of abnormal home runs?” There are two ways to answer this. The first way is to look at the abnormal home runs. Up until about t=40 (the mid-1990s), the abnormal home runs are pretty much scattershot above and below 0. However, at t=40, the residual jumps up for both leagues and then begins a downward trend. It’s not clear what the cause of this is, but the knee-jerk reaction is that there might be a drug use effect. On the other hand, there are a couple of other explanations.
The most obvious is a boring old expansion effect. In 1993, the National League added two teams (the Marlins and the Rockies), and in 1998 each league added a team (the AL’s Rays and the NL’s Diamondbacks). Talent pool dilution has shown up in our discussion of hit batsmen, and I believe that it can be a real effect. It would be mitigated over time, however, by the establishment and development of farm systems, in particular strong systems like the one that’s producing good, cheap talent for the Rays.
Modeling Run Production June 19, 2010
Posted by tomflesher in Baseball, Economics.Tags: Baseball, economics, regression, run production, sports economics
add a comment
A baseball team can be thought of as a factory which uses a single crew to operate two machines. The first machine produces runs while the team bats, and the second machine produces outs while the team is on fields. This is a somewhat abstract way to look at the process of winning games, because ordinarily machines have a fixed input and a fixed output. In a box factory, the input comprises man-hours and corrugated board, and the output is a finished box. Here, the input isn’t as well-defined.
Runs are a function of total bases, certainly, but total bases are functions of things like hits, home runs, and walks. Basically, runs are a function of getting on base and of advancing people who are already on base. Obviously, the best measure of getting on base is On-Base Percentage, and Slugging Average (expected number of bases per at-bat) is a good measure of advancement.
OBP wraps up a lot of things – walks, hits, and hit-by-pitch appearances – and SLG corrects for the greater effects of doubles, triples, and home runs. That doesn’t account for a few other things, though, like stolen bases, sacrifice flies, and sacrifice hits. It also doesn’t reflect batter ability directly, but that’s okay – the stats we have should represent batter ability since the defensive side is trying to prevent run production. The model might look something like this, then:
This is the simplest model we can start with – each factor contributes a discrete number of runs. If we need to (and we probably will), we can add terms to capture concavity of the marginal effect of different stats, or (more likely) an interaction term for SLG and, say, SB, so that a stolen base is worth more on a team where you’re more likely to be brought home by a batter because he’s more likely to give you extra bases. As it is, however, we can test this model with linear regression. The details of it are behind the cut. (more…)
Trends in DH use June 11, 2010
Posted by tomflesher in Baseball, Economics.Tags: Baseball, baseball-reference.com, designated hitter, economics, Interleague play, Mets, regression, sports economics, Stuff Keith Hernandez Says
add a comment
Last night, Keith Hernandez was talking about how the Mets are scheduled to play in American League parks starting, well, today. He pointed out that the Mets will be in a bit of a pickle because they aren’t built, as AL teams are, to carry one big hitter to be the full-time DH. Instead, an NL team will be forced to spread the wealth among lighter hitters who are carried for their defensive acumen as well as their offensive prowess. Keith then corrected himself and said that AL managers are using the DH differently – to rest individual players instead of having an everyday DH.
That pinged my “Stuff Keith Hernandez says” meter, and so I decided to crunch some numbers and see if that’s true. I interpreted Keith’s statement as implying that the number of designated hitters should be increasing, since managers are moving away from an everyday DH and toward spreading the DH assignments around a bit more. The crunching also needs to account for interleague play, which should obviously increase the number of DHes. So, after controlling for interleague play, does DH use show an increasing trend with time?
The DH Redux: Japan June 7, 2010
Posted by tomflesher in Baseball.Tags: Baseball, baseballguru.com, designated hitter, Japan, NPB, OBP, regression, replication
2 comments
In an earlier post, I analyzed team-level data from Major League Baseball to determine the size of the effect that the Designated Hitter rule has on on-base percentage. The conclusion I came to was that, if the model is properly specified, the effect of the designated hitter rule is about .008 in on-base percentage. If the reasoning was correct, then when there are no other confounding variables, the effect should be similar in size for any other professional league.
Of course, the other major professional league is Nippon Professional Baseball, the major leagues of Japan. Since it produces players at a level similar to MLB, and the other factors are similar – the DH rule was adopted in 1975 by one, but not both, of the two major leagues – NPB is an ideal place to try to test the model I specified in this post.
Does the DH Rule Cause Batters to be Hit? June 2, 2010
Posted by tomflesher in Baseball, Economics.Tags: Baseball, baseball-reference.com, designated hitter, economics, hit by pitch, Kevin Youkilis, regression, sports economics
add a comment
In an earlier post, I crunched some numbers on the Designated Hitter rule and came to the conclusion that the DH adds about .3 extra trips to first base per game after accounting for trend. I’m going to play around with another stat that a lot of people seem to think should be affected indirectly by the DH rule.
The Conventional Wisdom™ is that the DH should increase hit batsman. The argument is that pitchers don’t bear the costs of hitting a batter with a pitch because they don’t bat, so they’ll be less careful to avoid hitting a batter or more likely to plunk a batter out of malice. Do the numbers bear that out?
What is the effect of the Designated Hitter? May 30, 2010
Posted by tomflesher in Baseball.Tags: baseball-reference.com, designated hitter, R, regression
2 comments
Intuitively, the designated hitter rule seems like it should increase scoring. By getting on base more often than the pitcher would have, the designated hitter helps produce runs by hitting, by being on base so that other players can drive him in, and by not accumulating outs by bunting or striking out as often as the pitcher does. However, there should be a corresponding effect from having pitchers left in the game longer: a better pitcher who remains in the game might get more outs than a reliever who came in simply because the manager pinch-hit for the starting pitcher because he needed offense.
Behind the cut, I’ll explain the testing I did to determine whether the effect of a DH is positive (hint: it is) and look at how big an effect is actually there.
Quickie: MLB Playoffs by Pitching Statistics February 23, 2010
Posted by tomflesher in Baseball.Tags: Baseball, OLS, playoffs, probit, regression
add a comment
It’s cold out today. Last night, Buffalo was covered in a thin layer of freezing rain. I’m trying to stay warm by turning up my hot stove the way only an economist can – crunching the numbers on playoffs.
I’m re-using the dataset from my Cy Young Predictor a few entries ago in the interest of parsimony. It contains dummy variables teamdivwin and teamwildcard which take value 1 if the pitcher’s team won the division or the wildcard respectively. I then created a variable playoffs which took the value of the sum of teamdivwin and teamwildcard – just a playoff dummy variable.
Using a Probit model and a standard OLS regression model, I estimated the effects of individual pitching stats on playoffs. Neither model has very strong predictive value (linear has R-squared of about .05), which is unsurprising since it doesn’t take the team’s batting into account at all. None of the coefficient values are shocking – in the American League (designated as lg = 1), teams have a higher probability of making the playoffs because there are fewer teams, and although complete games appear to have a negative effect, the positive shutout effect more than makes up for that in both models. I’m interested in whether complete game wins and complete game losses have differential effects – that will probably be my next snowy-day project.
Results are behind the cut.