More on Home Runs Per Game July 9, 2010
Posted by tomflesher in Baseball, Economics.Tags: baseball-reference.com, Japan, R, Rays, regression, replication, Baseball, home runs, Japanese baseball, Chow test
add a comment
In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.
As a reminder, the MLB regression gave us a regression equation of
where is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.
Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly differe
nt. Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.
After running the same regression with t=1 in 1950, I got these results:
| Estimate | Std. Error | t-value | p-value | Signif | |
| B0 | 0.2462 | 0.0992 | 2.481 | 0.0148 | 0.9852 |
| t | 0.0478 | 0.0062 | 7.64 | 1.63E-11 | 1 |
| tsq | -0.0006 | 0.00009 | -7.463 | 3.82E-11 | 1 |
| DH | 0.0052 | 0.0359 | 0.144 | 0.8855 | 0.1145 |
This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an value of .4045, meaning it explains about 40% of the variation in home runs per game.
There’s a slightly interesting pattern to the residual home runs per game (. Although
it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.
Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.
In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes
where is the combined sum of squared residuals,
and
are the individual (i.e. MLB and Japan) sum of squared residuals,
is the number of parameters, and
and
are the number of observations in each group.
The critical value for 90% significance at 4 and 192 degrees of freedom would be 1.974 according to Texas A&M’s F calculator. That means we don’t have enough evidence that the parameters are different to treat them differently. This is probably an artifact of the small amount of data we have.
In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.
As a reminder, the MLB regression gave us a regression equation of
where is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.
Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly differe
nt. Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.
After running the same regression with t=1 in 1950, I got these results:
| Estimate | Std. Error | t-value | p-value | Signif | |
| B0 | 0.2462 | 0.0992 | 2.481 | 0.0148 | 0.9852 |
| t | 0.0478 | 0.0062 | 7.64 | 1.63E-11 | 1 |
| tsq | -0.0006 | 0.00009 | -7.463 | 3.82E-11 | 1 |
| DH | 0.0052 | 0.0359 | 0.144 | 0.8855 | 0.1145 |
This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an value of .4045, meaning it explains about 40% of the variation in home runs per game.
There’s a slightly interesting pattern to the residual home runs per game (. Although
it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.
Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.
In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes
Edwin Jackson, Fourth No-Hitter of 2010 June 25, 2010
Posted by tomflesher in Baseball, Economics.Tags: baseball-reference.com, BayesBall, Dallas Braden, Diamondbacks, Edwin Jackson, no-hitters, poisson distribution, Rays, Roy Halladay, Ubaldo Jimenez
add a comment
Tonight, Edwin Jackson of the Arizona Diamondbacks pitched a no-hitter against the Tampa Bay Rays. That’s the fourth no-hitter of this year, following Ubaldo Jimenez and the perfect games by Dallas Braden and Roy Halladay.
Two questions come to mind immediately:
- How likely is a season with 4 no-hitters?
- Does this mean we’re on pace for a lot more?
The second question is pretty easy to dispense with. Taking a look at the list of all no-hitters (which interestingly enough includes several losses), it’s hard to predict a pattern. No-hitters aren’t uniformly distributed over time, so saying that we’ve had 4 no-hitters in x games doesn’t tell us anything meaningful about a pace.
The first is a bit more interesting. I’m interested in the frequency of no-hitters, so I’m going to take a look at the list of frequencies here and take a page from Martin over at BayesBall in using the Poisson distribution to figure out whether this is something we can expect.
The Poisson distribution takes the form
where is the expected number of occurrences and we want to know how likely it would be to have
occurrences based on that.
Using Martin’s numbers – 201506 opportunities for no-hitters and an average of 4112 games per season from 1961 to 2009 – I looked at the number of no-hitters since 1961 (120) and determined that an average season should return about 2.44876 no-hitters. That means
and
Above is the distribution. p is the probability of exactly n no-hitters being thrown in a single season of 4112 games; cdf is the cumulative probability, or the probability of n or fewer no-hitters; p49 is the predicted number of seasons out of 49 (1961-2009) that we would expect to have n no-hitters; obs is the observed number of seasons with n no-hitters; cp49 is the predicted number of seasons with n or fewer no-hitters; and cobs is the observed number of seasons with n or fewer no-hitters.
It’s clear that 4 or even 5 no-hitters is a perfectly reasonable number to expect.
| 2.448760831 |
June 15 Wins Above Expectation June 16, 2010
Posted by tomflesher in Baseball.Tags: Angels, Baseball, Rays, Tigers, wins above expectation
add a comment
Wins Above Expectation are a statistic determined using team wins and the Pythagorean expectation, which is in turn determined using runs scored by and against each team. The Pythagorean expectation is the proportion of runs scored squared to runs scored squared plus runs against squared. It’s interpreted as an expected winning percentage.
Wins Above Expectation (WAE) is then the difference between Wins and Expected Wins, which are simply the Pythagorean Expectation multiplied by Games played. It’s a useful measure because it can be interpreted as wins that are due to efficiency (in economic terms) or, more simply, play that’s some combination of smart, clutch, and non-wasteful. It rewards winning close games and penalizes teams that win lots of laughers but lose close games, since the big wins predict more games will be won when all those runs are spent winning only one game.
Using Baseball-Reference.com, I crunched the numbers for AL teams up to June 15. As usual, the Los Angeles Angels of Anaheim lead the league in WAE with 3.68, with Detroit’s 2.39 a close second, but the Tampa Bay Rays are a surprising last with -1.96 WAE. Obviously, this early in the season it’s too soon to conclude anything based on this, but the complete data is behind the cut. (more…)
So why doesn't Nick Swisher pitch every night? April 15, 2009
Posted by tomflesher in Baseball.Tags: Cardinals, Cody Ransom, comparative advantage, Economics haiku, emergency relievers, Gabe Kapler, Joe Girardi, market for pitchers, Moneyball alumni, Nick Swisher, position players pitching, Rays, Scott Spiezio, Wade Boggs, Yankees
add a comment
Nick Swisher pitched for the first time in the major leagues on Monday night during the Yankees’ 15-5 loss to the Tampa Bay Rays. As you can see from the box score, Swish pitched pretty well. In fact, in 22 pitches, he gave up only one hit and one walk, threw 12 strikes, and struck out a major-league batter (left-fielder Gabe Kapler). So, will Yankees manager Joe Girardi tap him in relief again soon?
No, of course not. Find out why behind the cut.
Statistical evidence that the Rays are outclassed. October 27, 2008
Posted by tomflesher in Baseball.Tags: Baseball, Phillies, Rays
1 comment so far
The series thus far.
Q.E.D.
Poor Kazmir. October 17, 2008
Posted by tomflesher in Baseball.Tags: ALCS, Baseball, Cy Young, John Smoltz, Mike Mussina, Rays, Red Sox, Scott Kazmir, weird lines
add a comment
Last night, Scott Kazmir pitched 6 scoreless innings in ALCS game 5, giving up 2 hits and 3 walks but striking out 7 batters. He totalled up to a game score of 72 points. His bullpen then proceeded to give up 8 runs, allowing the Red Sox to come back and win the game (thus extending the series to game 5).
Has Scotty suffered the greatest postseason indignity ever? Nope. Not even close. That honor belongs to Mike Mussina of the 1997 Orioles.
Wins Above Expectation (with a side of run differential) September 1, 2008
Posted by tomflesher in Baseball.Tags: Angels, Baseball, Blue Jays, Rays, Research, sabermetrics
1 comment so far
In continuing my thoughts about the Pythagorean Expectation from about a week ago, I took a look at the MLB standings for the period ending August 31, 2008. I played with the stats a little bit, since I haven’t really thought through the basis for most of them.
Today’s project: find Pythagorean expectations for each team, then find the difference between the actual and expected win percentages (“pythagorean difference”). Apply the pythagorean difference to the total number of games played to determine a team’s Wins Above Expectation by multiplying the total number of games by the pythagorean difference.
Practical application: none.
Discussion and numbers behind the cut.