Friday, November 6, 2009

Closers in non-save situations

Closers in baseball today are specialists who get psyched up to come to the mound in the 9th inning and protect a small lead. The top performers are quite effective at this, but I've often heard the pundits and fellow fantasy baseball enthusiasts complain that managers bring the closers into non-save situations and screw up their ratios (ERA and WHIP). The theory is that when the team is already up by a few runs, or is losing, the closer can't have the same intensity. I've always been a bit skeptical about this; it sounds a lot like the clutch hitting myth, so today I'm going to investigate.

I've seen a couple articles (here and here) that have approached this problem in an heuristic way and only for one closer. I picked the first 6 closers off the top of my head - Trevor Hoffman, Bobby Jenks, Joe Nathan, Jonathan Papelbon, Mariano Rivera, and Billy Wagner - and I'll do some inference to see if any of them pitch significantly better when in save situations. I've limited the analyses to seasons when they were full-time closers.

The incredible Baseball-Reference has split stats for every pitcher for every year, and this includes all major performance indicators in save situations and in non-save situations, so getting the data into the right format was easy. The following graphs show the FIP and ERA for the 6 closers with the red indicating save situations and the blue indicating non-save situations. I threw out a partial season for both Hoffman and Wagner (mostly to make the graphs look nicer).
I thought if there was a difference in performance between the two situations that there could be a time trend - perhaps the veteran closer has learned how to perform in the non-save situations better than the raw, young closer. But these graphs show no obvious interaction between time and situation for any closer, so to perform my statistical analyses I aggregated the stats across all years.

For each player/situation combination I got estimates for the probability of a home run (HR), walk or hit by pitch (BB), strikeout (K), and of another type of out (out). Dividing the numerator and denominator of the usual FIP formula by PA, we see that FIP = 3(13P(HR) + 3P(BB) - 2P(K))/(P(K)+P(out)) + 3.2. And since it's written as a function of the probabilities and we know the covariance of multinomial probabilities, the delta method can be applied to get the variance of each FIP estimate.

The following table shows the FIP estimates for save situation and non-save situation, and the p-value - assuming normality of the estimates - for the test of FIP(SV) less than FIP(NSV).

Pitcher FIP(SV) FIP(NSV) p-value
Hoffman 2.83 3.30 0.08
Jenks 3.31 3.57 0.33
Nathan 2.67 2.14 0.90
Papelbon 2.51 2.34 0.63
Rivera 2.71 3.06 0.12
Wagner 2.75 2.63 0.64

We can see that three pitchers actually pitch better in non-save situations and even without controlling for multiple testing, no pitcher is significantly better in save situations. That doesn't mean that this will hold for every closer (although I suspect all significant differences would be false positives), but for these 6 the myth has been broken!

The data used here were obtained from Baseball-Reference.

Sunday, October 18, 2009

Score of the game and pitcher performance

Continuing my previous post, I will now briefly analyze the association between the score the game and some pitching stats. I certainly expected to see some statistical significance here as I used more data - a pitcher with a big lead will probably throw more strikes, giving up more home runs and fewer walks, to limit the probability of a big inning. I was interested in seeing just how big this effect is. Perhaps the FIP for pitchers with a big lead is actually lower than for those with a smaller lead, and pitchers should consider throwing more strikes all the time? Anyway, I'm sure a more detailed study on this, and the associated win probabilities for different pitching strategies has already been done.

I fit binomial logistic models (one each for HR, BB, and K) accounting for team at bat and for the pitcher (and number of pitches in the previous inning). The other factor was either lead^2 or just lead of the team pitching. With lead^2, I got coefficients and p-values:
HR 0.0032 8.53e-05 ***
BB -0.0027 1.14e-05 ***
K -0.0011 0.017213 *

Lead^2 is significant for HR, BB, and K, but each translates to less than half a percent multiplicative change in the odds of the associated counting statistic for each increase of one in lead^2. There are indeed more HR and fewer BB (and fewer K) when the game is not close, but for every 200 HR you see with lead^2 = x, you'll see less than one extra HR for lead^2 = x+1. This works out to about 10% more home runs in 5 run games than in tie games.

I fit the same model with lead instead of lead^2, and the effects were in the same direction but not as large, and the lead effect in the K model was not significant at all. This implies that pitchers on both sides throw more strikes when the game isn't close.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "".

Wednesday, September 2, 2009

Does one long half inning lead to another?

I've sometimes heard announcers say that while a pitcher loves watching his team rally, he gets tight sitting on the bench for so long, and his performance in the next half inning may suffer. This mildly relates to my previous post about a long 7th inning stretch not affecting the pitcher. But a long rally could mean 15, 20, even 25 minutes on the bench for a pitcher, so maybe that's enough time for his performance in the next half inning to suffer?

My idea to answer this question was to fit a multinomial logistic regression with the levels of the response being the necessary counts to estimate FIP: HR, BB+HBP, K, other outs; and the predictors being team at bat, pitcher, stadium, and pitches in the previous inning. The data from Retrosheet doesn't have the time of each half inning, but I figure pitches thrown is going to be highly correlated with actual time, and will suffice. Once the parameters in the multinomial logistic regression are estimated, one could easily estimate FIP=(13HR+3(BB+HBP)-2K)/IP for different values of pitches in the previous inning (for the average pitcher and average team at bat) and use the multivariate delta method to find the variance of each FIP estimate.

I didn't want to use linear regression with FIP as the response because it would basically be categorical. Moreover, sometimes IP=0, so I couldn't model FIP directly anyway - limiting the interpretability of my results for those who want to see FIP, and not some transformation of it.

I think the multinomial regression is necessary rather than a few binomial regressions because to estimate the variance of the FIP estimates, we need to take into account the covariance in the parameter estimates (any glm fit in R will spit out a correlation matrix of parameter estimates upon request). If using only binomial regressions, we can't estimate the correlation between the coefficient for pitches in the previous inning in the HR/PA model and the corresponding coefficient in the BB/PA model, for example.

Unfortunately, the multinomial regression functions available in R don't seem up to the task. (Let me know if I'm missing something.) But vglm in the VGAM package wants the data in binary form, which makes the already large data set (with one row for each inning) even larger, requiring a separate row for each PA. R then runs out of memory when finding the root - at least when I'm using all the PA from 2007-2008. The other function I found, multinom in the nnet library, doesn't seem to compute the covariance matrix.

But to answer today's question, it turns out one doesn't really need to look at the FIP. After some painful character string manipulation to parse the data, I fit three separate binomial logistic regressions - for HR, BB, K - and in none of them is the coefficient for previous pitches even close to significant, nor are any of the estimates practically different from zero. Below are the coefficients and the corresponding p-values:
HR -0.0026 0.14
BB 0.0012 0.26
K -0.0003 0.70

So it looks like length of the previous inning does not have any effect on the pitcher, and he should be wholeheartedly cheering for his team to score runs - even if he's mostly interested in his own sabermetric stats.

Next time, I'll use similar code to analyze the relationship between the score of the game and HR, BB, K allowed. It looks like there are some significant differences there.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "".

Thursday, August 13, 2009

Fielding Independent Pitching

I was working on a longer post about how a pitcher's performance might be affected by the score of the game, or how long the previous half inning took, and I realized that the FIP formula developed by Tom Tango was a bit counter-intuitive to me. I know there are more complex formulas that can perform better, but as a quick predictor of future ERA, FIP seems to be the popular choice. What puzzled me was the use of IP rather than plate appearances (PA) - I thought using IP would bring luck back into the equation, and hence predict lower ERAs for pitchers who happened to have a low BABIP in the preceding season.

So I did some quick analyses using data from 1995-2008. I fit a weighted (based on IP) multiple linear regression model with season ERA as the response and the previous season's HR/IP, (BB+HBP)/IP, and K/IP as predictors. I limited the analysis to pitchers who had pitched at least 130 innings in both the predictor year and the response year. I'm sure the results wouldn't change much if I didn't use the innings quota, but I wanted to focus on starting pitchers because that's the group my next post will be about. The multiple regression assumptions of homoscedasticity and linearity appear to be reasonable:
The estimates (all highly significant) are:

Rounding off to the nearest integer, my formula agrees with FIP. Then I fit the same model, but with HR/PA, (BB+HBP)/PA, and K/PA (using PA as the weight), and got:

The higher R^2 for the innings pitched model tells us that model does a better job of predicting ERA for the subsequent season. Removing balls in play from the regression equation reduced the quality of the fit, so this is evidence that pitchers do have some control over BABIP.

The data used here were obtained from FanGraphs.

Sunday, August 2, 2009

Yankees icing opposing pitchers?

Ever since September 11, 2001, "God Bless America" has been played at the 7th inning stretch of all games at Yankee Stadium, slightly delaying the start of the bottom of the 7th. Other stadiums reserve playing the song for special occasions such as Opening Day, Memorial Day, the 4th of July, and 9/11.

I've often read that this extended break before the bottom of the 7th "ices" the opposing pitcher, much like a timeout before a field goal attempt supposedly distresses a placekicker. This break gets even longer during the postseason when they trot out Dr. Ronan Tynan to sing an extended live version, making the visiting pitcher - and everyone else - stand still (they used to really frown on movement during the song!) while Dr. Tynan sings. I would think this delay might affect a new pitcher coming in from the bullpen more than it would affect the old one coming from the dugout - because the guy in the dugout already sat still for the top of the 7th inning. But either way, the claim is that the Yankees seem to do quite well in the bottom of the 7th at home, especially in the playoffs, since 9/11/2001.

I've seen a couple analyses confirming this effect (I won't name names) that don't really seem to get at the issue. I think the way to simply test for a "God Bless America" effect is to take into account the dynamic annual strength of the Yankee offense, and to limit the analysis to Yankee Stadium games in order to control for park effect. A more complicated analysis could of course take into account opposing pitching strength, injuries, etc.

Rather than looking directly at runs, it would be preferable to look at a sabermetric statistic like base runs, but I'm going to be lazy and look at runs directly. The number of runs the Yankees score in the bottom of the 7th inning of a given game is a count, so it's natural to model it with the Poisson distribution, incorporating an offset for the strength of the Yankee offense. I'll estimate the strength of the offense as the number of runs they score during the whole regular season at home; this actually has a bigger range than I thought, going from 520 runs in 2007 to just 412 runs in 2008 - no doubt partly due to A-Rod's injury. I'll start by considering just the regular season games played between 1995 and 2008. The required quasi-Poisson regression model is:

log(E(runs)) = log(strength) + b*GBA,

where E(runs) denotes the expected value (or mean) of 7th inning runs, and GBA is 1 if the game took place after 9/11/2001, and 0 if before. The results (from R) for b are:
Estimate p-value
-0.002675 0.98

(Dispersion parameter for quasipoisson family taken to be 2.29)

The Yanks are actually scoring at a slightly higher rate in the bottom of the 7th since 9/11/2001 (not shown), but the results above show that after controlling for their increasing offensive strength, they are scoring at a lower rate since the addition of the song! The rate change is highly insignificant though, with a p-value of 0.98. But the mumbling about an unfair advantage appears to be unfounded. What about a playoff advantage though? That's what the opposition really gripes about... it turns out no complicated modeling is necessary to answer that question.

Between 1995 and 2000, the Yanks played 31 home playoff games, and scored 29 7th inning runs, for an average of 0.94. They scored at least once in the 7th 14 times. Between 2001 and 2008, the Yanks played 32 home playoff games, and scored 15 7th inning runs, for an average of 0.47. In those years they scored at least once in the 7th only 9 times!

So it turns out that Dr. Tynan's long performance of "God Bless America" doesn't really throw the opposing pitchers off their game any more than being late in the game in the playoffs at Yankee Stadium has already done. If anything, it seems like it might help the opposition... of course the sample size is pretty small.

Or... maybe the live song has helped the Yanks, but A-Rod has single-handedly mitigated the benefits! :)

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "".

Friday, July 31, 2009


This is the first of possibly many blog entries about statistics in baseball. A bit about me: I'm a Ph.D. student in Statistics at Cornell University, and I've been a baseball fan since I was about five years old I suppose. I started playing fantasy baseball in the early 90's, and I've been at it ever since.

I'm not sure how often I'll have time to write, or how accessible all of my writings will be to the casual statistician, but I hope some people will read what I write here, and maybe I'll actually be filling a niche: baseball analysis with more rigorous statistics than most blogs. It'll be good practice for my own data analysis skills too - something I don't get much of.

I've recently downloaded some data from, and I'm required to cite them with the following statement whenever I use the data: The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "". This incredibly useful site has free downloads of the play by play information and summary information for the vast majority of MLB games ever played, and complete data for everything recent.

I'll use the statistical package R for my data analysis. The first question I'll tackle is whether the Yankees' "icing" of opposing pitchers with their long 7th inning stretch is actually effective. Stay tuned.