# Statistics in Baseball

## Tuesday, October 19, 2010

### Big changes!

I've got a great new job, starting in January, for a sports betting company in London. This means that the blog entries will probably stop for good because my baseball modelling efforts will go into the job. I may still post once in a while about other sports that my employers don't bet on - like maybe curling!

## Monday, October 11, 2010

### The Home Run Derby effect?

Do players who participate in the All-Star Game Home Run Derby screw up their swing, go into a slump, and have a poor second half? I've heard this one from the talking heads before, and it sounds completely false to me. Mark McGwire used to put on a show in batting practice in 1998, and it didn't stop him from hitting 70 home runs.

The Home Run Derby data were found from the MLB website, but they weren't in CSV or tab separated format, so I had to do some manipulation. I would have liked to get data on just the few games following the HR Derby to check for slumps, but I settled for the first and second half splits (as determined by the All-Star break) from Baseball-Reference. I used the years 2003-2009.

OPS is a good measure of how effective a hitter is, so I thought it would be best to compare pre- and post-break numbers in terms of OPS (I tried some other things and they led to similar conclusions). Baseball-Reference has a statistic called sOPS+ which measures a player's OPS relative to the league. This controls for season, but since I was looking at first and second half differences, this wasn't too important. I tried it anyway, and it gave almost the exact same results as OPS, so I stuck with OPS.

Although some players appeared in more than one HR Derby between 2003 and 2009, I assumed that the 56 differences between pre-break OPS and post-break OPS were independent. The differences looked Gaussian, and the average pre-break OPS was .958 and the average post-break OPS was .924, leading to a one-sided p-value of 0.02 in the paired t-test - so it's true, the participants do worse after the break! This idea is furthered by the fact that the mean career pre- and post-break OPS for the players are not significantly different - the decrease seems to happen specifically in the year the players compete in the HR Derby.

But... how are hitters selected for the HR Derby? By having a very good first half. The hitters participating are ones who have often done unusually well in the first half, and were heading for a drop-off in the second half whether they took part in the derby or not. I can think of two ways to get around this and answer the question of whether the HR Derby causes the poorer second half. One is to compare the second half OPS in HR Derby years to the second half OPS in non-HR Derby years, and the other is to see if players who take part in more rounds of the derby have a bigger second half drop-off than players who are eliminated early. (I could also look at the second half drop-off for players in the All-Star Game who weren't in the HR Derby, but it was enough of a pain getting the data for just these players, so I'll try to avoid this approach.)

The mean career second half OPS of the 56 HR Derby hitters is .894, and in HR Derby years it is .924. This is still a bit unsatisfactory because the HR Derby year is presumably in the prime of their career, so let's try the second way. Consider the number of swings taken in the competition by each player; this is equal to ten times the number of rounds they were in, plus their HR total. Fitting a linear regression of decrease in OPS on number of HR Derby swings, it is apparent that the more swings the player takes, the less the pre- and post-break OPS difference is, i.e. the opposite of the proposed effect. So I'm pretty comfortable writing the poorer second halves off to selection bias.

Thanks to Bret Hanlon for the idea for this post.

The Home Run Derby data were found from the MLB website, but they weren't in CSV or tab separated format, so I had to do some manipulation. I would have liked to get data on just the few games following the HR Derby to check for slumps, but I settled for the first and second half splits (as determined by the All-Star break) from Baseball-Reference. I used the years 2003-2009.

OPS is a good measure of how effective a hitter is, so I thought it would be best to compare pre- and post-break numbers in terms of OPS (I tried some other things and they led to similar conclusions). Baseball-Reference has a statistic called sOPS+ which measures a player's OPS relative to the league. This controls for season, but since I was looking at first and second half differences, this wasn't too important. I tried it anyway, and it gave almost the exact same results as OPS, so I stuck with OPS.

Although some players appeared in more than one HR Derby between 2003 and 2009, I assumed that the 56 differences between pre-break OPS and post-break OPS were independent. The differences looked Gaussian, and the average pre-break OPS was .958 and the average post-break OPS was .924, leading to a one-sided p-value of 0.02 in the paired t-test - so it's true, the participants do worse after the break! This idea is furthered by the fact that the mean career pre- and post-break OPS for the players are not significantly different - the decrease seems to happen specifically in the year the players compete in the HR Derby.

But... how are hitters selected for the HR Derby? By having a very good first half. The hitters participating are ones who have often done unusually well in the first half, and were heading for a drop-off in the second half whether they took part in the derby or not. I can think of two ways to get around this and answer the question of whether the HR Derby causes the poorer second half. One is to compare the second half OPS in HR Derby years to the second half OPS in non-HR Derby years, and the other is to see if players who take part in more rounds of the derby have a bigger second half drop-off than players who are eliminated early. (I could also look at the second half drop-off for players in the All-Star Game who weren't in the HR Derby, but it was enough of a pain getting the data for just these players, so I'll try to avoid this approach.)

The mean career second half OPS of the 56 HR Derby hitters is .894, and in HR Derby years it is .924. This is still a bit unsatisfactory because the HR Derby year is presumably in the prime of their career, so let's try the second way. Consider the number of swings taken in the competition by each player; this is equal to ten times the number of rounds they were in, plus their HR total. Fitting a linear regression of decrease in OPS on number of HR Derby swings, it is apparent that the more swings the player takes, the less the pre- and post-break OPS difference is, i.e. the opposite of the proposed effect. So I'm pretty comfortable writing the poorer second halves off to selection bias.

Thanks to Bret Hanlon for the idea for this post.

## Sunday, October 10, 2010

### Starting pitchers in their last inning of the game

In my previous post I compared the FIP for starting pitchers in their final inning of the game to the FIP of their team's bullpen, and found that they're being left in the game too long. Today I'm just going to present three tables that I created from the R code used to make that post. I limited the data to the AL in 2009 in the previous post because the bullpen data had to be collated manually, but no such barrier exists this time, so I used both the AL and NL from 2002-2009, limiting the data to pitchers who started at least 50 games over that time.

This first table looks at pitchers who had the most significant drop-off (all of them had p-values less than 10^(-6)) from their non-final inning numbers to their final inning numbers. 233 of the 241 of the pitchers in the sample had a significant drop-off (p-values less than 0.05), but these were the most severe. I don't know how much this difference really means because the important number for managerial decisions is the late FIP, but many of these guys are good pitchers, and are probably pitching in close games pretty often (unlike somebody who has a really high FIP throughout the game), and their team would benefit if the manager got them out one inning early. Of course knowing when the pitcher is going to start getting knocked around in a given game is impossible - sometimes they might be getting pulled in the 8th, other times in the 7th, etc. However it seems managers should be extra aware about these pitchers, and get them out of the game as soon as they're showing even a slight decrease in velocity.

This table is less interesting, but these were the only eight pitchers who didn't have a significant last inning FIP increase.

The final table shows the nine pitchers who had a last inning FIP over 9.00. I don't think any of them are still starting games. Rick Reed had a good career and my data set just caught the tail end of it. Most have been tried as relief pitchers, and Darren Oliver has actually become a pretty good one.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

This first table looks at pitchers who had the most significant drop-off (all of them had p-values less than 10^(-6)) from their non-final inning numbers to their final inning numbers. 233 of the 241 of the pitchers in the sample had a significant drop-off (p-values less than 0.05), but these were the most severe. I don't know how much this difference really means because the important number for managerial decisions is the late FIP, but many of these guys are good pitchers, and are probably pitching in close games pretty often (unlike somebody who has a really high FIP throughout the game), and their team would benefit if the manager got them out one inning early. Of course knowing when the pitcher is going to start getting knocked around in a given game is impossible - sometimes they might be getting pulled in the 8th, other times in the 7th, etc. However it seems managers should be extra aware about these pitchers, and get them out of the game as soon as they're showing even a slight decrease in velocity.

pitcher | early FIP | late FIP |
---|---|---|

Burnett, A.J. | 3.351 | 5.907 |

Byrd, Paul | 4.182 | 7.196 |

Fogg, Josh | 4.529 | 8.090 |

Garland, John | 4.210 | 7.301 |

Hernandez, Livan | 4.039 | 6.670 |

Lackey, John | 3.538 | 5.755 |

Lohse, Kyle | 4.332 | 7.152 |

Meche, Gil | 3.940 | 6.950 |

Ortiz, Russ | 4.183 | 7.639 |

Pavano, Carl | 3.536 | 7.049 |

Perez, Oliver | 4.176 | 7.506 |

Robertson, Nate | 4.160 | 7.473 |

Santana, Johan | 2.915 | 5.337 |

Silva, Carlos | 4.143 | 7.384 |

Trachsel, Steve | 4.444 | 7.597 |

Wakefield, Tim | 4.149 | 7.193 |

This table is less interesting, but these were the only eight pitchers who didn't have a significant last inning FIP increase.

pitcher | early FIP | late FIP | p-value |
---|---|---|---|

Williams, David | 5.131 | 6.063 | 0.187 |

Smoltz, John | 3.114 | 3.827 | 0.089 |

Litsch, Jesse | 4.510 | 5.820 | 0.085 |

Ryan, Brendan | 4.148 | 5.208 | 0.073 |

Kuroda, Hiroki | 3.400 | 4.561 | 0.071 |

Santos, Victor | 4.556 | 5.713 | 0.066 |

Galarraga, Armando | 4.760 | 6.435 | 0.056 |

Hammel, Jason | 3.976 | 5.410 | 0.050 |

The final table shows the nine pitchers who had a last inning FIP over 9.00. I don't think any of them are still starting games. Rick Reed had a good career and my data set just caught the tail end of it. Most have been tried as relief pitchers, and Darren Oliver has actually become a pretty good one.

player | early FIP | late FIP |
---|---|---|

James, Chuck | 4.240 | 10.827 |

McClung, Seth | 4.669 | 10.280 |

Oliver, Darren | 5.053 | 9.661 |

Kinney, Matt | 4.037 | 9.640 |

Waechter, Doug | 4.830 | 9.609 |

Reed, Rick | 3.531 | 9.534 |

Helling, Rick | 4.162 | 9.433 |

Owings, Micah | 4.872 | 9.243 |

Mays, Joe | 4.595 | 9.032 |

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

## Wednesday, September 1, 2010

### When to remove the starting pitcher

I've always thought managers are slow to go to the bullpen. It seems like they usually wait for the starting pitchers to get into trouble instead of trying to get them out before trouble starts, so I decided to look at the data. My strategy was to compare the last inning of starting pitching to the bullpen's average numbers, and look for significant differences. Getting bullpen numbers for each team isn't easy, and I had to copy/paste each team individually from different pages at Baseball-Reference - not wanting to do this for a bunch of different seasons, I just used the 2009 AL numbers. I would speculate that the NL managers are a bit better at pulling the starter because sometimes they are nudged to do so when he comes up to bat. As it turns out, using only 2009 gives plenty of data to see that managers do not pull the starters in a timely manner. In fact, for all 14 teams there is a significant difference between the bullpen and the last inning of their starting pitching.

Rather than using ERA to compare SPs and RPs, I used FIP. I added HBP to BB in this formula. The additive term seems to have changed from 3.20 to 3.10 since I wrote my entry on closers. The advantage of using FIP is twofold: it takes a lot of the luck out of the equation, and actually predicts future ERA better than past ERA does; and inherited runners are not important because we're just considering HR, BB, K, and IP. It's easy to estimate the standard deviation of FIP because it can be written as a function of multinomial probabilities. This is important here because I want to be able to tell if the bullpen's FIP is significantly better than the starters' FIP.

I limited the data to pitchers who started at least 15 games (64 pitchers qualify), figuring by that point the manager should have a good idea of when the pitcher is tiring. Of course I have the advantage of looking at the whole season's data to see where the differences lie - at the beginning of the season, the manager may not know how good his bullpen will be, or how his new pitchers behave in the late innings, etc. But as we'll see from the huge differences - SPs should be removed sooner rather than later!

The following table shows the team's average SP FIP for their last inning in the second column, the bullpen's FIP in the third column, and the p-value testing whether RP FIP is less than SP FIP in the final column.

Joe Girardi of the Yankees and Don Wakamatsu of the Mariners were the best at removing their starters before they got knocked around. But they still seem to leave them in too long, and actually they might only be the best because both teams had four qualifying SP, all of whom were pretty good - notice they have the best SP numbers of any teams.

Most teams actually have slightly better RP numbers than SP numbers (even when removing the last inning pitched for all the SP); I guess this is due to being able to throw harder when you throw fewer pitches (I've also heard pitchers are worse the second time through the order - something to investigate in the future). The differences are quite small, but still, if the managers are aware of this, maybe their bullpens are already operating at their innings limit, and can't come in one inning earlier. Or at least all their good RP are operating at their innings limit. I bet some of the difference that shows up in the table is unavoidable, but it sure seems like it would help to have your best AAA pitcher on the roster, instead of another backup hitter, to eat up some of the bad SP innings.

Next time I'll look at some of the pitchers who have the biggest dropoff in FIP from the first several innings to their final inning and at the ones who are the worst in their final inning. Without needing bullpen average FIPs here, I'll be able to consider several seasons at once.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

Rather than using ERA to compare SPs and RPs, I used FIP. I added HBP to BB in this formula. The additive term seems to have changed from 3.20 to 3.10 since I wrote my entry on closers. The advantage of using FIP is twofold: it takes a lot of the luck out of the equation, and actually predicts future ERA better than past ERA does; and inherited runners are not important because we're just considering HR, BB, K, and IP. It's easy to estimate the standard deviation of FIP because it can be written as a function of multinomial probabilities. This is important here because I want to be able to tell if the bullpen's FIP is significantly better than the starters' FIP.

I limited the data to pitchers who started at least 15 games (64 pitchers qualify), figuring by that point the manager should have a good idea of when the pitcher is tiring. Of course I have the advantage of looking at the whole season's data to see where the differences lie - at the beginning of the season, the manager may not know how good his bullpen will be, or how his new pitchers behave in the late innings, etc. But as we'll see from the huge differences - SPs should be removed sooner rather than later!

The following table shows the team's average SP FIP for their last inning in the second column, the bullpen's FIP in the third column, and the p-value testing whether RP FIP is less than SP FIP in the final column.

team | SP FIP | RP FIP | p-value |
---|---|---|---|

ANA | 7.278 | 4.274 | 0.000 |

BAL | 7.522 | 4.557 | 0.001 |

BOS | 6.334 | 4.154 | 0.001 |

CHA | 7.209 | 3.927 | 0.000 |

CLE | 7.851 | 4.686 | 0.000 |

DET | 6.173 | 4.666 | 0.024 |

KCA | 6.737 | 4.586 | 0.003 |

MIN | 9.148 | 4.322 | 0.000 |

NYA | 5.438 | 4.329 | 0.044 |

OAK | 6.265 | 3.349 | 0.000 |

SEA | 5.673 | 4.352 | 0.046 |

TBA | 7.011 | 4.487 | 0.002 |

TEX | 6.402 | 4.057 | 0.000 |

TOR | 8.500 | 4.211 | 0.000 |

Joe Girardi of the Yankees and Don Wakamatsu of the Mariners were the best at removing their starters before they got knocked around. But they still seem to leave them in too long, and actually they might only be the best because both teams had four qualifying SP, all of whom were pretty good - notice they have the best SP numbers of any teams.

Most teams actually have slightly better RP numbers than SP numbers (even when removing the last inning pitched for all the SP); I guess this is due to being able to throw harder when you throw fewer pitches (I've also heard pitchers are worse the second time through the order - something to investigate in the future). The differences are quite small, but still, if the managers are aware of this, maybe their bullpens are already operating at their innings limit, and can't come in one inning earlier. Or at least all their good RP are operating at their innings limit. I bet some of the difference that shows up in the table is unavoidable, but it sure seems like it would help to have your best AAA pitcher on the roster, instead of another backup hitter, to eat up some of the bad SP innings.

Next time I'll look at some of the pitchers who have the biggest dropoff in FIP from the first several innings to their final inning and at the ones who are the worst in their final inning. Without needing bullpen average FIPs here, I'll be able to consider several seasons at once.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

## Sunday, August 1, 2010

### Do knuckleballers induce slumps?

The knuckleball is a very unusual pitch. It's thrown significantly slower than every other pitch, and it's been said to mess with a hitter's timing. Hitters seem to do fine against the knuckleball (KB), but I've heard people say that they'll go into a slump after facing the knuckler. Today I thought I'd investigate. Retrosheet's game logs provide the starting pitcher for every game as well as the basic offensive statistics needed to compute on-base percentage (OBP) and slugging percentage (SLG), so getting data was easy.

I picked out six knuckleballers, either famous or contemporary: Tom Candiotti, Charlie Hough, Steve Sparks, Tim Wakefield, Phil Niekro, and Hoyt Wilhelm. There are some other famous ones, like Joe Niekro, but they didn't throw it for their whole career. I searched through the data to find any games that one of these pitchers started, and then looked at their opponent's OBP and SLG in their kth game after the knuckler game (k=1,...,3 is all that was needed), and paired that with the team's average OBP and SLG over the whole season, with the intention of studying the paired differences to look for a post-KB game effect. I assumed that the differences were independent - which is at least approximately true.

To calculate the team's overall strength I had to use the unweighted average OBP and SLG, i.e. I averaged the OBP and SLG from all 162 games rather than adding all the count data together to find the weighted average. This was necessary because the weighted average tends to be higher than the value for a single game.

The differences between the season average and the kth game average have a beautiful bell-shaped curve (for both OBP and SLG), so I used two paired t-tests to see if the differences were significantly greater than zero. Since each difference is actually based on the difference of two averages, both of which might have a slightly different sample size each time (at-bat totals won't be exactly the same every game), I could probably be more efficient by assigning some weights to the differences, but I highly doubt this would make a non-negligible difference in the p-values.

The third column in the following two tables shows the average for the kth game after the knuckleball game, k=1,...,3, with the average season total for the teams in the sample in the second column. The p-values are testing OBP>post-KB OBP and SLG>post-KB SLG respectively, and they are based on a paired t-test.

The knuckleball actually does seem to sap the team's power the day after they face it. Actually most of the next day woes are due to Candiotti's and Wakefield's effects, in particular Wakefield's. His tables are below:

The overall effects are higher in Wakefield's tables because he's pitched in an offensive era. Hitters really do seem significantly worse than normal the day after they face him - even with the Bonferroni correction, the SLG decrease is significant, and the OBP decrease is marginally significant. The unweighted averages of OBP and SLG are .316 and .398 in games Wakefield pitches - actually quite close to the averages of the opposing team on the following day, and both lower than the overall average. Hough, Niekro, and Wilhelm are the superior pitchers, but maybe they didn't throw the knuckler as often as Wakefield does, and hence don't screw up the hitters' next day timing as much? Or maybe it has something to do with contemporary hitters not seeing the knuckler as often as guys in the past may have? Steve Sparks doesn't cause this next day drop-off though.

Thanks to Ben Shaby for suggesting this idea. The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

I picked out six knuckleballers, either famous or contemporary: Tom Candiotti, Charlie Hough, Steve Sparks, Tim Wakefield, Phil Niekro, and Hoyt Wilhelm. There are some other famous ones, like Joe Niekro, but they didn't throw it for their whole career. I searched through the data to find any games that one of these pitchers started, and then looked at their opponent's OBP and SLG in their kth game after the knuckler game (k=1,...,3 is all that was needed), and paired that with the team's average OBP and SLG over the whole season, with the intention of studying the paired differences to look for a post-KB game effect. I assumed that the differences were independent - which is at least approximately true.

To calculate the team's overall strength I had to use the unweighted average OBP and SLG, i.e. I averaged the OBP and SLG from all 162 games rather than adding all the count data together to find the weighted average. This was necessary because the weighted average tends to be higher than the value for a single game.

The differences between the season average and the kth game average have a beautiful bell-shaped curve (for both OBP and SLG), so I used two paired t-tests to see if the differences were significantly greater than zero. Since each difference is actually based on the difference of two averages, both of which might have a slightly different sample size each time (at-bat totals won't be exactly the same every game), I could probably be more efficient by assigning some weights to the differences, but I highly doubt this would make a non-negligible difference in the p-values.

The third column in the following two tables shows the average for the kth game after the knuckleball game, k=1,...,3, with the average season total for the teams in the sample in the second column. The p-values are testing OBP>post-KB OBP and SLG>post-KB SLG respectively, and they are based on a paired t-test.

game | OBP | post-KB OBP | p-value |
---|---|---|---|

1 | 0.3206 | 0.3195 | 0.2566 |

2 | 0.3206 | 0.3226 | 0.8676 |

3 | 0.3206 | 0.3198 | 0.3233 |

game | SLG | post-KB SLG | p-value |
---|---|---|---|

1 | 0.3934 | 0.3869 | 0.0174 |

2 | 0.3934 | 0.3939 | 0.5696 |

3 | 0.3934 | 0.394 | 0.5872 |

The knuckleball actually does seem to sap the team's power the day after they face it. Actually most of the next day woes are due to Candiotti's and Wakefield's effects, in particular Wakefield's. His tables are below:

game | OBP | post-KB OBP | p-value |
---|---|---|---|

1 | 0.3261 | 0.3163 | 0.0093 |

2 | 0.3261 | 0.3256 | 0.4509 |

3 | 0.3261 | 0.3224 | 0.1968 |

game | SLG | post-KB SLG | p-value |
---|---|---|---|

1 | 0.4164 | 0.3958 | 0.0021 |

2 | 0.4164 | 0.4168 | 0.5205 |

3 | 0.4164 | 0.4081 | 0.1328 |

The overall effects are higher in Wakefield's tables because he's pitched in an offensive era. Hitters really do seem significantly worse than normal the day after they face him - even with the Bonferroni correction, the SLG decrease is significant, and the OBP decrease is marginally significant. The unweighted averages of OBP and SLG are .316 and .398 in games Wakefield pitches - actually quite close to the averages of the opposing team on the following day, and both lower than the overall average. Hough, Niekro, and Wilhelm are the superior pitchers, but maybe they didn't throw the knuckler as often as Wakefield does, and hence don't screw up the hitters' next day timing as much? Or maybe it has something to do with contemporary hitters not seeing the knuckler as often as guys in the past may have? Steve Sparks doesn't cause this next day drop-off though.

Thanks to Ben Shaby for suggesting this idea. The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

## Saturday, July 17, 2010

### Late-inning defensive substitutions

Sometimes a manager will remove a poor fielder from the game when his team is winning, believing the defensive improvement is enough to offset the loss of this player's bat. This usually happens in the 9th inning. Here's a short discussion about it: Tango analysis. My plan was to consider the following:

To lead into my (short) analysis, I want to continue the selection bias discussion from my previous post. In that situation, requiring all four pairwise combinations of win/loss with pinch runner/no pinch runner to exist for each player created a selection bias and led to an overestimate of the PR main effect. I don't think it affected the analysis very much because my interest was in estimating the sum of this main effect and the average player:PR interaction for 25 good hitters, and this sum should not be affected by a biased main effect - the sum for an individual player estimates his personal PR effect and this is not dependent on any bias in the estimate of the average.

In today's analysis I am looking at defensive replacement by teams in the lead, so they go on to win the game - with or without the replacement - a vast majority of the time. Hence, if I set up the model in the same way as I set up the pinch runner model, with win as the response in a logistic regression, this selection bias will be quite severe. The pairwise combination most often lacking is the loss/replacement, and so many players whose replacement only ever led to victory get deleted from the sample. Proceeding as if everything was normal leads to a hugely negative defensive replacement main effect. Again my interest would be in the sum of this effect and the interaction effects of good hitters, so this is not an insurmountable obstacle.

The obstacle seems to be the lack of repetitions. I was picturing managers replacing their good hitters in close games all the time, and then when extra innings roll around, being left without their good hitters. But as an exploratory analysis I looked at 25 years of data, limited to players hitting in the top five in the order (lower than that the replacement might well be as good as the guy he's replacing) who've been substituted for in the 9th inning with a one or two run lead at least one time (not imposing any condition on having lost at least one game, and so unable to estimate a player effect), and found that out of the 36844 cases remaining, only 629 were defensive replacements - this amounts to less than 2 substitutions per season per team. And of those 36844, only 1490 times did their spot come to bat again - 1428 times with them in it, and 62 times with their replacement in it. At least you'd think that the team wins more often when the player hasn't been replaced, right? 657 wins/1428 games (46%) batting for himself, and 30 wins/62 games (48%) with the replacement - a statistically insignificant difference. Unsurprisingly, when I fit the full logistic regression model described above (again I limited to the top five in the order and fit without player:I.def interaction), the I.def effect was not anywhere close to significant.

This brute force approach to try to get around the noisy data is not going to work here. It certainly looks like the defensive replacement effect, if any, is quite minimal anyway. I'm satisfied for now, but to answer the question properly, I'd have to be able to measure exactly how much a defensive replacement helps the defense, and I don't have the data to do that right now.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

- win: a binary response for logistic regression indicating if the team held on to win the game,
- I.def: an indicator of whether the player was substituted for,
- dist: the number of spots in the order until this player is due to bat again. Presumably as this increases, the defensive substitution starts to look better and better,
- lead: either one or two runs,
- inn: inning - limited to top of the 9th and bottom of the 9th. I didn't want to include the 8th and get into dealing with players who weren't replaced in the 8th but then were in the 9th. The replacement usually happens in the 9th inning anyway,

- player: identity of the original player, much like the pinch running analysis, I'm going to lump all the replacements together,
- opp: the opposing team,

To lead into my (short) analysis, I want to continue the selection bias discussion from my previous post. In that situation, requiring all four pairwise combinations of win/loss with pinch runner/no pinch runner to exist for each player created a selection bias and led to an overestimate of the PR main effect. I don't think it affected the analysis very much because my interest was in estimating the sum of this main effect and the average player:PR interaction for 25 good hitters, and this sum should not be affected by a biased main effect - the sum for an individual player estimates his personal PR effect and this is not dependent on any bias in the estimate of the average.

In today's analysis I am looking at defensive replacement by teams in the lead, so they go on to win the game - with or without the replacement - a vast majority of the time. Hence, if I set up the model in the same way as I set up the pinch runner model, with win as the response in a logistic regression, this selection bias will be quite severe. The pairwise combination most often lacking is the loss/replacement, and so many players whose replacement only ever led to victory get deleted from the sample. Proceeding as if everything was normal leads to a hugely negative defensive replacement main effect. Again my interest would be in the sum of this effect and the interaction effects of good hitters, so this is not an insurmountable obstacle.

The obstacle seems to be the lack of repetitions. I was picturing managers replacing their good hitters in close games all the time, and then when extra innings roll around, being left without their good hitters. But as an exploratory analysis I looked at 25 years of data, limited to players hitting in the top five in the order (lower than that the replacement might well be as good as the guy he's replacing) who've been substituted for in the 9th inning with a one or two run lead at least one time (not imposing any condition on having lost at least one game, and so unable to estimate a player effect), and found that out of the 36844 cases remaining, only 629 were defensive replacements - this amounts to less than 2 substitutions per season per team. And of those 36844, only 1490 times did their spot come to bat again - 1428 times with them in it, and 62 times with their replacement in it. At least you'd think that the team wins more often when the player hasn't been replaced, right? 657 wins/1428 games (46%) batting for himself, and 30 wins/62 games (48%) with the replacement - a statistically insignificant difference. Unsurprisingly, when I fit the full logistic regression model described above (again I limited to the top five in the order and fit without player:I.def interaction), the I.def effect was not anywhere close to significant.

This brute force approach to try to get around the noisy data is not going to work here. It certainly looks like the defensive replacement effect, if any, is quite minimal anyway. I'm satisfied for now, but to answer the question properly, I'd have to be able to measure exactly how much a defensive replacement helps the defense, and I don't have the data to do that right now.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

## Sunday, July 4, 2010

### Are pinch runners effective? Part 3 of 3

In this entry I'm going to see whether pinch running is a good tradeoff: increasing the probability of scoring by enough to offset the loss of offense in extra innings. My previous post saw how much pinch running helped the chances of scoring - it was more than I thought. Now we'll complete the second half of the problem.

My original plan was to look only at future innings where the PR (or original hitter if he wasn't pinch run for) came to bat. This would decrease the variance inevitably created by considering opponents' runs or runs in innings when this spot didn't come to bat. Then by finding the probability of this spot returning to the plate in each different situation, I could isolate the negative side of the PR effect, which could then be combined with the estimate of the positive side that I already have. I was hoping this would let me separate the effects of pinch running and the defensive replacement that surely occurs along with it, but these probabilities of returning to bat are going to depend on the opponent's run scoring and this depends a bit on the defense. The main problem though, would be to connect these separate analyses of the good and the bad of pinch running with a sensible variance estimate that allows for the dependence between the two models.

So I scrapped this idea and decided to start by just fitting a multiple logistic regression with win/loss as the binary response, hoping that the noisiness of the data would be offset by the fact that I had the data for games all the way back to 1952. The predictors I considered were:

The main effect of player measures mostly how good his team is, particularly the players hitting immediately after him. A player effect of zero means his team is about average. The main effect of I.PR measures how much pinch running for the average player increases the log odds of winning. Note that this "average" is not the same as the other one: this one refers more to the speed of player himself. It also incorporates how good a hitter he is - because he might return to the plate in extra innings - but as we will see, it's mostly about his speed, just as the player main effect is mostly about his team rather than himself.

But the interaction of I.PR and player provides information about pinch running for this player relative to the average; if it's positive, a PR should be employed for this player at least as often as the average player; if it's negative, the PR should be employed less often than the average. In fact if the sum of the interaction coefficient and the I.PR coefficient is negative, the log odds of winning the game are decreased when the player is pinch run for.

(Aside: the minor selection bias I mentioned in part 2 is exacerbated here. To estimate the player:I.PR interaction, we need all four pairwise combinations of win/loss with PR/no PR for each player, so any player who is lacking one of these combinations is deleted. By far the most likely to be missing are the two involving PR. About 55% of these deletions were loss/PR - a higher percentage of losses than the true PR population contains - so the PR main effect in my analysis has a positive bias. But my goal is to look at player-specific PR effects, and those are based on the sum of the PR main effect with the interaction term, a sum which should be invariant to any bias in the main effect - if the main effect is too high, the interaction estimate will just be lower to balance it out. I wouldn't expect the parameter estimates for the effects not involving PR to be biased.)

I don't want to consider just one player at a time because the variance of the interaction coefficient estimates is too large to make an informed conclusion. But what I can do is average the interaction effects of many players together. Without looking at the data, I picked a list of 25 players who I thought were good hitters, but in general pretty slow runners:

The table below is based on the average of the 25 aforementioned players. It gives the probabilities of winning in each of the 66 different situations (don't worry, I used a loop in R to make the html code so I didn't have to type it all). The situation column is as follows: lead, inning, base, outs. The p-values are for the 2-sided test between the two probability estimates.

A couple interesting things I notice:

I think my next entry will be about late-inning defensive replacements.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

My original plan was to look only at future innings where the PR (or original hitter if he wasn't pinch run for) came to bat. This would decrease the variance inevitably created by considering opponents' runs or runs in innings when this spot didn't come to bat. Then by finding the probability of this spot returning to the plate in each different situation, I could isolate the negative side of the PR effect, which could then be combined with the estimate of the positive side that I already have. I was hoping this would let me separate the effects of pinch running and the defensive replacement that surely occurs along with it, but these probabilities of returning to bat are going to depend on the opponent's run scoring and this depends a bit on the defense. The main problem though, would be to connect these separate analyses of the good and the bad of pinch running with a sensible variance estimate that allows for the dependence between the two models.

So I scrapped this idea and decided to start by just fitting a multiple logistic regression with win/loss as the binary response, hoping that the noisiness of the data would be offset by the fact that I had the data for games all the way back to 1952. The predictors I considered were:

- I.PR: an indicator of whether there was a pinch runner,
- lead: the score difference between the two teams (either -1, 0, or 1),
- inn: either top 8th, bottom 8th, top 9th, or bottom 9th. I lumped extra innings in with the 9th inning,

- outs: the number of outs when the runner first got on base (this is categorical because there's no reason outs would be related linearly to the log odds of scoring),
- I.2nd: an indicator of whether the runner was on 2nd base (the alternative is the reference level, 1st base),
- player: identity of the original runner (for parsimony I'm lumping all pinch runners together as being fast guys),
- opp: the opposing team. This doesn't make much difference because unless you're going to control for which pitcher you're facing, the variance in opposition even within the same team is pretty big due to different pitchers across different generations.

The main effect of player measures mostly how good his team is, particularly the players hitting immediately after him. A player effect of zero means his team is about average. The main effect of I.PR measures how much pinch running for the average player increases the log odds of winning. Note that this "average" is not the same as the other one: this one refers more to the speed of player himself. It also incorporates how good a hitter he is - because he might return to the plate in extra innings - but as we will see, it's mostly about his speed, just as the player main effect is mostly about his team rather than himself.

But the interaction of I.PR and player provides information about pinch running for this player relative to the average; if it's positive, a PR should be employed for this player at least as often as the average player; if it's negative, the PR should be employed less often than the average. In fact if the sum of the interaction coefficient and the I.PR coefficient is negative, the log odds of winning the game are decreased when the player is pinch run for.

(Aside: the minor selection bias I mentioned in part 2 is exacerbated here. To estimate the player:I.PR interaction, we need all four pairwise combinations of win/loss with PR/no PR for each player, so any player who is lacking one of these combinations is deleted. By far the most likely to be missing are the two involving PR. About 55% of these deletions were loss/PR - a higher percentage of losses than the true PR population contains - so the PR main effect in my analysis has a positive bias. But my goal is to look at player-specific PR effects, and those are based on the sum of the PR main effect with the interaction term, a sum which should be invariant to any bias in the main effect - if the main effect is too high, the interaction estimate will just be lower to balance it out. I wouldn't expect the parameter estimates for the effects not involving PR to be biased.)

I don't want to consider just one player at a time because the variance of the interaction coefficient estimates is too large to make an informed conclusion. But what I can do is average the interaction effects of many players together. Without looking at the data, I picked a list of 25 players who I thought were good hitters, but in general pretty slow runners:

- Berkman, Lance
- Bonds, Barry
- Cabrera, Miguel
- Dunn, Adam
- Giambi, Jason
- Guerrero, Vladimir
- Gwynn, Tony
- Helton, Todd
- Holliday, Matt
- Howard, Ryan
- Jones, Chipper
- Kent, Jeff
- Lee, Carlos
- McGriff, Fred
- McGwire, Mark
- Ordonez, Magglio
- Ortiz, David
- Palmeiro, Rafael
- Piazza, Mike
- Ramirez, Manny
- Rodriguez, Ivan
- Sheffield, Gary
- Sosa, Sammy
- Thomas, Frank
- Youkilis, Kevin

The table below is based on the average of the 25 aforementioned players. It gives the probabilities of winning in each of the 66 different situations (don't worry, I used a loop in R to make the html code so I didn't have to type it all). The situation column is as follows: lead, inning, base, outs. The p-values are for the 2-sided test between the two probability estimates.

situation | P(win) no PR | P(win) PR | p-value |
---|---|---|---|

-1,t8,1,0 | 0.325 | 0.379 | 0.217 |

-1,t8,1,1 | 0.232 | 0.277 | 0.226 |

-1,t8,1,2 | 0.145 | 0.177 | 0.236 |

-1,t8,2,0 | 0.399 | 0.457 | 0.210 |

-1,t8,2,1 | 0.290 | 0.341 | 0.221 |

-1,t8,2,2 | 0.156 | 0.190 | 0.235 |

-1,b8,1,0 | 0.442 | 0.501 | 0.206 |

-1,b8,1,1 | 0.332 | 0.386 | 0.217 |

-1,b8,1,2 | 0.218 | 0.261 | 0.228 |

-1,b8,2,0 | 0.521 | 0.580 | 0.200 |

-1,b8,2,1 | 0.401 | 0.459 | 0.210 |

-1,b8,2,2 | 0.233 | 0.278 | 0.227 |

-1,t9,1,0 | 0.216 | 0.259 | 0.227 |

-1,t9,1,1 | 0.148 | 0.180 | 0.234 |

-1,t9,1,2 | 0.089 | 0.110 | 0.240 |

-1,t9,2,0 | 0.275 | 0.325 | 0.222 |

-1,t9,2,1 | 0.190 | 0.229 | 0.230 |

-1,t9,2,2 | 0.096 | 0.118 | 0.240 |

-1,b9,1,0 | 0.296 | 0.348 | 0.219 |

-1,b9,1,1 | 0.209 | 0.251 | 0.228 |

-1,b9,1,2 | 0.129 | 0.158 | 0.236 |

-1,b9,2,0 | 0.367 | 0.423 | 0.213 |

-1,b9,2,1 | 0.263 | 0.311 | 0.223 |

-1,b9,2,2 | 0.139 | 0.170 | 0.236 |

0,t8,1,0 | 0.551 | 0.662 | 0.009 |

0,t8,1,1 | 0.468 | 0.583 | 0.011 |

0,t8,1,2 | 0.401 | 0.516 | 0.013 |

0,t8,2,0 | 0.628 | 0.729 | 0.007 |

0,t8,2,1 | 0.543 | 0.654 | 0.009 |

0,t8,2,2 | 0.421 | 0.537 | 0.013 |

0,b8,1,0 | 0.723 | 0.806 | 0.006 |

0,b8,1,1 | 0.652 | 0.749 | 0.007 |

0,b8,1,2 | 0.588 | 0.694 | 0.008 |

0,b8,2,0 | 0.782 | 0.851 | 0.005 |

0,b8,2,1 | 0.716 | 0.801 | 0.006 |

0,b8,2,2 | 0.608 | 0.712 | 0.008 |

0,t9,1,0 | 0.565 | 0.674 | 0.009 |

0,t9,1,1 | 0.482 | 0.597 | 0.011 |

0,t9,1,2 | 0.415 | 0.530 | 0.013 |

0,t9,2,0 | 0.641 | 0.740 | 0.007 |

0,t9,2,1 | 0.557 | 0.667 | 0.009 |

0,t9,2,2 | 0.435 | 0.551 | 0.012 |

0,b9,1,0 | 0.739 | 0.819 | 0.006 |

0,b9,1,1 | 0.670 | 0.764 | 0.007 |

0,b9,1,2 | 0.607 | 0.711 | 0.008 |

0,b9,2,0 | 0.796 | 0.861 | 0.005 |

0,b9,2,1 | 0.733 | 0.814 | 0.006 |

0,b9,2,2 | 0.627 | 0.728 | 0.007 |

1,t8,1,0 | 0.796 | 0.879 | 0.001 |

1,t8,1,1 | 0.742 | 0.843 | 0.001 |

1,t8,1,2 | 0.731 | 0.835 | 0.001 |

1,t8,2,0 | 0.843 | 0.909 | 0.001 |

1,t8,2,1 | 0.795 | 0.879 | 0.001 |

1,t8,2,2 | 0.747 | 0.846 | 0.001 |

1,b8,1,0 | 0.917 | 0.954 | 0.001 |

1,b8,1,1 | 0.891 | 0.938 | 0.001 |

1,b8,1,2 | 0.885 | 0.935 | 0.001 |

1,b8,2,0 | 0.939 | 0.966 | 0.001 |

1,b8,2,1 | 0.917 | 0.954 | 0.001 |

1,b8,2,2 | 0.894 | 0.940 | 0.001 |

1,t9,1,0 | 0.877 | 0.930 | 0.001 |

1,t9,1,1 | 0.840 | 0.907 | 0.001 |

1,t9,1,2 | 0.832 | 0.902 | 0.001 |

1,t9,2,0 | 0.907 | 0.948 | 0.001 |

1,t9,2,1 | 0.876 | 0.929 | 0.001 |

1,t9,2,2 | 0.843 | 0.909 | 0.001 |

A couple interesting things I notice:

- the win probability estimates when trailing by one or when tied are higher with a runner on 1st and no out than with a runner on 2nd and one out. Newsflash: bunting is dumb in general, even when you only need one run.
- some of the PR effects are shockingly large compared to what I had estimated in part 2 for the change in probability of the run scoring. I can only say that I double checked these estimates, and that the estimates from part 2 hadn't allowed for the slowness of the hitter. The other difference I can think of is that (for convenience) my code counted the runner as having scored even if he'd been erased by a fielder's choice and a subsequent runner scored. If this was more common in non-PR situations, it could have led to an understatement of the true PR effect in part 2.

I think my next entry will be about late-inning defensive replacements.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".

Subscribe to:
Posts (Atom)