Image credit: © Kirby Lee-USA TODAY Sports
Virtually all aspects of baseball are analyzed through increasingly complex models, including at Baseball Prospectus. One aspect has largely eluded this treatment: what we might call “BIP baserunning” or, if you prefer, “ordinary baserunning.” BIP baserunning—as distinguished from basestealing or advancing on a ball in the dirt—describes the ability of a baserunner to advance on balls in play. BIP baserunning generates measurements of both the baserunners themselves and the arms of fielders (typically outfielders) by the extent to which they deter or throw out baserunners trying to take that extra base. When a baserunner is thrown out, it becomes an outfield assist.
Typical examples of positive baserunning plays include:
- Taking the extra base on a single.
- Scoring from first on a double.
- Scoring from third on a sacrifice of some kind.
- Taking an extra base on a throw to another base (including the batter).
Traditionally, BIP baserunning has been addressed as a counting statistic where a runner or fielder’s results are treated largely as gospel, with the results tabulated in run expectancy change. That premise has not been reexamined much, probably because BIP baserunning is generally not that valuable: most runners are unlikely to offer more than a few runs over a season. BIP baserunning outcomes also are often predetermined by the nature of the ball in play itself.
Nonetheless, especially as a counting statistic, BIP baserunning can still be biased by the quality of defenders being played, the frequency with which the runner gets on base, and the frequency and nature of the BIP generated by the runner’s teammates. If we are interested in isolating a baserunner’s or fielder’s most likely contribution, which is what we believe a valid baseball statistic ought to be trying to describe, we need to do something else.
Because our switch from FRAA to RDA requires a change to our BIP baserunning / OF assists framework anyway, to harmonize run scales, we decided to try and do the thing properly. We created a new modeling system for baserunning that scores whether a runner was thrown out, stayed put, or took 1 to 3 bases. We incorporate Statcast batted ball inputs so we can model with more precision which baserunning feats are truly impressive, and which are not, and thereby neutralize the quality of a runner’s teammates. We use average run expectancy values by base and out that are independent of other baserunners. We focus on lead runners for the time being; trailing runners create a fraction of the impact that even this fraction represents, and require more study. This system has been implemented for all MLB seasons from 2015 to 2023, and will remain in place going forward.
To date, we have found that, once we adjust for context, the value of BIP baserunning is indeed fairly minimal. But perhaps we are overlooking something, and there is always room for improvement. We would like your input. So, we are going to describe in detail exactly what we are doing, and give our readers the opportunity not only to comment on the model, but to run it themselves and provide us with feedback.
The Challenges of a BIP Baserunning Model
Baserunning has several interesting aspects that need to be accommodated by any rigorous model.
First, the outcomes are discrete states, rather than continuous measurements. Generally speaking, once you are already on base, the potential outcomes of a ball in play are (1) being thrown out, (2) staying where you are, (3) taking one base, (4) taking two bases, and (5) taking three bases. Modeling discrete states is much more complex than just modeling a change in measurement. Ordinarily a model like this would be fit using a categorical model, which is what we do for our DRA / DRC metrics. Unlike a simple success / fail (Bernoulli) model, such as stolen base success, a categorical model can cover as many categories as you want, albeit at increasing computational cost and decreasing efficiency as the number of outcomes grows.
Second, and somewhat offsetting the first issue, is that baserunning thankfully has a natural order to it: you can take 0 bases, or 1 base, or 2 bases, or 3 bases. This is convenient because we know that whatever you had to do to take 1 base, you had to do that plus more to take 2 bases, and so on. In statistics, we call these outputs ordinal or cumulative, because you can use the statistical power of one category to better predict the next, instead of just treating all outcomes as unrelated. Importantly, you don’t have to assume the same distance between results, and it is perfectly acceptable for a greater-base outcome to be less likely than a lesser-base outcome, which of course it is, due to starting base positions and diminishing likelihood of achievement.
However, there is an important caveat: being thrown out on the bases is a huge deal, and it doesn’t fit into the ascending tendency of the other states. A runner can be thrown out almost anywhere, trying to take 1 base or 3 bases or just trying to get back to their original base. Where do you position these baserunners in our hierarchy? Should a runner who is thrown out at home be treated differently than a runner who was thrown out a second? We’ll discuss our solution below.
Third, the model needs to be intelligent enough to know what is possible and what is not. For example, a runner on second cannot take more than 2 bases under any scenario. A speedy runner on a single could take more than 2 bases if they are on first, but overall it should be highly unusual. If the model is making predictions that do not fit this pattern, something is wrong, and we have more work to do.
Fourth, you have to decide if you want to include double-play avoidance (batter being safe on the relay throw) as part of base-running. I could see an argument for both sides. We found the differences in values to be sufficiently small that it didn’t seem important to incorporate for the moment, and thus treat him as a trailing runner. But we welcome your feedback here also.
Fifth, you need a well-specified, robust system to keep track of all these rules and allow you to actually know what is going on inside this model. A run-of-the-mill machine learning model cannot achieve this, nor can your off-the-shelf linear regression. The search for the right system took up a lot of this process. However, we think we may have found it.
A Hurdle-Cumulative Model for BIP Baserunning
The Target Variables
To begin, we need to describe our target variable(s) and have them operate in some meaningful way. We already noted the ordinal or cumulative nature of most outcomes: taking somewhere between 0 through 3 bases. But the sticking point remains how we deal with being thrown out. Do we have to account for this at all? If so, does it matter if the runner is thrown out running back to first or trying to take third? Can we just treat it as -1 bases taken?
Another way to frame this problem is that before we can award a runner credit for running, we need to pass the “hurdle” of deciding whether the runner is actually going to be safe somewhere. If they are out, we are done, and negative run value will follow. But if they are safe, we can award them 0 to 3 additional bases. Arguably, a runner who gets thrown out while further along can open up bases behind them, although perhaps that credit instead should be awarded to the trailing runner who makes a heads up play. But at the end of the play, you are either safe or you are out; how you accomplished the latter is probably less important than the result, which can be an inning-killer regardless of where it happens. So we will value runner outs by treating it as the elimination of the base where the runner started, not where he (almost) ended up.
Putting these concepts together, you end up with a “hurdle-cumulative” model. The model simultaneously calculates your probability of being out versus not out on the basepaths, as well as how many bases are likely to be taken if you are not thrown out. By calculating them simultaneously, the models are allowed to be aware of each other, and reduce the chance of overfitting. Specifically, we code being thrown out on the basepaths as outcome 1, and then the “bases taken” outcomes of 0, 1, 2, and 3 bases as codes 2, 3, 4, and 5 respectively.
Where are we going to find a good implementation of a cumulative model? With the experimental psychologists, that’s who. They live in a world of items being rated on a scales of 1 to virtually anything, and have given plenty of thought to how to implement a cumulative model. Fortunately, the author of the leading R front-end for Stan, brms, is an experimental psychologist who has ensured that his open-source R package can fit cumulative models (among many others). Paul also recently implemented a hurdle-cumulative family, so we are now officially in business.
The Predictors
That gives us our target outputs, but how do we predict those outputs? These are the factors that we settled upon, after extensive testing:
Predictor | Hurdle outcome | Bases Taken Outcome |
BIP Launch speed | x | x |
BIP Launch angle | x | x |
BIP Estimated Bearing | x | |
Credited Position | x | x |
Fielder ID | x | x |
Runner ID | x | |
Runner speed | x | |
Potential tag up | x | |
Starting Base | x | x |
Outs Before PA | x | x |
Throwing Error | x |
There are some interesting findings in this table.
Predictors of the hurdle (getting thrown out) outcome are not the same as those that determine how many bases a runner takes, if any. There is plenty of overlap, but clear differences also.
Notable among these is that while the identity of the fielder helps determine if a runner is out on the bases, neither the identity of the runner nor the runner’s speed is a needle-mover. This was a surprise at first, and I suspect it would surprise many of you too: aren’t slow people more likely to be thrown out and fast people more likely to beat out a throw? Apparently not. But, from the coaching standpoint, I have been told this checks out, because outs on the basepaths are unusual: runners know whether they are fast or slow, and have reasonable heuristics about which types of balls in play make it worth it for them, personally, to try and take an extra base. As a result, outs on the basepaths tend to be the results of some unique factor, such as an unusually hard-hit ball, a terrific play by the outfielder, a random miscalculation by the runner, or some combination of the above. In theory, those are covered by our other predictors.
The other predictors will surprise you less. Batted ball characteristics matter, although BIP bearing (spray, which we estimate from stringer coordinates) matters to the number of bases taken but not being thrown out. For base-taking, foot speed matters, as does the runner’s identity. I like the fact that the model identified them as being separately relevant because baserunning seems to have an intelligence factor in addition to raw speed, and this model estimates how much of each the runner seems to have. Likewise, a tag-up play makes things more interesting because the runner has to give up whatever lead they might otherwise have, making advancement harder. Finally, a throwing error virtually guarantees an advancement of some sort. For the runner we want to control for a throwing error, but for a fielder we want to punish them for it.
The model would be more precise if we had access to runner and fielder coordinates at relevant times during the play, but MLB does not yet provide those to the public. Please add those measurements to your prayer circles, if you could.
The Run Values
This is another interesting aspect. It’s one thing to have your nicely-defined output categories, but what do you do with them? You can’t just subtract bases from one another, because the bases are arbitrary and don’t have a natural meaning. Hence, -1 is really not an option for being thrown out. This problem is compounded when we try to separate individual performance from typical performance, because we have to subtract one prediction from the other and get the average difference over the entire season.
Our approach is to calculate run expectancy values for each potential outcome for a lead runner, grouped by starting base and out. Our model already calculates the probability of each of the five states for each lead runner on a play, and the probabilities of the five states of course sum to 1 by rule. So if we multiply the run value of each potential outcome by the probability of the outcome with the player(s) in question, and aggregate the run value, and then do the same for a typical player in the same situation, the difference in run value tells us how much the runner or fielder contributed (or gave up) on the play. The average difference over the course of a season tells us how a player rated on a rate basis, and summing the differences gives us the total number of baserunning runs for the player.
You might ask why we use separate run values by out and starting base, when you could argue a runner does not control either, at least in his capacity as runner. In other words, why not just use one base state for all out situations, allowing us to get away with only three of them? The answer, for us anyway, is that we are already controlling for the base-out state of the situation in the model, and there is no need to do so again. More importantly, even if they did not create the situation, runners are still responsible for knowing the situation they are in, and we think it fair to hold them responsible for making the right move under the circumstances. Baseball is often randomized, and we’re used to isolating a player’s performance from uncontrollable external forces. But it’s best to consider baserunning akin to reliever usage: the setting matters, and the actors in both cases make decisions accordingly.
Checking the Model
How does one check the accuracy of a model like this? There are many ways, but I will discuss two of them.
On the front end, we used approximate leave-one-out cross-validation to assess the predictive power out of sample for each predictor, leaving those in that improved our results and taking those out that did not. This is standard Bayesian practice for model building, and we saw no reason to deviate from it here.
On the back end, we find it helpful to confirm that the model does not provide obviously wrong answers to certain situations. For example, a runner on third cannot take 2 bases, much less 3. A runner on second can take 2 bases, but not 3, and so on. I’m pleased to say that our model consistently gets these right, so it at least has that going for it.
The Results
We propose a few output metrics to reflect our new model. We provide a rate statistic, which for the moment we will call DRBa Rate, a/k/a the rate of Deserved Baserunning After Contact.The column DRBa is the counting statistic of DRBa Rate times opportunities, and is what figures into baserunning for WARP purposes. Better BIP baserunners have positive values, and poor baserunners have negative values.
We will show the top and bottom baserunners and fielders for both the 2015 and 2023 seasons:
Baserunner Results
Analogous statistics exist for Throwing. THR Rate is the rate statistic for THR, or Throwing Runs. Likewise, THR Opps refers to throwing run opportunities.
Now let’s show the top and bottom fielders from 2015 and 2023 in deterring or killing baserunners:
The results appear to be directionally correct. But the counting stats also are more compressed than what we are used to seeing. To some extent this is not surprising, given that we are no longer crediting baserunners or fielders for the fortuity of the positions in which they find themselves. But it is also possible we are being too stingy in our run values, or are shrinking factors that ought to be left alone. We welcome reader feedback on this issue.
Finally, we note that the range has compressed a bit from 2015 to 2023. On balance, we see this as a multi-year trend toward reduced value, albeit a somewhat noisy one. The reason for the trend is not entirely clear, to the extent it is a trend at all. One possibility is that teams have more intelligence than before about runner speed and which bases are worth trying for and which are not. Or perhaps runners are taking fewer risks, period. Or perhaps the league-wide tendency toward playing outfielders deeper has made it more difficult for individual fielders to stand out when it comes to baserunner deterrence. We welcome your feedback on this issue as well.
The Model Itself
And now, we move from the content to the “full nerd” portion of the program. Feel free to skip it if it is not your jam.
Below, we are providing you with the full model specification. We are also providing you with a sample season baserunning dataset and list of proposed run values. We hope that as many of you as possible will run the model for yourselves in R, or even just take a look at raw summaries, and give us your feedback. What do you think the model does well or less well? Are you able to “break” the model in some situations? (We get excited when people break things). Does the model seem to deal with some situations better than others? Do you have optimizations to suggest? We welcome all of your ideas.
The model is complex, and those who are not familiar with the brms front-end to Stan may not know quite what to make of it. But we’d love to teach those of you who are interested, or who just want to know more about modeling in Stan, so we will provide you with the model and engine specification, and then share a few pointers for those interested.
brr_ofa_hurdle_lead.mod <- brm(bf( bases_taken_code ~ 1 + s(ls_blend, la_blend, eb_blend) + (potential_tag_up || start_base : credited_pos_num) + (1|fielder_id_at_pos_num) + (credited_pos_num || outs_start) + runner_speed + (1|runner_id) + throwing_error, hu ~ 1 + (1|fielder_id_at_pos_num) + s(ls_blend, la_blend) + (start_base || credited_pos_num) + (credited_pos_num || outs_start)), data = other_br_plays, family = hurdle_cumulative(), # mixture distribution, logit link for hurdle prior = c( set_prior("normal(0, 5)", class="b"), # population effects prior, set_prior("normal(0, 5)", class="b", dpar="hu") # same but for hurdle ), chains = 1, cores = 1, seed = 1234, warmup = 1000, iter = 2000, normalize = FALSE, control = list(max_treedepth = 12, adapt_delta = .95), backend = 'cmdstanr', # necessary for threading threads = threading(8, static = TRUE, grainsize = round(nrow(all_bip_df) / 128)), refresh = 100)
The predictors were described above. You will note, however, that this is a hierarchical model that contains both ordinary predictors and modeled predictors. The latter are always in parentheses, and we describe them as “modeled” because they themselves are being shrunk to ensure their values are conservative and shrunk toward zero when the values would otherwise make no sense. Modeled predictors are also commonly known as random effects.
Some predictors also are better considered together. So, you will see examples where predictors are combined using what are known as random slopes. In plain English, it is not enough to simply find the average effect of the number of outs and the average effect of each starting base. You really need to combine them to get the full signal, AKA the “base-out state.” In traditional regression this would be called an “interaction”; random slopes are a more sophisticated way to achieve this effect while guarding against absurd values that can otherwise arise in small samples among the various possible combinations.
The brms front end allows us to fit multiple models at once, which is why you see two separate formulas, one for outcome, which is the number of bases (not) taken if the runner is safe, and one for hu, the hurdle component that dictates the probability of the runner being out. Remember from above that these two event types do not result from the same causes. We could fit the two models separately and probably get broadly similar results, but whenever you can fit related outcomes simultaneously, you should.
Beyond the substance, there are some pragmatic optimizations here also. In lieu of using multiple chains, which is ordinarily preferred, we use reduce-sum threading to run one Markov chain split into shards over all available CPUs. This is a much speedier way of fitting a model in Stan versus simply using multiple chains, particularly if you have eight CPUs or less. Ideally you would fit, say, eight threads each over four chains, but most of us don’t have 32 CPUs sitting around. If you do, godspeed.
We also set prior distributions on our traditional coefficients that are intended to keep the values within reason without unduly influencing them. This practice is sometimes called using “weakly informative priors.” We do not set prior distributions on the splines for batted ball quality or the various random effects: brms by default sets a student t distribution with three degrees of freedom scaled off the target variable for variance components, and frankly it is tough to outperform that default prior in most applications. So we leave it alone.
A few other things:
- We set the max_tree_depth deeper than the default value, because smoothing splines usually require a tree depth of 12;
- The model is complicated and I would rather not increase the iterations, so we raise the adapt_delta from its default 0.8. If you leave the adapt_delta at the default value, you can just set the model to save more iterations, but you also have a higher risk of divergences, which can compromise the model output.
- For the threading with shards, we set static = TRUE for reproducibility and specify the grainsize to optimize the size of the shards, which can make a huge performance difference. If you want to know more about this strategy, there is a vignette that walks you through one way to evaluate it.
Replicate our Work!
We are putting together a sample dataset, script, and runs table to allow you to replicate our values for the 2023 season. We would be delighted to have readers run the model and comment on the outputs, including the final run values. We will advise when this is ready for you to test.
Conclusion
There are almost certainly questions you have that we did not cover, so do not hesitate to ask them. Furthermore, you don’t have to be a statistician to have gut reactions and good feedback. Either way, we hope you will reach out to us either in the comments below or on social media with your assessments and suggestions. As usual, our goal is to get this as right as possible, and our readers are an important part of us being able to do that.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.