Geometry and Stan
It is currently AFL finals (playoffs) season, and I felt the urge to re-examine my set shots analysis I did a while back. Unfortunately, while the data exists for this and I imagine has vastly improved since my last look at it, one has to pay for it. So here I am on a Saturday afternoon, wanting to do some structural modelling of sports data as one does, and I’ve got my hands on some NFL field goal attempts data. This is going to look a lot like the attempt I did at this a few years ago, but now with some newer data and a hierarchical model.
By the way, if you’re interested in this kind of stuff, check out Andrew Gelman’s post on golf putting, which is what got me into this in the first place.
I got the data from NFLSavant, which keeps play-by-play data from 2013-2024. I used the 2024 dataset (the most recent). I filtered the Description
variable for FIELD GOAL
, and classified most of it into the following categories:
There were very few short and blocked instances. I decided that because of this there wasn’t enough information in the data to reasonably learn something about blocking and making the distance. Therefore I got rid of the short and blocked observations. What follows will therefore be just a model of getting the kicking angle right. I acknowledge that not modeling these probably makes the model less interesting and useful, but these are the data I have.
Because of the very small number of misses due to not making the distance, what I am left with is needing to model getting the angle right, or right enough. Fortunately, there is a fairly straightforward relationship between the distance of the field goal attempt, and the angle a player needs to kick through. That is, this angle gets smaller for longer attempts. Here is what the problem looks like for a 30-yard attempt:
The distance between the uprights is \(\frac{18.5}{3}\approx 6.17\) yards, and so for an attempt \(x\) yards from the goal, the angle \(\theta\) is:
\[ \begin{aligned} \tan\left(\frac{\theta}{2}\right)&=\frac{18.5}{3\times 2\times x}\\ \frac{\theta}{2}&=\tan^{-1}\left(\frac{18.5}{3\times 2\times x}\right) \end{aligned} \]
So the angle of tolerance that a kicker faces looks like this:
Now we need to turn this geometry model into a statistical model. Suppose that the kick angle is distributed according to a normal distribution with standard deviation \(\sigma\), and centered on a straight kick. Then the probability of a successful field goal attempt will be:
\[ \Pr(\mathrm{good}\mid \mathrm{distance}=x)=\Phi\left(\frac{\theta(x)}{2\sigma}\right)-\Phi\left(-\frac{\theta(x)}{2\sigma}\right)=2\Phi\left(\frac{\theta(x)}{2\sigma}\right)-1 \]
where \(\Phi(\cdot)\) is the standard normal cdf. This is enough to write down a likelihood, but I wanted something a bit more flexible, so I went with a Student’s \(t\)-distribution instead.1 Therefore letting \(F_t(x\mid \nu,\sigma)\) be the cdf of the \(t\) distribution centered on zero with \(\nu\) degrees of freedom and spread parameter \(\sigma\), I will use: \[ \Pr(\mathrm{good}\mid \mathrm{distance}=x)=2F_t\left(\frac{\theta(x)}{2}\mid \nu,\sigma\right)-1 \]
So this will define my likelihood.
As I will be using Bayesian estimation for this, I also need to choose priors for \(\sigma\) and \(\nu\). For the degrees of freedom parameter \(\nu\) I went with:
\[ \log\nu\sim N(\log(15),1^2) \] I chose this because the Student-\(t\) distribution starts to look a lot like a normal distribution at about \(\nu>30\), so pinning down the median of the prior distribution to half this seemed to make sense.
For \(\sigma\), I explored a few prior choices (holding the prior for \(\nu\) constant) and their implications for the model’s predictions using the following Stan program, which I also use for estimation:
{
data int<lower=0> N;
[N] dist;
vectorint good[N];
[2] prior_sigma;
vector[2] prior_nu;
vector
int<lower=0,upper=1> UseData;
}
{
transformed data
= 18.5/3.0; // in yards
real goalpost_width
[N] angle = atan(goalpost_width/(2.0*dist));
vector
}
{
parameters <lower=0> sigma;
real<lower=0> nu;
real
}
{
model
if (UseData==1) {
[N] pr_angle;
vector
for (ii in 1:N) {
[ii] = 2.0*student_t_cdf(angle[ii], nu, 0.0, sigma)-1.0;
pr_angle}
+= bernoulli_lpmf(good | pr_angle);
target }
+= lognormal_lpdf(sigma | prior_sigma[1],prior_sigma[2]);
target += lognormal_lpdf(nu | prior_nu[1],prior_nu[2]);
target
}
{
generated quantities
// predicted probability
[100] pr;
vectorfor (ii in 1:100) {
[ii] = 2.0*student_t_cdf(atan(goalpost_width/(2.0*ii)), nu, 0.0, sigma)-1.0;
pr
}
}
Here, I can turn off the likelihood contribution by setting UseData=0
. Then, the predictions pr
in the generated quantities
block are draws from the prior. Eventually, I settled on:
\[ \log\sigma\sim N(-2,1) \]
which generates predictions like this:
prior_predictive <-read.csv("individual_prior_predictive.csv")
(
ggplot(prior_predictive,aes(x=distance))
+geom_line(aes(y=X50.,linetype="median"),linewidth=1)
+geom_line(aes(y=mean,linetype="mean"),linewidth=1)
+geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.5)
+geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.5)
+theme_bw()
+xlab("Distance to goal (yards)")
+ylab("Probability of success")
)
Here the shaded regions show 50% and 95% Bayesian prior credible regions. This looked about right to me (without looking too much at the data), so I settled on these priors.
First, I estimate a pooled model that estimates one \(\nu\) and \(\sigma\) for all of the data. Here are the estimates:
Fit<-"individual.csv" |>
read.csv() |>
mutate(
distance = par |> parse_number()
)
Fit |>
filter(is.na(distance)) |>
select(-distance,-X) |>
kbl(digits=3) |>
kable_classic(full_width=FALSE) |>
add_header_above(c("","","","","percentiles"=5,"",""))
percentiles
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|
par | mean | se_mean | sd | X2.5. | X25. | X50. | X75. | X97.5. | n_eff | Rhat |
sigma | 0.048 | 0.000 | 0.002 | 0.044 | 0.047 | 0.048 | 0.050 | 0.052 | 1022.299 | 1.003 |
nu | 35.523 | 0.946 | 34.647 | 7.539 | 15.194 | 24.698 | 43.019 | 124.870 | 1342.192 | 1.002 |
lp__ | -389.363 | 0.027 | 0.940 | -391.875 | -389.731 | -389.096 | -388.692 | -388.425 | 1253.353 | 1.002 |
But they are quite hard to interpret, so let’s translate these into predictions:
d<-"cleaned_data.csv" |>
read.csv()
d_averages<-d |>
group_by(Distance) |>
summarize(
good = mean(Outcome=="GOOD"),
n=n()
)
(
ggplot(Fit |> filter(!is.na(distance)),aes(x=distance))
+geom_point(data=d_averages,aes(x=Distance,y=good,size=n),alpha=0.5)
+geom_line(aes(y=mean),color="red")
+geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.2,fill="red")
+geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.2,fill="red")
+geom_smooth(data=d,aes(x=Distance,y=1*(Outcome=="GOOD")),
method = "glm",
method.args = list(family = "binomial"))
+theme_bw()
+xlab("Distance to goal (yards)")
+ylab("Probability of success")
+ylim(c(0,1))
)
Here the red stuff is from the model, showing the mean (line) and 50% and 95% credible regions. The data are shown as black dots, and I put the logit line of best fit (blue) through there as well. While I don’t really believe that there is such a high probability of success for an attempt from (say) 75 yards, there isn’t any data out there to inform us about it. If there were, I would suspect that there would be a lot more misses due to not making the distance, and I would have been able to explicitly include them in my model.
Of course, treating all observations as equal is not necessarily a good idea, and especially in this case where the data are generated by different kickers. I therefore also implemented a hierarchical model. Here I let every kicker’s \(\sigma\) and \(\nu\) be draws from a (transformed) multivariate normal distribution. In the generated quantities
block, I calculate the prediction for each player’s probability of making a 40-yard field goal (as there were a lot of data around that distance). Here is the Stan program I wrote to estimate this model:
{
data int<lower=0> N;
int nkickers;
int id[N];
[N] dist;
vectorint good[N];
[2] prior_MU[2];
vector[2] prior_TAU;
vector;
real prior_Omega
}
{
transformed data
= 18.5/3.0; // in yards
real goalpost_width
[N] angle = atan(goalpost_width/(2.0*dist));
vector
}
{
parameters [2] MU;
vector<lower=0.0>[2] TAU;
vector[2] L_Omega;
cholesky_factor_corr
[2,nkickers] z;
matrix
}
{
transformed parameters
[nkickers] sigma;
vector[nkickers] nu;
vector
{
[2,nkickers] theta = rep_matrix(MU,nkickers)
matrix+diag_pre_multiply(TAU,L_Omega)*z;
= exp(theta[1,]');
sigma = exp(theta[2,]');
nu }
}
{
model
[N] pr_angle;
vector
for (ii in 1:N) {
[ii] = 2.0*student_t_cdf(angle[ii], nu[id[ii]], 0.0, sigma[id[ii]])-1.0;
pr_angle}
+= bernoulli_lpmf(good | pr_angle);
target
for (pp in 1:2) {
+= normal_lpdf(MU[pp] | prior_MU[pp][1],prior_MU[pp][2]);
target += cauchy_lpdf(TAU[pp] | 0.0, prior_TAU[pp]);
target }
+= lkj_corr_cholesky_lpdf(L_Omega| prior_Omega);
target += std_normal_lpdf(to_vector(z));
target
}
{
generated quantities
// predicted probability
[nkickers] pr40;
vectorfor (ii in 1:nkickers) {
[ii] = 2.0*student_t_cdf(atan(goalpost_width/(2.0*40.0)), nu[ii], 0.0, sigma[ii])-1.0;
pr40}
}
Here are the shrinkage estimates for the probability of making a 40-yard field goal for all 51 players in the dataset:
pr40<-"hierarchical.csv" |>
read.csv() |>
filter(grepl("pr40",par)) |>
mutate(
id = par |> str_replace("pr40","") |> parse_number()
) |>
full_join(
"idlookup.csv" |> read.csv(),
by="id"
) |>
mutate(
cumul = rank(mean)/n()-0.5/n()
)
(
ggplot(pr40,aes(x=mean))
+stat_ecdf()
+geom_errorbar(aes(y=cumul,xmin=X25.,xmax=X75.),alpha=0.5)
+theme_bw()
+xlab("Probability of a successful 40-yard attempt")
+ylab("Cumulative density of posterior mean")
)
And here is a “top 10” list of field goal kickers based on the hierarchical model, ranked by their predicted probability of making a 40-yard filed goal. Here raw.mean
lists each player’s fraction of successful attempts (from any distance). Note that this ranking is not simply a ranking of the raw means.
pr40 |>
arrange(-mean) |>
select(OffenseTeam,Kicker,raw.mean,mean,sd,n) |>
rename(attempts = n) |>
head(n=10) |>
kbl(digits=3) |>
kable_classic(full_width=FALSE) |>
add_header_above(c("","","","estimates"=2,""))
estimates
|
|||||
---|---|---|---|---|---|
OffenseTeam | Kicker | raw.mean | mean | sd | attempts |
PIT | 9-C.BOSWELL | 0.953 | 0.929 | 0.034 | 43 |
CHI | 8-C.SANTOS | 1.000 | 0.927 | 0.041 | 21 |
SEA | 5-J.MYERS | 0.963 | 0.924 | 0.038 | 27 |
WAS | 3-A.SEIBERT | 1.000 | 0.922 | 0.042 | 27 |
DEN | 3-W.LUTZ | 0.939 | 0.914 | 0.038 | 33 |
LAC | 11-C.DICKER | 0.932 | 0.913 | 0.035 | 44 |
TEN | 6-N.FOLK | 0.955 | 0.911 | 0.041 | 22 |
DAL | 17-B.AUBREY | 0.909 | 0.910 | 0.035 | 44 |
TB | 4-C.MCLAUGHLIN | 0.941 | 0.910 | 0.039 | 34 |
DET | 39-J.BATES | 0.931 | 0.904 | 0.041 | 29 |
In case you’re interested, here is the bottom ten list as well:
pr40 |>
arrange(mean) |>
select(OffenseTeam,Kicker,raw.mean,mean,sd,n) |>
rename(attempts = n) |>
head(n=10) |>
kbl(digits=3) |>
kable_classic(full_width=FALSE) |>
add_header_above(c("","","","estimates"=2,""))
estimates
|
|||||
---|---|---|---|---|---|
OffenseTeam | Kicker | raw.mean | mean | sd | attempts |
NYJ | 9-G.ZUERLEIN | 0.643 | 0.794 | 0.070 | 14 |
CLE | 7-D.HOPKINS | 0.692 | 0.798 | 0.060 | 26 |
GB | 44-B.NARVESON | 0.706 | 0.813 | 0.061 | 17 |
CIN | 3-C.YORK | 0.818 | 0.816 | 0.068 | 11 |
WAS | 3-C.YORK | 0.000 | 0.816 | 0.068 | 2 |
SF | 4-J.MOODY | 0.750 | 0.821 | 0.054 | 32 |
CIN | 2-E.MCPHERSON | 0.727 | 0.831 | 0.056 | 22 |
ATL | 6-Y.KOO | 0.781 | 0.833 | 0.050 | 32 |
BAL | 9-J.TUCKER | 0.750 | 0.837 | 0.051 | 32 |
NE | 13-J.SLYE | 0.839 | 0.841 | 0.050 | 31 |
library(tidyverse)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
dFG<-"https://nflsavant.com/pbp_data.php?year=2024" |>
read.csv() |>
filter(grepl("FIELD GOAL",Description)) |>
mutate(
Kicker = str_split_i(Description," ",2),
Outcome = ifelse(grepl("IS GOOD",Description),"GOOD",ifelse(grepl("IS BLOCKED",Description),"BLOCKED",str_split_i(Description,", ",2))),
Distance = str_split_i(Description," ",3) |> parse_number()
) |>
select(
GameId,GameDate,Description, OffenseTeam,Kicker,Outcome,YardLine,Distance
) |>
filter(!is.na(Distance)) |>
mutate(
Outcome = ifelse(Outcome=="HIT LEFT UPRIGHT","WIDE LEFT",Outcome),
Outcome = ifelse(Outcome=="HIT RIGHT UPRIGHT","WIDE RIGHT",Outcome),
Outcome = ifelse(Outcome=="HIT CROSSBAR","SHORT",Outcome)
) |>
filter(Outcome=="GOOD" | Outcome=="SHORT" | Outcome=="WIDE LEFT" | Outcome=="WIDE RIGHT" | Outcome=="BLOCKED")
d<-dFG |>
filter(Outcome!="SHORT" & Outcome!="BLOCKED") |>
mutate(
id = Kicker |> as.factor() |> as.integer()
)
d |>
write.csv("_posts/2025-09-20-nfl-field-goal-attempts/cleaned_data.csv")
idlookup<-d |>
group_by(OffenseTeam,Kicker,id) |>
summarize(
n = n(),
`raw mean` = mean(Outcome=="GOOD"),
`av dist` = mean(Distance)
)
idlookup |>
write.csv("_posts/2025-09-20-nfl-field-goal-attempts/idlookup.csv")
# prior predictive check
individual<-"_posts/2025-09-20-nfl-field-goal-attempts/individual.stan" |>
stan_model()
dStan<-list(
N = dim(d)[1],
dist = d$Distance,
good = d$Outcome=="GOOD",
prior_sigma = c(-2,1),
prior_nu = c(log(15),1),
UseData=0
)
Fit<-individual |>
sampling(
data=dStan,
seed=42
)
prior_predictive<-summary(Fit)$summary |>
data.frame() |>
rownames_to_column(var = "par") |>
mutate(
distance = par |> parse_number()
) |>
filter(!is.na(distance))
prior_predictive |>
write.csv("_posts/2025-09-20-nfl-field-goal-attempts/individual_prior_predictive.csv")
(
ggplot(prior_predictive,aes(x=distance))
+geom_line(aes(y=X50.,linetype="median"),linewidth=1)
+geom_line(aes(y=mean,linetype="mean"),linewidth=1)
+geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.5)
+geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.5)
+theme_bw()
+xlab("Distance to goal")
+ylab("Probability of success")
)
#-------------------------------------------------------------------------------
# Pooled model
#-------------------------------------------------------------------------------
dStan<-list(
N = dim(d)[1],
dist = d$Distance,
good = d$Outcome=="GOOD",
prior_sigma = c(-2,1),
prior_nu = c(log(15),1),
UseData=1
)
Fit<-individual |>
sampling(
data=dStan,
seed=42
)
summary(Fit)$summary |>
data.frame() |>
rownames_to_column(var="par") |>
write.csv("_posts/2025-09-20-nfl-field-goal-attempts/individual.csv")
fit<-summary(Fit)$summary |>
data.frame() |>
rownames_to_column(var = "par") |>
mutate(
distance = par |> parse_number()
) |>
filter(!is.na(distance))
d_averages<-d |>
group_by(Distance) |>
summarize(
good = mean(Outcome=="GOOD"),
n=n()
)
(
ggplot(fit,aes(x=distance))
+geom_point(data=d_averages,aes(x=Distance,y=good,size=n),alpha=0.5)
+geom_line(aes(y=mean),linewidth=1,color="red")
+geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.2,fill="red")
+geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.2,fill="red")
+geom_smooth(data=d,aes(x=Distance,y=1*(Outcome=="GOOD")),
method = "glm",
method.args = list(family = "binomial"))
+theme_bw()
+xlab("Distance to goal (yards)")
+ylab("Probability of success")
+ylim(c(0,1))
)
file<-"_posts/2025-09-20-nfl-field-goal-attempts/hierarchical.csv"
if (!file.exists(file)) {
hierarchical<-"_posts/2025-09-20-nfl-field-goal-attempts/hierarchical.stan" |>
stan_model()
dStan<-list(
N = dim(d)[1],
nkickers = d$id |> unique() |> length(),
id = d$id,
dist = d$Distance,
good = d$Outcome=="GOOD",
prior_MU = list(
c(-2,1),
c(log(15),1)
),
prior_TAU = c(1,1),
prior_Omega = 4
)
Fit<-hierarchical |>
sampling(
data=dStan,
control = list(adapt_delta = 0.99),
seed=42,
par = "z",include=FALSE
)
summary(Fit)$summary |>
data.frame() |>
rownames_to_column(var = "par") |>
write.csv(file)
}
This turned out to be a really important form of flexibility when I was modeling set shots in Australian football, where both the geometry is more complicated and the data are richer, but it didn’t really produce anything interesting here↩︎