It is currently AFL finals (playoffs) season, and I felt the urge to re-examine my set shots analysis I did a while back. Unfortunately, while the data exists for this and I imagine has vastly improved since my last look at it, one has to pay for it. So here I am on a Saturday afternoon, wanting to do some structural modelling of sports data as one does, and I’ve got my hands on some NFL field goal attempts data. This is going to look a lot like the attempt I did at this a few years ago, but now with some newer data and a hierarchical model.

By the way, if you’re interested in this kind of stuff, check out Andrew Gelman’s post on golf putting, which is what got me into this in the first place.

Modeling field goal attempts

The data

I got the data from NFLSavant, which keeps play-by-play data from 2013-2024. I used the 2024 dataset (the most recent). I filtered the Description variable for FIELD GOAL, and classified most of it into the following categories:

Good (i.e. successful, 3 points)
Missed to the left or right. This included hitting the uprights.
Short. This included hitting the crossbar
Blocked

There were very few short and blocked instances. I decided that because of this there wasn’t enough information in the data to reasonably learn something about blocking and making the distance. Therefore I got rid of the short and blocked observations. What follows will therefore be just a model of getting the kicking angle right. I acknowledge that not modeling these probably makes the model less interesting and useful, but these are the data I have.

The geometry problem

Because of the very small number of misses due to not making the distance, what I am left with is needing to model getting the angle right, or right enough. Fortunately, there is a fairly straightforward relationship between the distance of the field goal attempt, and the angle a player needs to kick through. That is, this angle gets smaller for longer attempts. Here is what the problem looks like for a 30-yard attempt:

The distance between the uprights is \(\frac{18.5}{3}\approx 6.17\) yards, and so for an attempt \(x\) yards from the goal, the angle \(\theta\) is:

\[ \begin{aligned} \tan\left(\frac{\theta}{2}\right)&=\frac{18.5}{3\times 2\times x}\\ \frac{\theta}{2}&=\tan^{-1}\left(\frac{18.5}{3\times 2\times x}\right) \end{aligned} \]

So the angle of tolerance that a kicker faces looks like this:

The statistical problem

Now we need to turn this geometry model into a statistical model. Suppose that the kick angle is distributed according to a normal distribution with standard deviation \(\sigma\), and centered on a straight kick. Then the probability of a successful field goal attempt will be:

\[ \Pr(\mathrm{good}\mid \mathrm{distance}=x)=\Phi\left(\frac{\theta(x)}{2\sigma}\right)-\Phi\left(-\frac{\theta(x)}{2\sigma}\right)=2\Phi\left(\frac{\theta(x)}{2\sigma}\right)-1 \]

where \(\Phi(\cdot)\) is the standard normal cdf. This is enough to write down a likelihood, but I wanted something a bit more flexible, so I went with a Student’s \(t\)-distribution instead.¹ Therefore letting \(F_t(x\mid \nu,\sigma)\) be the cdf of the \(t\) distribution centered on zero with \(\nu\) degrees of freedom and spread parameter \(\sigma\), I will use: \[ \Pr(\mathrm{good}\mid \mathrm{distance}=x)=2F_t\left(\frac{\theta(x)}{2}\mid \nu,\sigma\right)-1 \]

So this will define my likelihood.

As I will be using Bayesian estimation for this, I also need to choose priors for \(\sigma\) and \(\nu\). For the degrees of freedom parameter \(\nu\) I went with:

\[ \log\nu\sim N(\log(15),1^2) \] I chose this because the Student-\(t\) distribution starts to look a lot like a normal distribution at about \(\nu>30\), so pinning down the median of the prior distribution to half this seemed to make sense.

For \(\sigma\), I explored a few prior choices (holding the prior for \(\nu\) constant) and their implications for the model’s predictions using the following Stan program, which I also use for estimation:


data {
  int<lower=0> N;
  vector[N] dist;
  int good[N];
  
  vector[2] prior_sigma;
  vector[2] prior_nu;
  
  int<lower=0,upper=1> UseData;
  
}

transformed data {
  
  
  real goalpost_width = 18.5/3.0; // in yards
  
  vector[N] angle = atan(goalpost_width/(2.0*dist));
  
  
}

parameters {
  real<lower=0> sigma;
  real<lower=0> nu;
  
}

model {
  
  if (UseData==1) {
  
    vector[N] pr_angle;
  
    for (ii in 1:N) {
      pr_angle[ii] = 2.0*student_t_cdf(angle[ii], nu, 0.0, sigma)-1.0;
    }
  
    target += bernoulli_lpmf(good | pr_angle);
  }
  
  target += lognormal_lpdf(sigma | prior_sigma[1],prior_sigma[2]);
  target += lognormal_lpdf(nu | prior_nu[1],prior_nu[2]);
  
  
}

generated quantities {
  
  // predicted probability
  
  vector[100] pr;
  for (ii in 1:100) {
    
    pr[ii] = 2.0*student_t_cdf(atan(goalpost_width/(2.0*ii)), nu, 0.0, sigma)-1.0;
    
  }
  
}

Here, I can turn off the likelihood contribution by setting UseData=0. Then, the predictions pr in the generated quantities block are draws from the prior. Eventually, I settled on:

\[ \log\sigma\sim N(-2,1) \]

which generates predictions like this:

prior_predictive <-read.csv("individual_prior_predictive.csv")

(
  ggplot(prior_predictive,aes(x=distance))
  +geom_line(aes(y=X50.,linetype="median"),linewidth=1)
  +geom_line(aes(y=mean,linetype="mean"),linewidth=1)
  +geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.5)
  +geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.5)
  +theme_bw()
  +xlab("Distance to goal (yards)")
  +ylab("Probability of success")
)

Here the shaded regions show 50% and 95% Bayesian prior credible regions. This looked about right to me (without looking too much at the data), so I settled on these priors.

Estimates

Pooled estimation

First, I estimate a pooled model that estimates one \(\nu\) and \(\sigma\) for all of the data. Here are the estimates:

Fit<-"individual.csv" |>
  read.csv() |>
  mutate(
    distance = par |> parse_number()
  )

Fit |> 
  filter(is.na(distance)) |>
  select(-distance,-X) |>
  kbl(digits=3) |>
  kable_classic(full_width=FALSE) |>
  add_header_above(c("","","","","percentiles"=5,"",""))

				percentiles
par	mean	se_mean	sd	X2.5.	X25.	X50.	X75.	X97.5.	n_eff	Rhat
sigma	0.048	0.000	0.002	0.044	0.047	0.048	0.050	0.052	1022.299	1.003
nu	35.523	0.946	34.647	7.539	15.194	24.698	43.019	124.870	1342.192	1.002
lp__	-389.363	0.027	0.940	-391.875	-389.731	-389.096	-388.692	-388.425	1253.353	1.002

But they are quite hard to interpret, so let’s translate these into predictions:

d<-"cleaned_data.csv" |>
  read.csv()
d_averages<-d |>
  group_by(Distance) |>
  summarize(
    good = mean(Outcome=="GOOD"),
    n=n()
  )
(
  ggplot(Fit |> filter(!is.na(distance)),aes(x=distance))
  +geom_point(data=d_averages,aes(x=Distance,y=good,size=n),alpha=0.5)
  +geom_line(aes(y=mean),color="red")
  +geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.2,fill="red")
  +geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.2,fill="red")
  +geom_smooth(data=d,aes(x=Distance,y=1*(Outcome=="GOOD")),
               method = "glm", 
               method.args = list(family = "binomial"))
  +theme_bw()
  +xlab("Distance to goal (yards)")
  +ylab("Probability of success")
  +ylim(c(0,1))
)

Here the red stuff is from the model, showing the mean (line) and 50% and 95% credible regions. The data are shown as black dots, and I put the logit line of best fit (blue) through there as well. While I don’t really believe that there is such a high probability of success for an attempt from (say) 75 yards, there isn’t any data out there to inform us about it. If there were, I would suspect that there would be a lot more misses due to not making the distance, and I would have been able to explicitly include them in my model.

Hierarchical estimation

Of course, treating all observations as equal is not necessarily a good idea, and especially in this case where the data are generated by different kickers. I therefore also implemented a hierarchical model. Here I let every kicker’s \(\sigma\) and \(\nu\) be draws from a (transformed) multivariate normal distribution. In the generated quantities block, I calculate the prediction for each player’s probability of making a 40-yard field goal (as there were a lot of data around that distance). Here is the Stan program I wrote to estimate this model:


data {
  int<lower=0> N;
  int nkickers;
  int id[N];
  
  
  vector[N] dist;
  int good[N];
  
  vector[2] prior_MU[2];
  vector[2] prior_TAU;
  real prior_Omega;
  
  
  
}

transformed data {
  
  
  real goalpost_width = 18.5/3.0; // in yards
  
  vector[N] angle = atan(goalpost_width/(2.0*dist));
  
  
}

parameters {
  vector[2] MU;
  vector<lower=0.0>[2] TAU;
  cholesky_factor_corr[2] L_Omega;
  
  matrix[2,nkickers] z;
  
}

transformed parameters {
  
  vector[nkickers] sigma;
  vector[nkickers] nu;
  
  {
    matrix[2,nkickers]  theta = rep_matrix(MU,nkickers)
                    +diag_pre_multiply(TAU,L_Omega)*z;
    sigma = exp(theta[1,]');
    nu = exp(theta[2,]');
  }
  
}

model {
  
  
  vector[N] pr_angle;
  
  for (ii in 1:N) {
    pr_angle[ii] = 2.0*student_t_cdf(angle[ii], nu[id[ii]], 0.0, sigma[id[ii]])-1.0;
  }
  
  target += bernoulli_lpmf(good | pr_angle);
  
  for (pp in 1:2) {
    target += normal_lpdf(MU[pp] | prior_MU[pp][1],prior_MU[pp][2]);
    target += cauchy_lpdf(TAU[pp] | 0.0, prior_TAU[pp]);
  }
  target += lkj_corr_cholesky_lpdf(L_Omega| prior_Omega);
  target += std_normal_lpdf(to_vector(z));
  
  
  
}

generated quantities {
  
  // predicted probability
  
  vector[nkickers] pr40;
  for (ii in 1:nkickers) {
    pr40[ii] = 2.0*student_t_cdf(atan(goalpost_width/(2.0*40.0)), nu[ii], 0.0, sigma[ii])-1.0;
  }
  
}

Here are the shrinkage estimates for the probability of making a 40-yard field goal for all 51 players in the dataset:

pr40<-"hierarchical.csv" |>
  read.csv() |>
  filter(grepl("pr40",par)) |>
  mutate(
    id = par |> str_replace("pr40","") |> parse_number()
  ) |>
  full_join(
    "idlookup.csv" |> read.csv(),
    by="id"
  ) |>
  mutate(
    cumul = rank(mean)/n()-0.5/n()
  )

(
  ggplot(pr40,aes(x=mean))
  +stat_ecdf()
  +geom_errorbar(aes(y=cumul,xmin=X25.,xmax=X75.),alpha=0.5)
  +theme_bw()
  +xlab("Probability of a successful 40-yard attempt")
  +ylab("Cumulative density of posterior mean")
)

And here is a “top 10” list of field goal kickers based on the hierarchical model, ranked by their predicted probability of making a 40-yard filed goal. Here raw.mean lists each player’s fraction of successful attempts (from any distance). Note that this ranking is not simply a ranking of the raw means.

pr40 |> 
  arrange(-mean) |>
  select(OffenseTeam,Kicker,raw.mean,mean,sd,n) |>
  rename(attempts = n) |>
  head(n=10) |>
  kbl(digits=3) |>
  kable_classic(full_width=FALSE) |> 
  add_header_above(c("","","","estimates"=2,""))

			estimates
OffenseTeam	Kicker	raw.mean	mean	sd	attempts
PIT	9-C.BOSWELL	0.953	0.929	0.034	43
CHI	8-C.SANTOS	1.000	0.927	0.041	21
SEA	5-J.MYERS	0.963	0.924	0.038	27
WAS	3-A.SEIBERT	1.000	0.922	0.042	27
DEN	3-W.LUTZ	0.939	0.914	0.038	33
LAC	11-C.DICKER	0.932	0.913	0.035	44
TEN	6-N.FOLK	0.955	0.911	0.041	22
DAL	17-B.AUBREY	0.909	0.910	0.035	44
TB	4-C.MCLAUGHLIN	0.941	0.910	0.039	34
DET	39-J.BATES	0.931	0.904	0.041	29

In case you’re interested, here is the bottom ten list as well:

pr40 |> 
  arrange(mean) |>
  select(OffenseTeam,Kicker,raw.mean,mean,sd,n) |>
  rename(attempts = n) |>
  head(n=10) |>
  kbl(digits=3) |>
  kable_classic(full_width=FALSE) |> 
  add_header_above(c("","","","estimates"=2,""))

			estimates
OffenseTeam	Kicker	raw.mean	mean	sd	attempts
NYJ	9-G.ZUERLEIN	0.643	0.794	0.070	14
CLE	7-D.HOPKINS	0.692	0.798	0.060	26
GB	44-B.NARVESON	0.706	0.813	0.061	17
CIN	3-C.YORK	0.818	0.816	0.068	11
WAS	3-C.YORK	0.000	0.816	0.068	2
SF	4-J.MOODY	0.750	0.821	0.054	32
CIN	2-E.MCPHERSON	0.727	0.831	0.056	22
ATL	6-Y.KOO	0.781	0.833	0.050	32
BAL	9-J.TUCKER	0.750	0.837	0.051	32
NE	13-J.SLYE	0.839	0.841	0.050	31

Showing my working

library(tidyverse)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

dFG<-"https://nflsavant.com/pbp_data.php?year=2024" |>
  read.csv() |>
  filter(grepl("FIELD GOAL",Description)) |>
  mutate(
    Kicker = str_split_i(Description," ",2),
    Outcome = ifelse(grepl("IS GOOD",Description),"GOOD",ifelse(grepl("IS BLOCKED",Description),"BLOCKED",str_split_i(Description,", ",2))),
    Distance = str_split_i(Description," ",3) |> parse_number()
  ) |>
  select(
    GameId,GameDate,Description, OffenseTeam,Kicker,Outcome,YardLine,Distance
  ) |>
  filter(!is.na(Distance)) |>
  mutate(
    Outcome = ifelse(Outcome=="HIT LEFT UPRIGHT","WIDE LEFT",Outcome),
    Outcome = ifelse(Outcome=="HIT RIGHT UPRIGHT","WIDE RIGHT",Outcome),
    Outcome = ifelse(Outcome=="HIT CROSSBAR","SHORT",Outcome)
  ) |>
  filter(Outcome=="GOOD" | Outcome=="SHORT" | Outcome=="WIDE LEFT" | Outcome=="WIDE RIGHT" | Outcome=="BLOCKED")


d<-dFG |>
  filter(Outcome!="SHORT" & Outcome!="BLOCKED") |>
  mutate(
    id = Kicker |> as.factor() |> as.integer()
  )

d |> 
  write.csv("_posts/2025-09-20-nfl-field-goal-attempts/cleaned_data.csv")

idlookup<-d |> 
  group_by(OffenseTeam,Kicker,id) |>
  summarize(
    n = n(),
    `raw mean` = mean(Outcome=="GOOD"),
    `av dist` = mean(Distance)
  )

idlookup |> 
  write.csv("_posts/2025-09-20-nfl-field-goal-attempts/idlookup.csv")

# prior predictive check

individual<-"_posts/2025-09-20-nfl-field-goal-attempts/individual.stan" |>
  stan_model()

dStan<-list(
  N = dim(d)[1],
  dist = d$Distance,
  good = d$Outcome=="GOOD",
  prior_sigma = c(-2,1),
  prior_nu = c(log(15),1),
  UseData=0
)

Fit<-individual |>
  sampling(
    data=dStan,
    seed=42
  )

prior_predictive<-summary(Fit)$summary |>
  data.frame() |>
  rownames_to_column(var = "par") |>
  mutate(
    distance = par |> parse_number()
  ) |>
  filter(!is.na(distance))

prior_predictive |>
  write.csv("_posts/2025-09-20-nfl-field-goal-attempts/individual_prior_predictive.csv")

(
  ggplot(prior_predictive,aes(x=distance))
  +geom_line(aes(y=X50.,linetype="median"),linewidth=1)
  +geom_line(aes(y=mean,linetype="mean"),linewidth=1)
  +geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.5)
  +geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.5)
  +theme_bw()
  +xlab("Distance to goal")
  +ylab("Probability of success")
)

#-------------------------------------------------------------------------------
# Pooled model
#-------------------------------------------------------------------------------

dStan<-list(
  N = dim(d)[1],
  dist = d$Distance,
  good = d$Outcome=="GOOD",
  prior_sigma = c(-2,1),
  prior_nu = c(log(15),1),
  UseData=1
)

Fit<-individual |>
  sampling(
    data=dStan,
    seed=42
  )

summary(Fit)$summary |> 
  data.frame() |>
  rownames_to_column(var="par") |>
  write.csv("_posts/2025-09-20-nfl-field-goal-attempts/individual.csv")


fit<-summary(Fit)$summary |>
  data.frame() |>
  rownames_to_column(var = "par") |>
  mutate(
    distance = par |> parse_number()
  ) |>
  filter(!is.na(distance))

d_averages<-d |>
  group_by(Distance) |>
  summarize(
    good = mean(Outcome=="GOOD"),
    n=n()
  )


(
  ggplot(fit,aes(x=distance))
  +geom_point(data=d_averages,aes(x=Distance,y=good,size=n),alpha=0.5)
  +geom_line(aes(y=mean),linewidth=1,color="red")
  +geom_ribbon(aes(ymin = X25.,ymax=X75.),alpha=0.2,fill="red")
  +geom_ribbon(aes(ymin = X2.5.,ymax=X97.5.),alpha=0.2,fill="red")
  +geom_smooth(data=d,aes(x=Distance,y=1*(Outcome=="GOOD")),
               method = "glm", 
               method.args = list(family = "binomial"))
  +theme_bw()
  +xlab("Distance to goal (yards)")
  +ylab("Probability of success")
  +ylim(c(0,1))
)


file<-"_posts/2025-09-20-nfl-field-goal-attempts/hierarchical.csv"

if (!file.exists(file)) {

  hierarchical<-"_posts/2025-09-20-nfl-field-goal-attempts/hierarchical.stan" |>
    stan_model()
  
  
  dStan<-list(
    N = dim(d)[1],
    nkickers = d$id |> unique() |> length(),
    id = d$id,
    dist = d$Distance,
    good = d$Outcome=="GOOD",
    
    prior_MU = list(
      c(-2,1),
      c(log(15),1)
    ),
    prior_TAU = c(1,1),
    prior_Omega = 4
  )
  
  Fit<-hierarchical |>
    sampling(
      data=dStan,
      control = list(adapt_delta = 0.99),
      seed=42,
      par = "z",include=FALSE
      )
  
  summary(Fit)$summary |>
    data.frame() |>
    rownames_to_column(var = "par") |>
    write.csv(file)
  

  
}

This turned out to be a really important form of flexibility when I was modeling set shots in Australian football, where both the geometry is more complicated and the data are richer, but it didn’t really produce anything interesting here↩︎