8 Model evaluation

It is often the case that we want to take more than one model to the data. While it is sometimes feasible to include all of your models in one big estimation (e.g. in a mixture model, or a model that nests all models under consideration), sometimes it is not computationally feasible to do this. In these cases, one approach is to compare your models in some way, and assign them some kind of measure of goodness-of-fit. These could include, but are certainly not limited to, assigning posterior probabilities to each model, and cross-validation, which I will tell you about in this chapter.

An important questions you will need to answer for yourself when deciding between the tools described below (or some secret other thing) is which technique best speaks to your research question? While it might seem that, philosophically, you might want to always use model posterior probabilities, because this is how Bayes’ rule says we should do things, if we are trying to make out-of-sample predictions, then cross-validation might be more appropriate. It really depends on what you want to get out of your analysis.

8.1 Example dataset and models

Here I will take three models of other-regarding preferences to just the first participant to appear in the Bruhin, Fehr, and Schunk (2019) dataset. As a reminder from previous chapters, this experiment consists of 78 pair-wise choices of allocations between a decision-maker (the “self”) and an inactive participant (the “other”). I will consider three models that could be generating the data.

The first, and most general of the three, is the Fehr and Schmidt (1999) model of inequality-aversion. This is the same model as I use in the “representative agent” chapter. For this, I took the Stan file from the representative agent models chapter, and made a few modifications that would allow me to do the model evaluations shown in this chapter. First, I added a data element in called UseData, which is a vector of ones and zeros the same length as the data (in this case 78 choices). If (say) the 36th element of this vector is equal to one, then the 36th decision in the data is counted toward the likelihood. On the other hand if this element is zero, then the model is estimated ignoring this observation. We will need this functionality to perform cross-validation. Hence, setting UseData to a vector of ones will exactly replicate the estimations in the representative agent models chapter, and setting UseData to a vector of zero will give you draws from the prior (because it will ignore all the data). Second, I explicitly computed the log-likelihood associated with each decision in a vector called log_lik in the transformed parameters block. Again, this is needed for cross-validation. Finally, I replaced sampling statements with a tilde (~) with an explicit target += ... statement. That is, whereas in the representative agent models chapter I would have had:

lambda ~ lognormal(prior_lambda[1],prior_lambda[2]);

I now have:

target += lognormal_lpdf(lambda | prior_lambda[1],prior_lambda[2]);

The difference between these sampling statements is that the former drops the log-normal density’s constant of proportionality, while the latter does not. We will need this when we use bridgesampling to evaluate model posterior probabilities.

Here is the Stan program that estimates the Fehr and Schmidt (1999) model:

// FS.stan
data {
  int<lower=0> n; // number of observations
  vector[n] self_x; // payoff to self with allocation x
  vector[n] other_x; // payoff to other with allocation x
  vector[n] self_y; // payoff to self with allocation y
  vector[n] other_y; // payoff to other with allocation y
  int choice_x[n]; // =1 iff allocation x is chosen
  
  real prior_lambda[2];
  real prior_alpha[2];
  real prior_beta[2];
  
  vector[n] UseData;
}

transformed data {
  vector[n] dX; // disadvantageous inequality at allocation x
  vector[n] dY; // disadvantageous inequality at allocation y
  vector[n] aX; // advantageous inequality at allocation X
  vector[n] aY; // advantageous inequality at allocation X
  
  dX = fmax(0,other_x-self_x);
  dY = fmax(0,other_y-self_y);
  aX = fmax(0,self_x-other_x);
  aY = fmax(0,self_y-other_y);
}
parameters {
  real alpha;
  real beta;
  real<lower=0> lambda;
}
transformed parameters {
  
  vector[n] log_lik;
  
  for (ii in 1:n) {
    real Ux = self_x[ii]-alpha*dX[ii]-beta*aX[ii];
    real Uy = self_y[ii]-alpha*dY[ii]-beta*aY[ii];
    log_lik[ii] = bernoulli_logit_lpmf(choice_x[ii] | lambda*(Ux-Uy));
  }
  
}

model {
  

  // likelihood
  target += log_lik.*UseData;
  
  // priors
  target += normal_lpdf(alpha | prior_alpha[1],prior_alpha[2]);
  target += normal_lpdf(beta | prior_beta[1],prior_beta[2]);
  target += lognormal_lpdf(lambda | prior_lambda[1],prior_lambda[2]);
  
}

For this chapter, I consider two simplifications of the Fehr and Schmidt (1999) model. First, a model of pure selfishness that sets the aversion to inequality parameters equal to zero:

// Selfish.stan
data {
  int<lower=0> n; // number of observations
  vector[n] self_x; // payoff to self with allocation x
  vector[n] self_y; // payoff to self with allocation y
  int choice_x[n]; // =1 iff allocation x is chosen
  
  real prior_lambda[2];
  
  vector[n] UseData;
}


parameters {
  real<lower=0> lambda;
}
transformed parameters {
  
  vector[n] log_lik;
  
  for (ii in 1:n) {
    real Ux = self_x[ii];
    real Uy = self_y[ii];
    log_lik[ii] = bernoulli_logit_lpmf(choice_x[ii] | lambda*(Ux-Uy));
  }
  
}

model {
  

  // likelihood
  target += log_lik.*UseData;
  
  // priors
  target += lognormal_lpdf(lambda | prior_lambda[1],prior_lambda[2]);
  
}

And second, as an in-between model, I estimated a one-parameter ustility model that essentially imposes a restriction of \(\alpha=-\beta\). This can capture efficiency-loving or efficiency-averse preferences, but not inequality-averse preferences:

// Linear.stan
data {
  int<lower=0> n; // number of observations
  vector[n] self_x; // payoff to self with allocation x
  vector[n] other_x; // payoff to other with allocation x
  vector[n] self_y; // payoff to self with allocation y
  vector[n] other_y; // payoff to other with allocation y
  int choice_x[n]; // =1 iff allocation x is chosen
  
  real prior_lambda[2];
  real prior_gamma[2];
  
  vector[n] UseData;
}

parameters {
  real gamma;
  real<lower=0> lambda;
}
transformed parameters {
  
  vector[n] log_lik;
  
  for (ii in 1:n) {
    real Ux = self_x[ii]+gamma*other_x[ii];
    real Uy = self_y[ii]+gamma*other_y[ii];
    log_lik[ii] = bernoulli_logit_lpmf(choice_x[ii] | lambda*(Ux-Uy));
  }
  
}

model {
  

  // likelihood
  target += log_lik.*UseData;
  
  // priors
  target += normal_lpdf(gamma | prior_gamma[1],prior_gamma[2]);
  target += lognormal_lpdf(lambda | prior_lambda[1],prior_lambda[2]);
  
}

For all of these models, I use the priors calibrated in the representative agent models chapter, which are:

\[ \begin{aligned} \alpha,\beta,\gamma & \sim N(0,0.67^2)\\ \log \lambda &\sim N(-5.76,2.11^2) \end{aligned} \]

8.2 Model posterior probabilities

Bayes’ rule actually provides us with a clear way of assigning posterior probabilities to models. To see this, take Bayes’ rule in its general form for any two events \(A\) and \(B\):

\[ p(A\mid B)=\frac{p(B\mid A)p(A)}{p(B)} \]

and then substitute \(A=M\) (i.e. the model) and \(B=y\) (the data):

\[ p(M\mid y)=\frac{p(y\mid M)p(M)}{p(y)} \]

That is, if we have a prior for each model, \(p(M)\), and a likelihood of observing the data conditional on each model, \(p(y\mid M)\), we can calculate the posterior probability of a model. The denominator can be integrated out by noting that:

\[ \begin{aligned} p(M\mid y)&\propto p(y\mid M)p(M)\\ \implies p(M\mid Y)&=\frac{p(y\mid M)p(M)}{\sum_{m\in\mathcal M}p(y\mid m)p(m)} \end{aligned} \]

So we can assign a posterior probability to a model \(M\) in the set of models under consideration \(\mathcal M\) if we can compute \(p(y\mid M)\) and have formed a prior \(p(M)\) for each model.

While a prior for a model is something that you need to come up with on your own, the likelihood component \(p(y\mid M)\) comes straight from the marginal likelihood of each individual model. In principle, if you can estimate each model, then computing the marginal likelihood is feasible. To see this, note that we can notate Bayes’ rule for a specific model as follows:

\[ p(\theta\mid y,M)=\frac{p(y\mid \theta,M)p(\theta\mid M)}{p(y\mid M)} \]

So there is the marginal likelihood we are after right there in the denominator. A bit of re-arranging yields:

\[ p(y\mid M)=\frac{p(y\mid \theta,M)p(\theta\mid M)}{p(\theta\mid y,M)} \]

Therefore in principle at least, if we can draw from the posterior \(\theta\mid y,M\), we can compute the marginal likelihood \(p(y\mid M)\).

Unfortunately, in practice doing this is somewhat difficult. To see this, the most intuitive (for me at least) way of evaluating \(p(y\mid M)\) is to integrate out \(\theta\) from \(p(y\mid \theta,M)\), that is:

\[ \begin{aligned} p(y\mid M)&=\int_{\Theta}p(y,\theta\mid ,M)\mathrm d\theta\\ &=\int_\Theta p(y\mid \theta,M)p(\theta)\mathrm d\theta\\ &=E_\theta\left[p(y\mid \theta,M)\right] \end{aligned} \]

where the last line takes the expectation over prior \(\theta\). Therefore, we could approximate this marginal likelihood through simulation as:

\[ \begin{aligned} p(y\mid M)&\approx \frac1S\sum_{s=1}^S p(y\mid \theta_s,M)\\ \{\theta_s\}_{s=1}^S&\sim p(\theta) \end{aligned} \]

where the \(\theta_S\)s are draws from the prior. However if the posterior is very different to the prior (and we hope that it would be because we hope to learn from our data), then this simulation will be very inefficient. This is because there will be many draws of \(\theta_s\) that return very small likelihoods.

8.2.1 Implementation using bridge sampling and the bridgesampling library

A more efficient way to evaluate \(p(y\mid M)\) is to use bridge sampling (Meng and Wong 1996). While we fortunately have a package in R that will just do this for us, it may be useful to see how it works. For my own understanding, I found Gronau et al. (2017) immensely useful in learning how this worked, and what follows closely mimics their explaination.

We start, as is the case with many algebraic problems, by constructing fancy but trivial identity:

\[ \begin{aligned} 1&=\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\int_\Theta p(y\mid \theta,M)p(\theta,M)g(\theta)h(\theta)\mathrm d\theta} \end{aligned} \]

where \(g(\theta)\) is a user-specified proposal distribution, and \(h(\theta)\) is a user-specified “bridge function”.

Next, we multiply both sides by \(p(y\mid M)\):

\[ \begin{aligned} p(y\mid M)&=p(y\mid M)\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\int_\Theta p(y\mid \theta,M)p(\theta,M)g(\theta)h(\theta)\mathrm d\theta}\\ &=\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\frac{1}{p(y\mid M)}\int_\Theta p(y\mid \theta,M)p(\theta,M)g(\theta)h(\theta)\mathrm d\theta}\\ &=\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\int_\Theta \frac{p(y\mid \theta,M)p(\theta\mid M)}{p(y\mid M)}g(\theta)h(\theta)\mathrm d\theta}\\ &=\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\int_\Theta \frac{p(y,\theta\mid M)}{p(y\mid M)} g(\theta)h(\theta)\mathrm d\theta}\\ &=\frac{\int_\Theta p(y\mid \theta,M)p(\theta\mid M)g(\theta)h(\theta)\mathrm d\theta}{\int_\Theta p(\theta\mid y,M) g(\theta)h(\theta)\mathrm d\theta} \end{aligned} \]

Note now that both the numerator and denominator can be written as expectations. For the numerator we will view this as an expectation relative to the proposal density \(g(\theta)\), and for the denominator we will view this as an expectation relative to the posterior \(p(\theta\mid y,M)\), so we can write:

\[ \begin{aligned} p(y\mid M)&=\frac{E_{\theta \sim g(\theta)}\left[p(y\mid \theta,M)p(\theta\mid M)h(\theta)\right]}{E_{\theta\sim p(\theta\mid y,M)}\left[g(\theta)h(\theta)\right]} \end{aligned} \]

Therefore we can approximate \(p(y\mid M)\) as follows:

\[ \begin{aligned} p(y\mid M)&\approx\frac{\frac{1}{S}\sum_{s=1}^Sp(y\mid \hat\theta_s,M)p(\hat\theta_s\mid M)h(\hat\theta_s)}{\frac1T\sum_{t=1}^Tg(\tilde\theta_t)h(\tilde\theta_t)}\\ \hat\theta_s&\sim g(\theta)\\ \tilde\theta_t&\sim p(\theta\mid y,M) \end{aligned} \]

That is, the \(\hat\theta\)s are draws from the proposal distribution, and the \(\tilde\theta\)s are draws from the posterior.

Fortunately for us, the bridgesampling library (Gronau, Singmann, and Wagenmakers 2020) does all of this for us. The only thing we need to be careful of in writing our Stan files is that we do not drop constants of proportionality, which I do in the Stan programs shown above.34 Here is how we can assign posterior probabilities to our three models using the bridgesampling library:

library(tidyverse)
library(bridgesampling)

d1<-(read.csv("Data/BFS2019_choices_exp1.csv")
     |> mutate(experiment=1)
)
d2<-(read.csv("Data/BFS2019_choices_exp2.csv")
     |> mutate(experiment=2)    
)
D<-(rbind(d1,d2)
    # Just use the dictator game data
    |> filter(dg==1)  
    |> mutate(
      self_alloc = ifelse(choice_x==1,self_x,self_y),
      other_alloc = ifelse(choice_x==1,other_x,other_y)
    )
) |>
  filter(
    sid==102010050706
  )



model_FS<-"Code/Validation/FS.stan" |>
  stan_model()
model_Linear<-"Code/Validation/Linear.stan" |>
  stan_model()
model_Selfish<-"Code/Validation/Selfish.stan" |>
  stan_model()

dStan<-list(
  n = dim(D)[1],
  self_x = D$self_x,
  other_x = D$other_x,
  self_y = D$self_y,
  other_y = D$other_y,
  
  choice_x = D$choice_x,
  
  prior_alpha = c(0,0.67),
  prior_beta = c(0,0.67),
  prior_gamma = c(0,0.67),
  prior_lambda = c(-5.76,2.11),
  
  UseData = rep(1,dim(D)[1])
)

# Estimate the models

Fit_FS<-model_FS |>
  sampling(data=dStan,seed=42)

Fit_Selfish<-model_Selfish |>
  sampling(data=dStan,seed=42)

Fit_Linear<-model_Linear |>
  sampling(data=dStan,seed=42)

# Apply the bridge sampler to the estimated models

FS<-Fit_FS |> 
  bridge_sampler()

Selfish<-Fit_Selfish |> 
  bridge_sampler()

Linear<-Fit_Linear |> 
  bridge_sampler()

# Calculate a posterior probability for each model assuming an equal prior

post_prob(FS,Selfish,Linear) |>
  saveRDS("Code/Validation/PostProbs.rds")
"Code/Validation/PostProbs.rds" |> 
  readRDS() |>
  kbl(digits = 4,caption = "Posterior probabilities of the three models assuming equal prior probabilities.") |>
  kable_classic(full_width=FALSE)
Table 8.1: Posterior probabilities of the three models assuming equal prior probabilities.
x
FS 0.0742
Selfish 0.7706
Linear 0.1552

Table 8.1 shows the three models’ posterior probabilities assuming equal prior probabilities. Here we can see that most of the probability mass is assigned to the selfish model. That is, if we wanted to pick one model out of these three, we would choose the selfish model. Another, more Bayesian interpretation of the results in this Table is as follows: suppose that these three models are the only three models that could be generating the data, then after observing the data our belief that the selfish model is the true model is 77% (assuming equal prior probabilities).

8.3 Cross-validation

Cross-validation techniques, while not strictly Bayesian,35 are useful when you are interested in making out-of-sample predictions. Cross-validation mimics an out-of-sample prediction by splitting the dataset into a “training” dataset and a “testing” dataset. The training dataset is used to estimate the model, and then the testing dataset is used to evaluate the model. That is, we estimate the model ignoring the testing data, and then see how well the estimated model performs in predicting it.

8.3.1 Expected Log Predicted Density (ELPD) and other measures of goodness of fit

An important component to cross-validation is the choice of goodness-of-fit: we need to have a measure of how well, or how badly, our models fit the data. If we were in the realm of linear regression, a good starting point could be out-of-sample residual sum of squares:

\[ RSS_M=\sum_{i,t\in\mathrm{testing}}(y_{i,t}-X_{i,t}\hat\beta_M)^2 \]

That is, we would aim to select the model \(M\in\mathcal M\) that minimized \(RSS_M\).

Things become a little more complicated when we are dealing with choice data \(y_{i,t}\) that are not real numbers. For instance, many of the examples I use in this book, and many experiments in general, have binary choices. It is not clear whether minimizing the sum of squared residuals for binary data is what we want to be doing. Furthermore, even if we had continuous data, \(RSS_M\) only evaluates the model’s point predictions (i.e. \(\hat y_{i,t,M}=X_{i,t}\hat\beta_M\)), and completely ignores that our model makes probabilistic predictions. From the linear regression model’s perspective we are ignoring that, when coupled with (say) a distributional assumption of normal errors, the model also says something about the spread of data.

In the absence of a specific predictive goal,36 a good all-encompassing measure of goodness-of-fit is Expected Log Predicted Density (ELPD), which is defined as:

\[ \mathrm{ELPD}_M=E_{\theta\mid y,M}\left[\log p(y^*\mid \theta,M)\right] \]

where \(\theta\mid y,M\) is the posterior distribution of parameters \(\theta\) from model \(M\), and \(p(y^*\mid \theta,M)\) is the probability (mass or density) of observing (out-of-sample) observation \(y^*\) given model \(M\) and parameters \(\theta\). My (somewhat) lay-person’s translation of \(\mathrm{ELPD}_M\) is: “If we estimate and use model \(M\), how un-surprised will we be by new data?” That is, if \(\mathrm{ELPD}_M\) is large, then our model on average assigns a lot of probability (mass or density) to the new data that we will get, and so we are often not surprised by our new data.

Since \(y^*\) is unknown at the time of estimation, in practice we approximate it using cross-validation. We use the testing dataset to compute it:

\[ \begin{aligned} \mathrm{ELPD}_M&\approx \frac{1}{n_\mathrm{testing}}\sum_{i,t\in\mathrm{testing}}E_{\theta\mid M}\left[\log p(y_{i,t}\mid \theta,M)\right]\\ &\approx \frac{1}{n_\mathrm{testing}}\sum_{i,t\in\mathrm{testing}}\left[\frac{1}{S}\sum_{s=1}^S\log p(y_{i,t}\mid \theta_s,M)\right]\\ \theta_s&\sim p(\theta\mid y_\mathrm{training},M) \end{aligned} \]

That is, we are evaluating the out-of-sample (i.e. testing) log-likelihood of our model with respect to the posterior (conditional on the training data only). Hence you will see in the in the Stan programs above, that I have explicitly created a log_lik vector in the transformed parameters block. This will make it easy to extract the relevent elements of this when we do cross-validation.

I will proceed with the remainder of this chapter using ELPD as the measure of goodness-of-fit, but realize that what follows applies equally to any other valid measure.

8.3.2 1-round cross-validation

The simplest and fastest form of cross-validation in 1-round cross-validation. Here, we do the cross-validation process exactly once. We partition the sample into a “training” dataset, where we estimate the model, and a “testing” dataset, which we use to evaluate the model. We do this exactly once. The other cross-validation techniques mentioned later on in this chapter will involve repeating this many times.

Here is my code to evaluate ELPD for the three models using 1-round cross-validation. Note that I achieve this by setting ten random elements of the UseData vector to zero. This means that these observations are not used in incrementing the target (see the target += log_like .* UseData; line in the Stan programs), and so don’t contribute to the estimation process. I then extract the log-likelihoods corresponding to these testing data to compute ELPD.

library(tidyverse)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)


d1<-(read.csv("Data/BFS2019_choices_exp1.csv")
     |> mutate(experiment=1)
)
d2<-(read.csv("Data/BFS2019_choices_exp2.csv")
     |> mutate(experiment=2)    
)
D<-(rbind(d1,d2)
    # Just use the dictator game data
    |> filter(dg==1)  
    |> mutate(
      self_alloc = ifelse(choice_x==1,self_x,self_y),
      other_alloc = ifelse(choice_x==1,other_x,other_y)
    )
) |>
  filter(
    sid==102010050706
  )

model_FS<-"Code/Validation/FS.stan" |>
  stan_model()
model_Linear<-"Code/Validation/Linear.stan" |>
  stan_model()
model_Selfish<-"Code/Validation/Selfish.stan" |>
  stan_model()

dStan<-list(
  n = dim(D)[1],
  self_x = D$self_x,
  other_x = D$other_x,
  self_y = D$self_y,
  other_y = D$other_y,
  
  choice_x = D$choice_x,
  
  prior_alpha = c(0,0.67),
  prior_beta = c(0,0.67),
  prior_gamma = c(0,0.67),
  prior_lambda = c(-5.76,2.11),
  
  UseData = rep(1,dim(D)[1])
)

# split the data into training and testing subsamples
# Here we will randomly select 10 observations to hold out for the testing data
set.seed(42)
training<-(1:78)[order(runif(78))][1:10]
dStan$UseData[training]<-0


# Estimate the models

Fit_FS<-model_FS |>
  sampling(data=dStan,seed=42)

Fit_Selfish<-model_Selfish |>
  sampling(data=dStan,seed=42)

Fit_Linear<-model_Linear |>
  sampling(data=dStan,seed=42)

# extract the relevant parts of the log-likelihood matrix for the training data

ELPD_FS<-mean(extract(Fit_FS)$log_lik[,training]%*% rep(1,10))
ELPD_Selfish<-mean(extract(Fit_Selfish)$log_lik[,training]%*% rep(1,10))
ELPD_Linear<-mean(extract(Fit_Linear)$log_lik[,training]%*% rep(1,10))

d<-tibble(
  model = c("FS","Selfish","Linear"),
  ELPD  = c(ELPD_FS,ELPD_Selfish,ELPD_Linear)
)

d |>
  write.csv("Code/Validation/OneRound.csv")
"Code/Validation/OneRound.csv" |>
  read.csv() |>
  dplyr::select(-X) |>
  kbl(digits=2,caption = "ELPD evaluated using 1-round cross validation.") |>
  kable_classic(full_width=FALSE)
Table 8.2: ELPD evaluated using 1-round cross validation.
model ELPD
FS -6.30
Selfish -6.32
Linear -6.25

Table 8.2 shows the ELPD evaluated using 1-round cross-validation. More positive numbers indicate better out-of-sample predictions. Here we can see that we (marginally) select the FS model over the others.

8.3.3 Leave-one-out cross-validation (LOO)

Leave-one-out (LOO) cross-validation is much more computationally intensive than 1-round, but is preferable because it better mimics the estimation process and does not depend on the definition of the testing and training datasets. Here, we estimate the model \(n\) times, once for each observation in the dataset. Each time, we put just one observation into the testing dataset and evaluate its goodness-of-fit. We then take the average goodness-of-fit over these \(n\) estimation runs. In the code below, you can see I loop over the observations in the data (for (ii in 1:dim(D)[1]) { ...). Each time through the loop, element ii of UseData is set to zero. We then extract the log-likelihood for this left-out observation and store it in a matrix (e.g. extract(Fit_FS)$log_lik[,ii]) I then use the loo library to summarize the results.

library(tidyverse)
library(loo)

d1<-(read.csv("Data/BFS2019_choices_exp1.csv")
     |> mutate(experiment=1)
)
d2<-(read.csv("Data/BFS2019_choices_exp2.csv")
     |> mutate(experiment=2)    
)
D<-(rbind(d1,d2)
    # Just use the dictator game data
    |> filter(dg==1)  
    |> mutate(
      self_alloc = ifelse(choice_x==1,self_x,self_y),
      other_alloc = ifelse(choice_x==1,other_x,other_y)
    )
) |>
  filter(
    sid==102010050706
  )



model_FS<-"Code/Validation/FS.stan" |>
  stan_model()
model_Linear<-"Code/Validation/Linear.stan" |>
  stan_model()
model_Selfish<-"Code/Validation/Selfish.stan" |>
  stan_model()

dStan<-list(
  n = dim(D)[1],
  self_x = D$self_x,
  other_x = D$other_x,
  self_y = D$self_y,
  other_y = D$other_y,
  
  choice_x = D$choice_x,
  
  prior_alpha = c(0,0.67),
  prior_beta = c(0,0.67),
  prior_gamma = c(0,0.67),
  prior_lambda = c(-5.76,2.11),
  
  UseData = rep(1,dim(D)[1])
)

ll_FS<-c()
ll_Selfish<-c()
ll_Linear<-c()

for (ii in 1:dim(D)[1]) {
  
  print(paste(ii,"of",dim(D)[1]))
  
  dStan<-list(
    n = dim(D)[1],
    self_x = D$self_x,
    other_x = D$other_x,
    self_y = D$self_y,
    other_y = D$other_y,
    
    choice_x = D$choice_x,
    
    prior_alpha = c(0,0.67),
    prior_beta = c(0,0.67),
    prior_gamma = c(0,0.67),
    prior_lambda = c(-5.76,2.11),
    
    UseData = rep(1,dim(D)[1])
  )
  
  dStan$UseData[ii]<-0
  
  Fit_FS<-model_FS |>
    sampling(data=dStan,seed=42)
  ll_FS<-cbind(ll_FS,extract(Fit_FS)$log_lik[,ii])
  
  Fit_Linear<-model_Linear |>
    sampling(data=dStan,seed=42)
  ll_Linear<-cbind(ll_Linear,extract(Fit_Linear)$log_lik[,ii])
  
  Fit_Selfish<-model_Selfish |>
    sampling(data=dStan,seed=42)
  ll_Selfish<-cbind(ll_Selfish,extract(Fit_Selfish)$log_lik[,ii])

}

ll_FS |>
  write.csv("Code/Validation/LOO_ll_FS.csv")
ll_Linear |>
  write.csv("Code/Validation/LOO_ll_Linear.csv")
ll_Selfish |>
  write.csv("Code/Validation/LOO_ll_Selfish.csv")
models<-c(model1="FS",model2="Selfish",model3="Linear")

LOO_ll_FS<-"Code/Validation/LOO_ll_FS.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()
LOO_ll_Linear<-"Code/Validation/LOO_ll_Linear.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()
LOO_ll_Selfish<-"Code/Validation/LOO_ll_Selfish.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()

loo_compare(loo(LOO_ll_FS,cores=8),loo(LOO_ll_Selfish,cores=8),loo(LOO_ll_Linear,cores=8)) |>
  data.frame()|>
  rownames_to_column(var = "model") |>
  mutate(
    model = models[model]
  ) |>
  kbl( caption = "Model comparison using exact LOO") |>
  kable_classic(full_width=FALSE)
Table 8.3: Model comparison using exact LOO
model elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
Selfish 0.00000 0.0000000 -41.71713 6.029357 1.794600 0.8102173 83.43425 12.05871
Linear -2.14477 0.7093951 -43.86190 5.823461 2.979981 0.9124914 87.72379 11.64692
FS -4.65318 1.4090278 -46.37031 6.316032 4.604015 1.3196564 92.74061 12.63206

Table 8.3 shows a summary of the loo output. Recall that larger (i.e. more positive) values of ELPD indicate a better fit. Here we select the selfish model.

8.3.4 Approximate LOO

Vehtari, Gelman, and Gabry (2017) describes methods for approximating LOO using Pareto-smoothed importance sampling. The advantage of this process is that you only need to estimate each model once, rather than \(n\) times. This can be done in R using the loo library.

library(tidyverse)
library(loo)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

d1<-(read.csv("Data/BFS2019_choices_exp1.csv")
     |> mutate(experiment=1)
)
d2<-(read.csv("Data/BFS2019_choices_exp2.csv")
     |> mutate(experiment=2)    
)
D<-(rbind(d1,d2)
    # Just use the dictator game data
    |> filter(dg==1)  
    |> mutate(
      self_alloc = ifelse(choice_x==1,self_x,self_y),
      other_alloc = ifelse(choice_x==1,other_x,other_y)
    )
) |>
  filter(
    sid==102010050706
  )



model_FS<-"Code/Validation/FS.stan" |>
  stan_model()
model_Linear<-"Code/Validation/Linear.stan" |>
  stan_model()
model_Selfish<-"Code/Validation/Selfish.stan" |>
  stan_model()

dStan<-list(
  n = dim(D)[1],
  self_x = D$self_x,
  other_x = D$other_x,
  self_y = D$self_y,
  other_y = D$other_y,
  
  choice_x = D$choice_x,
  
  prior_alpha = c(0,0.67),
  prior_beta = c(0,0.67),
  prior_gamma = c(0,0.67),
  prior_lambda = c(-5.76,2.11),
  
  UseData = rep(1,dim(D)[1])
)

# Estimate the models

Fit_FS<-model_FS |>
  sampling(data=dStan,seed=42)

Fit_Selfish<-model_Selfish |>
  sampling(data=dStan,seed=42)

Fit_Linear<-model_Linear |>
  sampling(data=dStan,seed=42)

# apply approximate LOO cross-validation

FS<-extract(Fit_FS)$log_lik |> 
  loo(cores=8)
  

Selfish<-extract(Fit_Selfish)$log_lik |> 
  loo(cores=8)

Linear<-extract(Fit_Linear)$log_lik |> 
  loo(cores=8)

# Compare the models

loo_compare(FS,Selfish,Linear) |>
  saveRDS("Code/Validation/ApproxLOO.rds")

The downside of this process is that it is not guaranteed that the sampling is reliable. Fortunately there is a diagnostic of this, which `loo’ checks for us. For example, for the “Selfish” model, I got the following warning:

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     72    92.3%   2270    
   (0.7, 1]   (bad)       0     0.0%   <NA>    
   (1, Inf)   (very bad)  6     7.7%   <NA>    
See help('pareto-k-diagnostic') for details.

So there are six observations for which this process did not work well. In these cases one can instead do LOO as described in the previous section.

"Code/Validation/ApproxLOO.rds" |> 
  readRDS() |>
  data.frame()|>
  rownames_to_column(var = "model") |>
  mutate(
    model = models[model]
  ) |>
  kbl( caption = "Model validation using approximate LOO") |>
  kable_classic(full_width=FALSE)
Table 8.4: Model validation using approximate LOO
model elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
Selfish 0.0000000 0.0000000 -39.93907 5.310491 1.467190 0.6213825 79.87814 10.62098
Linear -0.9970267 0.7318856 -40.93610 5.009543 2.454233 0.6737671 81.87220 10.01909
FS -1.7310278 1.0247346 -41.67010 5.095779 3.476072 0.8952323 83.34020 10.19156

Table 8.4 shows the loo library’s summary of the model evaluation. Here we select the “Selfish” model.

8.3.5 \(k\)-fold cross-validation

Perhaps a happy medium between LOO and approximate LOO is \(k\)-fold cross-validation. Here, we partition the data into \(k\) “folds”. For each fold, we estimate the model using data not in that fold, then evaluate how well the estimated model predicts data in that fold. That is, when \(k=n\) we are back to LOO.

Here is my R code for implementing \(k\)-fold cross-validation. In this script, I partition the data into \(k=10\) folds.

library(tidyverse)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

d1<-(read.csv("Data/BFS2019_choices_exp1.csv")
     |> mutate(experiment=1)
)
d2<-(read.csv("Data/BFS2019_choices_exp2.csv")
     |> mutate(experiment=2)    
)
D<-(rbind(d1,d2)
    # Just use the dictator game data
    |> filter(dg==1)  
    |> mutate(
      self_alloc = ifelse(choice_x==1,self_x,self_y),
      other_alloc = ifelse(choice_x==1,other_x,other_y)
    )
) |>
  filter(
    sid==102010050706
  )



model_FS<-"Code/Validation/FS.stan" |>
  stan_model()
model_Linear<-"Code/Validation/Linear.stan" |>
  stan_model()
model_Selfish<-"Code/Validation/Selfish.stan" |>
  stan_model()

dStan<-list(
  n = dim(D)[1],
  self_x = D$self_x,
  other_x = D$other_x,
  self_y = D$self_y,
  other_y = D$other_y,
  
  choice_x = D$choice_x,
  
  prior_alpha = c(0,0.67),
  prior_beta = c(0,0.67),
  prior_gamma = c(0,0.67),
  prior_lambda = c(-5.76,2.11),
  
  UseData = rep(1,dim(D)[1])
)

ll_FS<-matrix(NA,4000,78)
ll_Selfish<-matrix(NA,4000,78)
ll_Linear<-matrix(NA,4000,78)

set.seed(42)


# set up an index for the folds. 
folds<-rep(1:10,8)
folds<-folds[order(runif(80))][1:78]



for (ff in unique(folds)) {
  
  print(ff)
  
  dStan<-list(
    n = dim(D)[1],
    self_x = D$self_x,
    other_x = D$other_x,
    self_y = D$self_y,
    other_y = D$other_y,
    
    choice_x = D$choice_x,
    
    prior_alpha = c(0,0.67),
    prior_beta = c(0,0.67),
    prior_gamma = c(0,0.67),
    prior_lambda = c(-5.76,2.11),
    
    UseData = rep(1,dim(D)[1])
  )
  
  dStan$UseData[folds==ff]<-0
  
  Fit_FS<-model_FS |>
    sampling(data=dStan,seed=42)
  ll_FS[,ff==folds]<-extract(Fit_FS)$log_lik[,ff==folds]
  
  Fit_Linear<-model_Linear |>
    sampling(data=dStan,seed=42)
  ll_Linear[,ff==folds]<-extract(Fit_Linear)$log_lik[,ff==folds]
  
  Fit_Selfish<-model_Selfish |>
    sampling(data=dStan,seed=42)
  ll_Selfish[,ff==folds]<-extract(Fit_Selfish)$log_lik[,ff==folds]

}

ll_FS |>
  write.csv("Code/Validation/kfold_ll_FS.csv")
ll_Linear |>
  write.csv("Code/Validation/kfold_ll_Linear.csv")
ll_Selfish |>
  write.csv("Code/Validation/kfold_ll_Selfish.csv")
kfold_ll_FS<-"Code/Validation/kfold_ll_FS.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()
kfold_ll_Linear<-"Code/Validation/kfold_ll_Linear.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()
kfold_ll_Selfish<-"Code/Validation/kfold_ll_Selfish.csv" |> read.csv() |> dplyr::select(-X) |> as.matrix()


loo_compare(loo(kfold_ll_FS,cores=8),loo(kfold_ll_Selfish,cores=8),loo(kfold_ll_Linear,cores=8)) |>
  data.frame()|>
  rownames_to_column(var = "model") |>
  mutate(
    model = models[model]
  ) |>
  kbl( caption = "Model comparison using $k$-fold cross-validation") |>
  kable_classic(full_width=FALSE)
Table 8.5: Model comparison using \(k\)-fold cross-validation
model elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
Selfish 0.000000 0.0000000 -42.31369 6.220406 2.098414 0.9582934 84.62739 12.44081
Linear -1.810640 0.8332116 -44.12433 5.871708 3.168026 0.9462041 88.24867 11.74342
FS -4.303829 1.6452414 -46.61752 6.330098 4.929662 1.3753267 93.23504 12.66020

Table 8.5 shows the loo library’s summary of the model evaluation using \(k\)-fold cross-validation. Again, we select the Selfish model.

References

Bruhin, Adrian, Ernst Fehr, and Daniel Schunk. 2019. “The Many Faces of Human Sociality: Uncovering the Distribution and Stability of Social Preferences.” Journal of the European Economic Association 17 (4): 1025–69.
Fehr, Ernst, and Klaus M Schmidt. 1999. “A Theory of Fairness, Competition, and Cooperation.” The Quarterly Journal of Economics 114 (3): 817–68.
Gronau, Quentin F., Alexandra Sarafoglou, Dora Matzke, Alexander Ly, Udo Boehm, Maarten Marsman, David S. Leslie, Jonathan J. Forster, Eric-Jan Wagenmakers, and Helen Steingroever. 2017. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81: 80–97. https://doi.org/https://doi.org/10.1016/j.jmp.2017.09.005.
Gronau, Quentin F., Henrik Singmann, and Eric-Jan Wagenmakers. 2020. “Bridgesampling: An r Package for Estimating Normalizing Constants.” Journal of Statistical Software 92: 1–29.
Meng, Xiao-Li, and Wing Hung Wong. 1996. “Simulating Ratios of Normalizing Constants via a Simple Identity: A Theoretical Exploration.” Statistica Sinica, 831–60.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.

  1. When using a sampling statement with a tilde (~), Stan drops constants of proportionality because they are not needed to simulate the posterior. However they are needed for computing the marginal likelihood of a model. ↩︎

  2. I mean two things here. Firstly, unlike posterior probabilities, we are not using Bayes’ rule to quantify uncertainty in our models. Secondly, cross-validation can equally be applied using Frequentist techniques. ↩︎

  3. For example, if you want to make predictions that minimize the expected squared error, then \(RSS_M\) may be a great choice for you.↩︎