Optimizing experiment design for estimating parametric models in economic experiments

October 2023

The big picture

So you want to estimate a structural model \(\ldots\)

What are you going to do with it?

What are you going to do with it?

Interpret the parameters
Interpret a subset of the parameters
Comment on a function of the parameters

Each one of these is a different research question, and therefore might be best answered with a different experiment design

This paper: How can I design an experiment to estimate a structural model with a specific research question in mind?

Scope of this project

Maximum likelihood estimation
- Currently thinking about a Bayesian equivalent of this
We are going to do something with the point estimates \(\hat\theta\) only
Estimating one model for each participant
- Conceptually simple to extend to experiment-level estimation
- Computationally challenging
I can formalize a prior \(p(\theta)\) about my model’s parameters
Static experiment design (i.e. not DOSE)
- Another computational constraint

A utility-maximization problem for the experimenter

\[ \max_{d\in \mathbb D} V(d) \]

\(d\) is an experiment design
\(\mathbb D\) is the feasible set of experiment designs
\(V(d)\) is our expected utility of experiment design \(d\)
- “How well do we expect design \(d\) to answer our research question?”

Some problems

\[ \max_{d\in \mathbb D} V(d) \]

We need to choose \(V(d)\)
- Some out-of-the-box suggestions based on statistical theory
- An approximate utility function
\(\mathbb D\) is large, so \(V(d)\) will be hard to maximize.
- An exchange algorithm
\(V(d)\) is an expectation over both experiment outcomes and participant heterogeneity
- Use a prior over participant parameter heterogeneity
- Approximate the sampling distribution of the estimator

Choosing a utility function \(V(d)\)

\(^*\)-optimal designs

Suggestions for making the estimates (in some sense) “as precise as possible”, based on the Fisher information matrix, which measures our estimates’ expected precision

\(\mathcal D\)-optimal: maximize the determinant
\(\mathcal A\)-optimal: maximize the trace

Focuses on precision of estimates \(\hat\theta\), not predictions \(g(\hat\theta)\)

Choosing a utility function \(V(d)\)

A more customizable approach

Specify a utility function \(v(\tilde\theta\mid\theta)\) that measures our utility of using point estimate \(\tilde\theta\) when the true parameters are actually \(\theta\).

Then, integrate out the uncertainty:

\[ \begin{aligned} V(d)&=\int_\Theta\int_\Theta v(\tilde\theta\mid\theta)p(\tilde\theta\mid\theta,d)\mathrm d\tilde\theta p(\theta)\mathrm d\theta=E\left[E\left[v(\tilde\theta\mid\theta)\big| \theta,d\right]\big| d\right] \end{aligned} \]

\(p(\tilde\theta\mid\theta,d)\) is the sampling distribution of our estimator given true parameters \(\theta\) and design \(d\).
\(p(\theta)\) is our prior belief about parameter \(\theta\)

Now we can focus on \(v(\tilde\theta\mid\theta)\)

\[ \begin{aligned} V(d)&=E\left[E\left[v(\tilde\theta\mid\theta)\big| \theta,d\right]\big| d\right] \end{aligned} \]

Squared prediction error of parameters:

\[ v(\tilde\theta\mid\theta)=-(\tilde\theta-\theta)^\top(\tilde\theta-\theta) \]

Squared prediction error of a function of parameters:

\[ v(\tilde\theta\mid\theta)=-(g(\tilde\theta)-g(\theta))^\top(g(\tilde\theta)-g(\theta)) \]

Expected log predictive density (for a binary choice data)

\[ v(\tilde\theta\mid\theta)=\frac{1}{T}\sum_{t=1}^T\left[\Lambda(\Delta_{t})\log\Lambda(\tilde\Delta_t)+(1-\Lambda(\Delta_t))\log(1-\Lambda(\tilde\Delta_t))\right] \]

Whatever you like!

But how can we calculate the expectation?

\[ \begin{aligned} V(d)&=E\left[E\left[v(\tilde\theta\mid\theta)\big| \theta,d\right]\big| d\right] \end{aligned} \]

The outer expectation is over true parameters \(\theta\). We can use Monte Carlo integration to approximate this (use draws from the prior \(p(\theta)\)).

The inner expectation is over the sampling distribution of \(\tilde\theta\mid d,\theta\). I use an approximation for this.

Approximating \(v(\tilde\theta\mid\theta)\)

\[ \begin{aligned} v(\tilde\theta\mid\theta)&\approx v(\theta\mid\theta)+(\tilde\theta-\theta)^\top\frac{\partial v(\theta\mid\theta)}{\partial \tilde\theta}+\frac{1}{2}(\tilde\theta-\theta)^\top\frac{\partial^2 v(\theta\mid\theta)}{\partial \tilde\theta\partial \tilde\theta^\top}(\tilde\theta-\theta) \end{aligned} \]

First term: not a function of \(\tilde\theta\)
Second term: zero – FOC of a utility function
Omitted terms: zero if we can write \(v(\tilde\theta\mid\theta)=(\tilde\theta-\theta)^\top A(\theta)(\tilde\theta-\theta)\)

…

Third term ???? Take the expectation!

\[ \begin{aligned} E\left[v(\tilde\theta\mid\theta)\mid d,\theta\right] &\approx v(\theta\mid\theta)+\frac{1}{2}E\left[(\tilde\theta-\theta)^\top\frac{\partial^2 v(\theta\mid\theta)}{\partial \tilde\theta\partial \tilde\theta^\top}(\tilde\theta-\theta)\mid d,\theta\right] \end{aligned} \]

Theorem: Let \(X\) be an \(n\times 1\) random vector with mean \(\mu\) and covariance \(\Sigma\), and let \(A\) be a symmetric \(n\times n\) matrix. Then, the expectation of the quadratic form \(X^\top A X\) is \(\mu^\top A\mu + \mathrm{tr}(A\Sigma)\)

\[ \begin{aligned} E[v(\tilde\theta\mid\theta)\mid d,\theta]&\approx v(\theta\mid\theta)+\frac12E(\tilde\theta-\theta)^\top \frac{\partial^2v(\theta\mid\theta)}{\partial \tilde\theta\partial\tilde\theta^\top}E(\tilde\theta-\theta)\\ &\quad+\frac{1}{2}\mathrm{tr}\left(\frac{\partial^2v(\theta\mid\theta)}{\partial \tilde\theta\partial\tilde\theta^\top}\mathcal I^{-1}(\theta)\right) \end{aligned} \]

Assume the bias term \(E(\tilde\theta-\theta)\) is negligable.

Probably the most heroic assumption in this process
Working on incorporating bias (watch this space)

Iterpretation in a single-parameter model

\[ \begin{aligned} E(v(\tilde\theta\mid\theta)\mid d,\theta)&\approx v(\theta\mid \theta)+\frac{1}{2}\underbrace{\frac{\partial^2v(\theta\mid\theta)}{\partial \tilde\theta^2}}_{<0}V(\tilde\theta) \end{aligned} \]

Minimize the (asymptotic) variance of the estimator

For more than one parameter, we have a “weighted variance”

Relationship to the \(^*\)-Optimal designs

\[ \begin{aligned} E[v(\tilde\theta\mid\theta)\mid d,\theta]&\approx v(\theta\mid\theta)+\frac{1}{2}\mathrm{tr}\left(\frac{\partial^2v(\theta\mid\theta)}{\partial \tilde\theta\partial\tilde\theta^\top}\mathcal I^{-1}(\theta)\right) \end{aligned} \]

If \(v(\tilde\theta\mid\theta)=-(\tilde\theta-\theta)^\top(\tilde\theta-\theta)\), then we will be computing the \(\mathcal A\)-Optimal design (maximize the trace of the information matrix)

An application: Designing a battery of pairwise lottery choices in the Marschak-Machina triangle

Experimental setup: The design constraints \(\mathbb D\)

80 pair-wise choices between a “Left” and “Right” lottery
Three possible prizes
- \(x_1=1.0\), \(x_2=0.5\), \(x_3=0.0\)
Connecting line segments must form an angle to horizontal between \(5^\circ\) and \(85^\circ\)
Setup mimics Harrison and Ng (2016) – will use as a comparison

A structural model

Rank-dependent utility (RDU) (Quiggin 1982) with CRRA utility, Prelec (1998) probability weighting, and logit choice

\[ \begin{aligned} U_i(L)&=\sum_{k=1}^K \pi_{i,k}^L(x^L_k)^{r_i}\\ \pi_{i,k}^L&=\omega_i\left(\sum_{j=1}^kp_{j}^L\right)-\omega_i\left(\sum_{j=1}^{k-1}p_{j}^L\right)\\ \omega_i(p)&=\exp\left(-(-\log\rho_i)^{1-\psi_i}(-\log p)^{\psi_i}\right)\\ \Pr\left(y_{i,t}=L\mid L_t,R_t,\theta\right)&=\frac{\exp(\lambda_i U_i(L_t))}{\exp(\lambda_i U_i(L_t))+\exp(\lambda_i U_i(R_t))} \end{aligned} \]

Design goals

Estimate a certainty equivalent as precisely as possible:

50% chance of largest prize, 50% chance of smallest prize

\[ v^\text{CE}(\tilde\theta\mid\theta) = -(C(\tilde\theta)-C(\theta))^2 \]

Test the expected utility parameter restriction (\(\psi=1\))

\[ v^\psi(\tilde\theta\mid\theta)=-(\tilde\psi-\psi)^2 \]

Predict decisions in another experiment (Harrison and Ng 2016)

Expected log pointwise prediction density (ELPD)

\[ v^\text{ELPD}(\tilde\theta\mid\theta)=\frac{1}{T_b}\sum_{t=1}^T\left[\Lambda(\Delta_t)\log\Lambda(\tilde\Delta_t)+(1-\Lambda(\Delta_t))\log(1-\Lambda(\tilde\Delta_t))\right] \]

Calibrating a prior \(p(\theta)\)

Bayesian hierarchical model using data from Harrison and Swarthout (2023) (undergrad participants only)

\[ \begin{pmatrix} \log r_i& \log\psi_i & \Phi^{-1}(\rho_i) & \log \lambda_i \end{pmatrix}^\top\sim iid N\left(\mu , \mathrm{diag}(\tau)\Omega\mathrm{diag}(\tau)\right) \]

Posterior means with posterior standard deviations in parentheses.
	\(\log(r)\)	\(\log(\psi)\)	\(\Phi^{-1}(\rho)\)	\(\log(\lambda)\)
\(\mu\)	-0.68 (0.051)	-0.251 (0.033)	-0.408 (0.118)	2.512 (0.035)

\(\tau\)	0.697 (0.05)	0.433 (0.029)	1.232 (0.107)	0.439 (0.032)

corr: \(r\)		0.004 (0.107)	0.112 (0.133)	-0.212 (0.091)
corr: \(\psi\)			-0.21 (0.085)	-0.015 (0.09)
corr: \(\rho\)				0.158 (0.109)

The designs – in the MM triangle

The designs – slope and length

Monte Carlo results – overview

Draw parameter \(\theta\) from prior
Simulate data
Estimate parameter \(\tilde\theta\)
- Calculate certainty equivalent
- Predict choices in Harrison and Ng (2016)
- Test EU restriction using likelihood ratio and Wald test

Monte Carlo results – Parameter recovery

Median prediction error

experiment	CE	lambda	psi	r	rho
CE	0.047	4.832	0.201	0.167	0.190
ELPD	0.054	4.449	0.151	0.140	0.185
HN2016	0.062	6.861	0.182	0.238	0.209
psi	0.066	4.596	0.141	0.148	0.200

Monte Carlo results – Predicting Harrison and Ng (2016)

Larger numbers are better

experiment	-log10(-ELPD)	log10(sd)
CE	-16.99	14.993427
ELPD	-9.74	7.642208
HN2016	-12.84	10.834724
psi	-22.48	20.476469

Monte Carlo results – Testing EU (\(\psi=1\))

Test power marginalized over prior

experiment	LR	Wald
CE	0.264	0.065
ELPD	0.374	0.120
HN2016	0.394	0.121
psi	0.410	0.136

Discussion

Computational feasability? Yes!

On my laptop (bought in 2017)

Estimating the hierarchical model: A couple of days

Can skip this if you can formalize your priors in other ways

Optimizing the designs: Overnight, using an exchange algorithm. Read the paper for more details on this!

ELPD experiments were the longest to design

Monte Carlo simulation: Overnight

Different research questions imply different experiment designs

Maybe obvious, but important

Importance of formalizing the research question and the structural model(s) before running the experiment

Experiments designed to answer a different research question can still be used to estimate structural models, but they will not answer your research question as well.

Corollary: “implications of theory” tests are not necessarily good designs to estimate structural models

Example: Expected Utility Theory \(\implies\) common ratio property

Test the implication: need many parallel line segments in MM triangle (Loomes and Sugden 1998)
This may come at the expense of estimating our risk-aversion parameter precisely
The structural (parametric) test of Expected Utility Theory is \(\psi=1\)

These tests have different alternative hypotheses:

Implication: Not expected utility theory
Structural: Rank dependent utility

Appendix

Some things I learned along the way – I

For large \(x\):

\[ \begin{aligned} \log(1+\exp(x))&\approx x\\ \log(1+\exp(x))&=\begin{cases} \log(1+\exp(x))&\text{ according to math}\\ \mathrm{Inf} &\text{ according to my computer} \end{cases} \end{aligned} \]

Some things I learned along the way – II

A randomly-generated experiment will most likely produce very biased estimators!

Bias term is the main component for random starting points, but is in the same order as variance for well-designed experiments
Pick some good starting values!
Start targeted experiments at designs that you got from trying to minimize the mean squared error

An experiment optimized for one goal will probably be terrible for another goal

Experiment that minimized the MSE of \(\psi\) had terrible sampling properties for the other parameters.

Laptop specs

Intel(R) Core(TM) i7-7660U @2.50GHz, 2496 Mhz, 2 Core(s), 4 Logical Processor(s)
Windows 10 Pro 10.0.18363
16.0 GB Installed Physical Memory (didn’t need this much)
Stan version 2.26.1 (development version) used for optimization
- No within-chain parallelization used

Exchange algorithm

Start with initial experiment design \(d\), and partition it into \(T\) elements so it can be expressed as \(d=\{d_t\}_{t=1}^T\)

For each subset \(d_t\) of the partition, evaluate the value of the experiment if this partition is removed. That is, compute \(V(d-d_t)\) for all \(t\)
For the \(d_t\) that generates the largest \(V(d-d_t)\), choose \(d'\) to maximize \(V(d-d_t+d')\)
The new experiment design is \(d-d_t+d'\), go back to step 1

References

Harrison, Glenn W, and Jia Min Ng. 2016. “Evaluating the Expected Welfare Gain from Insurance.” Journal of Risk and Insurance 83 (1): 91–120.

Harrison, Glenn W, and Todd Swarthout. 2023. “Cumulative Prospect Theory in the Laboratory: A Reconsideration.” In Research in Experimental Economics: Models of Risk Preferences: Descriptive and Normative Challenges, edited by G. W. Harrison and D. Ross. Bingley, UK: Emerald.

Loomes, Graham, and Robert Sugden. 1998. “Testing Different Stochastic Specificationsof Risky Choice.” Economica 65 (260): 581–98.

Prelec, Drazen. 1998. “The Probability Weighting Function.” Econometrica, 497–527.

Quiggin, John. 1982. “A Theory of Anticipated Utility.” Journal of Economic Behavior & Organization 3 (4): 323–43.