Our objective is to explain variation in the binary satisfaction measure, “Very Satisfied” and “Not Very Satisfied”. Let \(Y_i \in \lbrace VS, NVS \rbrace\) denote respondent \(i\)’s outcome. To account for the dichotomous outcome measure we use a probit model of the form \[ {\rm Pr}(Y_i={\rm VS} |\theta_{c[i]},\beta,X_i) = F(\mu + \theta_{c[i]} + \beta'X_i), \] where \(F(x)\) the normal cdf, \(c[i]\) is \(i\)’s county of residence and \(X_i\) is an array of predictor variables. \(\theta_{c[i]}\) is a county effect for \(i\)’s county while \(\beta\) caputures the effect of the predictor variables.

A Multilevel Model for \(\theta,\beta\)

All of the predictor variables are coded as categorical using the categories shown in the data section. To illustrate the basic structure of the model, assume - for simplicity - that \(X\) only consists of INCOME and AGE. The dimension of \(X_i\) would in this case be equal to the sum of the number of categories of INCOME and AGE and would have two 1s in the appropriate locations. We can write this model as \[ {\rm Pr}(Y_i={\rm VS} |\theta_{c[i]},\beta,X_i) = F(\mu + \theta_{c[i]} + \beta^{inc}_{j[i]} + \beta^{age}_{k[i]}), \] where \(j[i]\) is the income category which \(i\) belongs to, while \(k[i]\) is the age category which respondent \(i\) belongs to. Expressing the model this way (i.e., “effects” notation) is equivalent to the one using the “X” notation above.

We impose a multilevel structure on the effects of the model. This serves two purposes. First, while the full sample size of the data is very large, we wish to explore the effect of possibly deep interactions of the variables. For example, what is the difference in the probability of reporting being “Very Satisfied” for a Hispanic Male aged 20-24 and a Black Female aged 30-34? The sample cell sizes quickly become rather small as we increase the number of interactions in the model. A multilevel structure on \(\beta\) and \(\theta\) allows us to use efficient shrinkage estimation by pooling information within exchangeable groups of parameters (this) book is a good introduction to multilevel modeling). Second, by relying on a multilevel hierarchy of the parameters, we can employ fast Bayesian computational algorithms for parameter estimation. This second point is important since both sample size and parameter dimensionality is huge in this application.

To illustrate the multilevel specication, consider the basic model above with county, income and age effects. For the county effects we assume that \[ \theta_c \sim {\rm N}(\mu_\theta +\pi_ \gamma'Z_c,\tau_\theta^{-1}), \] where \(Z_c\) is a vector of county level predictors. Note that we can simply roll the \(Z_c\) vector into the \(X_i\) vector and without loss of generality assume that \[ \theta_c \sim {\rm N}(\mu_\theta,\tau^{-1}_\theta), \qquad c=1,\dots,K_c, \] where \(K_c\) is the number of counties. Similarly, for INCOME and AGE we assume \[ \beta^{inc}_j \sim {\rm N}(\mu_\mbox{inc},\tau^{-1}_\mbox{inc}), \qquad j=1,\dots,K_\mbox{inc}. \] \[ \beta^{age}_k \sim {\rm N}(\mu_\mbox{age},\tau^{-1}_\mbox{age}), \qquad k=1,\dots,K_\mbox{age}. \]

If the model also has an interaction, e.g., \[ {\rm Pr}(Y_i={\rm VS} |\theta_{c[i]},\beta,X_i) = F(\mu + \theta_{c[i]} + \beta^\mbox{female} X^\mbox{female}_i + \beta^\mbox{inc}_{j[i]} + \beta^\mbox{age}_{k[i]} + \beta^\mbox{female,inc}_{j[i]} X^\mbox{female}_i ), \] then the interaction is modelled in a similar fashion: \[ \beta^\mbox{female,inc}_j \sim {\rm N}(\mu_\mbox{female,inc},\tau^{-1}_\mbox{female,inc}), \qquad k=1,\dots,K_\mbox{inc}, \] where the gender effect is given a vague prior, e.g., \(\beta^\mbox{female} \sim N(0,10)\). We note that the model as specified above is over-parameterized. For example, we can add a constant to the intercept \(\mu\) and subtract the same constant from \((\mu_\theta,\mu_\mbox{inc},\mu_\mbox{age})\) without changing the likelihood function. We will estimate the model in the over-parameterized form and then post-process the estimates to get identified quantities. This is similar to the procedure used by Shirley and Gelman (2014). The section on estimation below provides the details.

We estimate two different specifications. In the first model we include Gender, Age and Income with all interactions (for Income and Age we only include interactions betweeen low Income and age, and high Income and Age). In the second model we add employment and health outcomes to the model.


Table 3. Model Specifications.
Model 1 Model 2 Multilevel?
County County Yes
Female Female No
Race Race No
Income Income Yes
Age Age Yes
Female x Race Female x Race No
Female x Income Female x Income Yes
Female x Age Female x Age Yes
Race x Income Race x Income Yes
Race x Age Race x Age Yes
Income_(0,10) x Age Income_(0,10) x Age Yes
Income_75+ x Age Income_75+ x Age Yes
Female x Race x Income Female x Race x Income Yes
Female x Race x Age Female x Race x Age Yes
Employment Yes
Race x Employment Yes
Health Yes
Income_(0,10) x Health Yes
Income_75+ x Health Yes
Marital Yes
Female x Marital Yes
County.Gini Yes
County.Income Yes
County.Unemp Yes
County.Religion Yes
County.Race Yes
Race x County.Race Yes


The total number of parameters contained in \((\theta,\beta)\) is 2122 for model 1 and 2240 for model 2.

Estimation

Estimating a probit model with more than 2,000 parameters and close to 1.4 Million observations is a non-trivial task. All conventional statistical software suites will not be able to handle this. We use a customized Markov Chain Monte Carlo (MCMC) algorithm for estimation. This section provides the technical details of this. Readers mainly interested in the results can skip this and move on to the results section.

MCMC Algorithm

We use an algorithm of the Gibbs sampler variety. We start with the well-known observation that the probit model can be represented as \[ Y_i = \begin{cases} VS & \mbox{if } Y^*_i \geq 0, \\ NVS & \mbox{if } Y^*_i < 0 \end{cases}, \qquad {\rm where}~~ Y^*_i \sim {\rm N}(\mu + \theta_{c[i]} + \beta'X_i,1). \]

We take advantage of this representation by augmenting the parameter vector by \(Y^* = \lbrace Y^*_i \rbrace_{i=1}^n\). To finish the specification of the full model we need to specify a list of prior distributions for the remaining model parameters.

For the mean and precision parameters for the exchangeable groups (e.g., \(\mu_\mbox{inc}\) and \(\tau_\mbox{inc}\) above), we assume \[ \mu_\mbox{var} \sim {\rm N}(0,\tau^{-1}_0),~~\tau_\mbox{var} \sim {\rm G}(a_0,b_0), \] where \(var\) is a predictor variable or interaction of variables (e.g., \(\theta\), income, age, female x income etc.), and \(G(a_0,b_0)\) is the Gamma distribution with shape \(a_0\) and rate \(b_0\). We set \(\tau_0=0.1\) and \(a_0=3,b_0=0.02\). This is a mildly informative prior for \(\tau\) which encourages shrinkage without ruling out the possibility of very low values of \(\tau\). Finally, for the \(\beta\) elements without a multilevel structure (gender and race), we assume \[ \beta^\mbox{female} \sim {\rm N}(0,100), ~\beta^\mbox{black} \sim {\rm N}(0,100),~ \beta^\mbox{female,black} \sim {\rm N}(0,100) , ~~\mbox{etc.} \]

Next we describe the details of the Gibbs sampler. This involves characterizing each of the conditional distributions of the joint data augmented posterior distribution.

  1. Sampling \(Y\). The conditional for \(Y^*_i\) is a truncated normal distribution: \[ Y^*_i \sim \begin{cases} {\rm TN}_{(0,\infty)}(\mu + \theta_{c[i]} + \beta'X_i,1) & \mbox{if } Y_i=\mbox{VS}, \\ {\rm TN}_{(-\infty,0)}(\mu + \theta_{c[i]} + \beta'X_i,1) & \mbox{if } Y_i=\mbox{NVS.} \\ \end{cases} \]

  2. Sampling \((\mu,\beta)\). The conditional for \(\tilde \beta \equiv (\mu,\beta)\) is a multivariate normal distribution: \[ \begin{eqnarray} \tilde \beta & \sim & {\rm N} \left( V \times ( X'Z(Y^*,\theta) + \Omega_\beta \pi_{\tilde \beta}), V \right), \\ V & = & \left( X'X + \Omega_{\tilde \beta} \right)^{-1}. \end{eqnarray} \] where \(X\) is the full \(N \times K\) matrix of predictor variables (including a column of ones for the intercept \(\mu\)), \(Z(Y^*,\theta)\) is an \(N\) vector with element \(i\) equal to \(Z_i \equiv Y^*_i - \theta_{c[i]}\), \(\Omega_{\tilde \beta}\) is a diagonal matrix with \(\tau\)-parameters along the diagonal arranged to be conformable with \(\tilde \beta\). For example, in the part of \(\tilde \beta\) where \(\beta^\mbox{inc}_j, j=1,\dots,K_\mbox{inc}\) are located, \(\Omega_{\tilde \beta}\) will have a corresponding \(\tau_\mbox{inc}\) (and \(\pi_{\tilde \beta}\) will have \(\mu_\mbox{inc}\) for the same locations).

  3. Sampling \(\theta\). To sample \(\theta_c\)

  4. Sampling \(\tau,\mu\). To sample

Due to the large sample size we coded up the algorithm using the Intel Visual Fortran XE Compiler 12.1 taking advantage of parallel processing capabilities. The source code is available here.

Post-processing