I just attended AAPOR in LA to present our recent MrP local approximate weights paper (currently available here, see also Andrew Gelman’s post here). AAPOR was full of really wonderful people, and I made a few professional connections that I’m very glad to have made. AAPOR is full of researchers and professionals with deep practical knowledge and sharp intuition about the practice and analysis of surveys. But, as a somewhat solitary statistician at AAPOR, and moreover one there to present on MrP, I found the conference’s single-minded emphasis on weights to be suprising, and I thought it would be worth writing a post explaining why. (Though people at AAPOR of course do stats methodology, to my surprise I didn’t come across a single other person appointed to a statistics department.) Modeling sampling bias and modeling response seem to me two sides of a coin.

As always when sticking my neck out in a field I don’t know, I should emphasize that the point I’m about to make is surely not new, but my experience at AAPOR makes me think it’s still worth saying.

Setup

Here’s a simple classical survey setup. Suppose we have regressors \(x\) and responses \(y\), and two joint distributions, \(\mathcal{P}_T(x, y)\) and \(\mathcal{P}_S(x, y)\), standing for “target” and “survey,” respectively. Let the regressor marginals be \(\mathcal{P}_T(x)\) and \(\mathcal{P}_S(x)\), with ratio \(w(x) = \mathcal{P}_T(x) / \mathcal{P}_S(x)\) which we assume is nonzero. Assume that \(\mathcal{P}_S(y\vert x) = \mathcal{P}_T(y\vert x) = \mathcal{P}(y\vert x)\), so we can unambiguously write \(\mathbb{E}\left[y \vert x \right]\). Suppose we have enough observations that we can safely apply a law of large numbers whenever we want.

Then the survey problem is this: we observe IID draws of \(x_i, y_i \sim \mathcal{P}_S(x, y)\) from the survey, but only \(x_j \sim \mathcal{P}_T(x)\) from the target. We want to know \(\mathbb{E}_{\mathcal{P}_T(y)}\left[y\right]\), but don’t directly observe draws from \(y\sim \mathcal{P}_T(y)\). We also don’t directly observe the response function \(\mathbb{E}\left[y \vert \cdot \right]\) nor the sampling ratio \(w(\cdot)\).

The fundamental equation of survey sampling

A few lines of algebra show that what we want to know can be expressed a couple different ways:

\[ \begin{align*} &\overbrace{\frac{1}{N_T} \sum_{j=1}^{N_T}\mathbb{E}\left[y \vert x_j \right]}^{\textrm{Eq R}} \overset{LLN}{\approx}{} \mathbb{E}_{\mathcal{P}_T(x)}\left[\mathbb{E}\left[y \vert x_j \right]\right] \overset{Tower}{=} \overbrace{\mathbb{E}_{\mathcal{P}_T(y)}\left[y\right]}^{\textrm{Want to know}} \overset{}{=} \int \mathbb{E}\left[y \vert x \right] \mathcal{P}_T(x) dx \overset{}{=}\\ % {}& \int \frac{\mathcal{P}_T(x)}{\mathcal{P}_S(x)} \mathbb{E}\left[y \vert x \right] \mathcal{P}_S(x) dx \overset{def}{=} \underbrace{\int w(x) \mathbb{E}\left[y \vert x \right] \mathcal{P}_S(x) dx}_{\textrm{Eq S}} \overset{Tower}{=} \int w(x) y\mathcal{P}_S(x, y) dx \overset{LLN}{\approx}{} \underbrace{\frac{1}{N_S} \sum_{i=1}^{N_S}w(x_i) y_i}_{\textrm{Eq W}}. \end{align*} \]

I find this line of reasoning pretty clarifying.

Eq W (for “weighting”) is how you form weighting estimates.
Eq R (for “response”) is how you form MrP estimates.
To use Eq W, you observe \(y_i\), but need to estimate the unknown \(w(\cdot)\)
To use Eq R, you observe \(x_j\), but need to estimate the unknown \(\mathbb{E}\left[y \vert \cdot \right]\).

In fact, if \(y\) is binary, then both are amenable to the same class of regression tools:

A reasonable way to estimate \(w(\cdot)\) is via logistic regression on the indicator \(1(x\textrm{ is a survey sample})\)
A reasonable way to estimate \(\mathbb{E}\left[y \vert \cdot \right]\) is via logistic regression with the response itself.

Finally, I highlighted Eq S, which I want to call “the fundamental equation of survey sampling.” (This is a blog, I get to be hyperbolic.) “S” is for “survey.” Eq S makes the symmetry between \(w(\cdot)\) and \(\mathbb{E}\left[y \vert \cdot \right]\) clear — our target of interest is a bilinear form with two components, both unknown: the weights and the response function.

More precisely, Eq S says how well you actually need to do estimating \(w(\cdot)\) and \(\mathbb{E}\left[y \vert \cdot \right]\). Specifically, Eq S is an inner product in \(\mathcal{L}_2(\mathcal{P}_S)\), with respect to which ideas of “orthogonality” are well-defined. Eq S says:

To use Eq W, you don’t need to estimate \(w(\cdot)\) in directions orthgonal to \(\mathbb{E}\left[y \vert \cdot \right]\)
To use Eq R, you don’t need to estimate \(\mathbb{E}\left[y \vert \cdot \right]\) in directions orthgonal to \(w(\cdot)\)
When weighting, you need to balance exactly one covariate, \(\mathbb{E}\left[y \vert x \right]\)
When modeling response, you need to know the regression coefficient of exactly one covariate, \(w(x)\)
To be robust against aversarial \(\mathbb{E}\left[y \vert \cdot \right]\) when using Eq W you have to estimate \(w(\cdot)\) really well
To be robust against aversarial \(w(\cdot)\) when using Eq P you have to estimate \(\mathbb{E}\left[y \vert \cdot \right]\) really well
If \(w(\cdot)\) varies little, you don’t need to estimate \(\mathbb{E}\left[y \vert \cdot \right]\) really well
If \(\mathbb{E}\left[y \vert \cdot \right]\) varies little, you don’t need to estimate \(w(\cdot)\) really well

Of course there are often practical considerations that justify modeling \(w(\cdot)\) or \(\mathbb{E}\left[y \vert \cdot \right]\), and these can be the subject of endless and meaningful debate. However, without articulating those reasons, focusing a discipline a priori exclusively on estimation of \(w(\cdot)\) without further elaboration seems strange to me in light of Eq S. In practice, it seems to me that one must make at least some assumptions about \(\mathbb{E}\left[y \vert \cdot \right]\) when modeling \(w(\cdot)\), or vice-versa, implicitly or explicitly, since the adversarial case seems likely to be too hard in practice.