In a reading group (on probabilistic forecasting!) we recently read Bissiri, Holmes, and Walker (2016). This papers is one of very few whose initial impact on me was a lifelong-memorable thrill. I just loved it. It’s at least an initial attack on what I think of as the fundamental problem of Bayesian inference: that even if you only care about a few aspects of the data generating process, you have to model the whole gosh darn thing. Unfortunately, I still think Bissiri, Holmes, and Walker (2016) is not the end of this road, even if it’s an exciting path. And I think there’s a particularly simple example that demonstrates its shortcomings: regression in the presence of heteroskedasticity.

Frequentists do heteroskedastic errors right

To me, the best illustration of the strength of frequentist over Bayesian methods is a simple and familiar one: linear regression with heteroskedastic errors. Frequentism has a super simple, elegant solution (the sandwich covariance), and Bayesian methods seem to require a whole lot of brittle, arbitrary, and computationally expensive model-making and estimation.

Here’s the simplest version of the problem. Consider an IID data generating process for pairs \((x_n, y_n)\) of the form \[ y_n = \theta^* x_n + \varepsilon_n, \] where all we know about \(\varepsilon_n\) is that \(\mathbb{E}\left[\varepsilon_n \vert x_n\right] = 0\) and \(\textrm{Var}\left(\varepsilon_n \vert x_n\right) = \sigma(x_n)^2 < \infty\), plus a handful of other minor technical conditions (convergence of \(\frac{1}{N} \sum_{n=1}^N\sigma(x_n)^2\), etc.) Suppose we only really want to know \(\theta\), and don’t care to model the heteroskedasticity \(\sigma(x_n)\) which maybe we think is extremely complicated.

If you’re a frequentist, for your point estimate you can just ignore the heteroskedasticity and compute \[ \hat{\theta}= \frac{\frac{1}{N} \sum_{n=1}^Ny_n x_n}{\frac{1}{N} \sum_{n=1}^Nx_n^2}. \] Heteroskedasticity enters the confidence intervals. To get confidence intervals, you set \(\hat{\varepsilon}_n = y_n - x_n \hat{\theta}\), compute \(M_x = \frac{1}{N} \sum_{n=1}^Nx_n^2\) and \(S = \frac{1}{N} \sum_{n=1}^Nx_n^2 \hat{\varepsilon}_n^2\), and use the central limit theorem to approximate to approximate the law \[ P(\sqrt{N}(\hat{\theta}- \theta^*)) \approx \mathcal{N}\left(0, M_x^{-1} S M_x^{-1}\right). \] The variance on the right hand side is the sandwich covariance, and it takes care of the heteroskedasticity for you, no modeling required. Your estimator \(\hat{\theta}\) may be inefficient relative to the minimum variance estimate that you could have computed if you knew (or could estimate) \(\sigma^2(x_n)\). But in exchange for this simplicity, you didn’t have to model \(\sigma^2(x_n)\) at all, and get a very simple and easy-to-understand estimation procedure.

Generalized Bayes doesn’t really help

Suppose we have a prior \(p(\theta) \propto 1\) that’s noninformative. Standard Bayesian procedures might try to model \(\log \sigma(x_n) = \beta^\intercal\phi(x_n)\) for some featurization \(\phi\) and unknow vector \(\beta\). But of course this itself is surely misspecified, and now you have to estimate \(\beta\), too.

You might hope that Bissiri, Holmes, and Walker (2016) fixes this. After all, a key motivation is not having to model aspects of the generative process that you don’t care about. Let’s see what happens in this case if you just use square loss.

\[ \hat{\theta}= \mathrm{argmin}_{\theta} L(\theta; \mathrm{Data}) \quad\textrm{where}\quad L(\theta; \mathrm{Data}) := \frac{1}{2} \frac{1}{N} \sum_{n=1}^N(y_n - \theta x_n)^2. \] Then the generalized Bayes solution gives the generalized posterior (for some constant \(C\)): \[ \begin{aligned} p(\theta \vert \mathrm{Data}) \propto{}& \exp\left( - C \cdot L(\theta; \mathrm{Data}) \right) \\\propto{}& \exp\left( - C \left(\frac{1}{2}\frac{1}{N} \sum_{n=1}^Nx_n^2 \theta^2 - \frac{1}{N} \sum_{n=1}^Ny_n x_n \theta \right) \right) \\=& \mathcal{N}\left(\hat{\theta}, \frac{C}{N} \sum_{n=1}^Nx_n^2\right). \end{aligned} \] So we basically get something like the homoskedastic variance back, but up to an unknown constant! The arbitrariness of \(C\) means this is even less useful than assuming homoskedasticity. If you turned to generalized Bayes because you think there’s heteroskedasticity that’s not accounted for, then I don’t think any number of axioms of consistency are going to convince you that \(p(\theta \vert \mathrm{Data})\) is a good description of your subjective belief about \(\theta\).

Conclusion

I would love to see a good solution to this problem that’s as easy as frequentist standard errors but still seems genuinely Bayesian. To, that means solving something that really looks like an inverse problem, possibly with some error, over a wide class of potential forms of heteroskedasticity that I never have to explicitly specify. And just so I’m not only criticising other people, I’ll add that computing frequentist variability, as we do in Giordano and Broderick (2023), is still just a frequentist method — for example, it will not take into account other forms of latent uncertainty, while still flexibly accounting for unmodeled heteroskedasticity. There’s a lot of new work out there, so I might be missing something big. But I have yet to see a Bayesian solution that is comparably elegant to the sandwich covariance.

References

Bissiri, P., C. Holmes, and S. Walker. 2016. “A General Framework for Updating Belief Distributions.” Journal of the Royal Statistical Society Series B: Statistical Methodology 78 (5): 1103–30.

Giordano, R., and T. Broderick. 2023. “The Bayesian Infinitesimal Jackknife for Variance.” arXiv Preprint arXiv:2305.06466.