\[ \def\F{F} \def\Fhat{\hat{\F}} \def\L{\mathcal{L}} \def\Lhat{\hat{\L}} \def\xvec{\vec{x}} \def\meann{\frac{1}{N} \sum_{n=1}^N} \]
The plug–in principle and Bayes
In frequentist statistics, we are used to using the “plug–in principle,” which is the idea that using the empirical distribution \(\Fhat\) as a plug–in estimator of the population distribution \(\F\) is a reasonable estimator. (See, for example, the introduction to Hall (2013).) The sample mean is the classic example. If we define the functional \(T(\F) = \int x \F(dx)\), which computes the mean, then the plug–in estimate \(T(\Fhat) = \int x \Fhat(dx) = \frac{1}{N} \sum_{n=1}^Nx_n\) is the sample mean, an eminently reasonable estimator of \(T(\F)\).
It appears that Bayesian estimators don’t often respect the plug–in principle. I discuss a subtle way in Giordano and Broderick (2023), but model selection is an even starker example. Consider data \(x\), for which we have two models, \(M_1\) and \(M_2\), with respective likelihoods \(\ell_1(x) = \log p(x| M_1)\) and \(\ell_2(x) = \log p(x| M_2)\). Suppose that a priori \(p(M_1) = p(M_2) = 1/2\), and that we want to do Bayesian model selection. Let \(\xvec = (x_1, \ldots, x_N)\), which are modeled as IID according to either \(M_1\) or \(M_2\), but which are in fact IID distributed according to \(\F\). By Bayes’ rule and a priori equiprobability,
\[ \begin{aligned} p(M_1 | \xvec) ={}& \frac{p(\xvec | M_1)}{p(\xvec | M_1) + p(\xvec | M_2)}. \end{aligned} \]
Now, if we let \(\Lhat_1 = \frac{1}{N} \sum_{n=1}^N\ell_1(x_n)\) and \(\Lhat_2 = \frac{1}{N} \sum_{n=1}^N\ell_2(x_n)\), we can rewrite \(p(M_1 | \xvec)\) as
\[ \begin{aligned} p(M_1 | \xvec) ={}& \frac{\exp(N \Lhat_1)}{\exp(N \Lhat_1) + \exp(N \Lhat_2)} \\={}& \frac{1}{1 + \exp(N (\Lhat_2 - \Lhat_1))}. \end{aligned} \]
This too can be written as a statistical functional, but depending on \(N\). Spefically, let \(K(\F) = \int \left(\ell_2(x) - \ell_1(x) \right) \F(dx)\). Then we can define
\[ p(M_1 | N, \F) = \frac{1}{1 + \exp(N K(\F))}, \]
which satisfies \(p(M_1 | \xvec) = p(M_1 | N, \Fhat)\), and we have written our Bayes model selection probability as a plug–in estimator. (We do this trick of representing Bayesian posteriors as statistical functionals in a general way in Giordano and Broderick (2023); this is a special case.) Indeed, the funtional we care about is the difference in KL divergence \(K(\F) = KL(F || M_1) - KL(F || M_2)\), a perfectly sensible statistic for model selection. Note that when \(K(\F)\) is very negative — that is, when \(\F\) is closer to \(M_1\) than \(M_2\) — then \(p(M_1 | N, \F) \approx 1\).
The question is, does the plug–in principle work for the Bayes estimator?
The Bayesian plug–in principle failure and Berk’s paradox
It does not, at least not in general. A nice counterexample is given by Berk’s paradox. (Caveat: I learned of this paradox from an early draft of Huggins and Miller (2024) and have not looked up the original reference, so there may be a game of telephone going on here, but the paradox is real enough.)
Suppose that in fact \(KL(F || M_1) = KL(F || M_2)\), but that the models are distinct. For exmaple, suppose that \(\F\) is a fair coin flip, \(M_1\) sets the heads probability to \(0.45\) and \(M_2\) sets the heads probability to \(0.55\). (Note that such a setup requires misspecification, but the plug–in principle doesn’t really require correct specification; the quantity you compute is supposed to measure what you want to know about the distribution \(\F\).) In this case, by symmetry, we get \(K(\F) = 0\), and
\[ p(M_1 | N, \F) = \frac{1}{1 + \exp(N \times 0)} = \frac{1}{2}, \]
which is what you’d expect by symmetry.
However, the plug–in estimate behaves very differently. By the CLT, under vanilla regularity conditions,
\[ \sqrt{N} (\Lhat_2 - \Lhat_1) / \sigma = Z \approx \mathrm{N}(0, 1) \quad\textrm{for some }\sigma. \]
Then we can write
\[ p(M_1 | \xvec) = p(M_1 | N, \Fhat) = \frac{1}{1 + \exp(\sqrt{N} Z)}. \]
As \(N\) grows, \(p(M_1 | \xvec)\) is approximately either \(1\) or \(0\) according to whether \(Z > 0\) or \(Z < 0\), which occur asymptotically with equal probability. So our posterior is either certain that the data was generated by \(M_1\) or by \(M_2\) — but which you are absolutely certain about is randomly selected with equal probability for each dataset! It turns out that \(p(M_1 | N, \F)\) behaves very differently than our plug–in \(p(M_1 | N, \Fhat)\).
What happened?
The problem is clearly the factor \(N\) in front of the well-behaved statistical functional \(K(\Fhat)\). Arguably, this factor of \(N\) causes overconfidence in the presence of model misspecification. It’s worth noting that, without some such behavior, you would not be able to have model selection consistency for arbitrarily small \(K(\F)\), so this bad behavior seems irreconcilable with infinite model selection power in the Bayesian posterior.
I don’t think this setup is entirely contrived despite relying on misspecification. For example, variable selection problems in linear regression (e.g. spike and slab models) will definitely suffer from this problem, in that small differences in \(O(1)\) sufficient statistics will be inflated in their posterior probabilities by \(O(\sqrt{N})\) terms, leading to a posteriori overconfidence — at least, relative to frequentist sampling variability, and arguably relative to reasonable subjective belief if we accept that our sparse linear models are misspecified. I may have to leave the details for another post.
I’m not quite sure how to resolve this paradox, or even that it is fully necessary to resolve it. But I do think it points to an interesting point of disconnect between Bayesian and frequentist statistical practice.