Over the course of two posts, I’d like to provide an intuitive walk-through of a proof of the Bayesian central limit theorem (BCLT, aka the Bernstein-von Mises theorem). I think most statisticans can probably sketch the arguments leading to the asymptotic normality of the maximum likelihood estimator off the top of their head, but I suspect that few could do the same for the BCLT. I, at least, found a lot of the BCLT literature a bit intimidating and short on intuition. But there’s intuition to be had, and I’d like to take a stab at providing some of it.

I will focus on attributes of the posterior that can be expressed using posterior expectations, which includes both total variation distance (the most commonly used form of convergence) and series expansions of posterior means. My approach will be based principally on on the proof of Theorem 1.4.2 in the book Bayesian Nonparametrics by Ghosh and Ramamoorthi, though the elements are to some degree common to all the proofs of the finite-dimensional BCLT that I am aware of. In particular, the division into the three regions I will describe in the next post is, as far as I can am aware, the only game in town for the BCLT, though the precise details of how the regions are dealt with differ according to the need of the authors.

Again, my goal is not to reproduce the proof in full detail, but rather to sketch the intuition underlying it. Consequently, I’ll be a little fast and loose about necessary conditions for standard asymptotic results, and will not always add qualifying filler like the phrase “under necessary regularity conditions.”

Throughout, I’ll assume we have independent and identically distributed (IID) data \(x=(x_1,\ldots,x_N)\), a parametric model \(p(x_n | \theta)\), and a prior density on the scalar parameter \(p(\theta)\) which is defined with respect to the Lebesgue measure. Idential arguments work with multivariate parameters where the Taylor series residuals are controlled by the integral form of the remainder rather than the mean value theorem. To avoid tedious notation, we’ll just focus on the scalar case.

In thie first post, I’ll discuss the two key tools that go into the BCLT: the Taylor series, and uniform laws of large numbers. I’ll start by reviewing the use of these two tools in the classical proof of the asymptotic normality of the MLE, concluding with a discussion of why the arguments that work for the MLE are not enough for Bayesian posteriors. Hopefully, the reader will then find themselves motivated to resolve the cliffhanger with the next post!

Notation and Maximum Likelihood Estimators

Let us begin by reviewing the proof of the asymptotic normality of the maximum likelihood estimator (MLE). Versions of this proof are standard, and can be found, for example, in Chapter 6 of Theory of Point Estimation by Lehmann.

I’ll write log probabilities like so: \(\log p(x_n | \theta) = \ell(x_n | \theta)\). The MLE is defined as follows:

\[\begin{align*} \hat\theta :=& \mathrm{argmax}_\theta \frac{1}{N}\sum_{n=1}^N \ell(x_n | \theta)\\ \theta_0 :=& \mathrm{argmax}_\theta \mathbb{E}_{x_1}[\ell(x_1 | \theta)]. \end{align*}\]

Here, \(\mathbb{E}_{x_1}\) denotes expectation over a single datapoint for a fixed \(\theta\), so \(\mathbb{E}_{x_1}[\ell(x_1 | \theta)]\) is a function of \(\theta\) alone. One typically has to worry about whether \(\hat\theta\) and \(\theta_0\) are uniquely defined, and establish that \(\hat\theta \rightarrow \theta_0\) as \(N\) grows large. Let’s assume that we have resolved such concerns.

I’ll be using a Taylor series, and for that, it will be convenient to denote partial derivatives with subscripts, i.e.,

\[\begin{align*} \ell_{(k)}(x_1 \vert \theta) := \left.\frac{\partial^k \ell(x_1 \vert \theta)}{\partial \theta^k} \right|_\theta. \end{align*}\]

In this notation, we can form the following Taylor series expansion of the first derivative of the log likelihood around the true parameter:

\[\begin{align*} \frac{1}{N}\sum_{n=1}^N \ell_{(1)}(x_n \vert \hat\theta) = \frac{1}{N}\sum_{n=1}^N \ell_{(1)}(x_n \vert \theta_0) + \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \tilde{\theta}) (\hat\theta - \theta_0), \end{align*}\]

where we have used the mean value theorem to evaluated the Taylor series residual at some \(\tilde{\theta}\) where \(|\tilde{\theta} - \theta_0 | < |\hat\theta - \theta_0 |\).

By definition of \(\hat\theta\), \(\frac{1}{N}\sum_{n=1}^N \ell_{(1)}(x_n \vert \hat\theta) = 0\), so we can rerrange the Taylor series expansion to get

\[\sqrt{N} (\hat\theta - \theta_0) = \frac{\frac{1}{\sqrt{N}}\sum_{n=1}^N \ell_{(1)}(x_n \vert \theta_0)} {- \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \tilde{\theta})}.\]

The left-hand side of the preceding display is the quantity we’re interested in, the limiting behavior of the MLE. What is the limiting behavior of the right hand side?

Uniform Laws of Large Numbers (ULLNs)

First, the easy term. Since \(\mathbb{E}_{x_1}[\ell_{(1)}(x_n \vert \theta_0)] = 0\) by the definition of \(\theta_0\), the ordinary central limit theorem gives

\[\begin{align*} \Sigma :=& \mathrm{Cov}_{x_1}(\ell_{(1)}(x_1 \vert \theta_0)) \\ \frac{1}{\sqrt{N}}\sum_{n=1}^N \ell_{(1)}(x_n \vert \theta_0) \rightsquigarrow& \mathcal{N}(0, \Sigma). \end{align*}\]

For the denominator, we might wish we could use the fact that, by the ordinary LLN,

\[\frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \theta_0) \rightarrow \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta)] =: -\mathcal{I},\]

which converges to the negative Fisher information by the ordinary law of large numbers, since \(\theta_0\) is fixed, if unknown. (In correctly specified models, \(\Sigma = \mathcal{I}\), but I will keep them separate, in part because I am interested in misspecified models.)

However, our expression has \(\tilde{\theta}\), not \(\theta_0\). Recall that \(\tilde{\theta}\) is chosen based on \(\hat\theta\), and so, in general, depends on the data! Due to this data dependence, we cannot simply apply the LLN, and must turn to a stronger result, a uniform law of large numbers, or ULLN.

A ULLN says that, within some interval, the random function \(\theta \mapsto \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \theta)\) converges uniformly to the deterministic function \(\theta \mapsto \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta)]\). Specifically, for some ball \(B_\theta\) containing \(\theta_0\),

\[\sup_{\theta \in B_\theta} \left| \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \theta) - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta)] \right| \rightarrow 0.\]

As written, this is actually a tad stronger than we need, since we know that \(\tilde{\theta} \rightarrow \theta_0\), which means that any ball \(B_\theta\) of fixed size will be “bigger than necessary.” Nevertheless, it is common to assume a ULLN, with the understanding that you are free to make \(B_\theta\) as small as necessary.

The study of what conditions imply a ULLN is a rich topic. But an easily applicable criterion is that the function of interest be bounded by an integrable envelope. Specifically: a function \(f(\theta, x_n)\) obeys a ULLN in \(B_\theta\) if there exists a non-negative function \(M(x)\), with \(\mathbb{E}_{x_1}[M(x_1)] < \infty\), such that

\[\left| f(\theta, x_1) \right| \le M(x_1)\quad \textrm{for all possible }x_1\textrm{ and all } \theta \in B_\theta.\]

(See van der Vaart’s Asymptotic Statistics, Example 19.7, which actually states a somewhat weaker requirement.)

Classical proofs often assume an integrable envelope condition, even if they do not use it directly in a ULLN. Lehmann’s Theory of Point Estimation, for example, proves asymptotic normality of the MLE under the condition that \(\ell_{3}(x_n \vert \theta)\) has an integrable envelope, though he does not use this directly in a ULLN (see Eq. 15 in Theorem 2.3 of section 6.2). I find proofs to be simpler if you simply lean on a ULLN, though this might be a matter of taste.

A ULLN basically allows you to ignore the dependence of \(\tilde{\theta}\) on the data, since \(\tilde{\theta}\) is eventually in any \(B_\theta\), and

\[\begin{align*} &\left| \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \tilde{\theta}) - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta_0)] \right| \le \\ &\quad \left| \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \tilde{\theta}) - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \tilde{\theta})] \right| + \left| \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \tilde{\theta})] - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta_0)] \right| \le\\ &\quad \sup_{\theta \in B_\theta } \left| \frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \theta) - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta)] \right| + \left| \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \tilde{\theta})] - \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta_0)] \right| \rightarrow\\ &\quad 0 + 0, \end{align*}\]

where the first term follows from the ULLN and the second from assumed continuity of \(\theta \mapsto \mathbb{E}_{x_1}[\ell_{(2)}(x_1 | \theta)]\) and the consistency of \(\hat\theta\).

For the remainder of this stroll, we’ll take the above argument for granted, and simply treat quantities like \(\frac{1}{N}\sum_{n=1}^N \ell_{(2)}(x_n \vert \tilde{\theta})\) as if they were evaluated at a fixed \(\theta_0\) whenever we know \(\tilde{\theta}\) will eventually land in an arbitrarily small ball around \(\hat\theta\).

One final observation about ULLNs is important — it is not reasonable to expect a ULLN to hold, in general, on non-compact sets. A simple counterexample is maybe the most useful way to illustrate this. Consider the function \(f(\theta, x_n) = (x_n - \theta)^2\). Should a ULLN apply to this function for \(\theta \in \mathbb{R}\)? No, because

\[\begin{align*} &\sup_{\theta \in \mathbb{R}}\left| \frac{1}{N} \sum_{n=1}^N (x_n - \theta)^2 - \mathbb{E}_{x_1}[(x_1 - \theta)^2] \right| =\\ &\quad \sup_{\theta \in \mathbb{R}}\left| \frac{1}{N} \sum_{n=1}^N x_n^2 - 2 \theta \frac{1}{N} \sum_{n=1}^N x_n + \theta^2 - \mathbb{E}_{x_1}[x_1^2] + 2 \theta \mathbb{E}_{x_1}[x_1] - \theta^2 \right| \ge\\ &\quad 2 \sup_{\theta \in \mathbb{R}}|\theta | \left| \frac{1}{N} \sum_{n=1}^N x_n - \mathbb{E}_{x_1}[x_1] \right| - \left| \frac{1}{N} \sum_{n=1}^N x_n^2 - \mathbb{E}_{x_1}[x_1^2] \right| = \infty. \end{align*}\]

No matter how close the observed data moments are to their population versions, the fact that \(\theta\) can take arbitrarily large values means the small errors in the \(x_n\) moments get magnified to arbitrarily large errors in the \(f(\theta, x_n)\) moments. This observation will be important for the BCLT.

Why is the BCLT harder than asymptotic normality of the MLE?

With the ULLN in hand, we can finally apply Slutsky’s theorem to get

\[\sqrt{N} (\hat\theta - \theta_0) \rightsquigarrow \mathcal{N}\left(0, \mathcal{I}^{-2} \Sigma \right).\]

Again, in correctly specified models, \(\Sigma = \mathcal{I}\) and the limiting variance is the classical \(\mathcal{I}^{-1}\).

To prove the asymptotic normality of the MLE, we had to repeatedly use the closeness of \(\hat\theta\) to \(\theta_0\). In particular, we used the closeness of \(\hat\theta\) to \(\theta_0\) to:

  • Be able to apply a ULLN (since \(\hat\theta\) was in a compact set)
  • Exploit the contiunity of the expected second derivative
  • Have the residual in the Taylor series go to zero.

Basically, for the MLE, we never had to consider the behavior of \(\ell(x_n | \theta)\) for \(\theta\) any further from \(\theta_0\) than an arbitrarily small ball, \(B_\theta\).

What about the BCLT? Bayesian posterior expectation of some function \(\phi(\theta)\) are given by

\[\mathbb{E}[\phi(\theta) | x] = \frac{\int \phi(\theta) \exp(\sum_{n=1}^N \ell(x_n | \theta))p(\theta) d\theta} {\int \exp(\sum_{n=1}^N \ell(x_n | \theta))p(\theta) d\theta}.\]

Recall that posterior expectations involve integrals of the form \(\int \phi(\theta) \exp(\sum_{n=1}^N \ell(x_n | \theta))p(\theta) d\theta\). In general, \(p(\theta)\) and \(\phi(\theta)\) are quite nonzero over all of \(\mathbb{R}\). The MLE argument will help us deal with the part of the Bayesian integral that is over an asymptotically shrinking neighborhood of \(\hat\theta\); indeed, in such a shrinking neighborhood, the arguments will be essentially the same.

It is the rest of the domain (typically a set with prior probability approaching one!) that requires special care for the BCLT. Indeed, the core of the BCLT consists in showing that, asymptotically, one can neglect all but a shrinking neighborhood of \(\hat\theta\) in posterior computations. It is these arguments that we will take up in the next post.