Can you use data to choose your prior? Some might say you can, pointing, for example, to empirical Bayes procedures, which are formally doing just that. But I would argue that the best answer to this question is no — that such procedures are better thought of as estimating more complex models rather than prior selection, and that answering this question in the affirmating elides the subjectivity that is inherent in Bayesian modeling.

Does empirical Bayes eliminate the need to choose a prior?

In Bayesian statistics, we have to choose a prior, and sometimes we really don’t know what the prior should be. So a natural question is whether the dataset at hand can tell us what prior we should choose. And there are certainly well-known procedures, like empirical Bayes, that might give the impression that the answer to this question is yes: given some set of candidate priors, we can choose the one that is most consistent with the data. Doing so is the natural continuation of the idea that our prior should obviously place nonzero mass on the data. Given this perspective, some folks might readily answer yes: you can use your data to pick your prior.

The difficulty is that the success of empirical Bayes — or indeed the Bayesian model selection problem that empirical Bayes is approximation — depends heavily on the non-expressiveness of the candidate priors you’re choosing from. If allowed to choose any prior at all, both empirical Bayes and Bayesian model selection will choose a degenerate prior that is a point mass at the maximum likelihood estimate (MLE), which of course leads to an absurd point mass posterior.

A simple normal counterexample

Take the following simple example. Suppose we have a single scalar observation, \(P(X | \theta) = N(X | \theta, 1)\). (You might think of \(X\) as a suitably rescaled sample mean from IID normal observations with known variance.) Let us use the conjugate prior \(P(\theta | \mu, \sigma) = N(\theta | \mu, \sigma^2)\), which is parameterized by \(\mu\) and \(\sigma\).

What happens if we use empirical Bayes to select \(\mu\) and \(\sigma\)? One common way to do so is to take

\[ \hat{\mu}, \hat{\sigma} = \textrm{argmax } P(X | \mu, \sigma) = \textrm{argmax } \int P(X | \theta) P(\theta | \mu, \sigma) d\theta. \]

In this case, standard properties of the normal distribution give \(P(X | \mu, \sigma) = N(\mu, 1 + \sigma^2)\). Irrespective of \(\sigma\), the optimal mean is \(\hat{\mu} = X\) (the MLE of \(\theta\)). Given \(\hat{\mu} = X\), the marginal probability \(P(X | X, \sigma)\) is increased by reducing the variance, i.e. taking \(\sigma = 0\). The “empirical Bayes” prior is then \(N(\theta | X, 0)\) — a degenerate point mass at the MLE.

This phenomenon is readily seen to be general. What measure \(P(\theta)\) maximizes the integral \(\int P(X | \theta) P(\theta) d\theta\)? It is the measure that puts all its mass on the MLE, \(\textrm{armax } P(X | \theta)\). Bayesian model selection, which also assigns posterior probability proportional to the marginal likelihood, also exhibits this pathology, unless expressive priors are downweighted or the class of priors is restricted.

The prior class needs to be restricted a priori somehow

So empirical Bayes, if it does anything useful, does so because the class of priors it is permitted to chose from is sufficiently restricted to avoid this pathological behavior. For example, in our toy normal model above, we might fix \(\sigma\) at some large value and choose only \(\mu\) with empirical Bayes. The selection of the expressivity of the class of candidate priors is itself an decision that must be made a priori — it is tantamount to choosing a prior, and unavoidably involves prior judgement.

Arguably, allowing the data to adjudicate between multiple “priors” is best thought of as a more expressive model that uses a mixture of priors — a mixture model whose expressivity must be limited a priori, either by limiting the class of candidate priors, assigning expressive ones low (hyper-)prior probability, or both. This is, of course, a matter of semantics, but I think it is a valuable one, in that it reserves the word “prior” for things that are truely prior to the data, rather than simply another level in the modeling hierarchy. And it emphasizes the fact that, like it or not, every Bayesian model needs real, subjective, a priori decisions.