If you’re concerned about the “reproducibility crisis” in statistics, you’re probably worried about p-hacking. You might also have the impression that Bayesian analysis is somehow “immune” to p-hacking. I am writing this post in an attempt to dispel that myth — or at least to shed enough light on it that it mostly disappears. Specifically, the kinds of “invariance to stopping rules” described in the classical Bayesian literature don’t look much like the kind of thing people actually worry about when they say p-hacking. Unfortunately, but also naturally, this nuance can be easily lost in the telephone game of statistical exposition.

There are, of course, interesting and substantial differences between hypothesis testing and Bayesian model selection. I’ll talk more about them in a subsequent post. But these differences are subtle enough that that, if you don’t know the details, a working statistician’s shorthand summary should not be “Bayesians don’t care about stopping rules” but rather “of course Bayesian inference is susceptible to p-hacking.” Because of course it is.

The simplest form of p-hacking

Here’s a characteristically pithy prototypical example of p-hacking.

In this telling, to p-hack, you test a bunch of null effects in the presence of noise, or chance variation, and select from your measurements the largest measured effect. But this measured effect is large due to the large number noisy measurements you searched through, not due to the largeness of a real effect. By misreporting your procedure and / or failing to take into account how your procedure would behave on new data, you mislead yourself or your audience.

As described, p-hacking has a distinctly frequentist flavor. After all, the term “p-value” is in the name! Indeed, a casual reader might have the impression that p-hacking is a distinctly frequentist problem, since, in most high-level discussion of the difference between Bayesian and frequentist methods, it’s asserted that Bayesian inference is somehow “invariant to stopping rules.” To take one of several examples from our Foundations Reading Group this semester:

“Once the priors are fixed, all that matters is the likelihood of the actual outcome on the various hypotheses under consideration; likelihoods of non-actual outcomes on the various hypotheses under consideration are irrelevant.” (Kotzen (2022), page 4)

What does this mean for p-hacking? One might say that p-hacking occurs because of double use of the data — once to form a test statistic (by searching over all the jelly bean colors), and again to assess its p-value. Alternatively, one might say that p-hacking occurs because one did not correctly account for the fact that the procedure you used to identify green jelly beans would, on alternative datasets, always find a “significant” relationship. Since the Bayesian analysis conditions on the observed data, and depends only on the observed data (and not on alternative hypothetical datasets), it might superficially appear that p-hacking cannot occur for the Bayesian.

Under this reasoning you might think that the p-hacking problem would be solved if only we had (somehow) done a Bayesian analysis. This is narrowly true, in the sense that a “correctly done” Bayesian analysis would not be susceptible to p-hacking. But a “correctly done” frequentist analysis, which takes into account the multiplicity of testing, is also not susceptible to p-hacking. What matters is not whether you did a frequentist or Bayesian analysis, but rather whether you did what you did correctly.

So let’s look at what an incorrect Bayesian analysis of p-hacking looks like, and try to understand how it goes wrong in Bayesian terms.

Some Bayesian versions of the simplest form of p-hacking

Let me propose a simple example of p-hacking. Suppose we can draw data data \(x\) according to \(p(x\vert \theta) = \mathcal{N}\left(x| \theta, 0.1\right)\), where \(\mathcal{N}\left(\cdot | \mu, \sigma^2\right)\) denotes a normal distribution with mean \(\mu\) and variance \(\sigma^2\). For the xkcd example, you can imagine \(x\) is data from an experiment on jellybeans of a particular color, and \(\theta\) is the measured association of acne with that jellybean color.

Let’s suppose for simplicity that our prior, \(p(\theta)\), is very dispersed, so that the details of \(p(\theta)\) has negligible effect on our posterior. Then the posterior given a single draw is given by

\[ p(\theta \vert x) = \mathcal{N}\left(\theta | x, 0.1\right). \]

No problem — the posterior credible interval is the same as the corresponding hypothesis testing interval.

Let’s p-hack this setup. Suppose that, instead of observing one \(x\) value, we re-do our experiment repeatedly, drawing a sequence of \(x_n\) independently of one another. Suppose for simplicity that each measures a different \(\theta_n\), which are a priori independent, so that each posterior is totally independent of one another. By cherry-picking repeated data sampling, we can p-hack the posterior in two ways:

Case 1: Reporting some extreme value over all \(p(\theta_n \vert x_n)\), say the maximum, as if it were the posterior of a typical value.
Case 2: Sampling until we get at least one \(x_n\) in a region of interest, and then reporting \(p(\theta_n \vert x_n)\) as if it were the only posterior we computed, or

Each of these produce misleading posterior conclusion, despite being “Bayesian.” I will argue that they are, in different ways, abuses of the Bayesian paradigm, just as ignoring multiple testing is an abuse of the frequentist paradigm. The important thing to note is that the Bayesian paradigm is just as susceptible to abuse as the frequentist one.

Case 1: The posterior statistic is misleading

Suppose we draw a large number of IID \(x_n\), compute the posteriors for each, and take \(n^*\) to be the index of the largest posterior expectation, \(n^* := \mathrm{argmax}_{n} \mathbb{E}\left[\theta_n \vert x_n\right]\). Suppose we then simply report \(\mathbb{E}\left[\theta_{{n^*}} \vert x_{{n^*}}\right]\) as if it were the only experiment we had done. In constract, let’s suppose that \(n\) is some index fixed in advance of the experiment.

If you report a posterior quantity \(\mathbb{E}\left[f(\theta) \vert x_n\right]\) with no further elaboration, the typical understanding is that the quantity \(f(\theta)\) is fixed, and that had you observed different data, \(x_n'\), you would have reported \(\mathbb{E}\left[f(\theta) \vert x_n'\right]\). This is of course rarely made explicit, but it is the source of what is actively misleading about reporting \(\mathbb{E}\left[\theta_{{n^*}} \vert x_{{n^*}}\right]\) without describing the dependence of \({n^*}\) on other aspects of the posterior. This is implicit in the typical justification for the posterior mean as the solution to \[ \mathbb{E}\left[f(\theta) \vert x\right] = \mathrm{argmin}_{c}\,\int (f(\theta) - c)^2 p(\theta \vert x) d\theta. \] Of course the function \(f(\cdot)\) is allowed to depend on the data, but this dependence needs to be made explicit. The function \(f(\theta) = \theta_{{n^*}}\) does depend on the data (through \({n^*}\)), and the function \(f(\theta) = \theta_n\) does not. Hiding the dependence in the former case is deceptive, and malpractice in purely Bayesian terms.

Note that, under our modeling assumptions, the quantity \(\mathbb{E}\left[\theta_{{n^*}} \vert x_{{n^*}}\right]\) is in fact a perfectly well-defined posterior quantity, it’s just not the same as the expectation of a typical posterior quantity such as \(\mathbb{E}\left[\theta_n \vert x_n\right]\). If we take \(f(\theta) = \theta_{{n^*}}\), it’s a different posterior quantity than taking \(f(\theta) = \theta_n\), where \(n\) does not depend on \(x\). For example, if you truly believe that each \(\theta_n\) is different, that the \(\theta_n\) are distributed according to \(p(\theta)\), and the model is correct, you are justified in looking at the posterior of the inferred largest parameter. This is exactly what’s done in empirical Bayesian versions of false discovery rate control, for example (Efron (2007)). Nobody would think that conditioning on the data allows you to freely exchange the posterior of the most highly expressed genes for a typical gene chosen in advance, and neither should it be permissible to “p-hack” by reporting \(\mathbb{E}\left[\theta_{{n^*}} \vert x_{{n^*}}\right]\) as if it were \(\mathbb{E}\left[\theta_n \vert x_n\right]\).

Case 2: p-hacking is the same as censoring

Arguably the more interesting case, and one which connects more closely to classical arguments, is supposing that we sample until we get \(x_n\) in some region \(C\). Specifically, let’s take for simplicity the region \(C = (20-\epsilon, 20 + \epsilon)\) for some \(\epsilon\) small relative to the randomness in \(x\), \(\sqrt{0.1}\). Call this observation \(x_{{n^*}}\in C\). (Now \({n^*}\) and \(x_{{n^*}}\) is defined in this way, with no relation to Case 1.) Of course, we maybe had to draw many \(x_n\) to do this, so our full dataset is \(\vec{x}= (x_1, x_2, \ldots, x_{{n^*}})\), where there are \(N\) observations total and \(x_{{n^*}}\) was the first to land in \(C\). I’ll refer “sampling until \(x_n \in C\)” as a “stopping rule,” though I’m only really using that term loosely, as will be clear below.

If we compute and report only \(p(\theta \vert x_{{n^*}})\), we would conclude that \[ p(\theta \vert x_{{n^*}}) \approx \mathcal{N}\left(\theta | 20, 0.1\right). \] That is, we came to the foregone Bayesian conclusion that \(\theta\) is about equal to \(20\). This is obviously wrong. Why, in Bayesian terms, is it wrong?

One objection may be that we didn’t report “all the data,” we just reported \(p(\theta \vert x_{{n^*}})\). From this point of view, you might make an argument a little like Case 1, though in a more complicated setting. To go a little beyond the facile “don’t throw away data” objection, we can ask what a correct Bayesian analysis should look like. Suppose you actually don’t observe any \(x_n\) except \(x_{{n^*}}\), because a dishonest research assistant p-hacked the data before giving it to you. You no longer have any data other than \(x_{{n^*}}\) to model. How can you use \(x_{{n^*}}\) to form a valid posterior?

They key is to recongize that the stopping rule changes the likelihood. Under the p-hacking data collection scheme, you only observe \(x\) when \(x\in C\), so the correct likeihood of \(x_{{n^*}}\) is

\[ p(x| \theta, \textrm{p-hacking}) = \frac{p(x\vert \theta) 1(x\in C)}{\int p(x' \vert \theta) 1(x' \in C) d x'} = \frac{p(x\vert \theta) 1(x\in C)}{p(x' \in C \vert \theta)}. \]

Note that the normalizing constant of this p-hacked likelihood now depends strongly on \(\theta\). The event that the stopping rule occurred is extremely informative.

If \(\epsilon \ll 0.1\), then \(\theta \mapsto p(x\vert \theta)\) varies little across \(\theta\), so \(p(x' \in C \vert \theta) \approx 2 p(x' = 20 \vert \theta) \epsilon\). Evaluated at \(x_{{n^*}}\), the likelihood becomes

\[ p(x_{{n^*}}| \theta, \textrm{p-hacking}) = \frac{p(x_{{n^*}}\vert \theta) 1(x_{{n^*}}\in C)}{p(x' \in C \vert \theta)} \approx \frac{p(x' = 20 \vert \theta)}{2 p(x' = 20 \vert \theta) \epsilon} = \frac{1}{2 \epsilon}. \] The likelihood no longer depends on \(\theta\). So the data \(x_{{n^*}}\) cannot tell us any information about \(\theta\). This is a formal statement of our clear intuition — under p-hacking, the data no longer is informative of \(\theta\), because no matter what \(\theta\) is, we always observe the same value, \(x_{{n^*}}\approx 20\).

The posterior using the correct, p-hacked likelihood of course becomes

\[ \begin{aligned} p(\theta \vert x_{{n^*}}, \textrm{p-hacking}) ={}& \frac{p(x_{{n^*}}| \theta, \textrm{p-hacking})p(\theta)}{p(x_{{n^*}}| \textrm{p-hacking})} \\={}& \frac{1}{2 \epsilon} \frac{2 \epsilon}{1} p(\theta) = p(\theta). \end{aligned} \] In other words, our posterior is the prior, precisely because the p-hacked observation \(x_{{n^*}}\) was not informative about \(\theta\).

In this view, p-hacking is just like censoring. We’re all comfortable with the idea that censoring changes the posterior, so no great leap is needed to conclude that p-hacking changes the posterior — specifically, to diminish the informativeness of the p-hacked observation.

Why isn’t Case 2 taken very seriously?

I think Case 1 is hardly controversial. Case 2 is a little more interesting. Why is Case 2 not taken seriously as a possibility, despite being a clear example of how you’d illegitimately p-hack a Bayesian posterior? A telling quote can be found in Edwards, Lindman, and Savage (1963), one of the most vocal proponents of the idea that Bayesians are immune to stopping rules.

“In general, suppose that you collect data of any kind whatsoever–no necessarily Bernoullian, nor identically distributed, nor independent of each other given the parameter \(\lambda\)—stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of \(n\) data actually observed will be exactly the same as it would be had you planned to take exactly \(n\) observations in the first place.” (Edwards, Lindman, and Savage (1963) pp. 238–239, emphasis mine)

In other words, Edwards, Lindman, and Savage (1963) considers only stopping rules for which, in my notation, \(p(x\in C \vert \theta) = 1\). That is, stopping rules which are uninformative about the parameter. Of course, that means that likelihood is invariant after multiplying by \(1(x\in C)\) and renormalizing, for any \(x_{{n^*}}\in C\):

\[ p(x_{{n^*}}\vert \theta, \textrm{Stopping rule}) = \frac{p(x_{{n^*}}\vert \theta) 1(x_{{n^*}}\in C)}{p(x' \in C \vert \theta)} = \frac{p(x_{{n^*}}\vert \theta) 1}{1} = p(x_{{n^*}}\vert \theta) \] For such a stopping rule, the likelihood’s dependence on \(\theta\) is unchanged, and so posterior inference is naturally unchanged.

Carefully reading careful accounts of Bayesian invariance to stopping rules, you usually find the proviso that the stopping rule occur with probability one (or, less typically, the more general condition that the probability of the stopping rule be independent of the parameter). This is reasonable, since otherwise you are stuck in the position of having to discard data with nonzero probability, and everyone agrees that’s invalid — even though discarding data is the essence of p-hacking.

What follows is more interesting. In an effort to give an example of such an informative stopping rule, Edwards, Lindman, and Savage (1963) goes on to give a silly example about a naturalist not counting how many lions drink at a watering hole because he was chased away by lions before he could observe any drink. Of course, this observation of “zero lions drinking” should not be taken as evidence that no lions drink at the hole under these circumstances. They then go on to say

“We would not give a facetious example had we been able to think of a serious one.” (ibid.)

It seems likely that, when Edwards, Lindman, and Savage (1963) was writing, the possibility of p-hacking was not as much in the air as it is now. Or maybe the idea of censoring data doesn’t look enough like a “stopping rule” to have warranted consideration. But, one way or another, it seems to me that, following Edwards, Lindman, and Savage (1963), the possibility of informative stopping rules is still typically minimized or ignored. For example, in the Kotzen (2022) article quoted above, we read

“Since the likelihoods of the actual outcome on the various hypotheses under consideration is not (typically) impacted by the choice of stopping rule, differences in stopping rules are not the source of an inferential difference on the Bayesian approach. … Of course, a difference in stopping rules could matter even for a Bayesian, if the choice of stopping rule impacts what outcome is actually observed. But the important point is that the stopping rule does not have an inferential impact per se; once we fix the priors, the outcome, and the likelihood of that outcome on the various hypotheses, the choice of stopping rule has no further inferential significance.” (Kotzen (2022) page 5 and footnote 12)

The key proviso that the “choice of stopping rule [not impact] what outcome is actually observed” is limited to a footnote, and at that is not totally clear (of course the stopping rule affects what outcome is observed). In constrast, a somewhat artificial example about coin tossing is given several pages of discussion.

Bad and good statistics are possible in both frameworks

I’m sure that this post would feel obvious to any of the authors I’ve cited. I don’t want to give the impression that I understand something they don’t. I’m simply arguing for a change in emphasis in how we discuss Bayes and p-hacking. The idea that Bayesian statistics is somehow immune to p-hacking can be taken as a purported strength (if you want a magic wand to avoid p-hacking errors) or a weakness (if you take it as obvious that p-hacking is problematic, and so any framework that pretends it’s not a problem must be flawed). I think neither situation is really the case, at least as far as the simplest versions of p-hacking are concerned. P-hacked posteriors are either misleading posterior summaries or misspecified likelihoods.

In a recent post, I argue that statistical frameworks dictate what must be described in an analysis, not what can be described. Analogously, a framework alone cannot avoid error, it can only make error more or less automatically apparent. It remains the case that limiting data to events that have probability \(1\) for all values of \(\theta\) can produce interesting divergences between Bayesian posteriors and frequentist procedures. These divergences are useful for assisting a nuanced understanding of the different questions that frequentist and Bayesian statistics are asking. (As I mentioned about, I intend write about this more in a future post.) However, I believe these relatively subtle situations should not be the first thing most people think of when they think about p-hacking and Bayes. I propose the above simple situations as more useful paradigms, in order to make it common knowledge that of course Bayesian inference is susceptible to p-hacking.

References

Edwards, W., H. Lindman, and L. Savage. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70 (3): 193.

Efron, B. 2007. “Doing Thousands of Hypothesis Tests at the Same Time.” Metron-International Journal of Statistics 65 (1): 3–21.

Kotzen, M. 2022. “The Bayesian and Classical Approaches to Statistical Inference.” Philosophy Compass 17 (9): e12867.