Why is it so hard to think “correctly” about confidence intervals?

I came across the following section in the (wonderful) textbook ModernDive:

Let’s return our attention to 95% confidence intervals. … A common but incorrect interpretation is: “There is a 95% probability that the confidence interval contains p.” Looking at Figure 8.27, each of the confidence intervals either does or doesn’t contain p. In other words, the probability is either a 1 or a 0.

(Although I’m going to pick on this quote a little bit, I want to stress that I love this textbook. This view of CIs is extremely common and I might well have taken a similar quote from any number of other sources. This book just happened to be in front of me today.)

I understand what the authors are saying. Given the data we observed and CI we computed, there is no remaining randomness — either the parameter is in the interval, or it isn’t. The parameter is not random, the data is. But I think there is room to admit that this point, while technically clear, is a little uncomfortable, even for those of us who are very familiar with these concepts. After all, there is a 95% chance that a randomly chosen interval contains the parameter. I chose an interval. Why can I no longer say that there is a 95% chance that the parameter is in that interval? To a beginning student of statistics who is encountering this idea for the first time, this commonplace qualification must seem pedantic at best and confusing at worst.

Fiducial inference

Chapter 9 of Ian Hacking’s Logic of Statistical Inference contains a beautiful account of precisely why we are so inclined towards the “incorrect” interpretation, as well as the shortcomings of our intuition. The logic is precisely that of Fisher’s famous (infamous?) fiducial inference. Understanding this connection not only helps us to better understand CIs (and their modes of failure), but also to be more sympathetic to the inherent reasonableness of students who are disinclined to let go of the “incorrect” interpretation.

As presented by Hacking, there are two relatively uncontroversial building blocks of fiducial inference, and one problematic one. Recall the idea that aleatoric probabilities (the stuff of gambling devices) and epistemic probabilities (degrees of epistemic belief) are fundamentally different quantities. (Hacking treats this question better than I could, but I also have a short post on this topic here). Following Hacking, I will denote the aleatoric probabilities by \(P\) and the degrees of belief by \(p\).

Assumption one: The “frequency principle.”

The first necessary assumption of fiducial inference is this:

If you know nothing about an aleatoric event \(E\) other than its probability, then \(p(E) = P(E)\).

This amounts to saying that, for pure gambling devices, absent other information, your subjective belief about whether an outcome occurs should be the same as the frequency with which that outcome occurs under randomization. If you know a coin comes up heads 50% of the time (\(P(heads) = 0.5\)), then your degree of certainty that it will come up heads on the next flip should be the same (\(p(heads) = 0.5\)). Hacking calls this assumption the “frequency principle.”

Assumption two: The “logic of support.”

The second fundamental assumption is that the logic of epistemic probabilities should be the same as the logic of aleatoric probabilities. Specifically:

Degrees of belief should obey Kolmogorov’s axioms.

For example, if events \(H\) and \(I\) are logically mutually exclusive, then \(p(H \textrm{ and }I) = p(H) + P(I)\). Conditional probabilities such as \(p(H \vert E)\) are a measure of how much the event \(E\) supports a subjective belief that \(H\) will occurs.

Neither the frequency principle nor the logic of support are particularly controversial, even for avowed frequentists. Note that assumption one states only how you come about subjective beliefs about systems you know to be aleatoric, and assumption two states describes only how subjective beliefs combine coherently. So there is nothing really Bayesian here.

Hypothesis testing and fiducial inference

Applying the frequency principle and the logic of support to confidence intervals, together with an additional (more controversial) logical step, will in fact lead us directly to the “incorrect” interpretation of a confidence interval. Let’s see how the logic works.

Suppose we have some data \(X\), and we want to know the value of some parameter \(\theta\). Suppose we have constructed a valid confidence set \(S(X)\) such that \(P(\theta \in S(X)) = 0.95\). Following Hacking, let \(D\) denote the event that our setup is correct — specifically, that we are correct about the randomness of \(X\) \(S(X)\) is a valid CI with the desired coverage. That is, given \(D\), we assume that \(X\) is really random, and we know the randomness, so \(P\) is a true aleatoric probability — no subjective belief here.

Of course, the construction of a confidence interval guarantees only the aleatoric probability — thus we have used \(P\), not \(p\). However, by the frequency principle, we are justified in writing

\(p(\theta \in S(X) \vert D) = P(\theta \in S(X)\vert D) = 0.95\),

so long as we know nothing other than the accuracy of our setup \(D\). (Note that \(\theta \in S(X)\) is a pivot. In general, pivots play a central role in fiducial inference.)

Note that \(p(\theta \in S(X) \vert D)\) is very near to our “incorrect” interpretation of confidence intervals! However, in reality, we know more than \(S(X)\), we actually observe \(X\) itself. Now, \(P(\theta \in S(X) \vert D, X)\) is either \(0\) or \(1\). Conditional on \(X\), there is no remaining aleatoric uncertainty to which we can apply the frequency principle. And most authors — including those of quote that opened this post — stop here.

There is an additional assumption, however, that allow us to formally compute \(p(\theta \in S(X) \vert X, D)\), and it is this (controversial) assumption that is at the core of fiducial inference. It is this:

Assumption three: Irrelevance

The full data \(X\) tells us nothing more about \(\theta\) (in an epistemic sense), than the confidence interval \(S(X)\).

In the case of confidence intervals, the assumption of irrelevance requires at least two things. First, it requires that our subjective belief that \(\theta \in S(X)\) does not depend on the particular interval that we compute from the data. In other words, we are as likely to believe that our CI contains the parameter no matter where its endpoints lie. Second, it requires that there is nothing more to be learned about the parameter from the data other than the information contained in the CI.

These are strong assumptions! However, when they hold, they justify the “incorrect” interpretation of confidence intervals — namely that there is a 95% subjective probability that \(\theta \in S(X)\), given the data we observed. For, under the assumption of irrelevance, by the logic of support (and then the frequency principle as above) we can write

\(p(\theta \in S(X) \vert X, D) = p(\theta \in S(X) \vert D) = P(\theta \in S(X) \vert D) = 0.95\).

How does this go wrong, and what does it mean for teaching?

Assumption three is often hard to justify, or outright fallacious. But one of its strengths is that it points to how the logic of fiducial inference fails, when it does fail. In particular, it is not hard to construct valid confidence intervals that contain only impossible values of \(\theta\) for some values of \(X\). (As long as a confidence interval takes on crazy values sufficiently rarely, there is nothing in the definition preventing it from doing so.) In fact, as Hacking points out, confidence intervals are tools for before you see the data, designed so that you do not make mistakes too often on average, and can suggest strange conclusions once you have seen a particular dataset.

However, it’s not crazy for someone, especially a beginning student, to subscribe to assumption three, even if they are not aware of it. After all, we typically present a confidence interval as the way to summarize what your data tells you about your parameter. And if that’s the case, then the “incorrect” interpretation of CIs follows from the extremely plausible frequency principle and logic of support. At the least I think we should acknowledge the reasonableness of this logical chain, and teach when it goes wrong rather than simply reject it by fiat.