There are lots of reasons to dislike p-values. Despite their inherent flaws, over-interpretation, and risks, it is extremely tempting to argue that, absent other information, the smaller the p-value, the less plausible the null hypothesis. For example, the venerable Prof. Philip Stark (who I admire and who was surely choosing his words very carefully), writes in “The Value of p-Values”:

“Small p-values are stronger evidence that the explanation [the null hypothesis] is wrong: the data case doubt on the explanation.”

For p-values based on reasonable hypothesis tests with no other information, I think that Prof. Stark is (usually, mostly) correct to say this. But there is nothing in the definition of a hypothesis test that requires it to be reasonable without a consideration of *power*, and power does not enter the definition of a p-value.

So, to motivate the importance of explicit power considerations in the use of hypothesis tests and p-values, let me describe three ridiculous but valid hypothesis tests. None of this is new, but perhaps it will be fun to have these examples all in the same place.

# Hypothesis tests and p-values

I will begin by reviewing the definition of p-values in the context of hypothesis testing. Let our random data \(X\) take values in a measurable space \(\Omega_x\). The distribution of \(X\) is posited to lie in some class of distributions \(\mathcal{P}_\theta\) parameterized by a parameter \(\theta \in \Omega_\theta\). A simple null hypothesis \(H_0\) specifies a value \(\theta_0 \in \Omega_\theta\) that fully specifies the distribution of the data, which we will write as \(P(X | \theta_0)\).

#### Hypothesis tests

A valid test of \(H_0\) with level \(\alpha\) consists of two parts:

- A measurable test statistic, \(T: \Omega_x \mapsto \Omega_T\), possibly incorporating additional randomness, and
- A region \(A(\alpha) \subseteq \Omega_T\) such that \(P(T(X) \in A(\alpha) | H_0) \le \alpha\).

#### P-values

Often (as in our simple example), the regions are nested in the sense that \(\alpha_1 < \alpha_2 \Rightarrow A(\alpha_1) \subset A(\alpha_2)\). Stricter tests result in smaller rejection regions. In such a case, we can define the p-value of a particular observation \(x\) as the smallest \(\alpha\) which rejects \(H_0\) for that \(x\).

#### A simple example

A simple example, which suffices for the whole present post, is data \(X = (X_1, \ldots, X_N)\), from a \(\mathcal{N}(\theta, 1)\) distribution. The classical two-sided test, which is eminently reasonable for many situations, uses

- \(T(X) = \sqrt{N} \vert \bar{X} - \theta_0 \vert\) and
- \(A(\alpha) = \left\{x: T(x) \ge \Phi^{-1}(1 - \alpha / 2) \right\}\),

where \(\bar{X}\) is the sample average and \(\Phi\) the cumulative distribution function of the standard normal. Today I will not quibble with this test, but rather propose absurd alternatives.

In the case of our simple example, the p-value is simply \(p(x) = 2(1 - \Phi(T(x)))\). As \(T(x)\) increases, the p-value \(p(x)\) decreases.

#### The reasoning

If \(\alpha\) is small (say, the much-loathed \(0.05\)), and we observe \(x\) such that \(T(x) \in A(\alpha)\), the argument goes that \(x\) constitutes evidence against \(H_0\), since such an outcome was improbably if \(H_0\) were true. Correspondingly, smaller p-values are taken to be associated with stronger evidence against the null.

However, one can easily construct valid tests that satisfy the above definition that range from obviously incomplete to absurd. In the counterexamples below I will be happy to assume that \(H_0\) is reasonable, even correct. (So we are not in the case of Prof. Stark’s “straw man”, ibid.) Missing in each of the increasibly egregious counterexamples that I will describe is a consideration of *power*, which is an explicit consideration of the ability of the test to reject when \(\theta \ne \theta_0\).

# Three ridiculous hypothesis tests

#### Example 1: Throw away most of the data

Suppose we use the simple example above example, but throw away all but the first datapoint. So our hypothesis test is

- $T(x) = x_1 - _0 $, \(A(\alpha)\) as above.

This test is valid, as are its p-values. In this case, it is true that larger \(p\) values cast further doubt on the hypothesis (and Prof. Stark’s quote is true). But the increment is small, since a single datapoint is much less informative than the full dataset.

Missing from this silly test is the fact that, by using all the data, one can construct a strictly larger rejection region — and so a test with more power — with the same level.

#### Example 2: Throw away all the data

Since we can use randomness in our test statistic, let us define

- \(T(x) \sim \textrm{Uniform}[0,1]\).
- \(A(\alpha) = \left\{x: T(x) \le \alpha \right\}\).

This test has the correct level and valid p-values, but has nothing at all to do with the data or \(H_0\). It also generates valid confidence intervals, which are either the whole space \(\Omega_\theta\) or \(\emptyset\), with probabilities \(1 - \alpha\) and \(\alpha\) respectivelly.

The book “Testing Statistical Hypothesis” defines p-values for randomized tests as the smallest \(\alpha\) which rejects with probability one. Using this definition, p-values for this case would always be \(1\). So, by this technicality, p-values slip through the cracks this counterexample. However, I would argue that one could just as well augment the data space with \([0,1]\) and consider the uniform draw to be “data” rather than part of the hypothesis, in which case the p-values is simply \(\textrm{Uniform}[0,1]\), independent of the data.

The problem with this test is that it has no greater power to reject under the alternative than under the null. Again, it is a consideration of power, rather than the definition of a valid test, that reveals the nature of the flaw.

#### Example 3: Construct crazy regions

Let us use \(T(x)\) as in the simple example, but use the region \(A(\alpha) = \left\{x: T(x) \le \Phi^{-1}((\alpha + 1) / 2) \right\}\). These regions have the correct levels, but they reject when \(T(x)\) is *close* to \(\theta_0\) rather than when it is far away. These tests will have high power against alternatives which are very close to \(\theta_0\), but no power against large deviations. Values \(T(x)\) which are very large will have large p-values, whereas the smallest p-values occur when \(T(x) \approx \theta_0\).

There are at least two ways to think about what is wrong with this test. One is that it produces rejection regions with smaller-than-optimal Lebesgue measure — the total length of the rejection region is much smaller than the classical test’s. Another is that it has highest power against alternatives that we (typically) care the least about, which are values of \(\theta\) that produce nearly the same distribution as the null. As above, power considerations are the key.

# Let’s think about power

Even the best available argument for p-values, hypothesis tests, and confidence intervals depends on having chosen tests in the first place that take power into account. The best statisticians (e.g. Prof. Stark and Ronald Fisher) will be very good at avoiding under-powered tests, and avoid such mistakes easily in most situations. However, for the general public, it seems to me that there is a lot of value in making power a fundamental part of teaching and talking about hypothesis tests, p-values, and confidence intervals.