Let’s talk about Hacking. Not Ian Hacking this time — p-hacking! I’d like to elaborate on a nice post by Michael Wiebe, where he investigates whether my work with Rachael and Tamara on robustness to the removal of small, adversarially selected subsets can be a defense against p-hacking. In short: yes, it can be! Michael establishes this with some experiments, and I’d like to supplement his observation with a little theory from our paper. Let me state before diving in that I was unaware of this nice feature of our work before Michael’s post, so I’m super grateful to him both for noticing it and producing such a nice write-up.

P-hacking for regression

Here’s what Michael means by p-hacking in a nutshell. (I’ll eliminate some details of his analysis that are unnecessary for my point here.) Suppose that you have a \(N\) data points, consisting of a mean-zero normally-distributed response, \(y_n\), for \(n=1,...,N\), and a large number of regressors, \(x_{nk}\), \(k=1,...,K\). Suppose that all of the \(x_{nk}\) have variance \(\sigma_x^2\) and are drawn independently of \(y_n\). Despite the fact that the regressors have no real explanatory power, if you run all \(K\) regressions \(y_n \sim \alpha_k + \beta_k x_{nk}\), for large enough \(K\), you expect to find at least one \(\hat\beta_k\) that is statistically significant at any fixed level.

For example, if you construct a test with valid 95% confidence intervals and \(K=20\), you expect one false rejection amongst the \(20\) regressions. If you run all 20, but only report the rejection, it may appear that you have found a significant effect, but you’ve actually only p-hacked. P-hacking is at best sloppy and at worst dishonest, and we want defenses against it.

Some notation

It will be helpful to write out our estimates explicitly. First, let’s write \(y_n = \varepsilon_n\) with a residual \(\varepsilon_n \sim \mathcal{N}(0, \sigma_\varepsilon^2)\), to emphasize that no regressors have real effects. Our estimators and their (robust) standard errors are

\[% \begin{align*} % \hat\beta_k = \frac{\sum_{n=1}^N x_{nk} \varepsilon_n}{\sum_{n=1}^N x_{nk}^2} \quad\quad\textrm{and}\quad\quad \hat\sigma_k^2 = \frac{\sum_{n=1}^N x_{nk}^2 \hat\varepsilon_{kn}^2} {(\sum_{n=1}^N x_{nk}^2)^2} \quad\quad\textrm{where}\quad\quad \hat\varepsilon_{kn} = y_n - \hat\beta_k x_{nk}. % \end{align*} %\]

A standard 95\% univariate test rejects \(H_0: \beta_k = 0\) if

\[% \begin{align*} % \textrm{Statistical significance:}\quad\quad \sqrt{N} \hat\beta_k > 1.96 \hat\sigma_k. % \end{align*} %\]

Now, we actually expect that, asymptotically and marginally, \(\sqrt{N} \hat\beta_k \sim \mathcal{N}(0, \sigma_\varepsilon^2 / \sigma_x^2)\) by the central limit theorem. For any fixed \(k\), we expect \(\hat\sigma_k \rightarrow \sigma_\varepsilon / \sigma_x\), and even for \(k\) chosen via p-hacking, we might expect \(\hat\sigma_k\) not to be orders of magnitude off. So many draws \(k=1,..,K\) from the distribution of \(\sqrt{N}\hat\beta_k\), we expect a few to be “statistically significant”. So far, I’ve just re-stated formally how p-hacking works.

Adversarial removal

Now, let us ask how much data we would need to drop to make a significant result \(\hat\beta_k\) statistically insignificant. Suppose without loss of generality that \(\hat\beta_k > 0\). If we can drop datapoints that decrease the estimate \(\hat\beta_k\) by the difference \(\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}\), then the estimator will become insignificant. Let’s assume for simplicity that the standard deviation \(\hat\sigma_k\) is less sensitive to datapoint removal than is \(\hat\beta_k\), which is typically the case in practice. Then, in the notation of Section 3 of our our paper,

The “signal” is \(\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}\),
The “noise” is \(\hat\sigma_k\) (since we only consider the sensitivity of \(\hat\beta_k\), the variance of whose limiting distribution after scaling is \(\hat\sigma_k\))
The “shape” \(\Gamma_\alpha\) is determined in a complicated way from \(\alpha\) and the distributions of the residuals and regressors, but which converges to a non-zero constants and deterministically satisfies \(\Gamma_\alpha \le \sqrt{\alpha(1-\alpha)}\),

and we expect to be unable to make the estimate insignificant when

\[% \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}}{\hat\sigma_k} \ge \Gamma_\alpha. % \end{align*} %\]

Multiplying both sides by \(\sqrt{N}\) and making some cancellations and rearrangements, we see that \(% \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\sqrt{N} \hat\beta_k}{\hat\sigma_k} \ge \sqrt{N} \Gamma_\alpha + 1.96. % \end{align*} %\)

Now, as \(N \rightarrow \infty\), the left-hand side of the preceding display converges in distribution, and the right-hand side blows up. In other words, estimates formed from p-hacking are always non-robust, at any \(\alpha\), for sufficiently large \(N\)!

Regression is not special

Although it’s convenient to work with regression, absolutely nothing about the preceding analysis is special to regression. The key is that p-hacking relies on variability on the order of the same size as shrinking confidence intervals, but adversarial subset removal produces changes that do not vanish asymptotically. P-hacking is thus dealt away with for the same reason that, as we say in the paper, statistical insignificance is always non-robust.

Postscript

Michael makes a somewhat different (but still interesting) point in his plot “False Positives are Not Robust.” I believe that plot can be explained by observing that, in his setup, the effect of increasing \(\gamma\) is to increase \(\sigma_\varepsilon\) in my notation. The flat line in his plot can the be attributed to the fact that both the signal and the noise are proportional to \(\sigma_\varepsilon\), and so the residual scale cancels in the signal-to-noise ratio.