Let’s talk about Hacking. Not Ian Hacking this time — p-hacking! I’d like to elaborate on a nice post by Michael Wiebe, where he investigates whether my work with Rachael and Tamara on robustness to the removal of small, adversarially selected subsets can be a defense against p-hacking. In short: yes, it can be! Michael establishes this with some experiments, and I’d like to supplement his observation with a little theory from our paper. Let me state before diving in that I was unaware of this nice feature of our work before Michael’s post, so I’m super grateful to him both for noticing it and producing such a nice write-up.

# P-hacking for regression

Here’s what Michael means by p-hacking in a nutshell. (I’ll eliminate some details of his analysis that are unnecessary for my point here.) Suppose that you have a \(N\) data points, consisting of a mean-zero normally-distributed response, \(y_n\), for \(n=1,...,N\), and a large number of regressors, \(x_{nk}\), \(k=1,...,K\). Suppose that all of the \(x_{nk}\) have variance \(\sigma_x^2\) and are drawn independently of \(y_n\). Despite the fact that the regressors have no real explanatory power, if you run all \(K\) regressions \(y_n \sim \alpha_k + \beta_k x_{nk}\), for large enough \(K\), you expect to find at least one \(\hat\beta_k\) that is statistically significant at any fixed level.

For example, if you construct a test with valid 95% confidence intervals and \(K=20\), you expect one false rejection amongst the \(20\) regressions. If you run all 20, but only report the rejection, it may appear that you have found a significant effect, but you’ve actually only p-hacked. P-hacking is at best sloppy and at worst dishonest, and we want defenses against it.

# Some notation

It will be helpful to write out our estimates explicitly. First, let’s write \(y_n = \varepsilon_n\) with a residual \(\varepsilon_n \sim \mathcal{N}(0, \sigma_\varepsilon^2)\), to emphasize that no regressors have real effects. Our estimators and their (robust) standard errors are

\[ % \begin{align*} % \hat\beta_k = \frac{\sum_{n=1}^N x_{nk} \varepsilon_n}{\sum_{n=1}^N x_{nk}^2} \quad\quad\textrm{and}\quad\quad \hat\sigma_k^2 = \frac{\sum_{n=1}^N x_{nk}^2 \hat\varepsilon_{kn}^2} {(\sum_{n=1}^N x_{nk}^2)^2} \quad\quad\textrm{where}\quad\quad \hat\varepsilon_{kn} = y_n - \hat\beta_k x_{nk}. % \end{align*} % \]

A standard 95% univariate test rejects \(H_0: \beta_k = 0\) if

\[ % \begin{align*} % \textrm{Statistical significance:}\quad\quad \sqrt{N} \hat\beta_k > 1.96 \hat\sigma_k. % \end{align*} % \]

Now, we actually expect that, asymptotically and marginally, \(\sqrt{N} \hat\beta_k \sim \mathcal{N}(0, \sigma_\varepsilon^2 / \sigma_x^2)\) by the central limit theorem. For any fixed \(k\), we expect \(\hat\sigma_k \rightarrow \sigma_\varepsilon / \sigma_x\), and even for \(k\) chosen via p-hacking, we might expect \(\hat\sigma_k\) not to be orders of magnitude off. So many draws \(k=1,..,K\) from the distribution of \(\sqrt{N}\hat\beta_k\), we expect a few to be “statistically significant”. So far, I’ve just re-stated formally how p-hacking works.

# Adversarial removal

Now, let us ask how much data we would need to drop to make a significant result \(\hat\beta_k\) statistically insignificant. Suppose without loss of generality that \(\hat\beta_k > 0\). If we can drop datapoints that decrease the estimate \(\hat\beta_k\) by the difference \(\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}\), then the estimator will become insignificant. Let’s assume for simplicity that the standard deviation \(\hat\sigma_k\) is less sensitive to datapoint removal than is \(\hat\beta_k\), which is typically the case in practice. Then, in the notation of Section 3 of our our paper,

- The “signal” is \(\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}\),
- The “noise” is \(\hat\sigma_k\) (since we only consider the sensitivity of \(\hat\beta_k\), the variance of whose limiting distribution after scaling is \(\hat\sigma_k\))
- The “shape” \(\Gamma_\alpha\) is determined in a complicated way from \(\alpha\) and the distributions of the residuals and regressors, but which converges to a non-zero constants and deterministically satisfies \(\Gamma_\alpha \le \sqrt{\alpha(1-\alpha)}\),

and we expect to be unable to make the estimate insignificant when

\[ % \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}}{\hat\sigma_k} \ge \Gamma_\alpha. % \end{align*} % \]

Multiplying both sides by \(\sqrt{N}\) and making some cancellations and rearrangements, we see that \[ % \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\sqrt{N} \hat\beta_k}{\hat\sigma_k} \ge \sqrt{N} \Gamma_\alpha + 1.96. % \end{align*} % \]

Now, as \(N \rightarrow \infty\), the left-hand side of the preceding display converges in distribution, and the right-hand side blows up. In other words, estimates formed from p-hacking are always non-robust, at any \(\alpha\), for sufficiently large \(N\)!

# Regression is not special

Although it’s convenient to work with regression, absolutely nothing about the preceding analysis is special to regression. The key is that p-hacking relies on variability on the order of the same size as shrinking confidence intervals, but adversarial subset removal produces changes that do not vanish asymptotically. P-hacking is thus dealt away with for the same reason that, as we say in the paper, statistical insignificance is always non-robust.

# Postscript

Michael makes a somewhat different (but still interesting) point in his plot “False Positives are Not Robust.” I believe that plot can be explained by observing that, in his setup, the effect of increasing \(\gamma\) is to increase \(\sigma_\varepsilon\) in my notation. The flat line in his plot can the be attributed to the fact that both the signal and the noise are proportional to \(\sigma_\varepsilon\), and so the residual scale cancels in the signal-to-noise ratio.