Let’s talk about Hacking. Not Ian Hacking this time — p-hacking! I’d like to elaborate on a nice post by Michael Wiebe, where he investigates whether my work with Rachael and Tamara on robustness to the removal of small, adversarially selected subsets can be a defense against p-hacking. In short: yes, it can be! Michael establishes this with some experiments, and I’d like to supplement his observation with a little theory from our paper. Let me state before diving in that I was unaware of this nice feature of our work before Michael’s post, so I’m super grateful to him both for noticing it and producing such a nice write-up.

# P-hacking for regression

Here’s what Michael means by p-hacking in a nutshell. (I’ll eliminate some details of his analysis that are unnecessary for my point here.) Suppose that you have a $$N$$ data points, consisting of a mean-zero normally-distributed response, $$y_n$$, for $$n=1,...,N$$, and a large number of regressors, $$x_{nk}$$, $$k=1,...,K$$. Suppose that all of the $$x_{nk}$$ have variance $$\sigma_x^2$$ and are drawn independently of $$y_n$$. Despite the fact that the regressors have no real explanatory power, if you run all $$K$$ regressions $$y_n \sim \alpha_k + \beta_k x_{nk}$$, for large enough $$K$$, you expect to find at least one $$\hat\beta_k$$ that is statistically significant at any fixed level.

For example, if you construct a test with valid 95% confidence intervals and $$K=20$$, you expect one false rejection amongst the $$20$$ regressions. If you run all 20, but only report the rejection, it may appear that you have found a significant effect, but you’ve actually only p-hacked. P-hacking is at best sloppy and at worst dishonest, and we want defenses against it.

# Some notation

It will be helpful to write out our estimates explicitly. First, let’s write $$y_n = \varepsilon_n$$ with a residual $$\varepsilon_n \sim \mathcal{N}(0, \sigma_\varepsilon^2)$$, to emphasize that no regressors have real effects. Our estimators and their (robust) standard errors are

% \begin{align*} % \hat\beta_k = \frac{\sum_{n=1}^N x_{nk} \varepsilon_n}{\sum_{n=1}^N x_{nk}^2} \quad\quad\textrm{and}\quad\quad \hat\sigma_k^2 = \frac{\sum_{n=1}^N x_{nk}^2 \hat\varepsilon_{kn}^2} {(\sum_{n=1}^N x_{nk}^2)^2} \quad\quad\textrm{where}\quad\quad \hat\varepsilon_{kn} = y_n - \hat\beta_k x_{nk}. % \end{align*} %

A standard 95\% univariate test rejects $$H_0: \beta_k = 0$$ if

% \begin{align*} % \textrm{Statistical significance:}\quad\quad \sqrt{N} \hat\beta_k > 1.96 \hat\sigma_k. % \end{align*} %

Now, we actually expect that, asymptotically and marginally, $$\sqrt{N} \hat\beta_k \sim \mathcal{N}(0, \sigma_\varepsilon^2 / \sigma_x^2)$$ by the central limit theorem. For any fixed $$k$$, we expect $$\hat\sigma_k \rightarrow \sigma_\varepsilon / \sigma_x$$, and even for $$k$$ chosen via p-hacking, we might expect $$\hat\sigma_k$$ not to be orders of magnitude off. So many draws $$k=1,..,K$$ from the distribution of $$\sqrt{N}\hat\beta_k$$, we expect a few to be “statistically significant”. So far, I’ve just re-stated formally how p-hacking works.

Now, let us ask how much data we would need to drop to make a significant result $$\hat\beta_k$$ statistically insignificant. Suppose without loss of generality that $$\hat\beta_k > 0$$. If we can drop datapoints that decrease the estimate $$\hat\beta_k$$ by the difference $$\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}$$, then the estimator will become insignificant. Let’s assume for simplicity that the standard deviation $$\hat\sigma_k$$ is less sensitive to datapoint removal than is $$\hat\beta_k$$, which is typically the case in practice. Then, in the notation of Section 3 of our our paper,

• The “signal” is $$\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}$$,
• The “noise” is $$\hat\sigma_k$$ (since we only consider the sensitivity of $$\hat\beta_k$$, the variance of whose limiting distribution after scaling is $$\hat\sigma_k$$)
• The “shape” $$\Gamma_\alpha$$ is determined in a complicated way from $$\alpha$$ and the distributions of the residuals and regressors, but which converges to a non-zero constants and deterministically satisfies $$\Gamma_\alpha \le \sqrt{\alpha(1-\alpha)}$$,

and we expect to be unable to make the estimate insignificant when

% \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}}{\hat\sigma_k} \ge \Gamma_\alpha. % \end{align*} %

Multiplying both sides by $$\sqrt{N}$$ and making some cancellations and rearrangements, we see that % \begin{align*} % \textrm{Robust to adversarial subset dropping:}\quad\quad \frac{\sqrt{N} \hat\beta_k}{\hat\sigma_k} \ge \sqrt{N} \Gamma_\alpha + 1.96. % \end{align*} %

Now, as $$N \rightarrow \infty$$, the left-hand side of the preceding display converges in distribution, and the right-hand side blows up. In other words, estimates formed from p-hacking are always non-robust, at any $$\alpha$$, for sufficiently large $$N$$!

# Regression is not special

Although it’s convenient to work with regression, absolutely nothing about the preceding analysis is special to regression. The key is that p-hacking relies on variability on the order of the same size as shrinking confidence intervals, but adversarial subset removal produces changes that do not vanish asymptotically. P-hacking is thus dealt away with for the same reason that, as we say in the paper, statistical insignificance is always non-robust.

# Postscript

Michael makes a somewhat different (but still interesting) point in his plot “False Positives are Not Robust.” I believe that plot can be explained by observing that, in his setup, the effect of increasing $$\gamma$$ is to increase $$\sigma_\varepsilon$$ in my notation. The flat line in his plot can the be attributed to the fact that both the signal and the noise are proportional to $$\sigma_\varepsilon$$, and so the residual scale cancels in the signal-to-noise ratio.