When we present our work on AMIP the relationship with the bootstrap often comes up. I think there’s a lot to say, but there’s one particularly useful perspective: the (empirical, nonparametric) bootstrap can be thought of as randomly querying the influence function. From this perspective, it seems both clear (a) why the bootstrap works as an estimator of variance and (b) why it won’t work as to find the approximately most influential set, i.e., the set of points which have the most extreme values of the influence function (AMIS in our paper).

Let’s suppose that you have a vector \(\psi \in \mathbb{R}^N\), with \(\sum_{n=1}^N \psi_n = 0\), where \(N\) is very large. We would like to know about \(\psi\), but suppose we can’t access it directly. Rather, we can only query it via inner products \(v^T \psi\). Moreover, suppose we can only compute \(B\) such inner products, where \(B \ll N\). For the purpose of this post, \(\psi\) will be the influence scores, \(v\) will be rescaled bootstrap weight vectors, \(N\) will be the number of data points, and \(B\) the number of bootstrap samples. But the discussion can start out more generally.

Suppose we don’t know anything a priori about \(\psi\), so we query it randomly, drawing IID entries for \(v\) from a distribution with mean zero and unit variance. Let the \(b\)-th random vector be denoted \(v^{b}\). We can ask: What can the collection of inner products \(V_B := \{v^{1} \cdot \psi, \ldots, v^{B} \cdot \psi \}\) tell us about \(\psi\)?

At first glance, the answer seems to be “not much other than the scale.” The set \(V_B\) tells us about the projection of \(\psi\) onto a \(B\)-dimensional subspace, out of \(N \gg B\) total dimensions. Furthermore, since \(\mathbb{E}[v \cdot \psi] = \sum_{n=1}^N \mathbb{E}[v_n] \psi_n = 0\), the vectors \(v^b\) are, on average, orthogonal to \(\psi\). So we do not expect the projection of \(\psi\) onto the space spanned by \(V_B\) to account for an appreciable proportion of \(|| \psi ||_2\). The set \(V_B\) can estimate the scale \(|| \psi ||_2\), however, since \(\mathrm{Var}(v \cdot \psi) = \sum_{n=1}^N \mathbb{E}[v_n^2] \psi_n^2 = \sum_{n=1}^N \psi_n^2 = ||\psi||_2^2\), and \(\mathrm{Var}(v \cdot \psi)\) can be estimated using the sample variance of \(v^b \cdot \psi\).

Note that the bootstrap is very similar to drawing \(v_n + 1 \sim \mathrm{Poisson}(1)\); the proper bootstrap actually has some correlation between different entries due to the constraint \(\sum_{n=1}^N v_n = 1\), but this correlation is of order \(1/N\) and can be neglected for simplicity in the present argument. The argument of the previous paragraph implies that the bootstrap effectively randomly projects \(\psi\) onto a very low-dimensional subspace, presumably losing most of its detail in doing so. It also makes sense that the bootstrap can tell us about \(||\psi||_2\) — recall that \(||\psi||_2\) consistently estimates the variance of the limiting distribution of our statistic, a quantity that we know the bootstrap is also able to estimate.

Recall that the AMIP from our paper is \(-\sum_{n=1}^{\lfloor \alpha N \rfloor} \psi_{(n)}\), where \(\psi_{(n)}\) is the \(n\)-th sorted entry of the \(\psi\) vector. From the argument sketch above I conjecture that the bootstrap distribution actually doesn’t convey much information about the AMIP other than the limiting variance. In particular, in the terms of our paper, I conjecture that the bootstrap can tell us about the “noise” of AMIP but not the “shape.”

Incidentally, the above perspective is also relevant for situations where we cannot form and / or invert the full Hessian matrix \(H\), and so cannot compute \(\psi\) directly. If we imagine sketching \(H^{-1}\), e.g. by using the conjugate gradient method applied to random vectors, we would run into a problem conceptually quite similar to the bootstrap. It’s interesting to think about how one could improve on random sampling in such a case.