To think about the influence function, think about sums.

I think the key to thinking intuitively about the influence function in our work on AMIP is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.

Specifically, suppose you have data weights \(w = (w_1, \ldots, w_N)\) and an estimator \(\phi(w) \in \mathbb{R}\) which depends on the weights in some complicated way. Let \(\phi^{lin}\) denote the first-order Taylor series expansion around the unit weight vector \(\vec{1} := (1, \ldots, 1)\)

\[\phi^{lin}(w) := \phi(\vec{1}) + \sum_{n=1}^N \psi_n (w_n - 1) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n w_n \quad\textrm{where}\quad \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{\vec{1}},\]

and we have used the fact that \(\sum_{n=1}^N \psi_n = 0\) for Z-estimators. (For situations where \(\sum_{n=1}^N \psi_n \ne 0\), just keep that sum around, and everything I say in this post still applies.) Thinking now of \(\psi\) as data, we can (in some abuse of notation) write \(\phi^{lin}(\psi) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n\). If \(\phi^{lin}(w)\) is a good approximation to \(\phi(w)\), then the effect of leaving a datapoint out of \(\phi(w)\) is well-approximated by the effect of leaving the corresponding entry out of \(\psi\) in \(\phi^{lin}(\psi)\). We have, in effect, replaced a complicated data dependence with a simple sum of terms. This is what linearization does for us. (NB: if our original estimator had been a sum of the data, the linearization would be exact!)

Typically \(\psi_n = O_p(N^{-1})\), so it’s a little helpful to define \(\gamma_n := N \psi_n\). We then can write:

\[\phi^{lin}(\gamma) := \phi(\vec{1}) + \frac{1}{N}\sum_{n=1}^N \gamma_n.\]

We can now ask what kinds of changes we can produce in \(\phi^{lin}(\gamma)\) by dropping entries from \(\gamma\) (while keeping \(N\) the same), and some of the core conclusions of our paper become obvious. Definitionally, \(\sum_{n=1}^N \gamma_n = 0\). For example, if we drop \(\alpha N\) points, for some fixed \(0 < \alpha < 1\), then the amount we can change the sum \(\frac{1}{N}\sum_{n=1}^N \gamma_n\) does not vanish, no matter how large \(N\) is, and no matter how small \(\alpha\) is. The amount you can change the sum \(\frac{1}{N}\sum_{n=1}^N \gamma_n\) also obviously depends on the tail shape of the distribution of the \(\gamma_n\), as well as their absolute scale. Increasing the scale (i.e., increasing the noise) obviously increases the amount you can change the sum. And, for a given scale (i.e., a given \(\frac{1}{N} \sum_{n=1}^N \gamma_n^2)\), you will be able to change the sum by the most when the left-out \(\gamma_n\) all take the same value.

So one way to think about AMIP is this: we provide a good approximation to your original statistic that takes the form of a simple sum over your data. Dropping datapoints corresponds to dropping data from this sum. You can then think about whether dropping sets that are selected in a certain way are reasonable or not in terms of dropping entries from a sum, about which it’s easy to have good intutition!