amip
Published

December 1, 2021

I think the key to thinking intuitively about the influence function in our work on AMIP is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.

Specifically, suppose you have data weights $$w = (w_1, \ldots, w_N)$$ and an estimator $$\phi(w) \in \mathbb{R}$$ which depends on the weights in some complicated way. Let $$\phi^{lin}$$ denote the first-order Taylor series expansion around the unit weight vector $$\vec{1} := (1, \ldots, 1)$$

$\phi^{lin}(w) := \phi(\vec{1}) + \sum_{n=1}^N \psi_n (w_n - 1) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n w_n \quad\textrm{where}\quad \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{\vec{1}},$

and we have used the fact that $$\sum_{n=1}^N \psi_n = 0$$ for Z-estimators. (For situations where $$\sum_{n=1}^N \psi_n \ne 0$$, just keep that sum around, and everything I say in this post still applies.) Thinking now of $$\psi$$ as data, we can (in some abuse of notation) write $$\phi^{lin}(\psi) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n$$. If $$\phi^{lin}(w)$$ is a good approximation to $$\phi(w)$$, then the effect of leaving a datapoint out of $$\phi(w)$$ is well-approximated by the effect of leaving the corresponding entry out of $$\psi$$ in $$\phi^{lin}(\psi)$$. We have, in effect, replaced a complicated data dependence with a simple sum of terms. This is what linearization does for us. (NB: if our original estimator had been a sum of the data, the linearization would be exact!)

Typically $$\psi_n = O_p(N^{-1})$$, so it’s a little helpful to define $$\gamma_n := N \psi_n$$. We then can write:

$\phi^{lin}(\gamma) := \phi(\vec{1}) + \frac{1}{N}\sum_{n=1}^N \gamma_n.$

We can now ask what kinds of changes we can produce in $$\phi^{lin}(\gamma)$$ by dropping entries from $$\gamma$$ (while keeping $$N$$ the same), and some of the core conclusions of our paper become obvious. Definitionally, $$\sum_{n=1}^N \gamma_n = 0$$. For example, if we drop $$\alpha N$$ points, for some fixed $$0 < \alpha < 1$$, then the amount we can change the sum $$\frac{1}{N}\sum_{n=1}^N \gamma_n$$ does not vanish, no matter how large $$N$$ is, and no matter how small $$\alpha$$ is. The amount you can change the sum $$\frac{1}{N}\sum_{n=1}^N \gamma_n$$ also obviously depends on the tail shape of the distribution of the $$\gamma_n$$, as well as their absolute scale. Increasing the scale (i.e., increasing the noise) obviously increases the amount you can change the sum. And, for a given scale (i.e., a given $$\frac{1}{N} \sum_{n=1}^N \gamma_n^2)$$, you will be able to change the sum by the most when the left-out $$\gamma_n$$ all take the same value.

So one way to think about AMIP is this: we provide a good approximation to your original statistic that takes the form of a simple sum over your data. Dropping datapoints corresponds to dropping data from this sum. You can then think about whether dropping sets that are selected in a certain way are reasonable or not in terms of dropping entries from a sum, about which it’s easy to have good intutition!