In this post, I’ll try to connect a few different ways of viewing jackknife and infinitesimal jackknife bias correction. This post may help provide some intution, as well as an introduction to how to use the infinitesimal jackknife and von Mises expansion to think about bias correction.

Throughout, for concreteness, I’ll use the simple example of the statistic

\[\begin{aligned} T & =\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}\right)^{2},\end{aligned}\]

where the \(x_{n}\) are IID and \(\mathbb{E}\left[x_{n}\right]=0\). Of course, the bias of \(T\) is known, since

\[\begin{aligned} \mathbb{E}\left[T\right] & =\frac{1}{N^{2}}\mathbb{E}\left[\sum_{n=1}^{N}x_{n}^{2}+\sum_{n_{1}\ne n_{x}}x_{n_{1}}x_{n_{2}}\right] =\frac{\mathrm{Var}\left(x_{1}\right)}{N}. \end{aligned}\]

We will ensure that we recover consistent estimates of this bias using each of the different perspectives. Of course, the utility of these concepts is when we do not readily have such a simple expression for bias as in, for example, Bayesian expectations.

At different points I will use different arguments for \(T\), hopefully without any real ambiguity. For convenience, write

\[\begin{aligned} \hat{\mu} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n} \quad\textrm{and}\quad \hat{\sigma}^{2} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}\end{aligned}\]

so that our example can be expressed as

\[\begin{aligned} T & =\hat{\mu}^{2}.\end{aligned}\]

An asymptotic series in \(N\).

Perhaps the most common way to understand the jackknife bias estimator and correction is as an asymptotic series in \(N\). Suppose that we have some reason to believe that the bias of \(T\) admits an asymptotic expansion in \(N\), the size of the observed dataset:

\[\begin{aligned} \mathbb{E}\left[T_{N}\right] & =a_{1}N^{-1}+o\left(N^{-1}\right).\end{aligned}\]

The jackknife bias estimator works as follows. Let \(T_{-i}\) denote \(T\) calculated with datapoint \(i\) left out. Then

\[\begin{aligned} \mathbb{E}\left[T_{N}-T_{-i}\right] & =a_{0}+a_{1}N^{-1}+o\left(N^{-1}\right)-\\ & \quad a_{0}-a_{1}\left(N-1\right)^{-1}+o\left(N^{-1}\right)\\ & =a_{1}\frac{N-1-N}{N\left(N-1\right)}+o\left(N^{-1}\right)\\ & =-\frac{a_{1}N^{-1}}{N-1}+o\left(N^{-1}\right)\\ & =-\frac{\mathbb{E}\left[T_{N}\right]}{N-1}+o\left(N^{-1}\right).\end{aligned}\]

Consequently,

\[\begin{aligned} \hat{B} & =-\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\ \mathbb{E}\left[\hat{B}\right] & =\mathbb{E}\left[T_{N}\right]+o\left(N^{-1}\right).\end{aligned}\]

is an unbiased estimate of the leading order term in the bias of \(T_{N}\), and the bias-corrected estimate \(T_{N}-\hat{B}\) has bias of smaller order \(o\left(N^{-1}\right)\),

\[\begin{aligned} \mathbb{E}\left[T_{N}-\hat{B}\right] & =o\left(N^{-1}\right).\end{aligned}\]

In our example,

\[\begin{aligned} T_{-i} & =\left(\frac{1}{N-1}\sum_{n\ne i}^{N}x_{n}\right)^{2}\\ & =\left(\frac{N}{N-1}\hat{\mu}-\frac{1}{N-1}x_{i}\right)^{2}\\ & =\left(N-1\right)^{-2}\left(N^{2}\hat{\mu}^{2}-2N\hat{\mu}x_{i}+x_{i}^{2}\right)\\ \frac{1}{N}\sum_{n=1}^{N}T_{-n} & =\left(N-1\right)^{-2}\left(\left(N^{2}-2N\right)\hat{\mu}^{2}+\hat{\sigma}^{2}\right)\\ \hat{B} & =-\left(N-1\right)^{-1}\left(\left(N-1\right)^{2}\hat{\mu}^{2}-\left(N^{2}-2N\right)\hat{\mu}^{2}-\hat{\sigma}^{2}\right)\\ & =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]

which is a perfectly good estimate of the bias.

A Taylor series in \(1/N\).

An equivalent way of looking at the previous example is to imagine \(T\) as a function of \(\tau=1/N\), to numerically estimate the derivative \(dT/d\tau\), and extrapolate to \(\tau=0\). Using the notation of the previous section, define the gradient estimate

\[\begin{aligned} \hat{g_{i}} & =\frac{T_{N}-T_{-i}}{\frac{1}{N}-\frac{1}{N-1}}.\end{aligned}\]

Here, we are viewing \(T_{-i}\) as an instance of the estimator evaluated at \(\tau=1/\left(N-1\right)\). By rearranging, we find that

\[\begin{aligned} \hat{g_{i}} & =-N\left(N-1\right)\left(T_{N}-T_{-i}\right),\\ \hat{g} & =\frac{1}{N}\sum_{n=1}^{N}\hat{g_{n}}\\ & =-N\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\ & =-N\hat{B}.\end{aligned}\]

Extrapolating to \(\tau=0\) gives

\[\begin{aligned} T_{\infty} & \approx T_{N}+\hat{g}\left(\frac{1}{N}-\frac{1}{\infty}\right)\\ & =T_{N}-\hat{B},\end{aligned}\]

as in the previous example.

A von Mises expansion.

Let us write the statistic as a functional of the data distribution as follows:

\[\begin{aligned} T\left(F\right) & =\left(\int xdF\left(x\right)\right)^{2}.\end{aligned}\]

Define the empirical distribution to be \(F_{N}\) and the true distribution \(F_{\infty}\). Suppose we can Taylor expand the statistic in the space of distribution functions as

\[\begin{aligned} T\left(F\right) & \approx T\left(F_{0}\right)+T_{1}\left(F_{0}\right)\left(G-F_{0}\right)+\frac{1}{2}T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right)+ O\left(\left|G-F_{0}\right|^{3}\right), & \textrm{(1)} \end{aligned}\]

where \(T_{1}\left(F_{0}\right)\) is a linear operator on the space of (signed) distribution functions and \(T_{2}\left(F_{0}\right)\) is a similarly defined bilinear operator. The expansion in Eq. 1 is known as a von Mises expansion.

Often these operators can be represented with “influence functions”, i.e., there exists a function \(x\mapsto\psi_{1}\left(F_{0}\right)\left(x\right)\) and \(x_{1},x_{2}\mapsto\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)\) such that

\[\begin{aligned} T_{1}\left(F_{0}\right)\left(G-F_{0}\right) & =\int\psi_{1}\left(F_{0}\right)\left(x\right)d\left(G-F_{0}\right)\left(x\right)\\ T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right) & =\int\int\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)d\left(G-F_{0}\right)\left(x_{1}\right)d\left(G-F_{0}\right)\left(x_{2}\right).\end{aligned}\]

For instance, the directional derivative of our example is given by

\[\begin{aligned} \left.\frac{dT\left(F+tG\right)}{dt}\right|_{t=0} & =\left.\frac{d}{dt}\right|_{t=0}\left(\int xd\left(F+tG\right)\left(x\right)\right)^{2}\\ & =2\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)\int xdG\left(x\right),\end{aligned}\]

so that

\[\begin{aligned} \psi_{1}\left(F\right)\left(x\right) & =\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)^{2}x.\end{aligned}\]

Similarly,

\[\begin{aligned} \left.\frac{d^{2}T\left(F+tG\right)}{dt^{2}}\right|_{t=0} & =2\int xdG\left(x\right)\int xdG\left(x\right),\end{aligned}\]

so

\[\begin{aligned} \psi_{2}\left(F\right)\left(x_{1},x_{2}\right) & =2x_{1}x_{2}.\end{aligned}\]

Define

\[\begin{aligned} \Delta_{N} & :=F_{N}-F_{\infty}.\end{aligned}\]

Then the Taylor expansion gives an expression for the bias in terms of the influence functions:

\[\begin{aligned} T\left(F_{N}\right)-T\left(F_{\infty}\right) & =\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}+\frac{1}{2}\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)+\\ & \quad\quad O\left(\left|\Delta_{N}\right|^{3}\right).\end{aligned}\]

Note that, in general, integrals against \(\Delta_{N}\) take the form

\[\begin{aligned} \int\phi\left(x\right)d\Delta_{N}\left(x\right) & =\int\phi\left(x\right)dF_{N}\left(x\right)-\int\phi\left(x\right)dF_{\infty}\left(x\right)\\ & =\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\mathbb{E}\left[\phi\left(x\right)\right].\end{aligned}\]

Consequently, the first term has zero bias, since

\[\begin{aligned} \mathbb{E}\left[\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}\right] & =\frac{1}{N}\mathbb{E}\left[\sum_{n=1}^{N}\psi_{1}\left(F_{\infty}\right)\left(x_{n}\right)\right]-\mathbb{E}\left[\psi_{1}\left(F_{\infty}\right)\left(x\right)\right]\\ & =0.\end{aligned}\]

The second term is given by

\[\begin{aligned} & \int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\\ & \quad=\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{n}\right)-\mathbb{E}_{x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)d\Delta_{N}\left(x_{1}\right)\\ & \quad=\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right].\end{aligned}\]

Note that in the expectation, \(x_{1}\) and \(x_{2}\) are independent. So, when $n_{1}\ne n_{2}$, the term in the sum has mean zero, and

\[\begin{aligned} & \mathbb{E}\left[\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\right]=\nonumber \\ & \quad\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right). & \textrm{(2)} \end{aligned}\]

In general, this is not zero, and so the leading-order bias term of \(\mathbb{E}\left[T\left(F_{N}\right)-T\left(F_{\infty}\right)\right]\) is given by the expectation of the quadratic term.

Note that integrals over \(\Delta_{N}\) are, by the CLT, of order \(1/\sqrt{N}\), so the order of the \(k\)-th term in the von Mises expansion is order \(N^{-k/2}\). By this argument, the bias of \(T\) is of order \(N\) and admits a series expansion in \(N\). Indeed, a von Mises expansion is one way you could justify the first perspective. The expected second-order term is precisely the value \(a_{1}\).

For our example, we can see that the bias is given by

\[\begin{aligned} & \frac{1}{2}\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)\\ & \quad=\frac{2}{2N}\left(\mathbb{E}\left[x_{1}^{2}\right]-\mathbb{E}\left[x_{1}\right]^{2}\right)\\ & \quad=\frac{1}{N}\mathrm{Var}\left(x_{1}\right),\end{aligned}\]

exactly as expected. In this case, the second order term is the exact bias because our very simple \(T\) is actually quadratic in the distribution function.

In general, one can estimate the bias by computing a sample version of the second-order term. In our simple example, \(\psi_{2}\left(F\right)\) does not actually depend on \(F\), but in general one would have to replace \(\psi_{2}\left(F_{\infty}\right)\) with \(\psi_{2}\left(F_{N}\right)\) and the population expectations with sample expectations. For our example, letting \(\hat{\mathbb{E}}\) denote sample expectations, this plug-in approach gives

\[\begin{aligned} & \frac{1}{N}\left(\hat{\mathbb{E}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{1}\right)\right]-\hat{\mathbb{E}}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{2}\right)\right]\right)\\ & \quad=\frac{1}{N}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}x_{n_{1}}x_{n_{2}}\right)\\ & \quad=\frac{1}{N}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]

which is simply a sample estimate of the variance.

Note that you might initially expect that, to use the reasoning in Eq. 1, you would need to express your estimator as an explicit function of \(N\), or at least take into account the \(N\) dependence in developing a Taylor series expansion such as that in Eq. 1. However, the example in the present case shows that this is not so, as the empirical distribution depends only implicitly on \(N\). In fact, the asymptotic series in \(N\) follows from the stochastic behavior of \(\Delta_{N}\) rather than from any explicit \(N\) dependence in the statistic.

The infinitesimal jackknife.

Rather than use Eq. 1 to estimate the bias directly with the plug-in principle, we might imagine using it to try to approximate the jackknife estimate of bias. In this section, I show that (a) a second order infinitesimal jackknife expansion is necessary and that (b) you then get the same answer as by estimating the bias from the second term of Eq. 1 directly.

Let \(F_{-i}\) denote the empirical distribution with datapoint \(i\) left out, and let \(\Delta_{-i}\) denote \(F_{-i}-F_{N}\). The infinitesimal jackknife estimate of \(T_{-i}\) is given by using Eq. 1 to extrapolate from \(F_{N}\) to \(F_{-i}\):

\[\begin{aligned} T_{IJ}^{\left(1\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}.\end{aligned}\]

This is the classical infinitesimal jackknife, which expands only to first order. The second order IJ is of course

\[\begin{aligned} T_{IJ}^{\left(2\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}+\frac{1}{2}T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i}.\end{aligned}\]

The difference with is that the base of the Taylor series is \(F_{N}\) rather than \(F_{\infty}\), and we are extrapolating to estimate the jackknife rather than to estimate the actual bias. A benefit is that all the quantities in the Taylor series can be evaluated, and no plug-in approximation is necessary. For instance, in our example,

\[\begin{aligned} \psi_{1}\left(F_{\infty}\right)\left(x\right) & =\left(\int\tilde{x}dF_{\infty}\left(\tilde{x}\right)\right)^{2}x,\end{aligned}\]

which contains the unknown true mean \(\mathbb{E}\left[x_{1}\right]\). In contrast,

\[\begin{aligned} \psi_{1}\left(F_{N}\right)\left(x\right) & =\hat{\mu}^{2}x,\end{aligned}\]

depending on the observed sample mean.

As before, it is useful to first right out the operation of \(\Delta_{-i}\) on a generic function of \(x\):

\[\begin{aligned} \int\phi\left(x\right)d\Delta_{-i}\left(x\right) & =\int\phi\left(x\right)dF_{-i}\left(x\right)-\int\phi\left(x\right)dF_{N}\left(x\right)\\ & =\frac{1}{N-1}\sum_{n\ne i}\phi\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right).\\ & =\left(\frac{1}{N-1}-\frac{1}{N}\right)\sum_{n=1}^{N}\phi\left(x_{n}\right)-\frac{\phi\left(x_{i}\right)}{N-1}\\ & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\phi\left(x_{i}\right)\right).\end{aligned}\]

From this we see that the first-order term is

\[\begin{aligned} T_{1}\left(F_{N}\right)\Delta_{-i} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\psi_{1}\left(F_{N}\right)\left(x_{i}\right)\right).\end{aligned}\]

Suppose we tried to use \(T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\) to approximate \(T_{-i}\) in the expression for \(\hat{B}\). We would get

\[\begin{aligned} \hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\ & =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{1}\left(F_{N}\right)\Delta_{-i}\right)\\ & =\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)\\ & =0.\end{aligned}\]

In other words, the first-order approximate estimates no bias. (This is in fact for the same reason that the expectation with respect to \(F_{\infty}\) of the first-order term evaluated at \(F_{\infty}\) is zero.)

The second order term is given by

\[\begin{aligned} T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i} & =\left(N-1\right)^{-1}\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{n}\right)-\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{i}\right)\right)d\Delta_{-i}\left(\tilde{x}\right)\\ & =\left(N-1\right)^{-2}\times(\\ & \quad\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{i},x_{n}\right)-\\ & \quad\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{i}\right)+\psi_{2}\left(F_{N}\right)\left(x_{i},x_{i}\right)\\ & \quad).\end{aligned}\]

As before, using \(T_{IJ}^{\left(2\right)}\left(F_{-i}\right)\) then to approximate \(T_{-i}\) gives

\[\begin{aligned} \hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\ & =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{2}\left(F_{N}\right)\Delta_{-n}\Delta_{-n}\right),\end{aligned}\]

where we have used the previous result that the first term has empirical expectation \(0\). Plugging in, we see that

\[\begin{aligned} \hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{n}\right)-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)\right),\end{aligned}\]

which is precisely a sample analogue of the population bias in Eq. 2 of the previous section. Of course, in our specific example, this gives

\[\begin{aligned} \hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}x_{n_{1}}x_{n_{2}}\right)\\ & =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]

which matches the exact jackknife’s factor of \(\left(N-1\right)^{-1}\), in contrast to our direct sample estimate of the bias term, which had a factor of \(N^{-1}\).