Why does Jackknife method works?
In the Advanced Mathematical Statistics course in Spring AY21/22, we studied two popular model-free computational methods used in estimation: Jackknife and Bootstrap. Given $n$ observations $X_1,\cdots, X_n$ and a statistics $T_n$ as the estimation for unknown parameter $\theta$ (e.g. mean value and variance), the Jackknife method for bias estimation formulates as follows, $$ b_{JACK} = (n - 1)(\bar{T}_{n-1} - T_n) $$ where $\bar{T}_{n-1} = n^{-1}\sum_{i=1}^n T_{n-1,i} = n^{-1}\sum_{i=1}^n T_{n-1}(X_1,\cdots,X_{i-1},X_{i+1},\cdots,X_n)$ represents the estimator generated by $n-1$ observations in our full dataset. This result was beautiful, since we don’t need any additional information even when the true bias of the estimator contains unknow parameter $\theta$. But there is one question: is this simple bias estimator accurate?
This formula was first proposed by Quenouille in 1949 and the motivation was that suppose the bias $b$ has the expansion1 (why?) $$ b = \frac{a_1}{n} + \frac{a_2}{n^2} + \cdots $$ Then we can verify that by subtracting $b_{JACK}$ from the original $T$, we can reduce the estimation bias from $O(n^{-1})$ to $O(n^{-2})$. In fact, if we consider the case where $T_n=g(\bar{X}_n)$ is used to estimate $\theta=g(\mu)$ and $g$ is sufficiently smooth2(e.g. method of moment), we have Taylor expansion $$ T_n - \theta = g^{\prime}(\mu) (\bar{X}_n -\mu) + \frac{1}{2}(\bar{X}_n -\mu)^Tg^{\prime\prime}(\mu)(\bar{X}_n -\mu) + R_n $$ where $R_n=O_{p}(n^{-2})$3 and $$ E[g^{\prime}(\mu) (\bar{X}_n -\mu)] = 0 $$ $$ E[(\bar{X}_n -\mu)^Tg^{\prime\prime}(\mu)(\bar{X}_n -\mu)] = O(n^{-1}) $$ Thus in this case the expansion $(2)$ is correct.
Let’s further calculate the $b_{JACK}$ in this case. Aagain by Taylor expansion, we have $$ T_{n-1, i} - T_n = g^{\prime}(\bar{X}_n)(\bar{X}_{n-1,i} - \bar{X}_n) + \frac{1}{2}(\bar{X}_{n-1,i} - \bar{X}_n)^Tg^{\prime\prime}(\xi_i)(\bar{X}_{n-1,i} - \bar{X}_n) $$
Then $$ b_{JACK} = (n - 1)(\bar{T}_{n-1} - T_n) = \frac{n-1}{2n}\sum_{i=1}^n (\bar{X}_{n-1,i} - \bar{X}_n)^Tg^{\prime\prime}(\xi_i)(\bar{X}_{n-1,i} - \bar{X}_n) $$ $$ = \frac{1}{2n(n-1)}\sum_{i=1}^n (X_i - \bar{X}_n)^Tg^{\prime\prime}(\xi_i)(X_i - \bar{X}_n) $$
where $\bar{X}_{n-1,i}$ is the mean value of $n-1$ observations without the ith one. Since $\xi_i \rightarrow \mu $ as $n \rightarrow \infty$, we can show that $b_{JACK} \rightarrow E[(\bar{X}_n -\mu)^Tg^{\prime\prime}(\mu)(\bar{X}_n -\mu)]$. By $E[g^{\prime}(\mu) (\bar{X}_n -\mu)] = 0$, we actually prove that the jackknife bias estimator is a consistent estimator of the first two order of the true bias.
From the above discussion, jackknife bias estimation works well for most estimators i.e. estimators have the form $g(\bar{X}_n)$ as $n\rightarrow \infty$. In fact, for a more general class of statistics named functional statistics4 (e.g. quantile), we have similar result in the consistency of jackknife bias estimation.
Jackknife method is popular for its variance estimation, which was first proposed by Tukey in 1958. In fact, the name “jackknife” was first proposed by Tukey because, like a physical jack-knife (a compact folding knife), it is a rough-and-ready tool that can improvise a solution for a variety of problems even though specific problems may be more efficiently solved with a purpose-designed tool5. By doing jackknife bias reduction, i.e. subtracting bias estimation from the original estimator $T_n$, we have $$ T_{JACK} = T_n - (n-1)(\bar{T}_{n-1} - T_n) = \frac{1}{n}\sum_{i=1}^n [nT_n - (n-1)T_{n-1,i}] $$ This form is informative since if we set $\tilde{T}_{n,i} = nT_n - (n-1)T_{n-1,i}$, named jackknife pseudovalues, $T_{JACK}$ is the mean value of some new statistics $\tilde{T}_{n,i}$’s. Tukey conjectured that $\tilde{T}_{n,i}$’s can be treated as though they were i.i.d. This conjecture was verified by Thorburn in 19776, who proved that if pseudovalues converge in mean square error to some random variables, these random variables are independent. Note that if we view $T_{JACK}\approx T_n$, $var(T_n)$, if exists, should approximately equal to $var(\tilde{T}_{n,i})/n$. This was actually the second conjecture proposed by Tukey. With this conjecture, Tukey defined the jackknife variance estimator as follows, $$ v_{JACK} = \frac{1}{n(n-1)}\sum_{i=1}^n (\tilde{T}_{n,i} - \frac{1}{n}\sum_{i=1}^n \tilde{T}_{n,i})^2 $$
Now let’s consider some simple examples. If $T_n = \bar{X}_n$, $T_n$ is unbiased estimator for $\mu$. Then $$ \bar{T}_{n-1} = \frac{1}{n(n-1)}\sum_{i=1}^n (nT_n - X_i) = T_n $$ Thus $b_{JACK} = 0$. Further, $v_{JACK}$ is the ordinary variance estimator for $\bar{X}_n$ with bias correction. In this case, the jackknife and the traditional provide the same estimators.
If we consider a more complicated one2 where $T_n = \bar{X}_n^2$ as the estimator of $\mu^2$. It can be shown that $E[T_n] = \mu^2 + \sigma^2/n$ and bias$(T_n) = \sigma^2/n$. We expect that in this case, jackknife bias reduction is powerful since bias$(T_n)$ has the form as $(2)$ and $T_n$ is a MM estimator. In fact, the jackknife bias estimation is the same as the plug-in method i.e. replace $\sigma^2$ with its sample form. A straightforward calculation shows that $$ var(T_n) = \frac{4\mu^2\sigma^2}{n} + \frac{4\mu\alpha_3}{n^2} + \frac{\alpha_4}{n^3} $$ where $\alpha_k$ denotes the kth central moment. The jackknife variance estimator is $$ v_{JACK} = \frac{4\bar{X}_n^2\hat{\sigma}^2}{n} - \frac{4\bar{X}_n\hat{\alpha}_3}{n(n-1)} + \frac{\hat{\alpha}_4}{n(n-1)^2} - \frac{\hat{\sigma}^4}{n^2(n-1)} $$ where $\hat{\alpha}_k$ are the kth sample central moment with bias correction i.e. the denominator is $n - 1$. We notice that $(10)$ and $(11)$ are asymptotically the same and $(10)$ is complicated. This case shows the power and value of jackknife method: we don’t need to derive the complicated explicit form of variance like $(10)$. We just need to calculate $(8)$. For statistics like sample quantile, $\alpha$-trimmed sample mean, and some more complicated ones like V-statistics and U-statistics in robust statistics, we can simply use the jackknife method to estimate its bias and variance.
Now, similar to our discussion in jackknife bias estimation, we need to consider the correctness of jackknife variance estimator. For jackknife variance estimator, we can show that (A) for some classes of statistics, the jackknife variance estimator is consistent; (B) the jackknife variance estimator is almost unbiased or positively biased. We first consider the property (A). Let $T_n = g(\bar{X}_n)$ where g is sufficiently smooth (in fact, we only require it is continuously differentiable). If $\nabla g (\mu) \neq 0$, we have $$ \frac{v_{JACK}}{\sigma_n^2} \rightarrow 1\ \ \ a.s. $$ where $\sigma_n$ is the asymptotic variance of $T_n$ (its explicit form can be given by the Delta’s method). This indicates that $v_{JACK}$ can be viewed as a variance estimator given by the Delta’s method through some simple calculations. In fact, the proof of this relation2 mainly approximate the $T_n$ by some linear statistics (that’s why we assume g is sufficiently smooth with $\nabla g (\mu) \neq 0$), which shares the same idea with the Delta’s method.
Now let’s consider the property (B). The relevant result was first obtained by Efron in 19817. The main part of the proof is the ANOVA decomposition on estimators. Assume that $T_n$, an estimator for $\theta$, satisfies $ET_n^2 < \infty$. Then we have
$$
T_n = \alpha^{(0)} + \frac{1}{n}\sum_{i} \alpha_i^{(1)} + \frac{1}{n^2} \sum_{i_1 < i_2} \alpha_{i_1i_2}^{(2)} + \cdots + \frac{1}{n^n} \alpha_{12\cdots n}^{(n)}
$$
where
$$\alpha^{(0)} = ET_n $$
$$\alpha_{i}^{(1)} = n[E(T_n|X_i) - \mu] $$
$$\alpha_{i_1i_2}^{(2)} = n^2[E(T_n|X_{i_1}, X_{i_2}) - E(T_n|X_{i_1}) - E(T_n|X_{i_2}) + \mu] $$
$$\cdots $$
$$\alpha_{i_1i_2\cdots i_k}^{(k)} = n^k[E(T_n|X_{i_1}, X_{i_2}, \cdots, X_{i_k}) - \sum_{s=1}^k E(T_n|(X_{i_j})_{j\neq s}) + $$
$$\ \ \sum_{s,t=1}^k E(T_n|(X_{i_j})_{j\neq s,t}) + \cdots + (-1)^k\mu $$
and these terms have zero expectation and mutually uncorrelated. This decomposition is powerful since it doesn’t assume independence on the original observations. Intuitively, it decomposes an estimator into the main effect, mutual effect and high order effect of observations. One just need to plug in the definition of $\alpha^{(i)}$ into the decomposition to verfify it. Notice that if we assume our observations are i.i.d., we have
$$
var(T_{n-1}) = \frac{1}{n-1}\sigma^2_1 + {n-1 \choose 2}\frac{1}{(n-1)^4}\sigma^2_2 + \cdots + \frac{1}{n^{2n}}\sigma_n^2
$$
where $\sigma_i^2 = var(\alpha^{(i)})$. Calculating the expectation of the jackknife variance estimator using $(13)$ and comparing it with $(15)$, one can show that
$$
Ev_{JACK} \geq var(T_{n-1})
$$
This shows that the “bias” of the jackknife variance estimator is positive (note that the RHS of the inequality is $T_{n-1}$ not $T_{n}$). For statistics like U-statistics, von Mises Series, e.t.c., we can replace $var(T_{n-1})$ with $var(T_{n})$.
So far, we have discussed the consistency of the jackknife method and the bias of the jackknife variance estimator. These properties guarantee the jackknife method to be a good method. In fact, our discussion can be extended to functional statistics and delete-d jackknife method (which can be used to estimate the sampling distribution of the $T_n$ like bootstrap but less computationally complex). One can refer to The Jackknife and Bootstrap by Shao, J. and Tu, D..
Efron, B., 1982. The jackknife, the bootstrap and the other resampling plans. ↩︎
Shao, J. and Tu, D., 1995. The Jackknife and Bootstrap. ↩︎ ↩︎ ↩︎
$O_p(n^{-k})$ here means $n^kO_p(n^{-k})$ is bounded in probability. ↩︎
If the true parameter has the form $T(F)$ where $F$ is the population distribution, its estimator can be obtained easily by plug-in method i.e. replace $F$ with empirical distribution $F_n$. We call this statistics functional statistics. ↩︎
Thorburn, D. E., 1977. On the asymptotic normality of the jackknife, Scand. J. Statist., 4, 113-118 ↩︎
Efron, B. and Stein, C., 1981. The Jackknife Estimate of Variance. The Annals of Statistics, 9(3). ↩︎