Jekyll2022-05-31T17:57:08+00:00/feed.xmlRyan Giordano, statistician.This is the professional webpage and open research journal of Ryan Giordano.R torch for statistics (not just machine learning).2022-04-01T16:00:00+00:002022-04-01T16:00:00+00:00/code/2022/04/01/rtorch_example<p>The <code class="language-plaintext highlighter-rouge">torch</code> package for R (<a href="https://torch.mlverse.org/">found here</a>) is CRAN-installable and provides automatic differentiation in R, as long as you’re willing to rewrite your code using Torch functions.</p> <p>The current docs for the <code class="language-plaintext highlighter-rouge">torch</code> package are great, but assume you’re interested in machine learning. But gradients are useful for ordinary statistics, too! In the notebook below I fit a simple Poisson regression model using <code class="language-plaintext highlighter-rouge">optim</code> by implementing the log likelihod and derivatives in torch. Though not really competitive with (the highly optimized) <code class="language-plaintext highlighter-rouge">lme4::glm</code> on this toy example, the my point is more how easily you can roll your own MLE in R using <code class="language-plaintext highlighter-rouge">torch</code>.</p> <p>The notebook itself <a href="2022-04-01_poisson_regression_torch_for_r.ipynb">can be downloaded here</a>, and an markdown version follows.</p> <hr /> <h1 id="example-of-torch-for-classical-stats-poisson-regression">Example of <code class="language-plaintext highlighter-rouge">torch</code> for classical stats (Poisson regression)</h1> <p>In this notebook, I’ll show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> for R to optimize loss functions and compute standard error estimates.</p> <p>The <a href="https://torch.mlverse.org/">torch for R website</a> is mostly focused on machine learning applications. The purpose of this notebook is just to show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> to get gradients and Hessians for your own purposes, including vanilla classical statistics.</p> <p>I’ll use <code class="language-plaintext highlighter-rouge">torch</code> to implement and optimize a Poisson regression loss function and compute standard errors using Fisher information. This is just a toy problem, but by simply dropping the loss into an out-of-the-box optimizer, we get essentially the same answer as the (highly optimized) <code class="language-plaintext highlighter-rouge">lme4</code> package in a similar amount of time.</p> <h1 id="installation">Installation</h1> <p>One of the big benefits of <code class="language-plaintext highlighter-rouge">torch</code> is that it can be installed via CRAN, and so can be easily packaged in with your own R packages without the user having to do a bunch of extra Python nonsense. Installation instructions can be found <a href="https://torch.mlverse.org/docs/">here</a>.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">lme4</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">torch</span><span class="p">)</span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">44</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Let’s generate some data. The model will be a simple Poisson regression:</p> $p(y_n | x_n) = \mathrm{Poisson}(\exp(x_n^T \beta))$ <p>The goal will be to estimate $$\beta$$, and standard errors, using maximum likelihood and the inverse Fisher information.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_obs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1000</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n_obs</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="n">beta_true</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.8</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="n">lambda_true</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">lambda_true</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_true</span><span class="p">)</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> V1 Min. : 1.021 1st Qu.: 2.570 Median : 3.998 Mean : 4.742 3rd Qu.: 6.349 Max. :15.447 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.000 4.000 4.664 6.000 25.000 </code></pre></div></div> <p>Let’s define the log likelihood in <code class="language-plaintext highlighter-rouge">torch</code>. Then we can use <code class="language-plaintext highlighter-rouge">torch</code> to evaluated gradients of the log likelihood for optimization, and the Hessian for standard errors.</p> <p>There are two important things to know:</p> <ul> <li>Torch does not operate on R numeric types. It operates on torch tensors, which can be created with torch_tensor().</li> <li>Torch uses only its own functions — not base R! You can typically find the things you need by browsing through the <a href="https://torch.mlverse.org/docs/reference/index.html">reference material</a>.</li> </ul> <p>I’ll keep torch versions of the data around in a list <code class="language-plaintext highlighter-rouge">tvars</code> for easy re-use. And I’ll write a function <code class="language-plaintext highlighter-rouge">EvalLogLik</code>, which takes in a torch tensor <code class="language-plaintext highlighter-rouge">beta</code>, the data, and returns the log likelihood, again as a torch tensor.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tvars</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w"> </span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">is</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="s2">"torch_tensor"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">stop</span><span class="p">(</span><span class="s2">"beta must be a torch tensor"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">log_lambda</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_matmul</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="n">lp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_sum</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">log_lambda</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">torch_exp</span><span class="p">(</span><span class="n">log_lambda</span><span class="p">))</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">lp</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># Sanity check that it works</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta_true</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch_tensor 1.72212e+06 [ CPUFloatType{} ] </code></pre></div></div> <p>We want to pass the (negative) log likelihood to an R routine as a function to be optimized. So we need to write a wrapper that takes an <code class="language-plaintext highlighter-rouge">R</code> numeric type, converts it to a torch tensor, calls <code class="language-plaintext highlighter-rouge">EvalLogProb</code>, and converts the result back to an R numeric type.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># From now on we'll take tvars to be a global variable to save</span><span class="w"> </span><span class="c1"># writing everything as lambda functions.</span><span class="w"> </span><span class="n">EvalLogLik</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">log_lik</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">verbose</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="o">=</span><span class="s2">", "</span><span class="p">),</span><span class="w"> </span><span class="s2">": "</span><span class="p">,</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">digits</span><span class="o">=</span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">log_lik</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># Just check that this runs</span><span class="w"> </span><span class="n">EvalLogLik</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1722115.375 </code></pre></div></div> <p>Now for some magic — a function that returns the gradient of <code class="language-plaintext highlighter-rouge">EvalLogLik</code> with respect to beta. This is what we’re using <code class="language-plaintext highlighter-rouge">torch</code> for. As before, we want something that we can pass to our optimizer, so that it takes a R numeric value for $$\beta$$ as input and returns $$\partial \log p(y \vert x, \beta) / \partial \beta$$ as R numeric output.</p> <p>Unlike before, we call <code class="language-plaintext highlighter-rouge">torch_tensor</code> with the extra argument <code class="language-plaintext highlighter-rouge">requires_grad=TRUE</code>. That tells <code class="language-plaintext highlighter-rouge">torch</code> that we will later want to compute a gradient with respt to this parameter.</p> <p>We compute the <code class="language-plaintext highlighter-rouge">loss</code> (the negative log likelihood) as we would normally.</p> <p>We then call <code class="language-plaintext highlighter-rouge">autograd_grad</code>, which returns the gradient of the first argument’s tensor with respect to the second argument’s tensor using all the computations that have been performed since the tensors were defined. The <code class="language-plaintext highlighter-rouge">autograd_grad</code> function returns a list (you can take gradients with respect to multiple inputs), so we just pull out the first element of the list.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikGrad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">beta_ad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="n">loss</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="n">grad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">loss</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w"> </span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> -440151.25 -691394 </code></pre></div></div> <p>Now we can use the loss and gradient into a nonlinear optimizer. Sure enough, we get a reasonable estimate, with a somewhat small loss gradient.</p> <p>(This gradient would ideally be smaller, but the out-of-the-box BFGS isn’t a very good optimization algorithm.)</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optim_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="n">opt_result</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">optim</span><span class="p">(</span><span class="w"> </span><span class="n">fn</span><span class="o">=</span><span class="n">EvalLogLik</span><span class="p">,</span><span class="w"> </span><span class="n">gr</span><span class="o">=</span><span class="n">EvalLogLikGrad</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s2">"BFGS"</span><span class="p">,</span><span class="w"> </span><span class="n">par</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">control</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">fnscale</span><span class="o">=</span><span class="m">-1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n_obs</span><span class="p">))</span><span class="w"> </span><span class="n">optim_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">optim_time</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="s2">"Estimate"</span><span class="o">=</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="s2">"Truth"</span><span class="o">=</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at BFGS optimum:\n"</span><span class="p">)</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">))</span><span class="w"> </span><span class="n">beta_hat</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Estimate Truth 1 0.9706614 1.0 2 1.7922191 1.8 Gradient at BFGS optimum:  -0.0355956 -0.0324285 </code></pre></div></div> <p>We can compare the results to what we’d get from the same regression using <code class="language-plaintext highlighter-rouge">lme4::glm</code>. Sure enough they match, and run in a comparable amount of time. (I’ve found there’s a fair amount of noise in the timing, and of course this is only reporting a single run, so all that really matters here is that the two are of the same order.)</p> <p>The glm algorithm (IRLS) tends to do a much better job of optimizing, in the sense that the gradient is smaller at the glm optimum, and the algorithm runs more quickly. Still, BFGS is just a quick-and-dirty choice, and doesn’t require any special structure to the problem.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">glm_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="n">glm_fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">data.frame</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">x2</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">start</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">family</span><span class="o">=</span><span class="s2">"poisson"</span><span class="p">)</span><span class="w"> </span><span class="n">glm_time</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glm_time</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in coefficients estimated by optim and glm:\t"</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">beta_hat</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nEstimation time (s):\n"</span><span class="p">)</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"optimization and torch: \t"</span><span class="p">,</span><span class="w"> </span><span class="n">optim_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"glm: \t\t\t\t"</span><span class="p">,</span><span class="w"> </span><span class="n">glm_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at glm optimum:\n"</span><span class="p">)</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)))</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in coefficients estimated by optim and glm: 1.882297e-05 Estimation time (s): optimization and torch: 0.1431296 glm: 0.01248574 Gradient at glm optimum:  -3.278255e-05 -7.224083e-05 </code></pre></div></div> <p>To compute standard errors, we need to compute the negative Hessian matrix of the log likelihood:</p> $\hat{\mathcal{I}} := - \left. \frac{\partial \log p(y | x, \beta)} {\partial \beta \partial \beta^T} \right|_{\hat\beta}$ <p>The quantity $$\mathcal{I}$$ is the empirical Fisher information, and $$\mathcal{I}^{-1}$$ is a standard estimator of the covariance of the MLE $$\hat\beta$$ under correct specification.</p> <p>The Python version of <code class="language-plaintext highlighter-rouge">torch</code> has a native function, like <code class="language-plaintext highlighter-rouge">autograd_grad</code>, that computes the Hessian directly. Unfortunately, that function has not yet been ported to R. (See <a href="https://github.com/mlverse/torch/issues/738">this issue</a> on github.) However, we can compute a Hessian by computing the gradients of each row of the gradient, as follows.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikHessian</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">beta_ad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="n">log_lik</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="c1"># The argument create_graph allows grad to be itself differentiated, and</span><span class="w"> </span><span class="c1"># the argument retain_graph saves gradient computations to make repeated differentiation</span><span class="w"> </span><span class="c1"># of the same quantity more efficient.</span><span class="w"> </span><span class="n">grad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">create_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="c1"># Now we compute the gradient of each element of the gradient, each of which is</span><span class="w"> </span><span class="c1"># one row of the Hessian matrix.</span><span class="w"> </span><span class="n">hess</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">))</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">hess</span><span class="p">[</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">grad</span><span class="p">[</span><span class="n">d</span><span class="p">],</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">hess</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w"> </span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> -1907663 -1720186 -1720186 -2263947 </code></pre></div></div> <p>In the code below, <code class="language-plaintext highlighter-rouge">fisher_info</code> is precisely $$\hat{\mathcal{I}}$$. We can see that the standard errors match one another.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fisher_info</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">-1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">)</span><span class="w"> </span><span class="n">torch_se</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">fisher_info</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">diag</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">()</span><span class="w"> </span><span class="n">glmer_se</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="o"></span><span class="n">coefficients</span><span class="p">[,</span><span class="w"> </span><span class="s2">"Std. Error"</span><span class="p">]</span><span class="w"> </span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in estimated standard errors:\t"</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">torch_se</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glmer_se</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in estimated standard errors: 3.857842e-07 </code></pre></div></div>The torch package for R (found here) is CRAN-installable and provides automatic differentiation in R, as long as you’re willing to rewrite your code using Torch functions.A Few Equivalent Perspectives on Jackknife Bias Correction2022-03-17T10:00:00+00:002022-03-17T10:00:00+00:00/jackknife/2022/03/17/jackknife_bias<p>In this post, I’ll try to connect a few different ways of viewing jackknife and infinitesimal jackknife bias correction. This post may help provide some intution, as well as an introduction to how to use the infinitesimal jackknife and von Mises expansion to think about bias correction.</p> <p>Throughout, for concreteness, I’ll use the simple example of the statistic</p> \begin{aligned} T &amp; =\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}\right)^{2},\end{aligned} <p>where the $$x_{n}$$ are IID and $$\mathbb{E}\left[x_{n}\right]=0$$. Of course, the bias of $$T$$ is known, since</p> \begin{aligned} \mathbb{E}\left[T\right] &amp; =\frac{1}{N^{2}}\mathbb{E}\left[\sum_{n=1}^{N}x_{n}^{2}+\sum_{n_{1}\ne n_{x}}x_{n_{1}}x_{n_{2}}\right] =\frac{\mathrm{Var}\left(x_{1}\right)}{N}. \end{aligned} <p>We will ensure that we recover consistent estimates of this bias using each of the different perspectives. Of course, the utility of these concepts is when we do not readily have such a simple expression for bias as in, for example, Bayesian expectations.</p> <p>At different points I will use different arguments for $$T$$, hopefully without any real ambiguity. For convenience, write</p> \begin{aligned} \hat{\mu} &amp; :=\frac{1}{N}\sum_{n=1}^{N}x_{n} \quad\textrm{and}\quad \hat{\sigma}^{2} &amp; :=\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}\end{aligned} <p>so that our example can be expressed as</p> \begin{aligned} T &amp; =\hat{\mu}^{2}.\end{aligned} <h1 id="an-asymptotic-series-in-n">An asymptotic series in $$N$$.</h1> <p>Perhaps the most common way to understand the jackknife bias estimator and correction is as an asymptotic series in $$N$$. Suppose that we have some reason to believe that the bias of $$T$$ admits an asymptotic expansion in $$N$$, the size of the observed dataset:</p> \begin{aligned} \mathbb{E}\left[T_{N}\right] &amp; =a_{1}N^{-1}+o\left(N^{-1}\right).\end{aligned} <p>The jackknife bias estimator works as follows. Let $$T_{-i}$$ denote $$T$$ calculated with datapoint $$i$$ left out. Then</p> \begin{aligned} \mathbb{E}\left[T_{N}-T_{-i}\right] &amp; =a_{0}+a_{1}N^{-1}+o\left(N^{-1}\right)-\\ &amp; \quad a_{0}-a_{1}\left(N-1\right)^{-1}+o\left(N^{-1}\right)\\ &amp; =a_{1}\frac{N-1-N}{N\left(N-1\right)}+o\left(N^{-1}\right)\\ &amp; =-\frac{a_{1}N^{-1}}{N-1}+o\left(N^{-1}\right)\\ &amp; =-\frac{\mathbb{E}\left[T_{N}\right]}{N-1}+o\left(N^{-1}\right).\end{aligned} <p>Consequently,</p> \begin{aligned} \hat{B} &amp; =-\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\ \mathbb{E}\left[\hat{B}\right] &amp; =\mathbb{E}\left[T_{N}\right]+o\left(N^{-1}\right).\end{aligned} <p>is an unbiased estimate of the leading order term in the bias of $$T_{N}$$, and the bias-corrected estimate $$T_{N}-\hat{B}$$ has bias of smaller order $$o\left(N^{-1}\right)$$,</p> \begin{aligned} \mathbb{E}\left[T_{N}-\hat{B}\right] &amp; =o\left(N^{-1}\right).\end{aligned} <p>In our example,</p> \begin{aligned} T_{-i} &amp; =\left(\frac{1}{N-1}\sum_{n\ne i}^{N}x_{n}\right)^{2}\\ &amp; =\left(\frac{N}{N-1}\hat{\mu}-\frac{1}{N-1}x_{i}\right)^{2}\\ &amp; =\left(N-1\right)^{-2}\left(N^{2}\hat{\mu}^{2}-2N\hat{\mu}x_{i}+x_{i}^{2}\right)\\ \frac{1}{N}\sum_{n=1}^{N}T_{-n} &amp; =\left(N-1\right)^{-2}\left(\left(N^{2}-2N\right)\hat{\mu}^{2}+\hat{\sigma}^{2}\right)\\ \hat{B} &amp; =-\left(N-1\right)^{-1}\left(\left(N-1\right)^{2}\hat{\mu}^{2}-\left(N^{2}-2N\right)\hat{\mu}^{2}-\hat{\sigma}^{2}\right)\\ &amp; =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned} <p>which is a perfectly good estimate of the bias.</p> <h1 id="a-taylor-series-in-1n">A Taylor series in $$1/N$$.</h1> <p>An equivalent way of looking at the previous example is to imagine $$T$$ as a function of $$\tau=1/N$$, to numerically estimate the derivative $$dT/d\tau$$, and extrapolate to $$\tau=0$$. Using the notation of the previous section, define the gradient estimate</p> \begin{aligned} \hat{g_{i}} &amp; =\frac{T_{N}-T_{-i}}{\frac{1}{N}-\frac{1}{N-1}}.\end{aligned} <p>Here, we are viewing $$T_{-i}$$ as an instance of the estimator evaluated at $$\tau=1/\left(N-1\right)$$. By rearranging, we find that</p> \begin{aligned} \hat{g_{i}} &amp; =-N\left(N-1\right)\left(T_{N}-T_{-i}\right),\\ \hat{g} &amp; =\frac{1}{N}\sum_{n=1}^{N}\hat{g_{n}}\\ &amp; =-N\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\ &amp; =-N\hat{B}.\end{aligned} <p>Extrapolating to $$\tau=0$$ gives</p> \begin{aligned} T_{\infty} &amp; \approx T_{N}+\hat{g}\left(\frac{1}{N}-\frac{1}{\infty}\right)\\ &amp; =T_{N}-\hat{B},\end{aligned} <p>as in the previous example.</p> <h1 id="a-von-mises-expansion">A von Mises expansion.</h1> <p>Let us write the statistic as a functional of the data distribution as follows:</p> \begin{aligned} T\left(F\right) &amp; =\left(\int xdF\left(x\right)\right)^{2}.\end{aligned} <p>Define the empirical distribution to be $$F_{N}$$ and the true distribution $$F_{\infty}$$. Suppose we can Taylor expand the statistic in the space of distribution functions as</p> \begin{aligned} T\left(F\right) &amp; \approx T\left(F_{0}\right)+T_{1}\left(F_{0}\right)\left(G-F_{0}\right)+\frac{1}{2}T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right)+ O\left(\left|G-F_{0}\right|^{3}\right), &amp; \textrm{(1)} \end{aligned} <p>where $$T_{1}\left(F_{0}\right)$$ is a linear operator on the space of (signed) distribution functions and $$T_{2}\left(F_{0}\right)$$ is a similarly defined bilinear operator. The expansion in Eq. 1 is known as a von Mises expansion.</p> <p>Often these operators can be represented with “influence functions”, i.e., there exists a function $$x\mapsto\psi_{1}\left(F_{0}\right)\left(x\right)$$ and $$x_{1},x_{2}\mapsto\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)$$ such that</p> \begin{aligned} T_{1}\left(F_{0}\right)\left(G-F_{0}\right) &amp; =\int\psi_{1}\left(F_{0}\right)\left(x\right)d\left(G-F_{0}\right)\left(x\right)\\ T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right) &amp; =\int\int\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)d\left(G-F_{0}\right)\left(x_{1}\right)d\left(G-F_{0}\right)\left(x_{2}\right).\end{aligned} <p>For instance, the directional derivative of our example is given by</p> \begin{aligned} \left.\frac{dT\left(F+tG\right)}{dt}\right|_{t=0} &amp; =\left.\frac{d}{dt}\right|_{t=0}\left(\int xd\left(F+tG\right)\left(x\right)\right)^{2}\\ &amp; =2\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)\int xdG\left(x\right),\end{aligned} <p>so that</p> \begin{aligned} \psi_{1}\left(F\right)\left(x\right) &amp; =\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)^{2}x.\end{aligned} <p>Similarly,</p> \begin{aligned} \left.\frac{d^{2}T\left(F+tG\right)}{dt^{2}}\right|_{t=0} &amp; =2\int xdG\left(x\right)\int xdG\left(x\right),\end{aligned} <p>so</p> \begin{aligned} \psi_{2}\left(F\right)\left(x_{1},x_{2}\right) &amp; =2x_{1}x_{2}.\end{aligned} <p>Define</p> \begin{aligned} \Delta_{N} &amp; :=F_{N}-F_{\infty}.\end{aligned} <p>Then the Taylor expansion gives an expression for the bias in terms of the influence functions:</p> \begin{aligned} T\left(F_{N}\right)-T\left(F_{\infty}\right) &amp; =\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}+\frac{1}{2}\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)+\\ &amp; \quad\quad O\left(\left|\Delta_{N}\right|^{3}\right).\end{aligned} <p>Note that, in general, integrals against $$\Delta_{N}$$ take the form</p> \begin{aligned} \int\phi\left(x\right)d\Delta_{N}\left(x\right) &amp; =\int\phi\left(x\right)dF_{N}\left(x\right)-\int\phi\left(x\right)dF_{\infty}\left(x\right)\\ &amp; =\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\mathbb{E}\left[\phi\left(x\right)\right].\end{aligned} <p>Consequently, the first term has zero bias, since</p> \begin{aligned} \mathbb{E}\left[\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}\right] &amp; =\frac{1}{N}\mathbb{E}\left[\sum_{n=1}^{N}\psi_{1}\left(F_{\infty}\right)\left(x_{n}\right)\right]-\mathbb{E}\left[\psi_{1}\left(F_{\infty}\right)\left(x\right)\right]\\ &amp; =0.\end{aligned} <p>The second term is given by</p> \begin{aligned} &amp; \int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\\ &amp; \quad=\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{n}\right)-\mathbb{E}_{x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)d\Delta_{N}\left(x_{1}\right)\\ &amp; \quad=\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right].\end{aligned} <p>Note that in the expectation, $$x_{1}$$ and $$x_{2}$$ are independent. So, whenn_{1}\ne n_{2}, the term in the sum has mean zero, and</p> \begin{aligned} &amp; \mathbb{E}\left[\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\right]=\nonumber \\ &amp; \quad\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right). &amp; \textrm{(2)} \end{aligned} <p>In general, this is not zero, and so the leading-order bias term of $$\mathbb{E}\left[T\left(F_{N}\right)-T\left(F_{\infty}\right)\right]$$ is given by the expectation of the quadratic term.</p> <p>Note that integrals over $$\Delta_{N}$$ are, by the CLT, of order $$1/\sqrt{N}$$, so the order of the $$k$$-th term in the von Mises expansion is order $$N^{-k/2}$$. By this argument, the bias of $$T$$ is of order $$N$$ and admits a series expansion in $$N$$. Indeed, a von Mises expansion is one way you could justify the first perspective. The expected second-order term is precisely the value $$a_{1}$$.</p> <p>For our example, we can see that the bias is given by</p> \begin{aligned} &amp; \frac{1}{2}\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)\\ &amp; \quad=\frac{2}{2N}\left(\mathbb{E}\left[x_{1}^{2}\right]-\mathbb{E}\left[x_{1}\right]^{2}\right)\\ &amp; \quad=\frac{1}{N}\mathrm{Var}\left(x_{1}\right),\end{aligned} <p>exactly as expected. In this case, the second order term is the exact bias because our very simple $$T$$ is actually quadratic in the distribution function.</p> <p>In general, one can estimate the bias by computing a sample version of the second-order term. In our simple example, $$\psi_{2}\left(F\right)$$ does not actually depend on $$F$$, but in general one would have to replace $$\psi_{2}\left(F_{\infty}\right)$$ with $$\psi_{2}\left(F_{N}\right)$$ and the population expectations with sample expectations. For our example, letting $$\hat{\mathbb{E}}$$ denote sample expectations, this plug-in approach gives</p> \begin{aligned} &amp; \frac{1}{N}\left(\hat{\mathbb{E}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{1}\right)\right]-\hat{\mathbb{E}}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{2}\right)\right]\right)\\ &amp; \quad=\frac{1}{N}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}x_{n_{1}}x_{n_{2}}\right)\\ &amp; \quad=\frac{1}{N}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned} <p>which is simply a sample estimate of the variance.</p> <p>Note that you might initially expect that, to use the reasoning in Eq. 1, you would need to express your estimator as an explicit function of $$N$$, or at least take into account the $$N$$ dependence in developing a Taylor series expansion such as that in Eq. 1. However, the example in the present case shows that this is not so, as the empirical distribution depends only implicitly on $$N$$. In fact, the asymptotic series in $$N$$ follows from the stochastic behavior of $$\Delta_{N}$$ rather than from any explicit $$N$$ dependence in the statistic.</p> <h1 id="the-infinitesimal-jackknife">The infinitesimal jackknife.</h1> <p>Rather than use Eq. 1 to estimate the bias directly with the plug-in principle, we might imagine using it to try to approximate the jackknife estimate of bias. In this section, I show that (a) a second order infinitesimal jackknife expansion is necessary and that (b) you then get the same answer as by estimating the bias from the second term of Eq. 1 directly.</p> <p>Let $$F_{-i}$$ denote the empirical distribution with datapoint $$i$$ left out, and let $$\Delta_{-i}$$ denote $$F_{-i}-F_{N}$$. The infinitesimal jackknife estimate of $$T_{-i}$$ is given by using Eq. 1 to extrapolate from $$F_{N}$$ to $$F_{-i}$$:</p> \begin{aligned} T_{IJ}^{\left(1\right)}\left(F_{-i}\right) &amp; :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}.\end{aligned} <p>This is the classical infinitesimal jackknife, which expands only to first order. The second order IJ is of course</p> \begin{aligned} T_{IJ}^{\left(2\right)}\left(F_{-i}\right) &amp; :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}+\frac{1}{2}T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i}.\end{aligned} <p>The difference with is that the base of the Taylor series is $$F_{N}$$ rather than $$F_{\infty}$$, and we are extrapolating to estimate the jackknife rather than to estimate the actual bias. A benefit is that all the quantities in the Taylor series can be evaluated, and no plug-in approximation is necessary. For instance, in our example,</p> \begin{aligned} \psi_{1}\left(F_{\infty}\right)\left(x\right) &amp; =\left(\int\tilde{x}dF_{\infty}\left(\tilde{x}\right)\right)^{2}x,\end{aligned} <p>which contains the unknown true mean $$\mathbb{E}\left[x_{1}\right]$$. In contrast,</p> \begin{aligned} \psi_{1}\left(F_{N}\right)\left(x\right) &amp; =\hat{\mu}^{2}x,\end{aligned} <p>depending on the observed sample mean.</p> <p>As before, it is useful to first right out the operation of $$\Delta_{-i}$$ on a generic function of $$x$$:</p> \begin{aligned} \int\phi\left(x\right)d\Delta_{-i}\left(x\right) &amp; =\int\phi\left(x\right)dF_{-i}\left(x\right)-\int\phi\left(x\right)dF_{N}\left(x\right)\\ &amp; =\frac{1}{N-1}\sum_{n\ne i}\phi\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right).\\ &amp; =\left(\frac{1}{N-1}-\frac{1}{N}\right)\sum_{n=1}^{N}\phi\left(x_{n}\right)-\frac{\phi\left(x_{i}\right)}{N-1}\\ &amp; =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\phi\left(x_{i}\right)\right).\end{aligned} <p>From this we see that the first-order term is</p> \begin{aligned} T_{1}\left(F_{N}\right)\Delta_{-i} &amp; =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\psi_{1}\left(F_{N}\right)\left(x_{i}\right)\right).\end{aligned} <p>Suppose we tried to use $$T_{IJ}^{\left(1\right)}\left(F_{-i}\right)$$ to approximate $$T_{-i}$$ in the expression for $$\hat{B}$$. We would get</p> \begin{aligned} \hat{B} &amp; =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\ &amp; =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{1}\left(F_{N}\right)\Delta_{-i}\right)\\ &amp; =\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)\\ &amp; =0.\end{aligned} <p>In other words, the first-order approximate estimates no bias. (This is in fact for the same reason that the expectation with respect to $$F_{\infty}$$ of the first-order term evaluated at $$F_{\infty}$$ is zero.)</p> <p>The second order term is given by</p> \begin{aligned} T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i} &amp; =\left(N-1\right)^{-1}\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{n}\right)-\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{i}\right)\right)d\Delta_{-i}\left(\tilde{x}\right)\\ &amp; =\left(N-1\right)^{-2}\times(\\ &amp; \quad\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{i},x_{n}\right)-\\ &amp; \quad\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{i}\right)+\psi_{2}\left(F_{N}\right)\left(x_{i},x_{i}\right)\\ &amp; \quad).\end{aligned} <p>As before, using $$T_{IJ}^{\left(2\right)}\left(F_{-i}\right)$$ then to approximate $$T_{-i}$$ gives</p> \begin{aligned} \hat{B} &amp; =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\ &amp; =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{2}\left(F_{N}\right)\Delta_{-n}\Delta_{-n}\right),\end{aligned} <p>where we have used the previous result that the first term has empirical expectation $$0$$. Plugging in, we see that</p> \begin{aligned} \hat{B} &amp; =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{n}\right)-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)\right),\end{aligned} <p>which is precisely a sample analogue of the population bias in Eq. 2 of the previous section. Of course, in our specific example, this gives</p> \begin{aligned} \hat{B} &amp; =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}x_{n_{1}}x_{n_{2}}\right)\\ &amp; =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned} <p>which matches the exact jackknife’s factor of $$\left(N-1\right)^{-1}$$, in contrast to our direct sample estimate of the bias term, which had a factor of $$N^{-1}$$.</p>In this post, I’ll try to connect a few different ways of viewing jackknife and infinitesimal jackknife bias correction. This post may help provide some intution, as well as an introduction to how to use the infinitesimal jackknife and von Mises expansion to think about bias correction.St. Augustine’s question: A counterexample to Ian Hacking’s ‘law of likelihood’2022-02-17T10:00:00+00:002022-02-17T10:00:00+00:00/philosophy/2022/02/17/st_augustines_paradox<p>In this post, I’d like to discuss a simple sense in which statistical reasoning refutes itself. My reasoning is almost trivial and certainly familiar to statisticians. But I think that the way I frame it constitutes an argument against a certain kind of philosophical overreach: against an attempt to view statistical reasoning as a branch of logic, rather than an activity that looks more like rhetoric.</p> <p>To make my argument I’d like to mash up two books which I’ve talked about before on this blog. The first is Ian Hacking’s Logic of Statistical Inference (I wrote <a href="/philosophy/2021/12/09/fidual_cis.html">here</a> about its wonderful chapter on fiducial inference). The other is an interesting section in St. Augustine’s confessions, which I <a href="/philosophy/2021/10/27/st_augustine.html">discussed here</a>. Ian Hacking’s ambition is, as the title of the book suggests, to describe the basis of a logic of statistical inference. His primary tool is the comparison of the likelihoods of what he calls “chance outcomes” (implicitly he seems to mean aleatoric gambling devices, but he is uncharacteristically imprecise, implying, I think, that we simply know a chance setup when we see it).</p> <p>St. Augustine, as I discuss in my earlier post, has a worldview stripped of what modern thinkers would call randomness. In St. Augustine’s vision of the world,an unknowable and all-powerful God guides to His own ends the outcome even of aleatoric devices, such as the drawing of lots and, presumably, the flipping of coins. Many people in the modern age do not think like St. Augustine. So it is reasonable to ask what I will call “St. Augustine’s question:” is St. Augustine’s deterministic worldview is correct?</p> <h1 id="hacking-on-st-augustines-question">Hacking on St. Augustine’s question</h1> <p>I would like to attempt, using Hacking’s methods, to bring the outcome of a coin flip to bear on St. Augustine’s question. One might reasonably wonder doubt that is a fair to Hacking. However, the first sentence of Hacking’s book articulates the scope of his:</p> <blockquote> <p>“The problem of the foundation of statistics is to state a set of principles which entail the validity of all correct statistical inference, and which do not imply that any fallacious inference is valid.”</p> </blockquote> <p>Hacking’s goal is ambitious (my argument here is essentially that it is over-ambitious). However, to his credit, it is clear: if we can formulate the St. Augustine question as a statistical one about a chance outcome, then we should expect Hacking’s logic to come to the correct epistemic conclusion. Furthermore, Hacking states himself (when arguing against Neyman-Pearson test in Chapter 7) that “the best way to refute a principle [is] not general metaphysics but concrete example.”</p> <p>Finally, lest it seem too esoteric to argue with St. Augustine, or that this example is too contrived to be meaningful, at the end of this post, I will draw connections between my argument and some shortcoming’s of likelihood-based model comparison that are well known to statisticians but largely ignored by Hacking’s book.</p> <h1 id="hackings-law-of-likelihood">Hacking’s law of likelihood</h1> <p>Hacking’s principle of inference is embodied in his “law of likelihood,” which is introduced in Chapter 5. The goal is to justifiably connect aleatoric statements to degrees of logical belief (without going through subjective probability). Stripping away some of Hacking’s notation, his law of likelihood states in brief that</p> <blockquote> <p>“If two joint propositions are consistent with the statistical data, the better supported is that with the greater likelihood.”</p> </blockquote> <p>Here I should clarify some of Hacking’s terminology. By “statistical data” he means everything you know before conducting a chance experiment, including the nature of the nature of how you get the data. A “joint proposition” is some statement about the world, possibly including things you don’t know, e.g., future unobserved data, or some unknown aspect of the real world. Hacking spends a lot of time defining and discussing his terms.</p> <p>For the present purpose, it suffices to describe some of Hacking’s own examples from Chapter 5 of how the law of likelihood is to be used. Suppose that a biased coin has P(H) = 0.9 and P(T) = 0.1. Then, by the law of likelihood, the proposition $$\pi_H$$ that a yet-unseen flip will be H is better supported than the proposition $$\pi_T$$ that it will be T, since P(H) &gt; P(T). Similarly, if we observe K heads out of N flips, by the law of likelihood, the proposition $$\pi_{K/N}$$ that P(H) = K / N is better supported than the proposition $$\pi_{(K-1)/N}$$ that P(H) = (K - 1) / N.</p> <p>Are these assertions trivial? Hacking spends the first part of the book arguing that they are not, and the latter part of the book demonstrating important differences, both conceptual and practical, with decision theory and subjective probability. Suffice to say they are beyond the scope of the present post.</p> <h1 id="asking-st-augustines-question-with-the-law-of-likelihood">Asking St. Augustine’s question with the law of likelihood</h1> <p>Let us suppose that we have made single coin flip which came up H. The coin was designed and flipped symmetrically to the best of our abilities. St. Augustine’s question can be expressed in terms of these two simple propositions:</p> <ul> <li>$$\pi_{R}$$ (Randomness): P(H) = 0.5, and we observed H</li> <li>$$\pi_{A}$$ (Augustine): P(H) = 1.0 (God wills it), and we observed H</li> </ul> <p>Obviously, the law of likelihood supports $$\pi_{A}$$, answering St. Augustine’s question in the affirmative, i.e., that St. Augustine’s worldview is better supported than randomness.</p> <p>Let me be the first to admit that this is pretty trivial. Perhaps you are disappointed, and sorry you bothered to read this far! Let me try to bring you back in.</p> <p>First, observe that the same reasoning applies to any number of coin flips. You might ask — was the sequence HTHTTH pre-ordained or random, and the law of likelihood always supports that it was pre-ordained. The same reasoning can be applied to whether some small number of flips in a particular sequence were pre-ordained — e.g., when asking whether every flip in the sequence HTHTTH random, or was at least one of them pre-ordained, the law of likelihood supports that at least one of them was pre-ordained. The same reasoning applies to degrees of probability, as well — e.g., when asking whether every flip in the sequence HTHTTH was fair, versus was it P(H) = 0.6 when H came up and P(T) = 0.6 when T came up, the law of likelihood supports that the sequence was not fair.</p> <p>In short, the law of likelihood always supports the most deterministic proposition. In this sense, the law of likelihood does not support its own applicability. Without randomness, there is no need or use for a logic of statistical inference. When given the opportunity to ask whether or not there is randomness in a particular setting, the law of likelihood always militates against randomness, and eats its own tail.</p> <h1 id="statisticians-know-this-and-so-does-hacking">Statisticians know this, and so does Hacking</h1> <p>This phenomenon is no surprise to statisticians, of course. Model selection based on likelihood — whether Bayesian or frequentist in design and use — favors the more complex models unless some corrective factor is used, such as regularization or priors. The answer given by the law of likelihood to St. Augustine’s question is just an extreme end of this phenomenon.</p> <p>Is Hacking aware of this problem? Of course; Hacking is aware of most things. For example, in Chapter 7, he discusses very briefly the importance of weighting likelihoods in some cases (“One author has suggested that a number be assigned to each hypothesis, which will represent the ‘seriousness’ of rejecting it … In the theory of likelihood testing, one would use weighted tests.”) Unfortunately, Hacking’s discussion of Bayesianism in Chapters 12 and 13 does not take up this point, focusing instead on arguing against uniform priors and dogmatic subjectivism. Probably most damningly, Hacking does not shrink away from using the law of likelihood to reason between a large number of expressive propositions and a single less expressive one, as in, for example, in his comparison unbiased tests in Chapter 7 (page 89 in the Cambridge Philosophy Classics 2016 edition). In summary, Hacking does not appear to take very seriously the fundamental role extra-statistical evidence must play in applications of the law of likelihood, in order to avoid its own self-refutation.</p> <h1 id="we-must-deliberately-choose-the-statistical-analogy">We must deliberately choose the statistical analogy</h1> <p>The point is that describing the world with randomness is a choice we make, and we make it because it is sometimes useful to us. In the course of doing something like statistical inference, we <em>must</em> posit <em>a priori</em> the existence of randomness as well as explanatory mechanisms of limited complexity. At the core of statistical reasoning is the <em>discard</em> of information — of viewing a set of voters, each entirely unique, as equivalent to balls drawn from an urn, or viewing the days weather, which is fixed from yesterday’s by deterministic laws of physics, as something exchangeable with some hypothetical population of other days, conceptually detached from contingency and their own pasts. Failure to remember this can lead to silly arguments about whether phenomena are “really random.” In other words, we must choose to make the <a href="/philosophy/2021/08/22/what_is_statistics.html">statistical analogy</a>, and accept that its applicability may not be indisputable.</p> <p>From this perspective, Hacking’s ambition — a logic of statistical inference — seems hopeless, not because of some inevitably subjective nature of probability itself, but because of the subjective nature of analogy. How can you form a logic which will give correct conclusions in every application of an analogy? The affairs of statistics are inevitably human and not purely computational, and the field is more exciting and fruitful for it.</p>In this post, I’d like to discuss a simple sense in which statistical reasoning refutes itself. My reasoning is almost trivial and certainly familiar to statisticians. But I think that the way I frame it constitutes an argument against a certain kind of philosophical overreach: against an attempt to view statistical reasoning as a branch of logic, rather than an activity that looks more like rhetoric.Some of the gambling devices that build statistics.2022-01-27T10:00:00+00:002022-01-27T10:00:00+00:00/philosophy/2022/01/27/basic_gambling_device<p>In <a href="/philosophy/2021/08/22/what_is_statistics.html">an earlier post</a>, I discuss how statistics uses gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in general (epistemic uncertainty). I called this the “statistical analogy.” Of course, this perspective is not at all new — see section 1.5 of , for example.</p> <p>When folks employ the statistical analogy, explicitly or implicitly, a few gambling devices come up again and again. I find that having their taxonomy in the back of the mind can help see what metaphor(s) is (are) being employed in a particular analysis. These gambling devices are obviously not fully distinct — you can typically simulate one with another, and the final “device” obviously encompasses all the others. But I will separate them here because they tend to play different metaphorical roles — and, I would argue, increasingly tenuously in the order I have written them.</p> <h1 id="the-urn-exchangeability">The urn (exchangeability)</h1> <p>The gambling device most commonly used in statistics is probably the urn: some container containing some objects, such as balls of different colors, which is shaken, and from which some objects are removed. The aleatoric randomness is provided by shaking as well as drawing blindly from the urn, creating a symmetry between all objects in the urn. Equivalent gambling devices include drawing cards from a shuffled deck or random respondents for a poll. Once one is in the habit of thinking about urns with a finite number of objects, it is a small step to consider urns with an infinite number of objects, such as super-populations in causal inference (, section 1.12).</p> <p>The ubiquitous assumption of exchangeability is equivalent to sequential drawing from a shaken urn (, section 3). Consequently, the urn model is at the core of most frequentist inferential methods, including the bootstrap and normal approximations for exchangeable data.</p> <h1 id="bets-using-biased-coins-subjective-probability">Bets using biased coins (subjective probability)</h1> <p>The biased coin, which chooses between two outcomes with given probabilities, plays a large role in subjective probability (associated with Bayesian statistics) as the basis for hypothetical betting. The key idea behind subjective probability is that, before gathering data, we have beliefs about the state of the world. If these beliefs satisfy some reasonable assumptions (i.e., are “coherent”) then there are some bets that we would be consider fair, and some that we would not. Equivalent aleatoric versions of these bets can then be used as metaphors for your subjective beliefs.</p> <p>For example, suppose that some unknown quantity can be either A or B, and we would accept as fair a bet in which we get1 if A occurs but pay \$2 if B occurs. Since these are precisely the odds that would be acceptable for a biased coin which comes up A 2/3 of the time and B 1/3 of the time, one might say that your subjective belief about A and B is equivalent to your subjective belief about a biased coin with probabilities 2/3 and 1/3. The bet on a biased coin is a metaphor for your subjective belief about A and B. (The full formal connection between betting and subjective probability is richer and more complicated than my cartoon. See , sections 3.1-3.4.)</p> <p>With a coin, the aleatoric randomness is produced a symmetric coin shape together with flipping or spinning, which creates a symmetry between the two sides. The biased coin can be extended to multiple outcomes with uneven dice, such as sheep knuckle bones, again with symmetry created between outcomes via spinning. Of course, you can draw from an urn using biased coins, or produce bets with urns. That is not my point! The point is that the way these gambling devices are used metaphorically is distinct.</p> <h1 id="the-spinner-continuous-uniform-random-variables">The spinner (continuous uniform random variables)</h1> <p>The urn and the biased coin are fundamentally discrete, though much of statistics deals with continuous-valued random variables. The spinner is the most natural way to produce a continuous random variable — namely, a uniform distribution on the circumference of a circle. A spinner creates aleatoric randomness by symmetry of the disk together with a vigorous spin. The needle goes around many times, but the random number is produced by the fractional part of the number of cycles. Pseudo-random number generators like the Mersenne twister seem to me to be in the same class, as they are based on the fractional part of a large number.</p> <p>The spinner creates sort of a bridge to the rest of probability theory, since any continuous random variable can be produced by applying function (the inverse CDF) to a uniform random variable on the unit interval. Given a spinner, one can begin to imagine complex aleatoric processed based on spinners and computation alone. Of course we can form approximations to the continuum with a sufficiently large number of coin flips, for example, or a sufficiently large urn. However, I think the spinner provides much cleaner intuition for why we consider continuous random variables to be reasonable aleatoric processes in the first place.</p> <h1 id="probabilistic-models">Probabilistic models</h1> <p>Once we have the probability calculus (via the spinner and computation), we can begin to form quite complex aleatoric models to represent our uncertainty. Arguably, this is the realm in which a lot of modern statistical work takes place. For example, suppose you are analyzing a binary outcome (hospitalized for COVID or not) as a function of some regressors (age and vaccine status). For an individual with a given age and vaccine status, we do not know for certain whether they will be hospitalized. A logistic regression is precisely a posited aleatoric system to describe this subjective uncertainty. Software like Stan, which allows generalists to perform inference on their own generative processes, make this kind of complex modeling relatively easy.</p> <p>Of course, at this level of abstraction, the metaphor can lose clarity and force. Why is logistic regression reasonable? Why not some other link function? Why not other regressors (e.g. interactions)? Taking for granted that such abstract models provide good metaphors for epistemic uncertainty is at the root of many misapplications of statistics. In fact, many early statisticians, particularly those in the frequentist camps, were expressly unwilling to extend the statistical analogy much further than exchangeability. One might see a key difference between Neyman-Rubin causal inference (), which (mostly ) requires only the urn, and Pearlian casual inference (), which requires probabilistic graphical models, as a difference in willingness to stretch the statistical analogy.</p> <p>As with all analogies, the quality of a particular statistical analogy is subject to an ineradicable subjectivity. But being aware of what analogy is being made in a particular situation can help clarify disagreements and avoid missteps.</p> <h1 id="references">References</h1> <p> Gelman, Andrew, et al. Bayesian data analysis. CRC press, 2013.</p> <p> Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.</p> <p> Shafer, Glenn, and Vladimir Vovk. “A Tutorial on Conformal Prediction.” Journal of Machine Learning Research 9.3 (2008).</p> <p> Ghosh, Jayanta K., Mohan Delampady, and Tapas Samanta. An introduction to Bayesian analysis: theory and methods. Vol. 725. New York: Springer, 2006.</p> <p> Pearl, Judea. Causality. Cambridge university press, 2009.</p>In an earlier post, I discuss how statistics uses gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in general (epistemic uncertainty). I called this the “statistical analogy.” Of course, this perspective is not at all new — see section 1.5 of , for example.How does AMIP work for regression when the weight vector induces colinearity in the regressors?2021-12-17T10:00:00+00:002021-12-17T10:00:00+00:00/amip/2021/12/17/reweighted_colinear_note<p>How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of <code class="language-plaintext highlighter-rouge">zaminfluence</code>. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.</p> <h3 id="notation">Notation</h3> <p>Suppose we have a weighted regression problem with regressor matrix $$X$$ (an $$N \times D$$ matrix), response vector $$\vec{y}$$ (an $$N$$-vector), and weight vector $$\vec{w}$$ (an $$N$$-vector): \begin{aligned} % \hat{\theta}(\vec{w}) :={}&amp; \theta\textrm{ such that }\sum_{n=1}^N w_n (y_n - \theta^T x_n) x_n = 0 \Rightarrow\\ \hat{\theta}(\vec{w}) ={}&amp; \left(\frac{1}{N}\sum_{n=1}^N w_n x_n x_n^T \right)^{-1} \frac{1}{N}\sum_{n=1}^Nw_n y_n x_n \\={}&amp; \left((\vec{w}\odot X)^T X\right)^{-1} (\vec{w}\odot X)^T \vec{y}. % \end{aligned}</p> <p>Here, we have used $$\vec{w}\odot X$$ like the Hadamard product with broadcasting. Formally, we really mean $$(\vec{w}1_D^T) \odot X$$, where $$1_D^T$$ is a $$D$$-length row vector containing all ones. Throughout, we will use $$1$$ and $$0$$ subscripted by a dimension to represent vectors filled with ones and zeros respectivley.</p> <h3 id="how-can-weights-induce-rank-deficiency">How can weights induce rank deficiency?</h3> <p>We are interested in what happens to the linear approximation at a weight vector $$\vec{w}$$ for which the Hessian $$\frac{1}{N}\sum_{n=1}^Nw_n x_n x_n^T$$ is singular. Assume that $$X$$ has rank $$D$$, and that $$(\vec{w}\odot X)$$ is rank $$D-1$$. Specifically, there exists some nonzero vector $$a_1 \in \mathbb{R}^D$$. such that $$(\vec{w} \odot X) a_1 = 0_N$$, where $$0_N$$ is the $$N$$-length vector of zeros. For each $$n$$, the preceding expression implies that $$w_n x_n^T a_1 = 0$$, so either $$w_n = 0$$ or $$x_n^T a_1 = 0$$. Without loss of generality, we can thus order the observations so that we drop the first $$N_{d}$$ rows:</p> \begin{aligned} % \vec{w}= \left(\begin{array}{c} 0_{N_{d}}\\ 1_{N_{k}} \end{array}\right) \quad\textrm{and}\quad X= \left(\begin{array}{c} X_d\\ X_k \end{array}\right) % \end{aligned} <p>Here, $$X_d$$ is an $$N_{d}\times D$$ matrix of dropped rows and $$X_k$$ is an $$N_{k}\times D$$ matrix of kept rows, where $$N_{k}+ N_{d}= N$$. We thus have</p> \begin{aligned} % Xa_1 = X= \left(\begin{array}{c} X_d a_1\\ 0_{N_{k}} \end{array}\right). % \end{aligned} <p>Here, $$X_d a_1 \ne 0$$ (for otherwise $$X$$ could not be rank $$D$$). In other words, the rows $$X_k$$ are rank deficient, the rows $$X_d$$ are not, but $$\vec{w}\odot X$$ is rank deficient precisely because $$\vec{w}$$ drops the full-rank portion $$X_d$$.</p> <h3 id="reparameterize-to-isolate-the-vanishing-subspace">Reparameterize to isolate the vanishing subspace</h3> <p>To understand how $$\hat{\theta}(\vec{w})$$ behaves, let’s isolate the coefficient that corresponds to the subspace that vanishes. To that end, let $$A$$ denote an invertible $$D \times D$$ matrix whose first column is $$a_1$$.</p> \begin{aligned} % A := \left(\begin{array}{c} a_1 &amp; a_2 &amp; \ldots &amp; a_D \\ \end{array}\right). % \end{aligned} <p>Define $$Z:= XA$$ and $$\beta := A^{-1} \theta$$ so that $$X\theta = Z\beta$$. Then we can equivalently investigate the behavior of</p> \begin{aligned} % \hat{\beta}(\vec{w}) ={}&amp; \left((\vec{w}\odot Z)^T Z\right)^{-1} (\vec{w}\odot Z)^T \vec{y}. % \end{aligned} <p>If we write $$Z_1$$ for the first column of $$Z$$ and $$Z_{2:D}$$ for the $$N \times (D - 1)$$ remaining columns, we have</p> \begin{aligned} % Z= \left(\begin{array}{c} Z_1 &amp; Z_{2:D} \\ \end{array}\right) = \left(\begin{array}{c} Xa_1 &amp; XA_{2:D} \\ \end{array}\right) = \left(\begin{array}{c} X_d a_1 &amp; X_d A_{2:D} \\ 0_{N_{k}} &amp; X_k A_{2:D} \\ \end{array}\right), % \end{aligned} <p>where we have used the fact definition of $$a_1$$ and the partition of $$X$$ from above.</p> <h3 id="consider-a-straight-path-from-1_n-to-vecw">Consider a straight path from $$1_N$$ to $$\vec{w}$$</h3> <p>Define $$\vec{w}(t) = (\vec{w}- 1_N) t+ 1_N$$ for $$t\in [0, 1]$$, so that $$\vec{w}(0) = 1_N$$ and $$\vec{w}(1) = \vec{w}$$. We can now write an explicit formula for $$\hat{\beta}(\vec{w}(t))$$ as a function of $$t$$ and consider what happens as $$t\rightarrow 1$$.</p> <p>Because $$\vec{w}$$ has zeros in its first $$N_{d}$$ entries,</p> \begin{aligned} % \vec{w}(t) \odot Z= \left(\begin{array}{cc} (1-t) X_d a_1 &amp; (1-t) X_d A_{2:D} \\ 0_{N_{k}} &amp; X_k A_{2:D} \\ \end{array} \right) % \end{aligned} <p>and</p> \begin{aligned} % \left((\vec{w}\odot Z)^T Z\right) ={}&amp; \left(\begin{array}{c} (1-t) a_1^T X_d^T X_d a_1 &amp; (1-t) a_1^T X_d^T X_k A_{2:D}\\ (1-t) A_{2:D}^T X_k^T X_d a_1 &amp; A_{2:D}^T ( X_k^T X_k + (1-t) X_d^T X_d )A_{2:D} \\ \end{array}\right). % \end{aligned} <p>Since the upper left hand entry $$(1-t) a_1^T X_d^T X_d a_1 \rightarrow 0$$ as $$t\rightarrow 1$$, we can see again that the regression is singular when evaluated at $$\vec{w}$$.</p> <p>However, by partitioning $$\vec{y}$$ into dropped and kept components, $$\vec{y}_d$$ and $$\vec{y}_k$$ respectively, we also have</p> \begin{aligned} % (\vec{w}(t) \odot Z)^T \vec{y}={}&amp; \left(\begin{array}{c} (1-t) a_1^T X_d^T \vec{y}_d\\ A_{2:D}^T \left(X_k^T + (1-t)X_d^T \right)\vec{y}_k \end{array}\right). % \end{aligned} <p>One can perhaps see at this point that the $$(1-t)$$ will cancel in the numerator and denominator of the regression. This can be seen directly by the Schur complement. Letting $$\hat{\beta}_1$$ denote the first element of $$\hat{\beta}$$, we can use the Schur complement to write</p> \begin{aligned} % \hat{\beta}_1(\vec{w}(t)) = \frac{(1-t) a_1^T X_d^T \vec{y}_d}{ (1-t) a_1^T X_d^T X_d a_1 - O((1-t)^2) } = \frac{a_1^T X_d^T \vec{y}_d}{ a_1^T X_d^T X_d a_1 - O((1-t)) } % \end{aligned} <p>where $$O(\cdot)$$ denotes a term of the specified order as $$t\rightarrow 1$$. We can see that, as $$t\rightarrow 1$$, $$\hat{\beta}_1(\vec{w}(t))$$ in fact varies smoothly, and we can expect linear approximations to work as well as they might even without the singularity. Formally, the singularity is a “removable singularity,” analogous to when a factor cancels in a ratio of polynomials.</p> <p>An analogous argument holds for the rest of the $$\hat{\beta}$$ vector, again using the Schur complement. Since $$\hat{\theta}$$ is simply a linear transform of $$\hat{\beta}$$, the same reasoning applies to $$\hat{\theta}$$ as well.</p> <h3 id="conslusions-and-consequences">Conslusions and consequences</h3> <p>Though the regression problem is singular precisely at $$\vec{w}$$, it is in fact well-behaved in a neighborhood of $$\vec{w}$$. This is because re-weighting downweights both the non-singular rows and their corresponding observations. Singularity occurs only when entries of $$\vec{w}$$ are precisely zero. For theoretical and practical purposes, you can completely avoid the problem by simply considering weight vectors that are not precisely zero at the left-out points, taking instead some arbitrarily small values.</p> <p>The most common way this seems to occur in practice is when a weight vector drops all the levels of some indicator. It is definitely on my TODO list to find some way to allow <code class="language-plaintext highlighter-rouge">zaminfluence</code> to deal with this gracefully.</p> <p>Note that this analysis leaned heavily on the structure of linear regression. In general, when the Hessian matrix of the objective function is nearly singular, it will be associated with non-linear behavior of $$\hat{\theta}(\vec{w}(t))$$ along a path from $$1_N$$ to $$w$$. Linear regression is rather a special case.</p>How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of zaminfluence. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.Fiducial inference and the interpretation of confidence intervals.2021-12-09T10:00:00+00:002021-12-09T10:00:00+00:00/philosophy/2021/12/09/fidual_cis<h2 id="why-is-it-so-hard-to-think-correctly-about-confidence-intervals">Why is it so hard to think “correctly” about confidence intervals?</h2> <p>I came across the following section in the (wonderful) textbook <a href="https://moderndive.com/8-confidence-intervals.html">ModernDive</a>:</p> <blockquote> <p>Let’s return our attention to 95% confidence intervals. … A common but incorrect interpretation is: “There is a 95% probability that the confidence interval contains p.” Looking at Figure 8.27, each of the confidence intervals either does or doesn’t contain p. In other words, the probability is either a 1 or a 0.</p> </blockquote> <p>(Although I’m going to pick on this quote a little bit, I want to stress that I love this textbook. This view of CIs is extremely common and I might well have taken a similar quote from any number of other sources. This book just happened to be in front of me today.)</p> <p>I understand what the authors are saying. Given the data we observed and CI we computed, there is no remaining randomness — either the parameter is in the interval, or it isn’t. The parameter is not random, the data is. But I think there is room to admit that this point, while technically clear, is a little uncomfortable, even for those of us who are very familiar with these concepts. After all, there is a 95% chance that a randomly chosen interval contains the parameter. I chose an interval. Why can I no longer say that there is a 95% chance that the parameter is in that interval? To a beginning student of statistics who is encountering this idea for the first time, this commonplace qualification must seem pedantic at best and confusing at worst.</p> <h2 id="fiducial-inference">Fiducial inference</h2> <p>Chapter 9 of Ian Hacking’s <em>Logic of Statistical Inference</em> contains a beautiful account of precisely <em>why</em> we are so inclined towards the “incorrect” interpretation, as well as the shortcomings of our intuition. The logic is precisely that of Fisher’s famous (infamous?) fiducial inference. Understanding this connection not only helps us to better understand CIs (and their modes of failure), but also to be more sympathetic to the inherent reasonableness of students who are disinclined to let go of the “incorrect” interpretation.</p> <p>As presented by Hacking, there are two relatively uncontroversial building blocks of fiducial inference, and one problematic one. Recall the idea that aleatoric probabilities (the stuff of gambling devices) and epistemic probabilities (degrees of epistemic belief) are fundamentally different quantities. (Hacking treats this question better than I could, but I also have <a href="/philosophy/2021/08/22/what_is_statistics.html">a short post on this topic here</a>). Following Hacking, I will denote the aleatoric probabilities by $$P$$ and the degrees of belief by $$p$$.</p> <h3 id="assumption-one-the-frequency-principle">Assumption one: The “frequency principle.”</h3> <p>The first necessary assumption of fiducial inference is this:</p> <blockquote> <p>If you know nothing about an aleatoric event $$E$$ other than its probability, then $$p(E) = P(E)$$.</p> </blockquote> <p>This amounts to saying that, for pure gambling devices, absent other information, your subjective belief about whether an outcome occurs should be the same as the frequency with which that outcome occurs under randomization. If you know a coin comes up heads 50% of the time ($$P(heads) = 0.5$$), then your degree of certainty that it will come up heads on the next flip should be the same ($$p(heads) = 0.5$$). Hacking calls this assumption the “frequency principle.”</p> <h3 id="assumption-two-the-logic-of-support">Assumption two: The “logic of support.”</h3> <p>The second fundamental assumption is that the logic of epistemic probabilities should be the same as the logic of aleatoric probabilities. Specifically:</p> <blockquote> <p>Degrees of belief should obey Kolmogorov’s axioms.</p> </blockquote> <p>For example, if events $$H$$ and $$I$$ are logically mutually exclusive, then $$p(H \textrm{ and }I) = p(H) + P(I)$$. Conditional probabilities such as $$p(H \vert E)$$ are a measure of how much the event $$E$$ supports a subjective belief that $$H$$ will occurs.</p> <p>Neither the frequency principle nor the logic of support are particularly controversial, even for avowed frequentists. Note that assumption one states only how you come about subjective beliefs about systems you know to be aleatoric, and assumption two states describes only how subjective beliefs combine coherently. So there is nothing really Bayesian here.</p> <h2 id="hypothesis-testing-and-fiducial-inference">Hypothesis testing and fiducial inference</h2> <p>Applying the frequency principle and the logic of support to confidence intervals, together with an additional (more controversial) logical step, will in fact lead us directly to the “incorrect” interpretation of a confidence interval. Let’s see how the logic works.</p> <p>Suppose we have some data $$X$$, and we want to know the value of some parameter $$\theta$$. Suppose we have constructed a valid confidence set $$S(X)$$ such that $$P(\theta \in S(X)) = 0.95$$. Following Hacking, let $$D$$ denote the event that our setup is correct — specifically, that we are correct about the randomness of $$X$$ $$S(X)$$ is a valid CI with the desired coverage. That is, given $$D$$, we assume that $$X$$ is really random, and we know the randomness, so $$P$$ is a true aleatoric probability — no subjective belief here.</p> <p>Of course, the construction of a confidence interval guarantees only the aleatoric probability — thus we have used $$P$$, not $$p$$. However, by the frequency principle, we are justified in writing</p> <p>$$p(\theta \in S(X) \vert D) = P(\theta \in S(X)\vert D) = 0.95$$,</p> <p>so long as we know nothing other than the accuracy of our setup $$D$$. (Note that $$\theta \in S(X)$$ is a pivot. In general, pivots play a central role in fiducial inference.)</p> <p>Note that $$p(\theta \in S(X) \vert D)$$ is very near to our “incorrect” interpretation of confidence intervals! However, in reality, we know more than $$S(X)$$, we actually observe $$X$$ itself. Now, $$P(\theta \in S(X) \vert D, X)$$ is either $$0$$ or $$1$$. Conditional on $$X$$, there is no remaining aleatoric uncertainty to which we can apply the frequency principle. And most authors — including those of quote that opened this post — stop here.</p> <p>There is an additional assumption, however, that allow us to formally compute $$p(\theta \in S(X) \vert X, D)$$, and it is this (controversial) assumption that is at the core of fiducial inference. It is this:</p> <h3 id="assumption-three-irrelevance">Assumption three: Irrelevance</h3> <blockquote> <p>The full data $$X$$ tells us nothing more about $$\theta$$ (in an epistemic sense), than the confidence interval $$S(X)$$.</p> </blockquote> <p>In the case of confidence intervals, the assumption of irrelevance requires at least two things. First, it requires that our subjective belief that $$\theta \in S(X)$$ does not depend on the particular interval that we compute from the data. In other words, we are as likely to believe that our CI contains the parameter no matter where its endpoints lie. Second, it requires that there is nothing more to be learned about the parameter from the data <em>other</em> than the information contained in the CI.</p> <p>These are strong assumptions! However, when they hold, they justify the “incorrect” interpretation of confidence intervals — namely that there is a 95% subjective probability that $$\theta \in S(X)$$, given the data we observed. For, under the assumption of irrelevance, by the logic of support (and then the frequency principle as above) we can write</p> <p>$$p(\theta \in S(X) \vert X, D) = p(\theta \in S(X) \vert D) = P(\theta \in S(X) \vert D) = 0.95$$.</p> <h2 id="how-does-this-go-wrong-and-what-does-it-mean-for-teaching">How does this go wrong, and what does it mean for teaching?</h2> <p>Assumption three is often hard to justify, or outright fallacious. But one of its strengths is that it points to <em>how</em> the logic of fiducial inference fails, when it does fail. In particular, it is not hard to construct valid confidence intervals that contain only impossible values of $$\theta$$ for some values of $$X$$. (As long as a confidence interval takes on crazy values sufficiently rarely, there is nothing in the definition preventing it from doing so.) In fact, as Hacking points out, confidence intervals are tools for <em>before</em> you see the data, designed so that you do not make mistakes too often on average, and can suggest strange conclusions once you have seen a particular dataset.</p> <p>However, it’s not crazy for someone, especially a beginning student, to subscribe to assumption three, even if they are not aware of it. After all, we typically present a confidence interval as <em>the</em> way to summarize what your data tells you about your parameter. And if that’s the case, then the “incorrect” interpretation of CIs follows from the extremely plausible frequency principle and logic of support. At the least I think we should acknowledge the reasonableness of this logical chain, and teach when it goes wrong rather than simply reject it by fiat.</p>Why is it so hard to think “correctly” about confidence intervals?To think about the influence function, think about sums.2021-12-01T10:00:00+00:002021-12-01T10:00:00+00:00/amip/2021/12/01/influence_is_sum<p>I think the key to thinking intuitively about the influence function in our <a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.</p> <p>Specifically, suppose you have data weights $$w = (w_1, \ldots, w_N)$$ and an estimator $$\phi(w) \in \mathbb{R}$$ which depends on the weights in some complicated way. Let $$\phi^{lin}$$ denote the first-order Taylor series expansion around the unit weight vector $$\vec{1} := (1, \ldots, 1)$$</p> $\phi^{lin}(w) := \phi(\vec{1}) + \sum_{n=1}^N \psi_n (w_n - 1) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n w_n \quad\textrm{where}\quad \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{\vec{1}},$ <p>and we have used the fact that $$\sum_{n=1}^N \psi_n = 0$$ for Z-estimators. (For situations where $$\sum_{n=1}^N \psi_n \ne 0$$, just keep that sum around, and everything I say in this post still applies.) Thinking now of $$\psi$$ as data, we can (in some abuse of notation) write $$\phi^{lin}(\psi) = \phi(\vec{1}) + \sum_{n=1}^N \psi_n$$. If $$\phi^{lin}(w)$$ is a good approximation to $$\phi(w)$$, then the effect of leaving a datapoint out of $$\phi(w)$$ is well-approximated by the effect of leaving the corresponding entry out of $$\psi$$ in $$\phi^{lin}(\psi)$$. We have, in effect, replaced a complicated data dependence with a simple sum of terms. This is what linearization does for us. (NB: if our original estimator had been a sum of the data, the linearization would be exact!)</p> <p>Typically $$\psi_n = O_p(N^{-1})$$, so it’s a little helpful to define $$\gamma_n := N \psi_n$$. We then can write:</p> $\phi^{lin}(\gamma) := \phi(\vec{1}) + \frac{1}{N}\sum_{n=1}^N \gamma_n.$ <p>We can now ask what kinds of changes we can produce in $$\phi^{lin}(\gamma)$$ by dropping entries from $$\gamma$$ (while keeping $$N$$ the same), and some of the core conclusions of our paper become obvious. Definitionally, $$\sum_{n=1}^N \gamma_n = 0$$. For example, if we drop $$\alpha N$$ points, for some fixed $$0 &lt; \alpha &lt; 1$$, then the amount we can change the sum $$\frac{1}{N}\sum_{n=1}^N \gamma_n$$ does not vanish, no matter how large $$N$$ is, and no matter how small $$\alpha$$ is. The amount you can change the sum $$\frac{1}{N}\sum_{n=1}^N \gamma_n$$ also obviously depends on the tail shape of the distribution of the $$\gamma_n$$, as well as their absolute scale. Increasing the scale (i.e., increasing the noise) obviously increases the amount you can change the sum. And, for a given scale (i.e., a given $$\frac{1}{N} \sum_{n=1}^N \gamma_n^2)$$, you will be able to change the sum by the most when the left-out $$\gamma_n$$ all take the same value.</p> <p>So one way to think about AMIP is this: we provide a good approximation to your original statistic that takes the form of a simple sum over your data. Dropping datapoints corresponds to dropping data from this sum. You can then think about whether dropping sets that are selected in a certain way are reasonable or not in terms of dropping entries from a sum, about which it’s easy to have good intutition!</p>I think the key to thinking intuitively about the influence function in our work on AMIP is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.The bootstrap randomly queries the influence function.2021-11-08T10:00:00+00:002021-11-08T10:00:00+00:00/amip/2021/11/08/bootstrap_influence<p>When we present our <a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> the relationship with the bootstrap often comes up. I think there’s a lot to say, but there’s one particularly useful perspective: the (empirical, nonparametric) bootstrap can be thought of as <em>randomly querying the influence function</em>. From this perspective, it seems both clear (a) why the bootstrap works as an estimator of variance and (b) why it won’t work as to find the approximately most influential set, i.e., the set of points which have the most extreme values of the influence function (AMIS in our paper).</p> <p>Let’s suppose that you have a vector $$\psi \in \mathbb{R}^N$$, with $$\sum_{n=1}^N \psi_n = 0$$, where $$N$$ is very large. We would like to know about $$\psi$$, but suppose we can’t access it directly. Rather, we can only query it via inner products $$v^T \psi$$. Moreover, suppose we can only compute $$B$$ such inner products, where $$B \ll N$$. For the purpose of this post, $$\psi$$ will be the influence scores, $$v$$ will be rescaled bootstrap weight vectors, $$N$$ will be the number of data points, and $$B$$ the number of bootstrap samples. But the discussion can start out more generally.</p> <p>Suppose we don’t know anything a priori about $$\psi$$, so we query it randomly, drawing IID entries for $$v$$ from a distribution with mean zero and unit variance. Let the $$b$$-th random vector be denoted $$v^{b}$$. We can ask: What can the collection of inner products $$V_B := \{v^{1} \cdot \psi, \ldots, v^{B} \cdot \psi \}$$ tell us about $$\psi$$?</p> <p>At first glance, the answer seems to be “not much other than the scale.” The set $$V_B$$ tells us about the projection of $$\psi$$ onto a $$B$$-dimensional subspace, out of $$N \gg B$$ total dimensions. Furthermore, since $$\mathbb{E}[v \cdot \psi] = \sum_{n=1}^N \mathbb{E}[v_n] \psi_n = 0$$, the vectors $$v^b$$ are, on average, orthogonal to $$\psi$$. So we do not expect the projection of $$\psi$$ onto the space spanned by $$V_B$$ to account for an appreciable proportion of $$|| \psi ||_2$$. The set $$V_B$$ <em>can</em> estimate the scale $$|| \psi ||_2$$, however, since $$\mathrm{Var}(v \cdot \psi) = \sum_{n=1}^N \mathbb{E}[v_n^2] \psi_n^2 = \sum_{n=1}^N \psi_n^2 = ||\psi||_2^2$$, and $$\mathrm{Var}(v \cdot \psi)$$ can be estimated using the sample variance of $$v^b \cdot \psi$$.</p> <p>Note that the bootstrap is very similar to drawing $$v_n + 1 \sim \mathrm{Poisson}(1)$$; the proper bootstrap actually has some correlation between different entries due to the constraint $$\sum_{n=1}^N v_n = 1$$, but this correlation is of order $$1/N$$ and can be neglected for simplicity in the present argument. The argument of the previous paragraph implies that the bootstrap effectively randomly projects $$\psi$$ onto a very low-dimensional subspace, presumably losing most of its detail in doing so. It also makes sense that the bootstrap can tell us about $$||\psi||_2$$ — recall that $$||\psi||_2$$ consistently estimates the variance of the limiting distribution of our statistic, a quantity that we know the bootstrap is also able to estimate.</p> <p>Recall that the AMIP from our paper is $$-\sum_{n=1}^{\lfloor \alpha N \rfloor} \psi_{(n)}$$, where $$\psi_{(n)}$$ is the $$n$$-th sorted entry of the $$\psi$$ vector. From the argument sketch above I conjecture that the bootstrap distribution actually doesn’t convey much information about the AMIP other than the limiting variance. In particular, in the terms of our paper, I conjecture that the bootstrap can tell us about the “noise” of AMIP but not the “shape.”</p> <p>Incidentally, the above perspective is also relevant for situations where we cannot form and / or invert the full Hessian matrix $$H$$, and so cannot compute $$\psi$$ directly. If we imagine sketching $$H^{-1}$$, e.g. by using the conjugate gradient method applied to random vectors, we would run into a problem conceptually quite similar to the bootstrap. It’s interesting to think about how one could improve on random sampling in such a case.</p>When we present our work on AMIP the relationship with the bootstrap often comes up. I think there’s a lot to say, but there’s one particularly useful perspective: the (empirical, nonparametric) bootstrap can be thought of as randomly querying the influence function. From this perspective, it seems both clear (a) why the bootstrap works as an estimator of variance and (b) why it won’t work as to find the approximately most influential set, i.e., the set of points which have the most extreme values of the influence function (AMIS in our paper).Saint Augustine and chance.2021-10-27T10:00:00+00:002021-10-27T10:00:00+00:00/philosophy/2021/10/27/st_augustine<p>I came across an interesting passage in the Confessions of Saint Augustine at the end of section (5) of Vindicianus on Astronomy. Augustine is describing a period in his early life when he was, to his later shame, interested in fortune-telling. In this particular passage, a friend is trying to helpfully convince his younger self that fortune-telling is nonsense. Augustine writes of the exchange:</p> <blockquote> <p>“I asked him why it was that many of their forecasts turned out to be correct. He replied that the best answer he could give was the power apparent in lots, a power everywhere diffused in the nature of things. So when someone happens to consult the pages of a poet whose verses and intention are concerned with a quite different subject, in a wonderful way a verse often emerges appropriate to the decision under discussion. He used to say that it was no wonder if, from the human soul, by some higher instinct that does not know what goes on within itself, some utterance emerges not by art but by ‘chance’ which is in sympathy with the affairs or actions of the inquirer.”</p> </blockquote> <p>In his Confessions, Saint Augustine sees divine will in every aspect of life and, moreover, he is writing at the end of the fourth century. So of course his conception of chance will differ from our modern one. Still, it is striking that, as he is trying to assert precisely that fortune tellers are correct only by accident, his concept of “accident” does not admit anything like modern randomness.</p> <p>Suppose a fortune teller flips a coin to predict an outcome that itself occurs half the time and is subsequently correct half the time. We account for this probabilistically, assert that the randomness of the coin flip disconnects the prediction from the outcome, and say that the co-occurence of prediction and outcome is the overlap of unrelated events. Augustine seems to want to say something similar, but cannot commit himself to the disconnect — he attributes correct predictions to “some higher instinct” beyond the control of the fortune teller which, nevertheless, kicks in only some of the time.</p>I came across an interesting passage in the Confessions of Saint Augustine at the end of section (5) of Vindicianus on Astronomy. Augustine is describing a period in his early life when he was, to his later shame, interested in fortune-telling. In this particular passage, a friend is trying to helpfully convince his younger self that fortune-telling is nonsense. Augustine writes of the exchange:Three ridiculous hypothesis tests.2021-09-30T10:00:00+00:002021-09-30T10:00:00+00:00/frequentist/statistics/2021/09/30/four_crazy_hypothesis_tests<p>There are lots of reasons to dislike p-values. Despite their inherent flaws, over-interpretation, and risks, it is extremely tempting to argue that, absent other information, the smaller the p-value, the less plausible the null hypothesis. For example, the venerable Prof. Philip Stark (who I admire and who was surely choosing his words very carefully), writes in <a href="https://figshare.com/articles/dataset/The_ASA_s_statement_on_p_values_context_process_and_purpose/3085162/4?file=5368499">“The Value of p-Values”</a>:</p> <blockquote> <p>“Small p-values are stronger evidence that the explanation [the null hypothesis] is wrong: the data case doubt on the explanation.”</p> </blockquote> <p>For p-values based on reasonable hypothesis tests with no other information, I think that Prof. Stark is (usually, mostly) correct to say this. But there is nothing in the definition of a hypothesis test that requires it to be reasonable without a consideration of <em>power</em>, and power does not enter the definition of a p-value.</p> <p>So, to motivate the importance of explicit power considerations in the use of hypothesis tests and p-values, let me describe three ridiculous but valid hypothesis tests. None of this is new, but perhaps it will be fun to have these examples all in the same place.</p> <h1 id="hypothesis-tests-and-p-values">Hypothesis tests and p-values</h1> <p>I will begin by reviewing the definition of p-values in the context of hypothesis testing. Let our random data $$X$$ take values in a measurable space $$\Omega_x$$. The distribution of $$X$$ is posited to lie in some class of distributions $$\mathcal{P}_\theta$$ parameterized by a parameter $$\theta \in \Omega_\theta$$. A simple null hypothesis $$H_0$$ specifies a value $$\theta_0 \in \Omega_\theta$$ that fully specifies the distribution of the data, which we will write as $$P(X | \theta_0)$$.</p> <h4 id="hypothesis-tests">Hypothesis tests</h4> <p>A valid test of $$H_0$$ with level $$\alpha$$ consists of two parts:</p> <ul> <li>A measurable test statistic, $$T: \Omega_x \mapsto \Omega_T$$, possibly incorporating additional randomness, and</li> <li>A region $$A(\alpha) \subseteq \Omega_T$$ such that $$P(T(X) \in A(\alpha) | H_0) \le \alpha$$.</li> </ul> <h4 id="p-values">P-values</h4> <p>Often (as in our simple example), the regions are nested in the sense that $$\alpha_1 &lt; \alpha_2 \Rightarrow A(\alpha_1) \subset A(\alpha_2)$$. Stricter tests result in smaller rejection regions. In such a case, we can define the p-value of a particular observation $$x$$ as the smallest $$\alpha$$ which rejects $$H_0$$ for that $$x$$.</p> <h4 id="a-simple-example">A simple example</h4> <p>A simple example, which suffices for the whole present post, is data $$X = (X_1, \ldots, X_N)$$, from a $$\mathcal{N}(\theta, 1)$$ distribution. The classical two-sided test, which is eminently reasonable for many situations, uses</p> <ul> <li>$$T(X) = \sqrt{N} \vert \bar{X} - \theta_0 \vert$$ and</li> <li>$$A(\alpha) = \left\{x: T(x) \ge \Phi^{-1}(1 - \alpha / 2) \right\}$$,</li> </ul> <p>where $$\bar{X}$$ is the sample average and $$\Phi$$ the cumulative distribution function of the standard normal. Today I will not quibble with this test, but rather propose absurd alternatives.</p> <p>In the case of our simple example, the p-value is simply $$p(x) = 2(1 - \Phi(T(x)))$$. As $$T(x)$$ increases, the p-value $$p(x)$$ decreases.</p> <h4 id="the-reasoning">The reasoning</h4> <p>If $$\alpha$$ is small (say, the much-loathed $$0.05$$), and we observe $$x$$ such that $$T(x) \in A(\alpha)$$, the argument goes that $$x$$ constitutes evidence against $$H_0$$, since such an outcome was improbably if $$H_0$$ were true. Correspondingly, smaller p-values are taken to be associated with stronger evidence against the null.</p> <p>However, one can easily construct valid tests that satisfy the above definition that range from obviously incomplete to absurd. In the counterexamples below I will be happy to assume that $$H_0$$ is reasonable, even correct. (So we are not in the case of Prof. Stark’s “straw man”, ibid.) Missing in each of the increasibly egregious counterexamples that I will describe is a consideration of <em>power</em>, which is an explicit consideration of the ability of the test to reject when $$\theta \ne \theta_0$$.</p> <h1 id="three-ridiculous-hypothesis-tests">Three ridiculous hypothesis tests</h1> <h4 id="example-1-throw-away-most-of-the-data">Example 1: Throw away most of the data</h4> <p>Suppose we use the simple example above example, but throw away all but the first datapoint. So our hypothesis test is</p> <ul> <li>$$T(x) = \vert x_1 - \theta_0 \vert$$, $$A(\alpha)$$ as above.</li> </ul> <p>This test is valid, as are its p-values. In this case, it is true that larger $$p$$ values cast further doubt on the hypothesis (and Prof. Stark’s quote is true). But the increment is small, since a single datapoint is much less informative than the full dataset.</p> <p>Missing from this silly test is the fact that, by using all the data, one can construct a strictly larger rejection region — and so a test with more power — with the same level.</p> <h4 id="example-2-throw-away-all-the-data">Example 2: Throw away all the data</h4> <p>Since we can use randomness in our test statistic, let us define</p> <ul> <li>$$T(x) \sim \textrm{Uniform}[0,1]$$.</li> <li>$$A(\alpha) = \left\{x: T(x) \le \alpha \right\}$$.</li> </ul> <p>This test has the correct level and valid p-values, but has nothing at all to do with the data or $$H_0$$. It also generates valid confidence intervals, which are either the whole space $$\Omega_\theta$$ or $$\emptyset$$, with probabilities $$1 - \alpha$$ and $$\alpha$$ respectivelly.</p> <p>The book “Testing Statistical Hypothesis” defines p-values for randomized tests as the smallest $$\alpha$$ which rejects with probability one. Using this definition, p-values for this case would always be $$1$$. So, by this technicality, p-values slip through the cracks this counterexample. However, I would argue that one could just as well augment the data space with $$[0,1]$$ and consider the uniform draw to be “data” rather than part of the hypothesis, in which case the p-values is simply $$\textrm{Uniform}[0,1]$$, independent of the data.</p> <p>The problem with this test is that it has no greater power to reject under the alternative than under the null. Again, it is a consideration of power, rather than the definition of a valid test, that reveals the nature of the flaw.</p> <h4 id="example-3-construct-crazy-regions">Example 3: Construct crazy regions</h4> <p>Let us use $$T(x)$$ as in the simple example, but use the region $$A(\alpha) = \left\{x: T(x) \le \Phi^{-1}((\alpha + 1) / 2) \right\}$$. These regions have the correct levels, but they reject when $$T(x)$$ is <em>close</em> to $$\theta_0$$ rather than when it is far away. These tests will have high power against alternatives which are very close to $$\theta_0$$, but no power against large deviations. Values $$T(x)$$ which are very large will have large p-values, whereas the smallest p-values occur when $$T(x) \approx \theta_0$$.</p> <p>There are at least two ways to think about what is wrong with this test. One is that it produces rejection regions with smaller-than-optimal Lebesgue measure — the total length of the rejection region is much smaller than the classical test’s. Another is that it has highest power against alternatives that we (typically) care the least about, which are values of $$\theta$$ that produce nearly the same distribution as the null. As above, power considerations are the key.</p> <h1 id="lets-think-about-power">Let’s think about power</h1> <p>Even the best available argument for p-values, hypothesis tests, and confidence intervals depends on having chosen tests in the first place that take power into account. The best statisticians (e.g. Prof. Stark and Ronald Fisher) will be very good at avoiding under-powered tests, and avoid such mistakes easily in most situations. However, for the general public, it seems to me that there is a lot of value in making power a fundamental part of teaching and talking about hypothesis tests, p-values, and confidence intervals.</p>There are lots of reasons to dislike p-values. Despite their inherent flaws, over-interpretation, and risks, it is extremely tempting to argue that, absent other information, the smaller the p-value, the less plausible the null hypothesis. For example, the venerable Prof. Philip Stark (who I admire and who was surely choosing his words very carefully), writes in “The Value of p-Values”: