Jekyll2023-03-29T21:50:23+00:00/feed.xmlRyan Giordano, statistician.This is the professional webpage and open research journal of Ryan Giordano.Free will and randomness2023-03-29T16:00:00+00:002023-03-29T16:00:00+00:00/philosophy/2023/03/29/causation<p>Free will and randomness feel opposed to one another: free wil is what makes us
human; randomness is the epitome of meaninglessness. But the two share a deep
affinity: they both are built on breaking with the past and with contingency.</p>
<p>All things are contingent. No event has a single cause. But it seems to us
that we have free will: the ability to make decisions, and so this seeming
capacity to intervene in the world gives rise to causal questions. If I choose
to eat this mushroom, will I get sick? Implicit in the question is the
possibility that I might choose, or not, to eat the mushroom. Of course, our
decision to intervene is itself contingent, since our minds are also part of the
world. The contingency of our own minds makes it hard to infer causation:
amongst patients who choose to go to the hospital outcomes are worse, but this
does not mean going to the hospital makes you sicker. Or maybe I would have
slept badly anyway on the nights I choose to stay on my phone until late. Our
choices don’t <em>feel</em> very contingent; if they did, there would be no reason to
ask about causation. But we know, abstractly, that our choices may produce
selection bias in ways we don’t fully understand.</p>
<p>How, then, do we establish causation? We require a technology to breaks the
chain of contingency more effectively than our own decision making processes.
That technology is randomness: aleatoric machines, designed originally for
gambling, which are specially constructed using symmetry and
information-discarding physical processes (spinning a wheel, a coin, or a die;
mixing an urn of balls; or their mathematical analogues in random number
generators). The function of these aleatoric machines is to break the bonds
between the past and the future more thoroughly than our own minds can. If an
outcome follows from the output of an aleatoric device, it cannot have followed
from anything else: if you are selected for the treatment arm of an experiment
by a coin flip, the effect of the past on your outcome is broken at the moment
of the coin flip, at least relative to what would have happened had the coin
flip turned out otherwise.</p>
<p>So free will gives meaning to causation, and causation is detected by
randomness. But there is an irony to this arrangement. Causation matters
because we make choices, the choices we make are the place where who we are
connects to the world, and the source of meaning. But the very notion of
causation requires a break from the past, a break which is most perfectly
created by randomness, the epitome of meaninglessness, as a process which is, by
design, maximally disconnected from the rest of the world. At a deep level, the
phenomena of free will and randomness are siblings: each relies on the
possibility of non-contingency. But behind free will there lies a soul, and
behind randomness, there is nothing (or worse, a tawdry betting game). The
difference between the two is one of value, though, not of kind.</p>
<p>This priveleged epistemic role of aleatoric devices in establishing causation is
an extremely recent invention. Often the idea is attributed to Neyman [1],
though this idea is certainly in Hume (Section VIII, Part I of [2]), who writes:</p>
<blockquote>
<p>“And if the definition above mentioned be admitted; liberty, when opposed to
necessity, not to constraint, is the same thing with chance; which is
universally allowed to have no existence.”</p>
</blockquote>
<p>(I might argue that Section VIII, and not Section VI, is the right place to look
in Hume for the origins of modern probabilisitic causal inference, in contrast
to some other authors [2]).</p>
<p>But even if the origins were to be traced back to the 17th century, it is
well-established that the pre-modern world did not give aleatoric reasoning such
a priveleged status ([4]). To me, this opens up a range of interesting
questions. Is it only the modern epistemic prominence of randomness that allows
us to even conceive of causation this way? Does the close proximity between
meaninglessness and free will affect our view of ourselves? When we insist, in
practice, that only randomness can establish causation, do we leave behind ways
of interacting with the world that do not fit into this framework? In what other
ways can we imagine leaving contingency behind? How have other cultures and
times done so?</p>
<p>[1] Neyman, Jerzy, and Karolina Iwaszkiewicz. “Statistical problems in
agricultural experimentation.” Supplement to the Journal of the Royal
Statistical Society 2.2 (1935): 107-180.</p>
<p>[2] Hume, D. (1748). An Enquiry Concerning Human Understanding. Renascence
Editions.</p>
<p>[3] Holland, Paul W. “Statistics and causal inference.”
Journal of the American statistical Association 81.396 (1986): 945-960.</p>
<p>[4] Hacking, Ian. The emergence of probability: A philosophical study of early
ideas about probability, induction and statistical inference. Cambridge
University Press, 2006.</p>Free will and randomness feel opposed to one another: free wil is what makes us human; randomness is the epitome of meaninglessness. But the two share a deep affinity: they both are built on breaking with the past and with contingency.The Popper-Miller theorem is the Bayesian transitivity paradox.2022-10-19T16:00:00+00:002022-10-19T16:00:00+00:00/philosophy/2022/10/19/popper_miller<p>Popper and Miller[1,2] proposed a tidy little paradox about inductive reasoning.
Many 20th century Bayesians (e.g. [3]) claim that Bayesian reasoning is valid
inductive reasoning. Popper, ever the enemy of induction, produced (with
Miller) the Popper-Miller (PM) theorem, which “proves” that Bayesian
“induction” is nothing but watered-down deduction.</p>
<p>The PM theorem was widely discussed at the time, and I feel there are enough
counterarguments. Here, I would like to point out something different: to argue
that the PM theorem is just the Bayesian transitivity paradox (BTP) in fancy
dress. That there is a connection between the PM theorem and the BTP was
pointed out in a brief comment by Redhead in [4], but I think the connection is
deeper and simpler than has been noticed before.</p>
<p>I’ll first say what the PM theorem and BTP are, and then show that the two are
the opposite sides of the same coin. The setup, throughout, is as follows.
Suppose that we are interested in how much some observed evidence, \(e\),
supports a hypothesis \(h\), when \(h \Rightarrow e\), but \(e \not\Rightarrow
h\). For example, \(h\) might be “my coin has heads on both sides” and
\(e\) might be “I observed a heads after a single flip.”</p>
<p>I’ll use logic notation (\(\lor\) for disjunction, \(\land\) for
conjuction, \(\lnot\) for negation), but one could equally have represented
\(e\) and \(h\) as sets than as logical propositions (with \(\bigcup, \bigcap,
(\cdot)^c\) instead of the respective logic symbols). Note that logical
implication of propositions (\(A \Rightarrow B\)) is the same as set containment
(\(A \subseteq B\)).</p>
<p>Both the PM theorem and the BTP are stated in terms of Bayesian logic. So I’ll
begin by assuming that I have a measure \(p(\cdot)\) on propositions, where
\(p(\cdot | \cdot)\) denotes conditional probability, though my final conclusion
will be in much greater generality. In particular, \(p(h | e)\) is the
posterior credibility of \(h\) given that we observed \(e\). Popper and Miller
analyzes the “support” of \(e\) for \(h\), which is defined as the difference
between the posterior and prior probabilities of \(h\), i.e., \(s(h | e) := p(h |
e) - p(h)\). When support is positive, we say \(e\) supports \(h\), and when it
is negative, we say \(e\) counter-supports \(h\). We’ll assume that \(s(h | e) >
0\) here. I assume throughout that \(p(e)\), \(p(h)\), and \(p(h | e)\) are all
strictly between \(0\) and \(1\), which is not essential, but simplifies things
a bit.</p>
<h1 id="the-popper-miller-pm-theorem">The Popper-Miller (PM) Theorem</h1>
<p>The PM theorem is based on a decomposition of \(h\) into “deductive”
and “inductive” parts, denoted \(h_D\) and \(h_I\) respectively,
with \(h = h_D \land h_I\). The
deductive part has the property that \(e \Rightarrow h\), and
the inductive part is supposed to capture “all of \(h\) that goes beyond
\(e\).” Their particular decomposition doesn’t matter for my purposes
(it happens to be \(h_D = h \lor e\) and \(h_I = h \lor \lnot e\)), but
it has these properties:</p>
\[\begin{align}
(A) && s(h | e) ={}& s(h_D | e) + s(h_I | e) \\
(B) && s(h_I | e) <{}& 0.
\end{align}\]
<p>Property (A) supports the notion that \(h_D\) and \(h_I\) are a “decomposition”
of the support of \(e\) for \(h\).</p>
<p>Property (B) is the PM theorem. It says that the inductive component is always
counter-supported by the evidence. One might interpret (B) as follows: that
Bayesian reasoning appears to do induction is only an illusion. The support is
merely deductive support diluted by inductive counter-support. Or so, at least,
Popper and Miller claim.</p>
<p>The most common objection was whether reasonable alternative decompositions
exist. Of course they do, and most of the discussion in the literatue was
about precisely which candidate decompositions are valid and not. Some
decompositions violate (A) and some violate (B). Popper and Miller argue in
several ways that their decomposition is uniquely appropriate [2], though I
think that Elby and Redhead argue convincingly that other decompositions are
reasonable [4,5]. For my purpose, all that matters is that a family of
alternative decompositions exist, some of which may violate either (A) or
(B).[*]</p>
<h1 id="the-bayesian-transitivity-paradox">The Bayesian Transitivity Paradox</h1>
<p>A consequence of (A) and (B) which is not much remarked on in the PM theorem
debate is the following:</p>
\[\begin{align*}
(C) && s(h_D | e) - s(h | e) >{}& 0.
\end{align*}\]
<p>That is, the deductive component receives greater support than the hypothesis.
This makes intuitive sense as a desiderata for a decomposition: \(h \Rightarrow
e \Rightarrow h_D\), and, intuitively, any notion of “support” should give
no more support to a hypothesis than to its logical consequences.</p>
<p>One might wonder whether it is always the case that \(s(r | e) > s(q | e)\) when
\(q \Rightarrow s\). It turns out that this is not necessarily the case, a
phenomenon that is known as the “Bayesian transitivity paradox” (BTP). In fact,
the \(h_I\) component of the PM theorem is an example: \(s(h | e) > 0 > s(h_I |
e)\), although \(h \Rightarrow h_I\). So the PM theorem unavoidably involves
the BTP, a point noted by Redhead [5].</p>
<p>It’s worth noting that the posterior itself does not suffer from anything like
the BTP. If \(q \Rightarrow r\), then \(p(q | e) \ge p(r | e)\), since
logical implication is the same as set containment. The BTP occurs for
\(s(\cdot \vert \cdot)\) because of the role played by the prior.</p>
<p>Of course, (C) follows from (A) and (B), and (B) follows from (A) and (C). It
follows that, given a decomposition of the form (A), <em>the inductive support is
negative if and only if the deductive part has greater support than the original
hypothesis</em>.</p>
<h1 id="the-pm-theorem-is-a-special-case-of-the-btp">The PM theorem is a special case of the BTP</h1>
<p>Let us step back from the specific notion of support and decomposition used in
the PM theorem, and ask what we might want <em>in general</em> from a decomposition of
a generic notion of support, which we denote \(\sigma(\cdot | \cdot)\) into a
deductive and inductive part, which we call \(x_D\) and \(x_I\). We no longer
require \(h = x_D \land x_I\), but we do require that \(e \Rightarrow x_D\) in
some sense. To investigate a generalized form of the PM theorem, one might ask
whether we can have:</p>
\[%
\begin{align*}
%
(A')&& \sigma(h | e) ={}& \sigma(x_D | e) + \sigma(x_I | e)\\
(B')&& \sigma(h_I | e) \ge{}& 0\\
(C')&& \sigma(h_D | e) \ge{}& \sigma(h | e).
%
\end{align*}
%\]
<p>We want (A’) because that’s what a “decomposition” would mean, we want (C’)
because we don’t want anything like the BTP, and we want to know whether (B’) is
possible because that’s what it would mean to do induction. But obviously, by
basic algebra, (A’), (B’) and (C’) cannot be simultaneously true, for <em>any</em>
possibly notion of support, probabilisitic or otherwise. The PM theorem is
simply a particular case of this simple and general observation.</p>
<p>In light of this, the PM theorem begins to a look a little trivial. When Popper
and Miller insist, e.g. in response to [7], that authors who contest their
decomposition produce alternative decompositions, they are in fact begging the
question.</p>
<p>That the BTP occurs is certainly a meaningful critique of probabilistic support
\(s(\cdot | \cdot)\). It seems to me, however, that the PM theorem simply
re-arranges the BTP in a way that sacrifices clarity rather than illuminates
what is really at issue.</p>
<h1 id="bibliography">Bibliography</h1>
<p>[1] Popper, K. and D. Miller (1983). “A proof of the impossibility of inductive probability”</p>
<p>[2] Popper, K. and D. Miller (1987). “Why probabilistic support is not inductive”. In: Philosophical Transactions of
the Royal Society of London. Series A, Mathematical and Physical Sciences 321.1562,
pp. 569–591.</p>
<p>[3] Carnap, R. (1966). “The aim of inductive logic”. In: Studies in Logic and the Foundations of
Mathematics. Vol. 44. Elsevier, pp. 303–318.</p>
<p>[4] Elby, A. (1994). “Contentious contents: For inductive probability”. In: The British journal
for the philosophy of science 45.1, pp. 193–200.</p>
<p>[5] Redhead, M. (1985). “On the impossibility of inductive probability”. In: The British Journal
for the Philosophy of Science 36.2, pp. 185–191.</p>
<p>[6] Levi, I. (1984). “The impossibility of inductive probability”. In: Nature 310.5976, pp. 433–433.</p>
<p>[7] Jeffrey, R. (1984). “The impossibility of inductive probability”. In: Nature 310.5976, pp. 433–
433.</p>
<h1 id="notes">Notes</h1>
<p>[*] A family of decompositions satisfying \(h = x_D \land x_I\), \(e
\Rightarrow x_D\), and condition (A) can be found by taking \({e, a, b}\) to be
any partition of the tautology (so that \(p(e \lor a \lor b) = p(e) + p(a) +
p(b) = 1\)), and taking \(x_D = e \lor a\) and \(x_I = e \lor b\). Levi pointed
out one such decomposition in [6], though argued that Bayesian inference is
still not deduction since \(s(h_I | e)\) varies over possibile decompositions
despite the fact that \(h_I \land e = h\) for all such decompositions, and
propostions that are logically equivalent given \(e\) should receive equal
support from \(e\). Other authors, e.g. Jeffreys in [7], argue for
decompositions that violate (A). It is easy to show that if \(e \lor a \lor b\)
is not the tautology, then (A) is violated.</p>Popper and Miller[1,2] proposed a tidy little paradox about inductive reasoning. Many 20th century Bayesians (e.g. [3]) claim that Bayesian reasoning is valid inductive reasoning. Popper, ever the enemy of induction, produced (with Miller) the Popper-Miller (PM) theorem, which “proves” that Bayesian “induction” is nothing but watered-down deduction.R torch for statistics (not just machine learning).2022-04-01T16:00:00+00:002022-04-01T16:00:00+00:00/code/2022/04/01/rtorch_example<p>The <code class="language-plaintext highlighter-rouge">torch</code> package for R (<a href="https://torch.mlverse.org/">found here</a>) is
CRAN-installable and provides automatic differentiation in R, as long as you’re
willing to rewrite your code using Torch functions.</p>
<p>The current docs for the <code class="language-plaintext highlighter-rouge">torch</code> package are great, but assume you’re interested
in machine learning. But gradients are useful for ordinary statistics, too!
In the notebook below I fit a simple Poisson regression model using <code class="language-plaintext highlighter-rouge">optim</code>
by implementing the log likelihod and derivatives in torch. Though not really
competitive with (the highly optimized) <code class="language-plaintext highlighter-rouge">lme4::glm</code> on this toy example, the
my point is more how easily you can roll your own MLE in R using <code class="language-plaintext highlighter-rouge">torch</code>.</p>
<p>The notebook itself <a href="2022-04-01_poisson_regression_torch_for_r.ipynb">can be downloaded here</a>,
and an markdown version follows.</p>
<hr />
<h1 id="example-of-torch-for-classical-stats-poisson-regression">Example of <code class="language-plaintext highlighter-rouge">torch</code> for classical stats (Poisson regression)</h1>
<p>In this notebook, I’ll show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> for R to optimize loss functions and compute standard error estimates.</p>
<p>The <a href="https://torch.mlverse.org/">torch for R website</a> is mostly focused on machine learning applications. The purpose of this notebook is just to show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> to get gradients and Hessians for your own purposes, including vanilla classical statistics.</p>
<p>I’ll use <code class="language-plaintext highlighter-rouge">torch</code> to implement and optimize a Poisson regression loss function and compute standard errors using Fisher information. This is just a toy problem, but by simply dropping the loss into an out-of-the-box optimizer, we get essentially the same answer as the (highly optimized) <code class="language-plaintext highlighter-rouge">lme4</code> package in a similar amount of time.</p>
<h1 id="installation">Installation</h1>
<p>One of the big benefits of <code class="language-plaintext highlighter-rouge">torch</code> is that it can be installed via CRAN, and so can be easily packaged in with your own R packages without the user having to do a bunch of extra Python nonsense. Installation instructions can be found <a href="https://torch.mlverse.org/docs/">here</a>.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">lme4</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">torch</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">44</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s generate some data. The model will be a simple Poisson regression:</p>
\[p(y_n | x_n) = \mathrm{Poisson}(\exp(x_n^T \beta))\]
<p>The goal will be to estimate \(\beta\), and standard errors, using maximum likelihood and the inverse Fisher information.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_obs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n_obs</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">beta_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.8</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lambda_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">lambda_true</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_true</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> V1
Min. : 1.021
1st Qu.: 2.570
Median : 3.998
Mean : 4.742
3rd Qu.: 6.349
Max. :15.447
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 4.664 6.000 25.000
</code></pre></div></div>
<p>Let’s define the log likelihood in <code class="language-plaintext highlighter-rouge">torch</code>. Then we can use <code class="language-plaintext highlighter-rouge">torch</code> to evaluated gradients of the log likelihood for optimization, and the Hessian for standard errors.</p>
<p>There are two important things to know:</p>
<ul>
<li>Torch does not operate on R numeric types. It operates on torch tensors, which can be created with torch_tensor().</li>
<li>Torch uses only its own functions — not base R! You can typically find the things you need by browsing through the <a href="https://torch.mlverse.org/docs/reference/index.html">reference material</a>.</li>
</ul>
<p>I’ll keep torch versions of the data around in a list <code class="language-plaintext highlighter-rouge">tvars</code> for easy re-use. And I’ll write a function <code class="language-plaintext highlighter-rouge">EvalLogLik</code>, which takes in a torch tensor <code class="language-plaintext highlighter-rouge">beta</code>, the data, and returns the log likelihood, again as a torch tensor.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tvars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">EvalLogLikTorch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">is</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="s2">"torch_tensor"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"beta must be a torch tensor"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">log_lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_matmul</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="n">lp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_sum</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">log_lambda</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">torch_exp</span><span class="p">(</span><span class="n">log_lambda</span><span class="p">))</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">lp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Sanity check that it works</span><span class="w">
</span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta_true</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch_tensor
1.72212e+06
[ CPUFloatType{} ]
</code></pre></div></div>
<p>We want to pass the (negative) log likelihood to an R routine as a function to be optimized. So we need to write a wrapper that takes an <code class="language-plaintext highlighter-rouge">R</code> numeric type, converts it to a torch tensor, calls <code class="language-plaintext highlighter-rouge">EvalLogProb</code>, and converts the result back to an R numeric type.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># From now on we'll take `tvars` to be a global variable to save</span><span class="w">
</span><span class="c1"># writing everything as lambda functions.</span><span class="w">
</span><span class="n">EvalLogLik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_lik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">verbose</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="o">=</span><span class="s2">", "</span><span class="p">),</span><span class="w"> </span><span class="s2">": "</span><span class="p">,</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">digits</span><span class="o">=</span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">log_lik</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs</span><span class="w">
</span><span class="n">EvalLogLik</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] 1722115.375
</code></pre></div></div>
<p>Now for some magic — a function that returns the gradient of <code class="language-plaintext highlighter-rouge">EvalLogLik</code> with respect to beta. This is what we’re using <code class="language-plaintext highlighter-rouge">torch</code> for. As before, we want something that we can pass to our optimizer, so that it takes a R numeric value for \(\beta\) as input and returns \(\partial \log p(y \vert x, \beta) / \partial \beta\) as R numeric output.</p>
<p>Unlike before, we call <code class="language-plaintext highlighter-rouge">torch_tensor</code> with the extra argument <code class="language-plaintext highlighter-rouge">requires_grad=TRUE</code>. That tells <code class="language-plaintext highlighter-rouge">torch</code> that we will later want to compute a gradient with respt to this parameter.</p>
<p>We compute the <code class="language-plaintext highlighter-rouge">loss</code> (the negative log likelihood) as we would normally.</p>
<p>We then call <code class="language-plaintext highlighter-rouge">autograd_grad</code>, which returns the gradient of the first argument’s tensor with respect to the second argument’s tensor using all the computations that have been performed since the tensors were defined. The <code class="language-plaintext highlighter-rouge">autograd_grad</code> function returns a list (you can take gradients with respect to multiple inputs), so we just pull out the first element of the list.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikGrad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">beta_ad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">loss</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span><span class="n">grad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">loss</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w">
</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] -440151.25 -691394
</code></pre></div></div>
<p>Now we can use the loss and gradient into a nonlinear optimizer. Sure enough, we get a reasonable estimate, with a somewhat small loss gradient.</p>
<p>(This gradient would ideally be smaller, but the out-of-the-box BFGS isn’t a very good optimization algorithm.)</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optim_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">opt_result</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">optim</span><span class="p">(</span><span class="w">
</span><span class="n">fn</span><span class="o">=</span><span class="n">EvalLogLik</span><span class="p">,</span><span class="w">
</span><span class="n">gr</span><span class="o">=</span><span class="n">EvalLogLikGrad</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="o">=</span><span class="s2">"BFGS"</span><span class="p">,</span><span class="w">
</span><span class="n">par</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">control</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">fnscale</span><span class="o">=</span><span class="m">-1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n_obs</span><span class="p">))</span><span class="w">
</span><span class="n">optim_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">optim_time</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="s2">"Estimate"</span><span class="o">=</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="s2">"Truth"</span><span class="o">=</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at BFGS optimum:\n"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">))</span><span class="w">
</span><span class="n">beta_hat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Estimate Truth
1 0.9706614 1.0
2 1.7922191 1.8
Gradient at BFGS optimum:
[1] -0.0355956 -0.0324285
</code></pre></div></div>
<p>We can compare the results to what we’d get from the same regression using <code class="language-plaintext highlighter-rouge">lme4::glm</code>. Sure enough they match, and run in a comparable amount of time. (I’ve found there’s a fair amount of noise in the timing, and of course this is only reporting a single run, so all that really matters here is that the two are of the same order.)</p>
<p>The glm algorithm (IRLS) tends to do a much better job of optimizing, in the sense that the gradient is smaller at the glm optimum, and the algorithm runs more quickly. Still, BFGS is just a quick-and-dirty choice, and doesn’t require any special structure to the problem.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">glm_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">glm_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="o">=</span><span class="n">data.frame</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">x2</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w">
</span><span class="n">start</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">family</span><span class="o">=</span><span class="s2">"poisson"</span><span class="p">)</span><span class="w">
</span><span class="n">glm_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glm_time</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in coefficients estimated by optim and glm:\t"</span><span class="p">,</span><span class="w">
</span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">beta_hat</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nEstimation time (s):\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"optimization and torch: \t"</span><span class="p">,</span><span class="w"> </span><span class="n">optim_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"glm: \t\t\t\t"</span><span class="p">,</span><span class="w"> </span><span class="n">glm_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at glm optimum:\n"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)))</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in coefficients estimated by optim and glm: 1.882297e-05
Estimation time (s):
optimization and torch: 0.1431296
glm: 0.01248574
Gradient at glm optimum:
[1] -3.278255e-05 -7.224083e-05
</code></pre></div></div>
<p>To compute standard errors, we need to compute the negative Hessian matrix of the log likelihood:</p>
\[\hat{\mathcal{I}} := -
\left. \frac{\partial \log p(y | x, \beta)}
{\partial \beta \partial \beta^T} \right|_{\hat\beta}\]
<p>The quantity \(\mathcal{I}\) is the empirical Fisher information, and \(\mathcal{I}^{-1}\) is a standard estimator of the covariance of the MLE \(\hat\beta\) under correct specification.</p>
<p>The Python version of <code class="language-plaintext highlighter-rouge">torch</code> has a native function, like <code class="language-plaintext highlighter-rouge">autograd_grad</code>, that computes the Hessian directly. Unfortunately, that function has not yet been ported to R. (See <a href="https://github.com/mlverse/torch/issues/738">this issue</a> on github.) However, we can compute a Hessian by computing the gradients of each row of the gradient, as follows.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikHessian</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">beta_ad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">log_lik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span><span class="c1"># The argument `create_graph` allows `grad` to be itself differentiated, and</span><span class="w">
</span><span class="c1"># the argument `retain_graph` saves gradient computations to make repeated differentiation</span><span class="w">
</span><span class="c1"># of the same quantity more efficient.</span><span class="w">
</span><span class="n">grad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">create_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="c1"># Now we compute the gradient of each element of the gradient, each of which is</span><span class="w">
</span><span class="c1"># one row of the Hessian matrix.</span><span class="w">
</span><span class="n">hess</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">hess</span><span class="p">[</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">grad</span><span class="p">[</span><span class="n">d</span><span class="p">],</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">hess</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w">
</span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] -1907663 -1720186
-1720186 -2263947
</code></pre></div></div>
<p>In the code below, <code class="language-plaintext highlighter-rouge">fisher_info</code> is precisely \(\hat{\mathcal{I}}\). We can see that the standard errors match one another.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fisher_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">-1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">)</span><span class="w">
</span><span class="n">torch_se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">fisher_info</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">diag</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">()</span><span class="w">
</span><span class="n">glmer_se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[,</span><span class="w"> </span><span class="s2">"Std. Error"</span><span class="p">]</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in estimated standard errors:\t"</span><span class="p">,</span><span class="w">
</span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">torch_se</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glmer_se</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in estimated standard errors: 3.857842e-07
</code></pre></div></div>The torch package for R (found here) is CRAN-installable and provides automatic differentiation in R, as long as you’re willing to rewrite your code using Torch functions.A Few Equivalent Perspectives on Jackknife Bias Correction2022-03-17T10:00:00+00:002022-03-17T10:00:00+00:00/jackknife/2022/03/17/jackknife_bias<p>In this post, I’ll try to connect a few different ways of viewing jackknife and
infinitesimal jackknife bias correction. This post may help provide some
intution, as well as an introduction to how to use the infinitesimal jackknife
and von Mises expansion to think about bias correction.</p>
<p>Throughout, for concreteness, I’ll use the simple example of the statistic</p>
\[\begin{aligned}
T & =\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}\right)^{2},\end{aligned}\]
<p>where the \(x_{n}\) are IID and \(\mathbb{E}\left[x_{n}\right]=0\).
Of
course, the bias of \(T\) is known, since</p>
\[\begin{aligned}
\mathbb{E}\left[T\right] & =\frac{1}{N^{2}}\mathbb{E}\left[\sum_{n=1}^{N}x_{n}^{2}+\sum_{n_{1}\ne n_{x}}x_{n_{1}}x_{n_{2}}\right]
=\frac{\mathrm{Var}\left(x_{1}\right)}{N}.
\end{aligned}\]
<p>We will
ensure that we recover consistent estimates of this bias using each of
the different perspectives. Of course, the utility of these concepts is
when we do not readily have such a simple expression for bias as in, for
example, Bayesian expectations.</p>
<p>At different points I will use different arguments for \(T\), hopefully
without any real ambiguity. For convenience, write</p>
\[\begin{aligned}
\hat{\mu} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n} \quad\textrm{and}\quad
\hat{\sigma}^{2} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}\end{aligned}\]
<p>so that our example can be expressed as</p>
\[\begin{aligned}
T & =\hat{\mu}^{2}.\end{aligned}\]
<h1 id="an-asymptotic-series-in-n">An asymptotic series in \(N\).</h1>
<p>Perhaps the most common way to understand the jackknife bias estimator
and correction is as an asymptotic series in \(N\). Suppose that we have
some reason to believe that the bias of \(T\) admits an asymptotic
expansion in \(N\), the size of the observed dataset:</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}\right] & =a_{1}N^{-1}+o\left(N^{-1}\right).\end{aligned}\]
<p>The jackknife bias estimator works as follows. Let \(T_{-i}\) denote \(T\)
calculated with datapoint \(i\) left out. Then</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}-T_{-i}\right] & =a_{0}+a_{1}N^{-1}+o\left(N^{-1}\right)-\\
& \quad a_{0}-a_{1}\left(N-1\right)^{-1}+o\left(N^{-1}\right)\\
& =a_{1}\frac{N-1-N}{N\left(N-1\right)}+o\left(N^{-1}\right)\\
& =-\frac{a_{1}N^{-1}}{N-1}+o\left(N^{-1}\right)\\
& =-\frac{\mathbb{E}\left[T_{N}\right]}{N-1}+o\left(N^{-1}\right).\end{aligned}\]
<p>Consequently,</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\
\mathbb{E}\left[\hat{B}\right] & =\mathbb{E}\left[T_{N}\right]+o\left(N^{-1}\right).\end{aligned}\]
<p>is an unbiased estimate of the leading order term in the bias of
\(T_{N}\), and the bias-corrected estimate \(T_{N}-\hat{B}\) has bias of
smaller order \(o\left(N^{-1}\right)\),</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}-\hat{B}\right] & =o\left(N^{-1}\right).\end{aligned}\]
<p>In our example,</p>
\[\begin{aligned}
T_{-i} & =\left(\frac{1}{N-1}\sum_{n\ne i}^{N}x_{n}\right)^{2}\\
& =\left(\frac{N}{N-1}\hat{\mu}-\frac{1}{N-1}x_{i}\right)^{2}\\
& =\left(N-1\right)^{-2}\left(N^{2}\hat{\mu}^{2}-2N\hat{\mu}x_{i}+x_{i}^{2}\right)\\
\frac{1}{N}\sum_{n=1}^{N}T_{-n} & =\left(N-1\right)^{-2}\left(\left(N^{2}-2N\right)\hat{\mu}^{2}+\hat{\sigma}^{2}\right)\\
\hat{B} & =-\left(N-1\right)^{-1}\left(\left(N-1\right)^{2}\hat{\mu}^{2}-\left(N^{2}-2N\right)\hat{\mu}^{2}-\hat{\sigma}^{2}\right)\\
& =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which is a perfectly good estimate of the bias.</p>
<h1 id="a-taylor-series-in-1n">A Taylor series in \(1/N\).</h1>
<p>An equivalent way of looking at the previous example is to imagine \(T\)
as a function of \(\tau=1/N\), to numerically estimate the derivative
\(dT/d\tau\), and extrapolate to \(\tau=0\). Using the notation of the
previous section, define the gradient estimate</p>
\[\begin{aligned}
\hat{g_{i}} & =\frac{T_{N}-T_{-i}}{\frac{1}{N}-\frac{1}{N-1}}.\end{aligned}\]
<p>Here, we are viewing \(T_{-i}\) as an instance of the estimator evaluated
at \(\tau=1/\left(N-1\right)\). By rearranging, we find that</p>
\[\begin{aligned}
\hat{g_{i}} & =-N\left(N-1\right)\left(T_{N}-T_{-i}\right),\\
\hat{g} & =\frac{1}{N}\sum_{n=1}^{N}\hat{g_{n}}\\
& =-N\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\
& =-N\hat{B}.\end{aligned}\]
<p>Extrapolating to \(\tau=0\) gives</p>
\[\begin{aligned}
T_{\infty} & \approx T_{N}+\hat{g}\left(\frac{1}{N}-\frac{1}{\infty}\right)\\
& =T_{N}-\hat{B},\end{aligned}\]
<p>as in the previous example.</p>
<h1 id="a-von-mises-expansion">A von Mises expansion.</h1>
<p>Let us write the statistic as a functional of the data distribution as
follows:</p>
\[\begin{aligned}
T\left(F\right) & =\left(\int xdF\left(x\right)\right)^{2}.\end{aligned}\]
<p>Define the empirical distribution to be \(F_{N}\) and the true
distribution \(F_{\infty}\). Suppose we can Taylor expand the statistic in
the space of distribution functions as</p>
\[\begin{aligned}
T\left(F\right) & \approx T\left(F_{0}\right)+T_{1}\left(F_{0}\right)\left(G-F_{0}\right)+\frac{1}{2}T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right)+
O\left(\left|G-F_{0}\right|^{3}\right), & \textrm{(1)} \end{aligned}\]
<p>where \(T_{1}\left(F_{0}\right)\) is a linear operator on the space of
(signed) distribution functions and \(T_{2}\left(F_{0}\right)\) is a
similarly defined bilinear operator. The expansion in
Eq. 1
is known as a von Mises expansion.</p>
<p>Often these operators can be represented with “influence functions”,
i.e., there exists a function
\(x\mapsto\psi_{1}\left(F_{0}\right)\left(x\right)\) and
\(x_{1},x_{2}\mapsto\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)\)
such that</p>
\[\begin{aligned}
T_{1}\left(F_{0}\right)\left(G-F_{0}\right) & =\int\psi_{1}\left(F_{0}\right)\left(x\right)d\left(G-F_{0}\right)\left(x\right)\\
T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right) & =\int\int\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)d\left(G-F_{0}\right)\left(x_{1}\right)d\left(G-F_{0}\right)\left(x_{2}\right).\end{aligned}\]
<p>For instance, the directional derivative of our example is given by</p>
\[\begin{aligned}
\left.\frac{dT\left(F+tG\right)}{dt}\right|_{t=0} & =\left.\frac{d}{dt}\right|_{t=0}\left(\int xd\left(F+tG\right)\left(x\right)\right)^{2}\\
& =2\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)\int xdG\left(x\right),\end{aligned}\]
<p>so that</p>
\[\begin{aligned}
\psi_{1}\left(F\right)\left(x\right) & =\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)^{2}x.\end{aligned}\]
<p>Similarly,</p>
\[\begin{aligned}
\left.\frac{d^{2}T\left(F+tG\right)}{dt^{2}}\right|_{t=0} & =2\int xdG\left(x\right)\int xdG\left(x\right),\end{aligned}\]
<p>so</p>
\[\begin{aligned}
\psi_{2}\left(F\right)\left(x_{1},x_{2}\right) & =2x_{1}x_{2}.\end{aligned}\]
<p>Define</p>
\[\begin{aligned}
\Delta_{N} & :=F_{N}-F_{\infty}.\end{aligned}\]
<p>Then the Taylor
expansion gives an expression for the bias in terms of the influence
functions:</p>
\[\begin{aligned}
T\left(F_{N}\right)-T\left(F_{\infty}\right) & =\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}+\frac{1}{2}\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)+\\
& \quad\quad O\left(\left|\Delta_{N}\right|^{3}\right).\end{aligned}\]
<p>Note that, in general, integrals against \(\Delta_{N}\) take the form</p>
\[\begin{aligned}
\int\phi\left(x\right)d\Delta_{N}\left(x\right) & =\int\phi\left(x\right)dF_{N}\left(x\right)-\int\phi\left(x\right)dF_{\infty}\left(x\right)\\
& =\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\mathbb{E}\left[\phi\left(x\right)\right].\end{aligned}\]
<p>Consequently, the first term has zero bias, since</p>
\[\begin{aligned}
\mathbb{E}\left[\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}\right] & =\frac{1}{N}\mathbb{E}\left[\sum_{n=1}^{N}\psi_{1}\left(F_{\infty}\right)\left(x_{n}\right)\right]-\mathbb{E}\left[\psi_{1}\left(F_{\infty}\right)\left(x\right)\right]\\
& =0.\end{aligned}\]
<p>The second term is given by</p>
\[\begin{aligned}
& \int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\\
& \quad=\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{n}\right)-\mathbb{E}_{x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)d\Delta_{N}\left(x_{1}\right)\\
& \quad=\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right].\end{aligned}\]
<p>Note that in the expectation, \(x_{1}\) and \(x_{2}\) are independent. So,
when $n_{1}\ne n_{2}$, the term in the sum has mean zero, and</p>
\[\begin{aligned}
& \mathbb{E}\left[\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\right]=\nonumber \\
& \quad\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right).
& \textrm{(2)}
\end{aligned}\]
<p>In general, this is not zero, and so the leading-order bias term of
\(\mathbb{E}\left[T\left(F_{N}\right)-T\left(F_{\infty}\right)\right]\) is
given by the expectation of the quadratic term.</p>
<p>Note that integrals over \(\Delta_{N}\) are, by the CLT, of order
\(1/\sqrt{N}\), so the order of the \(k\)-th term in the von Mises expansion
is order \(N^{-k/2}\). By this argument, the bias of \(T\) is of order \(N\)
and admits a series expansion in \(N\). Indeed, a von Mises expansion is
one way you could justify the first perspective. The expected
second-order term is precisely the value \(a_{1}\).</p>
<p>For our example, we can see that the bias is given by</p>
\[\begin{aligned}
& \frac{1}{2}\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)\\
& \quad=\frac{2}{2N}\left(\mathbb{E}\left[x_{1}^{2}\right]-\mathbb{E}\left[x_{1}\right]^{2}\right)\\
& \quad=\frac{1}{N}\mathrm{Var}\left(x_{1}\right),\end{aligned}\]
<p>exactly as expected. In this case, the second order term is the exact
bias because our very simple \(T\) is actually quadratic in the
distribution function.</p>
<p>In general, one can estimate the bias by computing a sample version of
the second-order term. In our simple example, \(\psi_{2}\left(F\right)\)
does not actually depend on \(F\), but in general one would have to
replace \(\psi_{2}\left(F_{\infty}\right)\) with
\(\psi_{2}\left(F_{N}\right)\) and the population expectations with sample
expectations. For our example, letting \(\hat{\mathbb{E}}\) denote sample
expectations, this plug-in approach gives</p>
\[\begin{aligned}
& \frac{1}{N}\left(\hat{\mathbb{E}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{1}\right)\right]-\hat{\mathbb{E}}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{2}\right)\right]\right)\\
& \quad=\frac{1}{N}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}x_{n_{1}}x_{n_{2}}\right)\\
& \quad=\frac{1}{N}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which is simply a sample estimate of the variance.</p>
<p>Note that you might initially expect that, to use the reasoning in Eq. 1, you
would need to express your estimator as an explicit function of \(N\), or at
least take into account the \(N\) dependence in developing a Taylor series
expansion such as that in Eq. 1. However, the example in the present case shows
that this is not so, as the empirical distribution depends only implicitly on
\(N\). In fact, the asymptotic series in \(N\) follows from the stochastic
behavior of \(\Delta_{N}\) rather than from any explicit \(N\) dependence in the
statistic.</p>
<h1 id="the-infinitesimal-jackknife">The infinitesimal jackknife.</h1>
<p>Rather than use Eq. 1 to estimate the bias directly with the plug-in principle,
we might imagine using it to try to approximate the jackknife estimate of bias.
In this section, I show that (a) a second order infinitesimal jackknife
expansion is necessary and that (b) you then get the same answer as by
estimating the bias from the second term of Eq. 1 directly.</p>
<p>Let \(F_{-i}\) denote the empirical distribution with datapoint \(i\) left
out, and let \(\Delta_{-i}\) denote \(F_{-i}-F_{N}\). The infinitesimal
jackknife estimate of \(T_{-i}\) is given by using
Eq. 1
to extrapolate from \(F_{N}\) to \(F_{-i}\):</p>
\[\begin{aligned}
T_{IJ}^{\left(1\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}.\end{aligned}\]
<p>This is the classical infinitesimal jackknife, which expands only to
first order. The second order IJ is of course</p>
\[\begin{aligned}
T_{IJ}^{\left(2\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}+\frac{1}{2}T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i}.\end{aligned}\]
<p>The difference with is that the base of the Taylor series is \(F_{N}\)
rather than \(F_{\infty}\), and we are extrapolating to estimate the
jackknife rather than to estimate the actual bias. A benefit is that all
the quantities in the Taylor series can be evaluated, and no plug-in
approximation is necessary. For instance, in our example,</p>
\[\begin{aligned}
\psi_{1}\left(F_{\infty}\right)\left(x\right) & =\left(\int\tilde{x}dF_{\infty}\left(\tilde{x}\right)\right)^{2}x,\end{aligned}\]
<p>which contains the unknown true mean \(\mathbb{E}\left[x_{1}\right]\). In
contrast,</p>
\[\begin{aligned}
\psi_{1}\left(F_{N}\right)\left(x\right) & =\hat{\mu}^{2}x,\end{aligned}\]
<p>depending on the observed sample mean.</p>
<p>As before, it is useful to first right out the operation of
\(\Delta_{-i}\) on a generic function of \(x\):</p>
\[\begin{aligned}
\int\phi\left(x\right)d\Delta_{-i}\left(x\right) & =\int\phi\left(x\right)dF_{-i}\left(x\right)-\int\phi\left(x\right)dF_{N}\left(x\right)\\
& =\frac{1}{N-1}\sum_{n\ne i}\phi\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right).\\
& =\left(\frac{1}{N-1}-\frac{1}{N}\right)\sum_{n=1}^{N}\phi\left(x_{n}\right)-\frac{\phi\left(x_{i}\right)}{N-1}\\
& =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\phi\left(x_{i}\right)\right).\end{aligned}\]
<p>From this we see that the first-order term is</p>
\[\begin{aligned}
T_{1}\left(F_{N}\right)\Delta_{-i} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\psi_{1}\left(F_{N}\right)\left(x_{i}\right)\right).\end{aligned}\]
<p>Suppose we tried to use \(T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\) to
approximate \(T_{-i}\) in the expression for \(\hat{B}\). We would get</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\
& =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{1}\left(F_{N}\right)\Delta_{-i}\right)\\
& =\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)\\
& =0.\end{aligned}\]
<p>In other words, the first-order approximate
estimates no bias. (This is in fact for the same reason that the
expectation with respect to \(F_{\infty}\) of the first-order term
evaluated at \(F_{\infty}\) is zero.)</p>
<p>The second order term is given by</p>
\[\begin{aligned}
T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i} & =\left(N-1\right)^{-1}\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{n}\right)-\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{i}\right)\right)d\Delta_{-i}\left(\tilde{x}\right)\\
& =\left(N-1\right)^{-2}\times(\\
& \quad\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{i},x_{n}\right)-\\
& \quad\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{i}\right)+\psi_{2}\left(F_{N}\right)\left(x_{i},x_{i}\right)\\
& \quad).\end{aligned}\]
<p>As before, using
\(T_{IJ}^{\left(2\right)}\left(F_{-i}\right)\) then to approximate
\(T_{-i}\) gives</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\
& =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{2}\left(F_{N}\right)\Delta_{-n}\Delta_{-n}\right),\end{aligned}\]
<p>where we have used the previous result that the first term has empirical
expectation \(0\). Plugging in, we see that</p>
\[\begin{aligned}
\hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{n}\right)-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)\right),\end{aligned}\]
<p>which is precisely a sample analogue of the population bias in Eq. 2 of the
previous section. Of course, in our specific example, this gives</p>
\[\begin{aligned}
\hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}x_{n_{1}}x_{n_{2}}\right)\\
& =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which matches the exact jackknife’s factor of \(\left(N-1\right)^{-1}\),
in contrast to our direct sample estimate of the bias term, which had a
factor of \(N^{-1}\).</p>In this post, I’ll try to connect a few different ways of viewing jackknife and infinitesimal jackknife bias correction. This post may help provide some intution, as well as an introduction to how to use the infinitesimal jackknife and von Mises expansion to think about bias correction.St. Augustine’s question: A counterexample to Ian Hacking’s ‘law of likelihood’2022-02-17T10:00:00+00:002022-02-17T10:00:00+00:00/philosophy/2022/02/17/st_augustines_paradox<p>In this post, I’d like to discuss a simple sense in which statistical reasoning
refutes itself. My reasoning is almost trivial and certainly familiar to
statisticians. But I think that the way I frame it constitutes an argument
against a certain kind of philosophical overreach: against an attempt to view
statistical reasoning as a branch of logic, rather than an activity that looks
more like rhetoric.</p>
<p>To make my argument I’d like to mash up two books which I’ve talked about before
on this blog. The first is Ian Hacking’s Logic of Statistical Inference (I
wrote <a href="/philosophy/2021/12/09/fidual_cis.html">here</a> about
its wonderful chapter on fiducial inference). The other is an interesting
section in St. Augustine’s confessions, which I <a href="/philosophy/2021/10/27/st_augustine.html">discussed here</a>. Ian Hacking’s ambition is, as the
title of the book suggests, to describe the basis of a logic of statistical
inference. His primary tool is the comparison of the likelihoods of what he
calls “chance outcomes” (implicitly he seems to mean aleatoric gambling devices,
but he is uncharacteristically imprecise, implying, I think, that we simply know
a chance setup when we see it).</p>
<p>St. Augustine, as I discuss in my earlier post, has a worldview stripped of what
modern thinkers would call randomness. In St. Augustine’s vision of the world,an
unknowable and all-powerful God guides to His own ends the outcome even of
aleatoric devices, such as the drawing of lots and, presumably, the flipping of
coins. Many people in the modern age do not think like St. Augustine. So it
is reasonable to ask what I will call “St. Augustine’s question:” is St.
Augustine’s deterministic worldview is correct?</p>
<h1 id="hacking-on-st-augustines-question">Hacking on St. Augustine’s question</h1>
<p>I would like to attempt, using Hacking’s methods, to bring the outcome of a coin
flip to bear on St. Augustine’s question. One might reasonably wonder doubt that
is a fair to Hacking. However, the first sentence of Hacking’s book articulates
the scope of his:</p>
<blockquote>
<p>“The problem of the foundation of statistics is to state a set of principles
which entail the validity of all correct statistical inference, and which
do not imply that any fallacious inference is valid.”</p>
</blockquote>
<p>Hacking’s goal is ambitious (my argument here is essentially that it is
over-ambitious). However, to his credit, it is clear: if we can formulate the
St. Augustine question as a statistical one about a chance outcome, then we
should expect Hacking’s logic to come to the correct epistemic conclusion.
Furthermore, Hacking states himself (when arguing against Neyman-Pearson
test in Chapter 7) that “the best way to refute a principle [is] not
general metaphysics but concrete example.”</p>
<p>Finally, lest it seem too esoteric to argue with St. Augustine, or that this
example is too contrived to be meaningful, at the end of this post, I will draw
connections between my argument and some shortcoming’s of likelihood-based model
comparison that are well known to statisticians but largely ignored by Hacking’s
book.</p>
<h1 id="hackings-law-of-likelihood">Hacking’s law of likelihood</h1>
<p>Hacking’s principle of inference is embodied in his “law of likelihood,” which
is introduced in Chapter 5. The goal is to justifiably connect aleatoric
statements to degrees of logical belief (without going through subjective
probability). Stripping away some of Hacking’s notation, his law of likelihood
states in brief that</p>
<blockquote>
<p>“If two joint propositions are consistent with the statistical data,
the better supported is that with the greater likelihood.”</p>
</blockquote>
<p>Here I should clarify some of Hacking’s terminology. By “statistical data” he
means everything you know before conducting a chance experiment, including the
nature of the nature of how you get the data. A “joint proposition” is some
statement about the world, possibly including things you don’t know, e.g.,
future unobserved data, or some unknown aspect of the real world. Hacking spends
a lot of time defining and discussing his terms.</p>
<p>For the present purpose, it suffices to describe some of Hacking’s own examples
from Chapter 5 of how the law of likelihood is to be used. Suppose that a
biased coin has P(H) = 0.9 and P(T) = 0.1. Then, by the law of likelihood, the
proposition \(\pi_H\) that a yet-unseen flip will be H is better supported than
the proposition \(\pi_T\) that it will be T, since P(H) > P(T). Similarly, if
we observe K heads out of N flips, by the law of likelihood, the proposition
\(\pi_{K/N}\) that P(H) = K / N is better supported than the proposition
\(\pi_{(K-1)/N}\) that P(H) = (K - 1) / N.</p>
<p>Are these assertions trivial? Hacking spends the first part of the book arguing
that they are not, and the latter part of the book demonstrating important
differences, both conceptual and practical, with decision theory and subjective
probability. Suffice to say they are beyond the scope of the present post.</p>
<h1 id="asking-st-augustines-question-with-the-law-of-likelihood">Asking St. Augustine’s question with the law of likelihood</h1>
<p>Let us suppose that we have made single coin flip which came up H. The coin was
designed and flipped symmetrically to the best of our abilities. St. Augustine’s
question can be expressed in terms of these two simple propositions:</p>
<ul>
<li>\(\pi_{R}\) (Randomness): P(H) = 0.5, and we observed H</li>
<li>\(\pi_{A}\) (Augustine): P(H) = 1.0 (God wills it), and we observed H</li>
</ul>
<p>Obviously, the law of likelihood supports \(\pi_{A}\), answering St. Augustine’s
question in the affirmative, i.e., that St. Augustine’s worldview is better
supported than randomness.</p>
<p>Let me be the first to admit that this is pretty trivial. Perhaps you are
disappointed, and sorry you bothered to read this far! Let me try to bring you
back in.</p>
<p>First, observe that the same reasoning applies to any number of coin flips. You
might ask — was the sequence HTHTTH pre-ordained or random, and the law of
likelihood always supports that it was pre-ordained. The same reasoning can be
applied to whether some small number of flips in a particular sequence were
pre-ordained — e.g., when asking whether every flip in the sequence HTHTTH
random, or was at least one of them pre-ordained, the law of likelihood supports
that at least one of them was pre-ordained. The same reasoning applies to
degrees of probability, as well — e.g., when asking whether every flip
in the sequence HTHTTH was fair, versus was it P(H) = 0.6 when H came up
and P(T) = 0.6 when T came up, the law of likelihood supports that the
sequence was not fair.</p>
<p>In short, the law of likelihood always supports the most deterministic
proposition. In this sense, the law of likelihood does not support its own
applicability. Without randomness, there is no need or use for a logic of
statistical inference. When given the opportunity to ask whether or not there
is randomness in a particular setting, the law of likelihood always militates
against randomness, and eats its own tail.</p>
<h1 id="statisticians-know-this-and-so-does-hacking">Statisticians know this, and so does Hacking</h1>
<p>This phenomenon is no surprise to statisticians, of course. Model selection
based on likelihood — whether Bayesian or frequentist in design and use —
favors the more complex models unless some corrective factor is used, such as
regularization or priors. The answer given by the law of likelihood to St.
Augustine’s question is just an extreme end of this phenomenon.</p>
<p>Is Hacking aware of this problem? Of course; Hacking is aware of most things.
For example, in Chapter 7, he discusses very briefly the importance of weighting
likelihoods in some cases (“One author has suggested that a number be assigned
to each hypothesis, which will represent the ‘seriousness’ of rejecting it …
In the theory of likelihood testing, one would use weighted tests.”)
Unfortunately, Hacking’s discussion of Bayesianism in Chapters 12 and 13 does
not take up this point, focusing instead on arguing against uniform priors and
dogmatic subjectivism. Probably most damningly, Hacking does not shrink away
from using the law of likelihood to reason between a large number of expressive
propositions and a single less expressive one, as in, for example, in his
comparison unbiased tests in Chapter 7 (page 89 in the Cambridge Philosophy
Classics 2016 edition). In summary, Hacking does not appear to take very
seriously the fundamental role extra-statistical evidence must play in
applications of the law of likelihood, in order to avoid its own
self-refutation.</p>
<h1 id="we-must-deliberately-choose-the-statistical-analogy">We must deliberately choose the statistical analogy</h1>
<p>The point is that describing the world with randomness is a choice we make, and
we make it because it is sometimes useful to us. In the course of doing
something like statistical inference, we <em>must</em> posit <em>a priori</em> the existence
of randomness as well as explanatory mechanisms of limited complexity. At the
core of statistical reasoning is the <em>discard</em> of information — of viewing a
set of voters, each entirely unique, as equivalent to balls drawn from an urn,
or viewing the days weather, which is fixed from yesterday’s by deterministic
laws of physics, as something exchangeable with some hypothetical population of
other days, conceptually detached from contingency and their own pasts. Failure
to remember this can lead to silly arguments about whether phenomena are “really
random.” In other words, we must choose to make the <a href="/philosophy/2021/08/22/what_is_statistics.html">statistical analogy</a>, and accept
that its applicability may not be indisputable.</p>
<p>From this perspective, Hacking’s ambition — a logic of statistical inference —
seems hopeless, not because of some inevitably subjective nature of probability
itself, but because of the subjective nature of analogy. How can you form a
logic which will give correct conclusions in every application of an analogy?
The affairs of statistics are inevitably human and not purely computational, and
the field is more exciting and fruitful for it.</p>In this post, I’d like to discuss a simple sense in which statistical reasoning refutes itself. My reasoning is almost trivial and certainly familiar to statisticians. But I think that the way I frame it constitutes an argument against a certain kind of philosophical overreach: against an attempt to view statistical reasoning as a branch of logic, rather than an activity that looks more like rhetoric.Some of the gambling devices that build statistics.2022-01-27T10:00:00+00:002022-01-27T10:00:00+00:00/philosophy/2022/01/27/basic_gambling_device<p>In <a href="/philosophy/2021/08/22/what_is_statistics.html">an earlier post</a>, I discuss how statistics uses
gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in
general (epistemic uncertainty). I called this the “statistical analogy.” Of
course, this perspective is not at all new — see section 1.5 of [0], for
example.</p>
<p>When folks employ the statistical analogy, explicitly or implicitly, a few
gambling devices come up again and again. I find that having their taxonomy in
the back of the mind can help see what metaphor(s) is (are) being employed in a
particular analysis. These gambling devices are obviously not fully distinct —
you can typically simulate one with another, and the final “device” obviously
encompasses all the others. But I will separate them here because they tend to
play different metaphorical roles — and, I would argue, increasingly tenuously
in the order I have written them.</p>
<h1 id="the-urn-exchangeability">The urn (exchangeability)</h1>
<p>The gambling device most commonly used in statistics is probably the urn: some
container containing some objects, such as balls of different colors, which is
shaken, and from which some objects are removed. The aleatoric randomness is
provided by shaking as well as drawing blindly from the urn, creating a symmetry
between all objects in the urn. Equivalent gambling devices include drawing
cards from a shuffled deck or random respondents for a poll. Once one is in the
habit of thinking about urns with a finite number of objects, it is a small step
to consider urns with an infinite number of objects, such as super-populations
in causal inference ([1], section 1.12).</p>
<p>The ubiquitous assumption of exchangeability is equivalent to sequential drawing
from a shaken urn ([2], section 3). Consequently, the urn model is at the core
of most frequentist inferential methods, including the bootstrap and normal
approximations for exchangeable data.</p>
<h1 id="bets-using-biased-coins-subjective-probability">Bets using biased coins (subjective probability)</h1>
<p>The biased coin, which chooses between two outcomes with given probabilities,
plays a large role in subjective probability (associated with Bayesian
statistics) as the basis for hypothetical betting. The key idea behind
subjective probability is that, before gathering data, we have beliefs about the
state of the world. If these beliefs satisfy some reasonable assumptions (i.e.,
are “coherent”) then there are some bets that we would be consider fair, and
some that we would not. Equivalent aleatoric versions of these bets can then be
used as metaphors for your subjective beliefs.</p>
<p>For example, suppose that some unknown quantity can be either A or B, and we
would accept as fair a bet in which we get $1 if A occurs but pay $2 if B
occurs. Since these are precisely the odds that would be acceptable for a
biased coin which comes up A 2/3 of the time and B 1/3 of the time, one might
say that your subjective belief about A and B is equivalent to your subjective
belief about a biased coin with probabilities 2/3 and 1/3. The bet on a biased
coin is a metaphor for your subjective belief about A and B. (The full formal
connection between betting and subjective probability is richer and more
complicated than my cartoon. See [3], sections 3.1-3.4.)</p>
<p>With a coin, the aleatoric randomness is produced a symmetric coin shape
together with flipping or spinning, which creates a symmetry between the two
sides. The biased coin can be extended to multiple outcomes with uneven dice,
such as sheep knuckle bones, again with symmetry created between outcomes via
spinning. Of course, you can draw from an urn using biased coins, or produce
bets with urns. That is not my point! The point is that the way these gambling
devices are used metaphorically is distinct.</p>
<h1 id="the-spinner-continuous-uniform-random-variables">The spinner (continuous uniform random variables)</h1>
<p>The urn and the biased coin are fundamentally discrete, though much of
statistics deals with continuous-valued random variables. The spinner is the
most natural way to produce a continuous random variable — namely, a uniform
distribution on the circumference of a circle. A spinner creates aleatoric
randomness by symmetry of the disk together with a vigorous spin. The needle
goes around many times, but the random number is produced by the fractional part
of the number of cycles. Pseudo-random number generators like the Mersenne
twister seem to me to be in the same class, as they are based on the fractional
part of a large number.</p>
<p>The spinner creates sort of a bridge to the rest of probability theory, since
any continuous random variable can be produced by applying function (the inverse
CDF) to a uniform random variable on the unit interval. Given a spinner, one
can begin to imagine complex aleatoric processed based on spinners and
computation alone. Of course we can form approximations to the continuum with a
sufficiently large number of coin flips, for example, or a sufficiently large
urn. However, I think the spinner provides much cleaner intuition for why we
consider continuous random variables to be reasonable aleatoric processes in the
first place.</p>
<h1 id="probabilistic-models">Probabilistic models</h1>
<p>Once we have the probability calculus (via the spinner and computation), we can
begin to form quite complex aleatoric models to represent our uncertainty.
Arguably, this is the realm in which a lot of modern statistical work takes
place. For example, suppose you are analyzing a binary outcome (hospitalized
for COVID or not) as a function of some regressors (age and vaccine status).
For an individual with a given age and vaccine status, we do not know for
certain whether they will be hospitalized. A logistic regression is precisely a
posited aleatoric system to describe this subjective uncertainty. Software like
Stan, which allows generalists to perform inference on their own generative
processes, make this kind of complex modeling relatively easy.</p>
<p>Of course, at this level of abstraction, the metaphor can lose clarity and
force. Why is logistic regression reasonable? Why not some other link
function? Why not other regressors (e.g. interactions)? Taking for granted
that such abstract models provide good metaphors for epistemic uncertainty is at
the root of many misapplications of statistics. In fact, many early
statisticians, particularly those in the frequentist camps, were expressly
unwilling to extend the statistical analogy much further than exchangeability.
One might see a key difference between Neyman-Rubin causal inference ([1]),
which (mostly ) requires only the urn, and Pearlian casual inference ([4]),
which requires probabilistic graphical models, as a difference in willingness
to stretch the statistical analogy.</p>
<p>As with all analogies, the quality of a particular statistical analogy is
subject to an ineradicable subjectivity. But being aware of what analogy
is being made in a particular situation can help clarify disagreements
and avoid missteps.</p>
<h1 id="references">References</h1>
<p>[0] Gelman, Andrew, et al. Bayesian data analysis. CRC press, 2013.</p>
<p>[1] Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.</p>
<p>[2] Shafer, Glenn, and Vladimir Vovk. “A Tutorial on Conformal Prediction.” Journal of Machine Learning Research 9.3 (2008).</p>
<p>[3] Ghosh, Jayanta K., Mohan Delampady, and Tapas Samanta. An introduction to Bayesian analysis: theory and methods. Vol. 725. New York: Springer, 2006.</p>
<p>[4] Pearl, Judea. Causality. Cambridge university press, 2009.</p>In an earlier post, I discuss how statistics uses gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in general (epistemic uncertainty). I called this the “statistical analogy.” Of course, this perspective is not at all new — see section 1.5 of [0], for example.How does AMIP work for regression when the weight vector induces colinearity in the regressors?2021-12-17T10:00:00+00:002021-12-17T10:00:00+00:00/amip/2021/12/17/reweighted_colinear_note<p>How does AMIP work for regression when the weight vector induces
colinearity in the regressors? This problem came up in our paper, as
well as in a couple users of <code class="language-plaintext highlighter-rouge">zaminfluence</code>. Superficially, the
higher-order infinitesimal jackknife has nothing to say about such a
point, since a requirement for the accuracy of the approximation is that
the Hessian matrix be uniformly non-singular. However, we shall see
that, in the special case of regression, we can re-express the problem
so that the singularity disappears.</p>
<h3 id="notation">Notation</h3>
<p>Suppose we have a weighted regression problem with regressor matrix \(X\)
(an \(N \times D\) matrix), response vector \(\vec{y}\) (an \(N\)-vector), and
weight vector \(\vec{w}\) (an \(N\)-vector):
\(\begin{aligned}
%
\hat{\theta}(\vec{w}) :={}& \theta\textrm{ such that }\sum_{n=1}^N
w_n (y_n - \theta^T x_n) x_n = 0
\Rightarrow\\
\hat{\theta}(\vec{w}) ={}&
\left(\frac{1}{N}\sum_{n=1}^N
w_n x_n x_n^T \right)^{-1} \frac{1}{N}\sum_{n=1}^Nw_n y_n x_n
\\={}& \left((\vec{w}\odot X)^T X\right)^{-1} (\vec{w}\odot X)^T \vec{y}.
%
\end{aligned}\)</p>
<p>Here, we have used \(\vec{w}\odot X\) like the Hadamard
product with broadcasting. Formally, we really mean
\((\vec{w}1_D^T) \odot X\), where \(1_D^T\) is a \(D\)-length row vector
containing all ones. Throughout, we will use \(1\) and \(0\) subscripted by
a dimension to represent vectors filled with ones and zeros
respectivley.</p>
<h3 id="how-can-weights-induce-rank-deficiency">How can weights induce rank deficiency?</h3>
<p>We are interested in what happens to the linear approximation at a
weight vector \(\vec{w}\) for which the Hessian
\(\frac{1}{N}\sum_{n=1}^Nw_n x_n x_n^T\) is singular. Assume that \(X\) has
rank \(D\), and that \((\vec{w}\odot X)\) is rank \(D-1\). Specifically, there
exists some nonzero vector \(a_1 \in \mathbb{R}^D\). such that \((\vec{w}
\odot X) a_1 = 0_N\), where \(0_N\) is the \(N\)-length vector of zeros. For
each \(n\), the preceding expression implies that \(w_n x_n^T a_1 = 0\), so
either \(w_n = 0\) or \(x_n^T a_1 = 0\). Without loss of generality, we can
thus order the observations so that we drop the first \(N_{d}\) rows:</p>
\[\begin{aligned}
%
\vec{w}=
\left(\begin{array}{c}
0_{N_{d}}\\
1_{N_{k}}
\end{array}\right)
\quad\textrm{and}\quad
X= \left(\begin{array}{c}
X_d\\
X_k
\end{array}\right)
%
\end{aligned}\]
<p>Here, \(X_d\) is an \(N_{d}\times D\) matrix of dropped
rows and \(X_k\) is an \(N_{k}\times D\) matrix of kept rows, where
\(N_{k}+ N_{d}= N\). We thus have</p>
\[\begin{aligned}
%
Xa_1 =
X= \left(\begin{array}{c}
X_d a_1\\
0_{N_{k}}
\end{array}\right).
%
\end{aligned}\]
<p>Here, \(X_d a_1 \ne 0\) (for otherwise \(X\) could not be
rank \(D\)). In other words, the rows \(X_k\) are rank deficient, the rows
\(X_d\) are not, but \(\vec{w}\odot X\) is rank deficient precisely because
\(\vec{w}\) drops the full-rank portion \(X_d\).</p>
<h3 id="reparameterize-to-isolate-the-vanishing-subspace">Reparameterize to isolate the vanishing subspace</h3>
<p>To understand how \(\hat{\theta}(\vec{w})\) behaves, let’s isolate the
coefficient that corresponds to the subspace that vanishes. To that end,
let \(A\) denote an invertible \(D \times D\) matrix whose first column is
\(a_1\).</p>
\[\begin{aligned}
%
A := \left(\begin{array}{c}
a_1 & a_2 & \ldots & a_D \\
\end{array}\right).
%
\end{aligned}\]
<p>Define \(Z:= XA\) and \(\beta := A^{-1} \theta\) so that
\(X\theta = Z\beta\).
Then we can equivalently investigate the behavior of</p>
\[\begin{aligned}
%
\hat{\beta}(\vec{w}) ={}&
\left((\vec{w}\odot Z)^T Z\right)^{-1}
(\vec{w}\odot Z)^T \vec{y}.
%
\end{aligned}\]
<p>If we write \(Z_1\) for the first column of \(Z\) and
\(Z_{2:D}\) for the \(N \times (D - 1)\) remaining columns, we have</p>
\[\begin{aligned}
%
Z=
\left(\begin{array}{c}
Z_1 & Z_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
Xa_1 & XA_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
X_d a_1 & X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}\right),
%
\end{aligned}\]
<p>where we have used the fact definition of \(a_1\) and the
partition of \(X\) from above.</p>
<h3 id="consider-a-straight-path-from-1_n-to-vecw">Consider a straight path from \(1_N\) to \(\vec{w}\)</h3>
<p>Define \(\vec{w}(t) = (\vec{w}- 1_N) t+ 1_N\) for \(t\in [0, 1]\), so that
\(\vec{w}(0) = 1_N\) and \(\vec{w}(1) = \vec{w}\). We can now write an
explicit formula for \(\hat{\beta}(\vec{w}(t))\) as a function of \(t\) and
consider what happens as \(t\rightarrow 1\).</p>
<p>Because \(\vec{w}\) has zeros in its first \(N_{d}\) entries,</p>
\[\begin{aligned}
%
\vec{w}(t) \odot Z=
\left(\begin{array}{cc}
(1-t) X_d a_1 & (1-t) X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}
\right)
%
\end{aligned}\]
<p>and</p>
\[\begin{aligned}
%
\left((\vec{w}\odot Z)^T Z\right) ={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T X_d a_1 &
(1-t) a_1^T X_d^T X_k A_{2:D}\\
(1-t) A_{2:D}^T X_k^T X_d a_1 &
A_{2:D}^T ( X_k^T X_k + (1-t) X_d^T X_d )A_{2:D} \\
\end{array}\right).
%
\end{aligned}\]
<p>Since the upper left hand entry \((1-t) a_1^T X_d^T X_d a_1 \rightarrow
0\) as \(t\rightarrow 1\), we can see again that the regression is singular when
evaluated at \(\vec{w}\).</p>
<p>However, by partitioning \(\vec{y}\) into dropped and kept components,
\(\vec{y}_d\) and \(\vec{y}_k\) respectively, we also have</p>
\[\begin{aligned}
%
(\vec{w}(t) \odot Z)^T \vec{y}={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T \vec{y}_d\\
A_{2:D}^T \left(X_k^T + (1-t)X_d^T \right)\vec{y}_k
\end{array}\right).
%
\end{aligned}\]
<p>One can perhaps see at this point that the \((1-t)\) will
cancel in the numerator and denominator of the regression. This can be
seen directly by the Schur complement. Letting \(\hat{\beta}_1\) denote
the first element of \(\hat{\beta}\), we can use the Schur complement to
write</p>
\[\begin{aligned}
%
\hat{\beta}_1(\vec{w}(t)) =
\frac{(1-t) a_1^T X_d^T \vec{y}_d}{
(1-t) a_1^T X_d^T X_d a_1 -
O((1-t)^2)
}
=
\frac{a_1^T X_d^T \vec{y}_d}{
a_1^T X_d^T X_d a_1 - O((1-t))
}
%
\end{aligned}\]
<p>where \(O(\cdot)\) denotes a term of the specified order
as \(t\rightarrow 1\). We can see that, as \(t\rightarrow 1\),
\(\hat{\beta}_1(\vec{w}(t))\) in fact varies smoothly, and we can expect
linear approximations to work as well as they might even without the
singularity. Formally, the singularity is a “removable singularity,”
analogous to when a factor cancels in a ratio of polynomials.</p>
<p>An analogous argument holds for the rest of the \(\hat{\beta}\) vector,
again using the Schur complement. Since \(\hat{\theta}\) is simply a
linear transform of \(\hat{\beta}\), the same reasoning applies to
\(\hat{\theta}\) as well.</p>
<h3 id="conslusions-and-consequences">Conslusions and consequences</h3>
<p>Though the regression problem is singular precisely at \(\vec{w}\), it is
in fact well-behaved in a neighborhood of \(\vec{w}\). This is because
re-weighting downweights both the non-singular rows and their
corresponding observations. Singularity occurs only when entries of
\(\vec{w}\) are precisely zero. For theoretical and practical purposes,
you can completely avoid the problem by simply considering weight
vectors that are not precisely zero at the left-out points, taking
instead some arbitrarily small values.</p>
<p>The most common way this seems to occur in practice is when a weight
vector drops all the levels of some indicator. It is definitely on my
TODO list to find some way to allow <code class="language-plaintext highlighter-rouge">zaminfluence</code> to deal with this
gracefully.</p>
<p>Note that this analysis leaned heavily on the structure of linear
regression. In general, when the Hessian matrix of the objective
function is nearly singular, it will be associated with non-linear
behavior of \(\hat{\theta}(\vec{w}(t))\) along a path from \(1_N\) to \(w\).
Linear regression is rather a special case.</p>How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of zaminfluence. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.Fiducial inference and the interpretation of confidence intervals.2021-12-09T10:00:00+00:002021-12-09T10:00:00+00:00/philosophy/2021/12/09/fidual_cis<h2 id="why-is-it-so-hard-to-think-correctly-about-confidence-intervals">Why is it so hard to think “correctly” about confidence intervals?</h2>
<p>I came across the following section in the
(wonderful) textbook
<a href="https://moderndive.com/8-confidence-intervals.html">ModernDive</a>:</p>
<blockquote>
<p>Let’s return our attention to 95% confidence intervals.
… A common but incorrect interpretation is: “There is a 95% probability that the
confidence interval contains p.” Looking at Figure 8.27, each of the confidence
intervals either does or doesn’t contain p. In other words, the probability is
either a 1 or a 0.</p>
</blockquote>
<p>(Although I’m going to pick on this quote a little bit, I want to stress that I
love this textbook. This view of CIs is extremely common and I might well have
taken a similar quote from any number of other sources. This book just
happened to be in front of me today.)</p>
<p>I understand what the authors are saying. Given the data we observed and CI we
computed, there is no remaining randomness — either the parameter is in the
interval, or it isn’t. The parameter is not random, the data is. But I think
there is room to admit that this point, while technically clear, is a little
uncomfortable, even for those of us who are very familiar with these concepts.
After all, there is a 95% chance that a randomly chosen interval contains the
parameter. I chose an interval. Why can I no longer say that there is a 95%
chance that the parameter is in that interval? To a beginning student of
statistics who is encountering this idea for the first time, this commonplace
qualification must seem pedantic at best and confusing at worst.</p>
<h2 id="fiducial-inference">Fiducial inference</h2>
<p>Chapter 9 of Ian Hacking’s <em>Logic of Statistical Inference</em> contains a beautiful
account of precisely <em>why</em> we are so inclined towards the “incorrect”
interpretation, as well as the shortcomings of our intuition. The logic is
precisely that of Fisher’s famous (infamous?) fiducial inference.
Understanding this connection not only helps us to better understand CIs (and
their modes of failure), but also to be more sympathetic to the inherent
reasonableness of students who are disinclined to let go of the “incorrect”
interpretation.</p>
<p>As presented by Hacking, there are two relatively uncontroversial building
blocks of fiducial inference, and one problematic one. Recall the idea that
aleatoric probabilities (the stuff of gambling devices) and epistemic
probabilities (degrees of epistemic belief) are fundamentally different
quantities. (Hacking treats this question better than I could, but I also have
<a href="/philosophy/2021/08/22/what_is_statistics.html">a short post on this topic here</a>). Following Hacking, I will denote
the aleatoric probabilities by \(P\) and the degrees of belief by \(p\).</p>
<h3 id="assumption-one-the-frequency-principle">Assumption one: The “frequency principle.”</h3>
<p>The first necessary assumption of fiducial inference is this:</p>
<blockquote>
<p>If you know nothing about an aleatoric event \(E\) other than its probability,
then \(p(E) = P(E)\).</p>
</blockquote>
<p>This amounts to saying that, for pure gambling devices, absent other
information, your subjective belief about whether an outcome occurs should be
the same as the frequency with which that outcome occurs under randomization. If
you know a coin comes up heads 50% of the time (\(P(heads) = 0.5\)), then your
degree of certainty that it will come up heads on the next flip should be the
same (\(p(heads) = 0.5\)). Hacking calls this assumption the “frequency
principle.”</p>
<h3 id="assumption-two-the-logic-of-support">Assumption two: The “logic of support.”</h3>
<p>The second fundamental assumption is that the logic of epistemic probabilities
should be the same as the logic of aleatoric probabilities. Specifically:</p>
<blockquote>
<p>Degrees of belief should obey Kolmogorov’s axioms.</p>
</blockquote>
<p>For example, if events \(H\) and \(I\) are logically mutually exclusive, then
\(p(H \textrm{ and }I) = p(H) + P(I)\). Conditional probabilities such as \(p(H \vert
E)\) are a measure of how much the event \(E\) supports a subjective belief that
\(H\) will occurs.</p>
<p>Neither the frequency principle nor the logic of support are particularly
controversial, even for avowed frequentists. Note that assumption one states
only how you come about subjective beliefs about systems you know to be
aleatoric, and assumption two states describes only how subjective beliefs
combine coherently. So there is nothing really Bayesian here.</p>
<h2 id="hypothesis-testing-and-fiducial-inference">Hypothesis testing and fiducial inference</h2>
<p>Applying the frequency principle and the logic of support to confidence
intervals, together with an additional (more controversial) logical step, will
in fact lead us directly to the “incorrect” interpretation of a confidence
interval. Let’s see how the logic works.</p>
<p>Suppose we have some data \(X\), and we want to know the value of some parameter
\(\theta\). Suppose we have constructed a valid confidence set \(S(X)\) such
that \(P(\theta \in S(X)) = 0.95\). Following Hacking, let \(D\) denote the
event that our setup is correct — specifically, that we are correct about the
randomness of \(X\) \(S(X)\) is a valid CI with the desired coverage. That is,
given \(D\), we assume that \(X\) is really random, and we know the randomness,
so \(P\) is a true aleatoric probability — no subjective belief here.</p>
<p>Of course, the construction of a confidence interval guarantees only the
aleatoric probability — thus we have used \(P\), not \(p\). However, by the
frequency principle, we are justified in writing</p>
<p>\(p(\theta \in S(X) \vert D) = P(\theta \in S(X)\vert D) = 0.95\),</p>
<p>so long as we know nothing other than the accuracy of our setup \(D\). (Note
that \(\theta \in S(X)\) is a pivot. In general, pivots play a central role in
fiducial inference.)</p>
<p>Note that \(p(\theta \in S(X) \vert D)\) is very near to our “incorrect”
interpretation of confidence intervals! However, in reality, we know more than
\(S(X)\), we actually observe \(X\) itself. Now, \(P(\theta \in S(X) \vert D, X)\)
is either \(0\) or \(1\). Conditional on \(X\), there is no remaining
aleatoric uncertainty to which we can apply the frequency principle. And
most authors — including those of quote that opened this post — stop here.</p>
<p>There is an additional assumption, however, that allow us to formally compute
\(p(\theta \in S(X) \vert X, D)\), and it is this (controversial) assumption that is
at the core of fiducial inference. It is this:</p>
<h3 id="assumption-three-irrelevance">Assumption three: Irrelevance</h3>
<blockquote>
<p>The full data \(X\) tells us nothing more about \(\theta\) (in an epistemic sense),
than the confidence interval \(S(X)\).</p>
</blockquote>
<p>In the case of confidence intervals, the assumption of irrelevance requires at
least two things. First, it requires that our subjective belief that \(\theta
\in S(X)\) does not depend on the particular interval that we compute
from the data. In other words, we are as likely to believe that our CI
contains the parameter no matter where its endpoints lie. Second, it
requires that there is nothing more to be learned about the parameter
from the data <em>other</em> than the information contained in the CI.</p>
<p>These are strong assumptions! However, when they hold, they justify the
“incorrect” interpretation of confidence intervals — namely that there is a
95% subjective probability that \(\theta \in S(X)\), given the data we observed.
For, under the assumption of irrelevance, by the logic of support (and then the
frequency principle as above) we can write</p>
<p>\(p(\theta \in S(X) \vert X, D) =
p(\theta \in S(X) \vert D) =
P(\theta \in S(X) \vert D) = 0.95\).</p>
<h2 id="how-does-this-go-wrong-and-what-does-it-mean-for-teaching">How does this go wrong, and what does it mean for teaching?</h2>
<p>Assumption three is often hard to justify, or outright fallacious. But one of
its strengths is that it points to <em>how</em> the logic of fiducial inference fails,
when it does fail. In particular, it is not hard to construct valid confidence
intervals that contain only impossible values of \(\theta\) for some values of
\(X\). (As long as a confidence interval takes on crazy values sufficiently
rarely, there is nothing in the definition preventing it from doing so.) In
fact, as Hacking points out, confidence intervals are tools for <em>before</em> you see
the data, designed so that you do not make mistakes too often on average, and
can suggest strange conclusions once you have seen a particular dataset.</p>
<p>However, it’s not crazy for someone, especially a beginning student, to
subscribe to assumption three, even if they are not aware of it. After all, we
typically present a confidence interval as <em>the</em> way to summarize what your data
tells you about your parameter. And if that’s the case, then the “incorrect”
interpretation of CIs follows from the extremely plausible frequency principle
and logic of support. At the least I think we should acknowledge the
reasonableness of this logical chain, and teach when it goes wrong rather than
simply reject it by fiat.</p>Why is it so hard to think “correctly” about confidence intervals?To think about the influence function, think about sums.2021-12-01T10:00:00+00:002021-12-01T10:00:00+00:00/amip/2021/12/01/influence_is_sum<p>I think the key to thinking intuitively about the influence function in our
<a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> is this: Lineraization
approximates a complicated estimator with a simple sum. If you can establish
that the linearization provides a good approximation, then you can reason about
your complicated estimator by reasoning about sums. And sums are easy to reason
about.</p>
<p>Specifically, suppose you have data weights \(w = (w_1, \ldots, w_N)\) and
an estimator \(\phi(w) \in \mathbb{R}\) which depends on the weights
in some complicated way. Let \(\phi^{lin}\) denote the first-order
Taylor series expansion around the unit weight vector \(\vec{1} := (1, \ldots, 1)\)</p>
\[\phi^{lin}(w) :=
\phi(\vec{1}) +
\sum_{n=1}^N \psi_n (w_n - 1) =
\phi(\vec{1}) + \sum_{n=1}^N \psi_n w_n
\quad\textrm{where}\quad
\psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{\vec{1}},\]
<p>and we have used the fact that \(\sum_{n=1}^N \psi_n = 0\) for Z-estimators.
(For situations where \(\sum_{n=1}^N \psi_n \ne 0\), just keep that sum around,
and everything I say in this post still applies.) Thinking now of \(\psi\) as
data, we can (in some abuse of notation) write \(\phi^{lin}(\psi) =
\phi(\vec{1}) + \sum_{n=1}^N \psi_n\). If \(\phi^{lin}(w)\) is a good
approximation to \(\phi(w)\), then the effect of leaving a datapoint out of
\(\phi(w)\) is well-approximated by the effect of leaving the corresponding
entry out of \(\psi\) in \(\phi^{lin}(\psi)\). We have, in effect, replaced a
complicated data dependence with a simple sum of terms. This is what
linearization does for us. (NB: if our original estimator had been a sum of the
data, the linearization would be exact!)</p>
<p>Typically \(\psi_n = O_p(N^{-1})\), so it’s a little helpful to define
\(\gamma_n := N \psi_n\). We then can write:</p>
\[\phi^{lin}(\gamma) := \phi(\vec{1}) + \frac{1}{N}\sum_{n=1}^N \gamma_n.\]
<p>We can now ask what kinds of changes we can produce in \(\phi^{lin}(\gamma)\) by
dropping entries from \(\gamma\) (while keeping \(N\) the same), and some of the
core conclusions of our paper become obvious. Definitionally, \(\sum_{n=1}^N
\gamma_n = 0\). For example, if we drop \(\alpha N\) points, for some fixed \(0 <
\alpha < 1\), then the amount we can change the sum \(\frac{1}{N}\sum_{n=1}^N
\gamma_n\) does not vanish, no matter how large \(N\) is, and no matter how
small \(\alpha\) is. The amount you can change the sum
\(\frac{1}{N}\sum_{n=1}^N \gamma_n\) also obviously depends on the tail shape of
the distribution of the \(\gamma_n\), as well as their absolute scale.
Increasing the scale (i.e., increasing the noise) obviously increases the amount
you can change the sum. And, for a given scale (i.e., a given \(\frac{1}{N}
\sum_{n=1}^N \gamma_n^2)\), you will be able to change the sum by the most when
the left-out \(\gamma_n\) all take the same value.</p>
<p>So one way to think about AMIP is this: we provide a good approximation to your
original statistic that takes the form of a simple sum over your data. Dropping
datapoints corresponds to dropping data from this sum. You can then think about
whether dropping sets that are selected in a certain way are reasonable or not
in terms of dropping entries from a sum, about which it’s easy to have good
intutition!</p>I think the key to thinking intuitively about the influence function in our work on AMIP is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.The bootstrap randomly queries the influence function.2021-11-08T10:00:00+00:002021-11-08T10:00:00+00:00/amip/2021/11/08/bootstrap_influence<p>When we present our <a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> the
relationship with the bootstrap often comes up. I think there’s a lot to say,
but there’s one particularly useful perspective: the (empirical, nonparametric)
bootstrap can be thought of as <em>randomly querying the influence function</em>.
From this perspective, it seems both clear (a) why the bootstrap works as
an estimator of variance and (b) why it won’t work as to find the
approximately most influential set, i.e., the set of points which
have the most extreme values of the influence function (AMIS in our paper).</p>
<p>Let’s suppose that you have a vector \(\psi \in \mathbb{R}^N\), with
\(\sum_{n=1}^N \psi_n = 0\), where \(N\) is very large. We would like to know
about \(\psi\), but suppose we can’t access it directly. Rather, we can only
query it via inner products \(v^T \psi\). Moreover, suppose we can only compute
\(B\) such inner products, where \(B \ll N\). For the purpose of this post,
\(\psi\) will be the influence scores, \(v\) will be rescaled bootstrap weight
vectors, \(N\) will be the number of data points, and \(B\) the number of
bootstrap samples. But the discussion can start out more generally.</p>
<p>Suppose we don’t know anything a priori about \(\psi\), so we query it randomly,
drawing IID entries for \(v\) from a distribution with mean zero and unit
variance. Let the \(b\)-th random vector be denoted \(v^{b}\). We can ask:
What can the collection of inner products \(V_B := \{v^{1} \cdot \psi, \ldots,
v^{B} \cdot \psi \}\) tell us about \(\psi\)?</p>
<p>At first glance, the answer seems to be “not much other than the scale.” The
set \(V_B\) tells us about the projection of \(\psi\) onto a \(B\)-dimensional
subspace, out of \(N \gg B\) total dimensions. Furthermore, since
\(\mathbb{E}[v \cdot \psi] = \sum_{n=1}^N \mathbb{E}[v_n] \psi_n = 0\), the
vectors \(v^b\) are, on average, orthogonal to \(\psi\). So we do not expect
the projection of \(\psi\) onto the space spanned by \(V_B\) to account for an
appreciable proportion of \(|| \psi ||_2\). The set \(V_B\) <em>can</em> estimate the
scale \(|| \psi ||_2\), however, since \(\mathrm{Var}(v \cdot \psi) =
\sum_{n=1}^N \mathbb{E}[v_n^2] \psi_n^2 = \sum_{n=1}^N \psi_n^2 =
||\psi||_2^2\), and \(\mathrm{Var}(v \cdot \psi)\) can be estimated using
the sample variance of \(v^b \cdot \psi\).</p>
<p>Note that the bootstrap is very similar to drawing \(v_n + 1 \sim
\mathrm{Poisson}(1)\); the proper bootstrap actually has some correlation
between different entries due to the constraint \(\sum_{n=1}^N v_n = 1\), but
this correlation is of order \(1/N\) and can be neglected for simplicity in the
present argument. The argument of the previous paragraph implies that the
bootstrap effectively randomly projects \(\psi\) onto a very low-dimensional
subspace, presumably losing most of its detail in doing so. It also makes sense
that the bootstrap can tell us about \(||\psi||_2\) — recall that
\(||\psi||_2\) consistently estimates the variance of the limiting distribution
of our statistic, a quantity that we know the bootstrap is also able to
estimate.</p>
<p>Recall that the AMIP from our paper is \(-\sum_{n=1}^{\lfloor \alpha N \rfloor}
\psi_{(n)}\), where \(\psi_{(n)}\) is the \(n\)-th sorted entry of the \(\psi\)
vector. From the argument sketch above I conjecture that the bootstrap
distribution actually doesn’t convey much information about the AMIP other than
the limiting variance. In particular, in the terms of our paper, I conjecture
that the bootstrap can tell us about the “noise” of AMIP but not the “shape.”</p>
<p>Incidentally, the above perspective is also relevant for situations where we
cannot form and / or invert the full Hessian matrix \(H\), and so cannot compute
\(\psi\) directly. If we imagine sketching \(H^{-1}\), e.g. by using the
conjugate gradient method applied to random vectors, we would run into a
problem conceptually quite similar to the bootstrap. It’s interesting
to think about how one could improve on random sampling in such a case.</p>When we present our work on AMIP the relationship with the bootstrap often comes up. I think there’s a lot to say, but there’s one particularly useful perspective: the (empirical, nonparametric) bootstrap can be thought of as randomly querying the influence function. From this perspective, it seems both clear (a) why the bootstrap works as an estimator of variance and (b) why it won’t work as to find the approximately most influential set, i.e., the set of points which have the most extreme values of the influence function (AMIS in our paper).