Jekyll2023-09-20T18:26:09+00:00/feed.xmlRyan Giordano, statistician.This is the professional webpage and open research journal of Ryan Giordano.Can you use data to choose your prior?2023-09-20T16:00:00+00:002023-09-20T16:00:00+00:00/bayes/2023/09/20/data_for_prior_selection<p>Can you use data to choose your prior? Some might say you can, pointing, for
example, to empirical Bayes procedures, which are formally doing just that. But
I would argue that the best answer to this question is no — that such
procedures are better thought of as estimating more complex models rather than
prior selection, and that answering this question in the affirmating elides the
subjectivity that is inherent in Bayesian modeling.</p>
<h1 id="does-empirical-bayes-eliminate-the-need-to-choose-a-prior">Does empirical Bayes eliminate the need to choose a prior?</h1>
<p>In Bayesian statistics, we have to choose a prior, and sometimes we really don’t
know what the prior should be. So a natural question is whether the dataset at
hand can tell us what prior we should choose. And there are certainly
well-known procedures, like empirical Bayes, that might give the impression that
the answer to this question is yes: given some set of candidate priors, we can
choose the one that is most consistent with the data. Doing so is the natural
continuation of the idea that our prior should obviously place nonzero mass on
the data. Given this perspective, some folks might readily answer yes:
you can use your data to pick your prior.</p>
<p>The difficulty is that the success of empirical Bayes — or indeed the Bayesian
model selection problem that empirical Bayes is approximation — depends
heavily on the <em>non-expressiveness</em> of the candidate priors you’re choosing
from. If allowed to choose any prior at all, both empirical Bayes and Bayesian
model selection will choose a degenerate prior that is a point mass at the
maximum likelihood estimate (MLE), which of course leads to an absurd point mass
posterior.</p>
<h1 id="a-simple-normal-counterexample">A simple normal counterexample</h1>
<p>Take the following simple example. Suppose we have a single scalar observation,
\(P(X | \theta) = N(X | \theta, 1)\). (You might think of \(X\) as a suitably
rescaled sample mean from IID normal observations with known variance.) Let us
use the conjugate prior \(P(\theta | \mu, \sigma) = N(\theta | \mu, \sigma^2)\),
which is parameterized by \(\mu\) and \(\sigma\).</p>
<p>What happens if we use empirical Bayes to select \(\mu\) and \(\sigma\)?
One common way to do so is to take</p>
\[\hat{\mu}, \hat{\sigma} =
\textrm{argmax } P(X | \mu, \sigma) =
\textrm{argmax } \int P(X | \theta) P(\theta | \mu, \sigma) d\theta.\]
<p>In this case, standard properties of the normal distribution give \(P(X | \mu,
\sigma) = N(\mu, 1 + \sigma^2)\). Irrespective of \(\sigma\), the optimal mean
is \(\hat{\mu} = X\) (the MLE of \(\theta\)). Given \(\hat{\mu} = X\), the
marginal probability \(P(X | X, \sigma)\) is increased by reducing the variance,
i.e. taking \(\sigma = 0\). The “empirical Bayes” prior is then \(N(\theta | X,
0)\) — a degenerate point mass at the MLE.</p>
<p>This phenomenon is readily seen to be general. What measure \(P(\theta)\)
maximizes the integral \(\int P(X | \theta) P(\theta) d\theta\)? It is the
measure that puts all its mass on the MLE, \(\textrm{armax } P(X | \theta)\).
Bayesian model selection, which also assigns posterior probability proportional
to the marginal likelihood, also exhibits this pathology, unless expressive
priors are downweighted or the class of priors is restricted.</p>
<h1 id="the-prior-class-needs-to-be-restricted-a-priori-somehow">The prior class needs to be restricted <em>a priori</em> somehow</h1>
<p>So empirical Bayes, if it does anything useful, does so because the class of
priors it is permitted to chose from is sufficiently restricted to avoid this
pathological behavior. For example, in our toy normal model above, we might fix
\(\sigma\) at some large value and choose only \(\mu\) with empirical Bayes.
The selection of the expressivity of the class of candidate priors is itself an
decision that must be made <em>a priori</em> — it is tantamount to choosing a prior,
and unavoidably involves prior judgement.</p>
<p>Arguably, allowing the data to adjudicate between multiple “priors” is best
thought of as a more expressive model that uses a mixture of priors — a
mixture model whose expressivity must be limited <em>a priori</em>, either by limiting
the class of candidate priors, assigning expressive ones low (hyper-)prior
probability, or both. This is, of course, a matter of semantics, but I think it
is a valuable one, in that it reserves the word “prior” for things that are
truely prior to the data, rather than simply another level in the modeling
hierarchy. And it emphasizes the fact that, like it or not, every Bayesian
model needs real, subjective, <em>a priori</em> decisions.</p>Can you use data to choose your prior? Some might say you can, pointing, for example, to empirical Bayes procedures, which are formally doing just that. But I would argue that the best answer to this question is no — that such procedures are better thought of as estimating more complex models rather than prior selection, and that answering this question in the affirmating elides the subjectivity that is inherent in Bayesian modeling.Three versions of risk-controlling prediction sets2023-06-30T16:00:00+00:002023-06-30T16:00:00+00:00/conformal/2023/06/30/risk_controlling_prediction<p>I had the honor and good fortune to present the “Distribution-Free,
Risk-Cntrolling Prediction Sets” paper (RCPS, [1]) at the Jordan symposium this
month. The paper is actually a bit more general than it first appears. Their
key contribution — the use of more complex and general loss functions — can
actually play well with more traditional conformal inference methods.</p>
<p>I got to talk to Stephen and Anastasios, and they (anr presumably the other
authors) are all well aware of the variants I’m about to describe, and other
sophisticated readers of the conformal literature will have seen these variants,
too. But it took me some thought, so I thought it was worth writing up here.</p>
<h1 id="setup">Setup</h1>
<p>As with classical conformal, RCPS data takes the form of IID pairs \(Z_m = (X_m,
Y_m)\), and we want to form a set, \(S(X_n)\), such that \(Y_n \in S(X_n)\) with
a new datapoint \(Z_n\). We want this to happen with high probability according
to some notion thereof — we’ll explore a few different versions below.</p>
<p>I will assume (as in RCPS) that we have a family of sets that are parameterized
by a scalar parameter \(\lambda\), writing \(S_\lambda(\cdot)\). The size of
the sets must be non-decreasing in \(\lambda\), so the task is to choose a
sufficiently large \(\hat{\lambda}\) with the help of a “calibration” data set
\(\mathcal{Z} := Z_1, \ldots, Z_N\) so that the sets are large enough (but not
too large). I’ll write \(S_{\hat{\lambda}}(\cdot)\) to emphasize with the hat
that the value of \(\hat{\lambda}\) depends on a randomly selected calibration
set.</p>
<p>Classical conformal inference (CI) produce intervals such that</p>
\[\underset{\mathcal{Z},Z_n}{\mathcal{P}}\left(
Y_n \in S_{\hat{\lambda}}(X_n)
\right) \ge 1 - \varepsilon, \quad\quad\textrm{Eq. 1 (traditional CI)}\]
<p>for some target error \(\varepsilon\). That is, there is high probability that a
new datapoint’s response lies within the given set, where the probability is
taken jointly over the new datapoint and the calibration set.</p>
<p>RCPS does something formally different. They take a loss function
\(L(Y_n, S)\) (which is non-increasing in the size of \(S\)), and
control</p>
\[\underset{\mathcal{Z}}{\mathcal{P}}\left(
\underset{Z_n}{\mathbb{E}}\left[
L(Y_n, S_{\hat{\lambda}}(X_n))
\right] \ge \alpha
\right) \ge 1 - \delta, \quad\quad\textrm{Eq. 2 (RCPS)}\]
<p>for some target accuracy \(\alpha\) and risk level \(\delta\).</p>
<h1 id="differences-from-traditional-ci">Differences from traditional CI</h1>
<p>Superficially, there are three differences between RCPS and CI:</p>
<ol>
<li>The use of a generic loss function</li>
<li>Separately controlling the randomness from the calibration
set and new data point</li>
<li>Controlling a “point estimate” (the expected loss) rather
than producing an interval (e.g., guaranteeing that the
loss is less than some amount some fraction of the time)</li>
</ol>
<p>A naive reader (e.g. me, on the first read) might wonder whether all three
differences are tied together somehow. But these features can all be achieved
separately — in particular, we can provide interval-like guarantees with
generic losses, as well as separately control the randomness in the calibration
set and new datapoint.</p>
<h1 id="traditional-ci-with-a-generic-loss">Traditional CI with a generic loss</h1>
<p>First, let’s do something like traditional CI but with a generic
loss function. That might mean choosing \(\hat{\lambda}\) so that</p>
\[\underset{\mathcal{Z},Z_n}{\mathcal{P}}\left(
L(Y_n, S_{\hat{\lambda}}(X_n)) \le \beta
\right) \ge 1 - \varepsilon, \quad\quad\textrm{Eq. 3 (loss)}\]
<p>for some \(\beta\) and some \(\varepsilon\). Here we have retained difference
(1), but not (2) and (3) — we provide and instance-wise interval for the loss
rather than a point estimate (3), and have not separately controlled the
randomness in the calibration and test point (2). To achieve Eq. 3, we can
invert the map from \(\lambda \mapsto L(Y_m, S_{\lambda}(X_m)\). Define</p>
\[\lambda(Z_m) := \inf \, \{ \lambda: L(Y_m, S_{\lambda}(X_m)) \le \beta \}.\]
<p>The values \(\lambda(Z_m)\) on the caibration set are exchangeable with the
\(\lambda(Z_n)\) on a new datapoint. Taking \(\lambda(Z_m)\) as our “conformity
scores” and applying traditional CI thus gives Eq. 3.</p>
<h1 id="traditional-ci-with-separate-control-on-the-randomness">Traditional CI with separate control on the randomness</h1>
<p>Similarly, we can achieve (2) without (1) and (3), doing traditional CI
intervals but with separate control over the randomness in the new datapoint and
calibration dataset. Specifically, we’d like to find a \(\hat{\lambda}\) so
that</p>
\[\underset{\mathcal{Z}}{\mathcal{P}}\left(
\underset{Z_n}{\mathcal{P}}\left(
Y_n \in S_{\hat{\lambda}}(X_n)
\right) \ge 1 - \gamma
\right) \ge 1 - \delta. \quad\quad\textrm{Eq. 4 (separate randomness control)}\]
<p>Eq. 4 is exactly Eq. 1, but we have separated out the sources of randomness. Eq.
4 can be achieved by doing standard RCPS with the indicator loss function:</p>
\[L(Y_n, S) = \mathbb{I}\left( Y_n \notin S \right).\]
<p>As pointed out by [1] (see Proposition 4 and discussion), the loss here is
binomial and so the Bentkus bound produces nearly tight intervals.</p>
<h1 id="generic-loss-functions-and-high-probability-interval-bounds">Generic loss functions and high-probability interval bounds</h1>
<p>Finally, we can acheive (1) and (2) but not (3). By combining
the previous two ideas, we can find sets satisfying</p>
\[\underset{\mathcal{Z}}{\mathcal{P}}\left(
\underset{Z_n}{\mathcal{P}}\left(
L(Y_n, S_{\hat{\lambda}}(X_n)) \le \beta
\right) \ge 1 - \gamma
\right) \ge 1 - \varepsilon. \quad\quad\textrm{Eq. 5 (loss, intervals, separate control)}\]
<h1 id="comparison">Comparison</h1>
<p>To me, Eq. 5 looks like the best sort of guarantee — interval control over a
generic loss, with separate control of the randomness. In fact, I rather wish
the original RCPS paper had looked at bounds of the form Eq. 5 rather than Eq.
2.</p>
<p>Of course, computing the \(\lambda(Z_m)\) for each calibration point requires
inverting \(N\) loss functions rather than a single empirical loss as in the
RCPS paper. Furthermore, Anastasios pointed out to me that the concentration
bounds used by RCPS to separately control the randomness (difference (2))
converge at rate \(1/\sqrt{N}\), while traditional CI is based on quantile
estimates which are accurate at rate \(1/N\). So there is both a computational
and theoretical price to be paid.</p>
<p>But the main point is that the machinery developed in [1] allows you to pick and
choose what you need for your particular problem. In this
sense, [1] represents an even richer set of techniques than might
appear at first glance.</p>
<h1 id="references">References</h1>
<p>[1] Bates, S., Angelopoulos, A., Lei, L., Malik, J. and Jordan, M., 2021.
Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6), pp.1-34.</p>I had the honor and good fortune to present the “Distribution-Free, Risk-Cntrolling Prediction Sets” paper (RCPS, [1]) at the Jordan symposium this month. The paper is actually a bit more general than it first appears. Their key contribution — the use of more complex and general loss functions — can actually play well with more traditional conformal inference methods.Meaning and randomness2023-04-10T16:00:00+00:002023-04-10T16:00:00+00:00/philosophy/2023/04/10/chance_and_meaning<blockquote>
<p>“If the various formations had had some <em>meaning</em>, if, for example,
there had been concealed signs and messages for us
which it was important we decode correctly, unceasing attention to
what was happening would have been inescapable and understandable.
But this was not the case of course, the various
cloud shapes and hues meant <em>nothing</em>, what they looked like at
any given juncture was based on chance, so if there was anything the
clouds suggested it was meaninglessness in its purest form.” <br />
— My Struggle Volume One, Karl Ove Knausgaard, p. 388.</p>
</blockquote>
<blockquote>
<p>“Moreover a swollen sea often gives warning of winds to come, when
suddenly and from its depths it begins to swell, and rocks, white and foamy with
snowy brine, strive to reply to Neptune with gloom-inducing voices or when
a shrill whistle arising from a lofty mountain peak grows stronger, repulsed
by the barrier of crags.”<br />
— Cicero, On Divination, Book One.</p>
</blockquote>
<p>In modern times, chance or randomness is invoked for two purposes: one, to
permit calculation of uncertainty, or, two, to evoke meaninglessness. These two
purposes have nothing essentially to do with one another. Statistical reasoning
does not depend on the meaning of chance events, only their long-run regularity
or etiological independence from other events. And meaninglessness has never
required a mathematical basis. But the two purposes share a common root, and
perhaps are mutually supportive, via the statistical analogy, that places
aleatoric gambling devices at the center of all inductive reasoning.</p>
<p>What, in the precise shape of a cloud or shape of an ocean’s wave, is random?
Despite the usefulness of studying them with aleatoric models [1], the analogy
between clouds or waves and aleatoric devices is tenuous. Perhaps one can
imagine that a particular cloud is a single sample taken out of the long run
from a chaotic and ergodic system, or imagine a population of clouds on days and
locations considered “equivalent” by definition. But doing so functions less to
tell us much about what we see, than to form a <em>post hoc</em> justification of
statistical methods that are our only formal recourse in the absence of
deterministic mechanisms or deductive reasoning.</p>
<p>And, because of the inherent vacuousness of dice rolls and coin flips, the
statistical analogy comes at a cost. When the inexplicable is taken by default
to be random, and randomness is understood by analogy with aleatoric devices,
meaninglessness can steal into otherwise marvelous phenomena where it has no
inherent place. When we look at the precise shape and details of a particular
cloud or a wave, we see the inexplicable, but we should be free to choose the
form taken by this inexplicability. Aleatoric devices need not be our only
metaphor for the unknowable. Neither should it be necessary to imagine a divine
power sending potentially legible messages. Instead, perhaps we can keep
divination and statistics in their cages, and strive to see the infinite variety
of the world as simultaneously inscrutable and full of meaning. To help with
this, we might need new metaphors.</p>
<p>[1] Berlinghieri, Renato, Brian L. Trippe, David R. Burt, Ryan Giordano, Kaushik
Srinivasan, Tamay Özgökmen, Junfei Xia, and Tamara Broderick. “Gaussian
processes at the Helm (holtz): A more fluid model for ocean currents.” arXiv
preprint arXiv:2302.10364 (2023).</p>“If the various formations had had some meaning, if, for example, there had been concealed signs and messages for us which it was important we decode correctly, unceasing attention to what was happening would have been inescapable and understandable. But this was not the case of course, the various cloud shapes and hues meant nothing, what they looked like at any given juncture was based on chance, so if there was anything the clouds suggested it was meaninglessness in its purest form.” — My Struggle Volume One, Karl Ove Knausgaard, p. 388.Free will and randomness2023-03-29T16:00:00+00:002023-03-29T16:00:00+00:00/philosophy/2023/03/29/causation<p>Free will and randomness feel opposed to one another: free wil is what makes us
human; randomness is the epitome of meaninglessness. But the two share a deep
affinity: they both are built on breaking with the past and with contingency.</p>
<p>All things are contingent. No event has a single cause. But it seems to us
that we have free will: the ability to make decisions, and so this seeming
capacity to intervene in the world gives rise to causal questions. If I choose
to eat this mushroom, will I get sick? Implicit in the question is the
possibility that I might choose, or not, to eat the mushroom. Of course, our
decision to intervene is itself contingent, since our minds are also part of the
world. The contingency of our own minds makes it hard to infer causation:
amongst patients who choose to go to the hospital outcomes are worse, but this
does not mean going to the hospital makes you sicker. Or maybe I would have
slept badly anyway on the nights I choose to stay on my phone until late. Our
choices don’t <em>feel</em> very contingent; if they did, there would be no reason to
ask about causation. But we know, abstractly, that our choices may produce
selection bias in ways we don’t fully understand.</p>
<p>How, then, do we establish causation? We require a technology to breaks the
chain of contingency more effectively than our own decision making processes.
That technology is randomness: aleatoric machines, designed originally for
gambling, which are specially constructed using symmetry and
information-discarding physical processes (spinning a wheel, a coin, or a die;
mixing an urn of balls; or their mathematical analogues in random number
generators). The function of these aleatoric machines is to break the bonds
between the past and the future more thoroughly than our own minds can. If an
outcome follows from the output of an aleatoric device, it cannot have followed
from anything else: if you are selected for the treatment arm of an experiment
by a coin flip, the effect of the past on your outcome is broken at the moment
of the coin flip, at least relative to what would have happened had the coin
flip turned out otherwise.</p>
<p>So free will gives meaning to causation, and causation is detected by
randomness. But there is an irony to this arrangement. Causation matters
because we make choices, the choices we make are the place where who we are
connects to the world, and the source of meaning. But the very notion of
causation requires a break from the past, a break which is most perfectly
created by randomness, the epitome of meaninglessness, as a process which is, by
design, maximally disconnected from the rest of the world. At a deep level, the
phenomena of free will and randomness are siblings: each relies on the
possibility of non-contingency. But behind free will there lies a soul, and
behind randomness, there is nothing (or worse, a tawdry betting game). The
difference between the two is one of value, though, not of kind.</p>
<p>This priveleged epistemic role of aleatoric devices in establishing causation is
an extremely recent invention. Often the idea is attributed to Neyman [1],
though this idea is certainly in Hume (Section VIII, Part I of [2]), who writes:</p>
<blockquote>
<p>“And if the definition above mentioned be admitted; liberty, when opposed to
necessity, not to constraint, is the same thing with chance; which is
universally allowed to have no existence.”</p>
</blockquote>
<p>(I might argue that Section VIII, and not Section VI, is the right place to look
in Hume for the origins of modern probabilisitic causal inference, in contrast
to some other authors [2]).</p>
<p>But even if the origins were to be traced back to the 17th century, it is
well-established that the pre-modern world did not give aleatoric reasoning such
a priveleged status ([4]). To me, this opens up a range of interesting
questions. Is it only the modern epistemic prominence of randomness that allows
us to even conceive of causation this way? Does the close proximity between
meaninglessness and free will affect our view of ourselves? When we insist, in
practice, that only randomness can establish causation, do we leave behind ways
of interacting with the world that do not fit into this framework? In what other
ways can we imagine leaving contingency behind? How have other cultures and
times done so?</p>
<p>[1] Neyman, Jerzy, and Karolina Iwaszkiewicz. “Statistical problems in
agricultural experimentation.” Supplement to the Journal of the Royal
Statistical Society 2.2 (1935): 107-180.</p>
<p>[2] Hume, D. (1748). An Enquiry Concerning Human Understanding. Renascence
Editions.</p>
<p>[3] Holland, Paul W. “Statistics and causal inference.”
Journal of the American statistical Association 81.396 (1986): 945-960.</p>
<p>[4] Hacking, Ian. The emergence of probability: A philosophical study of early
ideas about probability, induction and statistical inference. Cambridge
University Press, 2006.</p>Free will and randomness feel opposed to one another: free wil is what makes us human; randomness is the epitome of meaninglessness. But the two share a deep affinity: they both are built on breaking with the past and with contingency.The Popper-Miller theorem is the Bayesian transitivity paradox.2022-10-19T16:00:00+00:002022-10-19T16:00:00+00:00/philosophy/2022/10/19/popper_miller<p>Popper and Miller[1,2] proposed a tidy little paradox about inductive reasoning.
Many 20th century Bayesians (e.g. [3]) claim that Bayesian reasoning is valid
inductive reasoning. Popper, ever the enemy of induction, produced (with
Miller) the Popper-Miller (PM) theorem, which “proves” that Bayesian
“induction” is nothing but watered-down deduction.</p>
<p>The PM theorem was widely discussed at the time, and I feel there are enough
counterarguments. Here, I would like to point out something different: to argue
that the PM theorem is just the Bayesian transitivity paradox (BTP) in fancy
dress. That there is a connection between the PM theorem and the BTP was
pointed out in a brief comment by Redhead in [4], but I think the connection is
deeper and simpler than has been noticed before.</p>
<p>I’ll first say what the PM theorem and BTP are, and then show that the two are
the opposite sides of the same coin. The setup, throughout, is as follows.
Suppose that we are interested in how much some observed evidence, \(e\),
supports a hypothesis \(h\), when \(h \Rightarrow e\), but \(e \not\Rightarrow
h\). For example, \(h\) might be “my coin has heads on both sides” and
\(e\) might be “I observed a heads after a single flip.”</p>
<p>I’ll use logic notation (\(\lor\) for disjunction, \(\land\) for
conjuction, \(\lnot\) for negation), but one could equally have represented
\(e\) and \(h\) as sets than as logical propositions (with \(\bigcup, \bigcap,
(\cdot)^c\) instead of the respective logic symbols). Note that logical
implication of propositions (\(A \Rightarrow B\)) is the same as set containment
(\(A \subseteq B\)).</p>
<p>Both the PM theorem and the BTP are stated in terms of Bayesian logic. So I’ll
begin by assuming that I have a measure \(p(\cdot)\) on propositions, where
\(p(\cdot | \cdot)\) denotes conditional probability, though my final conclusion
will be in much greater generality. In particular, \(p(h | e)\) is the
posterior credibility of \(h\) given that we observed \(e\). Popper and Miller
analyzes the “support” of \(e\) for \(h\), which is defined as the difference
between the posterior and prior probabilities of \(h\), i.e., \(s(h | e) := p(h |
e) - p(h)\). When support is positive, we say \(e\) supports \(h\), and when it
is negative, we say \(e\) counter-supports \(h\). We’ll assume that \(s(h | e) >
0\) here. I assume throughout that \(p(e)\), \(p(h)\), and \(p(h | e)\) are all
strictly between \(0\) and \(1\), which is not essential, but simplifies things
a bit.</p>
<h1 id="the-popper-miller-pm-theorem">The Popper-Miller (PM) Theorem</h1>
<p>The PM theorem is based on a decomposition of \(h\) into “deductive”
and “inductive” parts, denoted \(h_D\) and \(h_I\) respectively,
with \(h = h_D \land h_I\). The
deductive part has the property that \(e \Rightarrow h\), and
the inductive part is supposed to capture “all of \(h\) that goes beyond
\(e\).” Their particular decomposition doesn’t matter for my purposes
(it happens to be \(h_D = h \lor e\) and \(h_I = h \lor \lnot e\)), but
it has these properties:</p>
\[\begin{align}
(A) && s(h | e) ={}& s(h_D | e) + s(h_I | e) \\
(B) && s(h_I | e) <{}& 0.
\end{align}\]
<p>Property (A) supports the notion that \(h_D\) and \(h_I\) are a “decomposition”
of the support of \(e\) for \(h\).</p>
<p>Property (B) is the PM theorem. It says that the inductive component is always
counter-supported by the evidence. One might interpret (B) as follows: that
Bayesian reasoning appears to do induction is only an illusion. The support is
merely deductive support diluted by inductive counter-support. Or so, at least,
Popper and Miller claim.</p>
<p>The most common objection was whether reasonable alternative decompositions
exist. Of course they do, and most of the discussion in the literatue was
about precisely which candidate decompositions are valid and not. Some
decompositions violate (A) and some violate (B). Popper and Miller argue in
several ways that their decomposition is uniquely appropriate [2], though I
think that Elby and Redhead argue convincingly that other decompositions are
reasonable [4,5]. For my purpose, all that matters is that a family of
alternative decompositions exist, some of which may violate either (A) or
(B).[*]</p>
<h1 id="the-bayesian-transitivity-paradox">The Bayesian Transitivity Paradox</h1>
<p>A consequence of (A) and (B) which is not much remarked on in the PM theorem
debate is the following:</p>
\[\begin{align*}
(C) && s(h_D | e) - s(h | e) >{}& 0.
\end{align*}\]
<p>That is, the deductive component receives greater support than the hypothesis.
This makes intuitive sense as a desiderata for a decomposition: \(h \Rightarrow
e \Rightarrow h_D\), and, intuitively, any notion of “support” should give
no more support to a hypothesis than to its logical consequences.</p>
<p>One might wonder whether it is always the case that \(s(r | e) > s(q | e)\) when
\(q \Rightarrow s\). It turns out that this is not necessarily the case, a
phenomenon that is known as the “Bayesian transitivity paradox” (BTP). In fact,
the \(h_I\) component of the PM theorem is an example: \(s(h | e) > 0 > s(h_I |
e)\), although \(h \Rightarrow h_I\). So the PM theorem unavoidably involves
the BTP, a point noted by Redhead [5].</p>
<p>It’s worth noting that the posterior itself does not suffer from anything like
the BTP. If \(q \Rightarrow r\), then \(p(q | e) \ge p(r | e)\), since
logical implication is the same as set containment. The BTP occurs for
\(s(\cdot \vert \cdot)\) because of the role played by the prior.</p>
<p>Of course, (C) follows from (A) and (B), and (B) follows from (A) and (C). It
follows that, given a decomposition of the form (A), <em>the inductive support is
negative if and only if the deductive part has greater support than the original
hypothesis</em>.</p>
<h1 id="the-pm-theorem-is-a-special-case-of-the-btp">The PM theorem is a special case of the BTP</h1>
<p>Let us step back from the specific notion of support and decomposition used in
the PM theorem, and ask what we might want <em>in general</em> from a decomposition of
a generic notion of support, which we denote \(\sigma(\cdot | \cdot)\) into a
deductive and inductive part, which we call \(x_D\) and \(x_I\). We no longer
require \(h = x_D \land x_I\), but we do require that \(e \Rightarrow x_D\) in
some sense. To investigate a generalized form of the PM theorem, one might ask
whether we can have:</p>
\[%
\begin{align*}
%
(A')&& \sigma(h | e) ={}& \sigma(x_D | e) + \sigma(x_I | e)\\
(B')&& \sigma(h_I | e) \ge{}& 0\\
(C')&& \sigma(h_D | e) \ge{}& \sigma(h | e).
%
\end{align*}
%\]
<p>We want (A’) because that’s what a “decomposition” would mean, we want (C’)
because we don’t want anything like the BTP, and we want to know whether (B’) is
possible because that’s what it would mean to do induction. But obviously, by
basic algebra, (A’), (B’) and (C’) cannot be simultaneously true, for <em>any</em>
possibly notion of support, probabilisitic or otherwise. The PM theorem is
simply a particular case of this simple and general observation.</p>
<p>In light of this, the PM theorem begins to a look a little trivial. When Popper
and Miller insist, e.g. in response to [7], that authors who contest their
decomposition produce alternative decompositions, they are in fact begging the
question.</p>
<p>That the BTP occurs is certainly a meaningful critique of probabilistic support
\(s(\cdot | \cdot)\). It seems to me, however, that the PM theorem simply
re-arranges the BTP in a way that sacrifices clarity rather than illuminates
what is really at issue.</p>
<h1 id="bibliography">Bibliography</h1>
<p>[1] Popper, K. and D. Miller (1983). “A proof of the impossibility of inductive probability”</p>
<p>[2] Popper, K. and D. Miller (1987). “Why probabilistic support is not inductive”. In: Philosophical Transactions of
the Royal Society of London. Series A, Mathematical and Physical Sciences 321.1562,
pp. 569–591.</p>
<p>[3] Carnap, R. (1966). “The aim of inductive logic”. In: Studies in Logic and the Foundations of
Mathematics. Vol. 44. Elsevier, pp. 303–318.</p>
<p>[4] Elby, A. (1994). “Contentious contents: For inductive probability”. In: The British journal
for the philosophy of science 45.1, pp. 193–200.</p>
<p>[5] Redhead, M. (1985). “On the impossibility of inductive probability”. In: The British Journal
for the Philosophy of Science 36.2, pp. 185–191.</p>
<p>[6] Levi, I. (1984). “The impossibility of inductive probability”. In: Nature 310.5976, pp. 433–433.</p>
<p>[7] Jeffrey, R. (1984). “The impossibility of inductive probability”. In: Nature 310.5976, pp. 433–
433.</p>
<h1 id="notes">Notes</h1>
<p>[*] A family of decompositions satisfying \(h = x_D \land x_I\), \(e
\Rightarrow x_D\), and condition (A) can be found by taking \({e, a, b}\) to be
any partition of the tautology (so that \(p(e \lor a \lor b) = p(e) + p(a) +
p(b) = 1\)), and taking \(x_D = e \lor a\) and \(x_I = e \lor b\). Levi pointed
out one such decomposition in [6], though argued that Bayesian inference is
still not deduction since \(s(h_I | e)\) varies over possibile decompositions
despite the fact that \(h_I \land e = h\) for all such decompositions, and
propostions that are logically equivalent given \(e\) should receive equal
support from \(e\). Other authors, e.g. Jeffreys in [7], argue for
decompositions that violate (A). It is easy to show that if \(e \lor a \lor b\)
is not the tautology, then (A) is violated.</p>Popper and Miller[1,2] proposed a tidy little paradox about inductive reasoning. Many 20th century Bayesians (e.g. [3]) claim that Bayesian reasoning is valid inductive reasoning. Popper, ever the enemy of induction, produced (with Miller) the Popper-Miller (PM) theorem, which “proves” that Bayesian “induction” is nothing but watered-down deduction.R torch for statistics (not just machine learning).2022-04-01T16:00:00+00:002022-04-01T16:00:00+00:00/code/2022/04/01/rtorch_example<p>The <code class="language-plaintext highlighter-rouge">torch</code> package for R (<a href="https://torch.mlverse.org/">found here</a>) is
CRAN-installable and provides automatic differentiation in R, as long as you’re
willing to rewrite your code using Torch functions.</p>
<p>The current docs for the <code class="language-plaintext highlighter-rouge">torch</code> package are great, but assume you’re interested
in machine learning. But gradients are useful for ordinary statistics, too!
In the notebook below I fit a simple Poisson regression model using <code class="language-plaintext highlighter-rouge">optim</code>
by implementing the log likelihod and derivatives in torch. Though not really
competitive with (the highly optimized) <code class="language-plaintext highlighter-rouge">lme4::glm</code> on this toy example, the
my point is more how easily you can roll your own MLE in R using <code class="language-plaintext highlighter-rouge">torch</code>.</p>
<p>The notebook itself <a href="2022-04-01_poisson_regression_torch_for_r.ipynb">can be downloaded here</a>,
and an markdown version follows.</p>
<hr />
<h1 id="example-of-torch-for-classical-stats-poisson-regression">Example of <code class="language-plaintext highlighter-rouge">torch</code> for classical stats (Poisson regression)</h1>
<p>In this notebook, I’ll show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> for R to optimize loss functions and compute standard error estimates.</p>
<p>The <a href="https://torch.mlverse.org/">torch for R website</a> is mostly focused on machine learning applications. The purpose of this notebook is just to show how easy it is to use <code class="language-plaintext highlighter-rouge">torch</code> to get gradients and Hessians for your own purposes, including vanilla classical statistics.</p>
<p>I’ll use <code class="language-plaintext highlighter-rouge">torch</code> to implement and optimize a Poisson regression loss function and compute standard errors using Fisher information. This is just a toy problem, but by simply dropping the loss into an out-of-the-box optimizer, we get essentially the same answer as the (highly optimized) <code class="language-plaintext highlighter-rouge">lme4</code> package in a similar amount of time.</p>
<h1 id="installation">Installation</h1>
<p>One of the big benefits of <code class="language-plaintext highlighter-rouge">torch</code> is that it can be installed via CRAN, and so can be easily packaged in with your own R packages without the user having to do a bunch of extra Python nonsense. Installation instructions can be found <a href="https://torch.mlverse.org/docs/">here</a>.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">lme4</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">torch</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">44</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s generate some data. The model will be a simple Poisson regression:</p>
\[p(y_n | x_n) = \mathrm{Poisson}(\exp(x_n^T \beta))\]
<p>The goal will be to estimate \(\beta\), and standard errors, using maximum likelihood and the inverse Fisher information.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_obs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n_obs</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">beta_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.8</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lambda_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">lambda_true</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rpois</span><span class="p">(</span><span class="n">n_obs</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_true</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> V1
Min. : 1.021
1st Qu.: 2.570
Median : 3.998
Mean : 4.742
3rd Qu.: 6.349
Max. :15.447
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 4.664 6.000 25.000
</code></pre></div></div>
<p>Let’s define the log likelihood in <code class="language-plaintext highlighter-rouge">torch</code>. Then we can use <code class="language-plaintext highlighter-rouge">torch</code> to evaluated gradients of the log likelihood for optimization, and the Hessian for standard errors.</p>
<p>There are two important things to know:</p>
<ul>
<li>Torch does not operate on R numeric types. It operates on torch tensors, which can be created with torch_tensor().</li>
<li>Torch uses only its own functions — not base R! You can typically find the things you need by browsing through the <a href="https://torch.mlverse.org/docs/reference/index.html">reference material</a>.</li>
</ul>
<p>I’ll keep torch versions of the data around in a list <code class="language-plaintext highlighter-rouge">tvars</code> for easy re-use. And I’ll write a function <code class="language-plaintext highlighter-rouge">EvalLogLik</code>, which takes in a torch tensor <code class="language-plaintext highlighter-rouge">beta</code>, the data, and returns the log likelihood, again as a torch tensor.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tvars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">EvalLogLikTorch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">is</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="s2">"torch_tensor"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"beta must be a torch tensor"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">log_lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_matmul</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="n">lp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_sum</span><span class="p">(</span><span class="n">tvars</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">log_lambda</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">torch_exp</span><span class="p">(</span><span class="n">log_lambda</span><span class="p">))</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">lp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Sanity check that it works</span><span class="w">
</span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta_true</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch_tensor
1.72212e+06
[ CPUFloatType{} ]
</code></pre></div></div>
<p>We want to pass the (negative) log likelihood to an R routine as a function to be optimized. So we need to write a wrapper that takes an <code class="language-plaintext highlighter-rouge">R</code> numeric type, converts it to a torch tensor, calls <code class="language-plaintext highlighter-rouge">EvalLogProb</code>, and converts the result back to an R numeric type.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># From now on we'll take `tvars` to be a global variable to save</span><span class="w">
</span><span class="c1"># writing everything as lambda functions.</span><span class="w">
</span><span class="n">EvalLogLik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_lik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">verbose</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="o">=</span><span class="s2">", "</span><span class="p">),</span><span class="w"> </span><span class="s2">": "</span><span class="p">,</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">digits</span><span class="o">=</span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">log_lik</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs</span><span class="w">
</span><span class="n">EvalLogLik</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] 1722115.375
</code></pre></div></div>
<p>Now for some magic — a function that returns the gradient of <code class="language-plaintext highlighter-rouge">EvalLogLik</code> with respect to beta. This is what we’re using <code class="language-plaintext highlighter-rouge">torch</code> for. As before, we want something that we can pass to our optimizer, so that it takes a R numeric value for \(\beta\) as input and returns \(\partial \log p(y \vert x, \beta) / \partial \beta\) as R numeric output.</p>
<p>Unlike before, we call <code class="language-plaintext highlighter-rouge">torch_tensor</code> with the extra argument <code class="language-plaintext highlighter-rouge">requires_grad=TRUE</code>. That tells <code class="language-plaintext highlighter-rouge">torch</code> that we will later want to compute a gradient with respt to this parameter.</p>
<p>We compute the <code class="language-plaintext highlighter-rouge">loss</code> (the negative log likelihood) as we would normally.</p>
<p>We then call <code class="language-plaintext highlighter-rouge">autograd_grad</code>, which returns the gradient of the first argument’s tensor with respect to the second argument’s tensor using all the computations that have been performed since the tensors were defined. The <code class="language-plaintext highlighter-rouge">autograd_grad</code> function returns a list (you can take gradients with respect to multiple inputs), so we just pull out the first element of the list.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikGrad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">beta_ad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">loss</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span><span class="n">grad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">loss</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w">
</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] -440151.25 -691394
</code></pre></div></div>
<p>Now we can use the loss and gradient into a nonlinear optimizer. Sure enough, we get a reasonable estimate, with a somewhat small loss gradient.</p>
<p>(This gradient would ideally be smaller, but the out-of-the-box BFGS isn’t a very good optimization algorithm.)</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optim_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">opt_result</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">optim</span><span class="p">(</span><span class="w">
</span><span class="n">fn</span><span class="o">=</span><span class="n">EvalLogLik</span><span class="p">,</span><span class="w">
</span><span class="n">gr</span><span class="o">=</span><span class="n">EvalLogLikGrad</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="o">=</span><span class="s2">"BFGS"</span><span class="p">,</span><span class="w">
</span><span class="n">par</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">control</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">fnscale</span><span class="o">=</span><span class="m">-1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n_obs</span><span class="p">))</span><span class="w">
</span><span class="n">optim_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">optim_time</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="s2">"Estimate"</span><span class="o">=</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="s2">"Truth"</span><span class="o">=</span><span class="n">beta_true</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at BFGS optimum:\n"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">))</span><span class="w">
</span><span class="n">beta_hat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Estimate Truth
1 0.9706614 1.0
2 1.7922191 1.8
Gradient at BFGS optimum:
[1] -0.0355956 -0.0324285
</code></pre></div></div>
<p>We can compare the results to what we’d get from the same regression using <code class="language-plaintext highlighter-rouge">lme4::glm</code>. Sure enough they match, and run in a comparable amount of time. (I’ve found there’s a fair amount of noise in the timing, and of course this is only reporting a single run, so all that really matters here is that the two are of the same order.)</p>
<p>The glm algorithm (IRLS) tends to do a much better job of optimizing, in the sense that the gradient is smaller at the glm optimum, and the algorithm runs more quickly. Still, BFGS is just a quick-and-dirty choice, and doesn’t require any special structure to the problem.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">glm_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">glm_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="o">=</span><span class="n">data.frame</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">x2</span><span class="o">=</span><span class="n">x</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w">
</span><span class="n">start</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">family</span><span class="o">=</span><span class="s2">"poisson"</span><span class="p">)</span><span class="w">
</span><span class="n">glm_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glm_time</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in coefficients estimated by optim and glm:\t"</span><span class="p">,</span><span class="w">
</span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">beta_hat</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nEstimation time (s):\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"optimization and torch: \t"</span><span class="p">,</span><span class="w"> </span><span class="n">optim_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"glm: \t\t\t\t"</span><span class="p">,</span><span class="w"> </span><span class="n">glm_time</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"\nGradient at glm optimum:\n"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">EvalLogLikGrad</span><span class="p">(</span><span class="n">coefficients</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)))</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in coefficients estimated by optim and glm: 1.882297e-05
Estimation time (s):
optimization and torch: 0.1431296
glm: 0.01248574
Gradient at glm optimum:
[1] -3.278255e-05 -7.224083e-05
</code></pre></div></div>
<p>To compute standard errors, we need to compute the negative Hessian matrix of the log likelihood:</p>
\[\hat{\mathcal{I}} := -
\left. \frac{\partial \log p(y | x, \beta)}
{\partial \beta \partial \beta^T} \right|_{\hat\beta}\]
<p>The quantity \(\mathcal{I}\) is the empirical Fisher information, and \(\mathcal{I}^{-1}\) is a standard estimator of the covariance of the MLE \(\hat\beta\) under correct specification.</p>
<p>The Python version of <code class="language-plaintext highlighter-rouge">torch</code> has a native function, like <code class="language-plaintext highlighter-rouge">autograd_grad</code>, that computes the Hessian directly. Unfortunately, that function has not yet been ported to R. (See <a href="https://github.com/mlverse/torch/issues/738">this issue</a> on github.) However, we can compute a Hessian by computing the gradients of each row of the gradient, as follows.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EvalLogLikHessian</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">beta_ad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">torch_tensor</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">requires_grad</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">log_lik</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">EvalLogLikTorch</span><span class="p">(</span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">tvars</span><span class="p">)</span><span class="w">
</span><span class="c1"># The argument `create_graph` allows `grad` to be itself differentiated, and</span><span class="w">
</span><span class="c1"># the argument `retain_graph` saves gradient computations to make repeated differentiation</span><span class="w">
</span><span class="c1"># of the same quantity more efficient.</span><span class="w">
</span><span class="n">grad</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">log_lik</span><span class="p">,</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">create_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="c1"># Now we compute the gradient of each element of the gradient, each of which is</span><span class="w">
</span><span class="c1"># one row of the Hessian matrix.</span><span class="w">
</span><span class="n">hess</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">beta</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">grad</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">hess</span><span class="p">[</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">autograd_grad</span><span class="p">(</span><span class="n">grad</span><span class="p">[</span><span class="n">d</span><span class="p">],</span><span class="w"> </span><span class="n">beta_ad</span><span class="p">,</span><span class="w"> </span><span class="n">retain_graph</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">hess</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Just check that this runs and has the correct dimensions</span><span class="w">
</span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">beta_true</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] -1907663 -1720186
-1720186 -2263947
</code></pre></div></div>
<p>In the code below, <code class="language-plaintext highlighter-rouge">fisher_info</code> is precisely \(\hat{\mathcal{I}}\). We can see that the standard errors match one another.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fisher_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">-1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">EvalLogLikHessian</span><span class="p">(</span><span class="n">opt_result</span><span class="o">$</span><span class="n">par</span><span class="p">)</span><span class="w">
</span><span class="n">torch_se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">fisher_info</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">diag</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">()</span><span class="w">
</span><span class="n">glmer_se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[,</span><span class="w"> </span><span class="s2">"Std. Error"</span><span class="p">]</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Difference in estimated standard errors:\t"</span><span class="p">,</span><span class="w">
</span><span class="nf">max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">torch_se</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">glmer_se</span><span class="p">)),</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Difference in estimated standard errors: 3.857842e-07
</code></pre></div></div>The torch package for R (found here) is CRAN-installable and provides automatic differentiation in R, as long as you’re willing to rewrite your code using Torch functions.A Few Equivalent Perspectives on Jackknife Bias Correction2022-03-17T10:00:00+00:002022-03-17T10:00:00+00:00/jackknife/2022/03/17/jackknife_bias<p>In this post, I’ll try to connect a few different ways of viewing jackknife and
infinitesimal jackknife bias correction. This post may help provide some
intution, as well as an introduction to how to use the infinitesimal jackknife
and von Mises expansion to think about bias correction.</p>
<p>Throughout, for concreteness, I’ll use the simple example of the statistic</p>
\[\begin{aligned}
T & =\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}\right)^{2},\end{aligned}\]
<p>where the \(x_{n}\) are IID and \(\mathbb{E}\left[x_{n}\right]=0\).
Of
course, the bias of \(T\) is known, since</p>
\[\begin{aligned}
\mathbb{E}\left[T\right] & =\frac{1}{N^{2}}\mathbb{E}\left[\sum_{n=1}^{N}x_{n}^{2}+\sum_{n_{1}\ne n_{x}}x_{n_{1}}x_{n_{2}}\right]
=\frac{\mathrm{Var}\left(x_{1}\right)}{N}.
\end{aligned}\]
<p>We will
ensure that we recover consistent estimates of this bias using each of
the different perspectives. Of course, the utility of these concepts is
when we do not readily have such a simple expression for bias as in, for
example, Bayesian expectations.</p>
<p>At different points I will use different arguments for \(T\), hopefully
without any real ambiguity. For convenience, write</p>
\[\begin{aligned}
\hat{\mu} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n} \quad\textrm{and}\quad
\hat{\sigma}^{2} & :=\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}\end{aligned}\]
<p>so that our example can be expressed as</p>
\[\begin{aligned}
T & =\hat{\mu}^{2}.\end{aligned}\]
<h1 id="an-asymptotic-series-in-n">An asymptotic series in \(N\).</h1>
<p>Perhaps the most common way to understand the jackknife bias estimator
and correction is as an asymptotic series in \(N\). Suppose that we have
some reason to believe that the bias of \(T\) admits an asymptotic
expansion in \(N\), the size of the observed dataset:</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}\right] & =a_{1}N^{-1}+o\left(N^{-1}\right).\end{aligned}\]
<p>The jackknife bias estimator works as follows. Let \(T_{-i}\) denote \(T\)
calculated with datapoint \(i\) left out. Then</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}-T_{-i}\right] & =a_{0}+a_{1}N^{-1}+o\left(N^{-1}\right)-\\
& \quad a_{0}-a_{1}\left(N-1\right)^{-1}+o\left(N^{-1}\right)\\
& =a_{1}\frac{N-1-N}{N\left(N-1\right)}+o\left(N^{-1}\right)\\
& =-\frac{a_{1}N^{-1}}{N-1}+o\left(N^{-1}\right)\\
& =-\frac{\mathbb{E}\left[T_{N}\right]}{N-1}+o\left(N^{-1}\right).\end{aligned}\]
<p>Consequently,</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\
\mathbb{E}\left[\hat{B}\right] & =\mathbb{E}\left[T_{N}\right]+o\left(N^{-1}\right).\end{aligned}\]
<p>is an unbiased estimate of the leading order term in the bias of
\(T_{N}\), and the bias-corrected estimate \(T_{N}-\hat{B}\) has bias of
smaller order \(o\left(N^{-1}\right)\),</p>
\[\begin{aligned}
\mathbb{E}\left[T_{N}-\hat{B}\right] & =o\left(N^{-1}\right).\end{aligned}\]
<p>In our example,</p>
\[\begin{aligned}
T_{-i} & =\left(\frac{1}{N-1}\sum_{n\ne i}^{N}x_{n}\right)^{2}\\
& =\left(\frac{N}{N-1}\hat{\mu}-\frac{1}{N-1}x_{i}\right)^{2}\\
& =\left(N-1\right)^{-2}\left(N^{2}\hat{\mu}^{2}-2N\hat{\mu}x_{i}+x_{i}^{2}\right)\\
\frac{1}{N}\sum_{n=1}^{N}T_{-n} & =\left(N-1\right)^{-2}\left(\left(N^{2}-2N\right)\hat{\mu}^{2}+\hat{\sigma}^{2}\right)\\
\hat{B} & =-\left(N-1\right)^{-1}\left(\left(N-1\right)^{2}\hat{\mu}^{2}-\left(N^{2}-2N\right)\hat{\mu}^{2}-\hat{\sigma}^{2}\right)\\
& =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which is a perfectly good estimate of the bias.</p>
<h1 id="a-taylor-series-in-1n">A Taylor series in \(1/N\).</h1>
<p>An equivalent way of looking at the previous example is to imagine \(T\)
as a function of \(\tau=1/N\), to numerically estimate the derivative
\(dT/d\tau\), and extrapolate to \(\tau=0\). Using the notation of the
previous section, define the gradient estimate</p>
\[\begin{aligned}
\hat{g_{i}} & =\frac{T_{N}-T_{-i}}{\frac{1}{N}-\frac{1}{N-1}}.\end{aligned}\]
<p>Here, we are viewing \(T_{-i}\) as an instance of the estimator evaluated
at \(\tau=1/\left(N-1\right)\). By rearranging, we find that</p>
\[\begin{aligned}
\hat{g_{i}} & =-N\left(N-1\right)\left(T_{N}-T_{-i}\right),\\
\hat{g} & =\frac{1}{N}\sum_{n=1}^{N}\hat{g_{n}}\\
& =-N\left(N-1\right)\left(T_{N}-\frac{1}{N}\sum_{n=1}^{N}T_{-n}\right)\\
& =-N\hat{B}.\end{aligned}\]
<p>Extrapolating to \(\tau=0\) gives</p>
\[\begin{aligned}
T_{\infty} & \approx T_{N}+\hat{g}\left(\frac{1}{N}-\frac{1}{\infty}\right)\\
& =T_{N}-\hat{B},\end{aligned}\]
<p>as in the previous example.</p>
<h1 id="a-von-mises-expansion">A von Mises expansion.</h1>
<p>Let us write the statistic as a functional of the data distribution as
follows:</p>
\[\begin{aligned}
T\left(F\right) & =\left(\int xdF\left(x\right)\right)^{2}.\end{aligned}\]
<p>Define the empirical distribution to be \(F_{N}\) and the true
distribution \(F_{\infty}\). Suppose we can Taylor expand the statistic in
the space of distribution functions as</p>
\[\begin{aligned}
T\left(F\right) & \approx T\left(F_{0}\right)+T_{1}\left(F_{0}\right)\left(G-F_{0}\right)+\frac{1}{2}T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right)+
O\left(\left|G-F_{0}\right|^{3}\right), & \textrm{(1)} \end{aligned}\]
<p>where \(T_{1}\left(F_{0}\right)\) is a linear operator on the space of
(signed) distribution functions and \(T_{2}\left(F_{0}\right)\) is a
similarly defined bilinear operator. The expansion in
Eq. 1
is known as a von Mises expansion.</p>
<p>Often these operators can be represented with “influence functions”,
i.e., there exists a function
\(x\mapsto\psi_{1}\left(F_{0}\right)\left(x\right)\) and
\(x_{1},x_{2}\mapsto\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)\)
such that</p>
\[\begin{aligned}
T_{1}\left(F_{0}\right)\left(G-F_{0}\right) & =\int\psi_{1}\left(F_{0}\right)\left(x\right)d\left(G-F_{0}\right)\left(x\right)\\
T_{2}\left(F_{0}\right)\left(G-F_{0}\right)\left(G-F_{0}\right) & =\int\int\psi_{2}\left(F_{0}\right)\left(x_{1},x_{2}\right)d\left(G-F_{0}\right)\left(x_{1}\right)d\left(G-F_{0}\right)\left(x_{2}\right).\end{aligned}\]
<p>For instance, the directional derivative of our example is given by</p>
\[\begin{aligned}
\left.\frac{dT\left(F+tG\right)}{dt}\right|_{t=0} & =\left.\frac{d}{dt}\right|_{t=0}\left(\int xd\left(F+tG\right)\left(x\right)\right)^{2}\\
& =2\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)\int xdG\left(x\right),\end{aligned}\]
<p>so that</p>
\[\begin{aligned}
\psi_{1}\left(F\right)\left(x\right) & =\left(\int\tilde{x}dF\left(\tilde{x}\right)\right)^{2}x.\end{aligned}\]
<p>Similarly,</p>
\[\begin{aligned}
\left.\frac{d^{2}T\left(F+tG\right)}{dt^{2}}\right|_{t=0} & =2\int xdG\left(x\right)\int xdG\left(x\right),\end{aligned}\]
<p>so</p>
\[\begin{aligned}
\psi_{2}\left(F\right)\left(x_{1},x_{2}\right) & =2x_{1}x_{2}.\end{aligned}\]
<p>Define</p>
\[\begin{aligned}
\Delta_{N} & :=F_{N}-F_{\infty}.\end{aligned}\]
<p>Then the Taylor
expansion gives an expression for the bias in terms of the influence
functions:</p>
\[\begin{aligned}
T\left(F_{N}\right)-T\left(F_{\infty}\right) & =\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}+\frac{1}{2}\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)+\\
& \quad\quad O\left(\left|\Delta_{N}\right|^{3}\right).\end{aligned}\]
<p>Note that, in general, integrals against \(\Delta_{N}\) take the form</p>
\[\begin{aligned}
\int\phi\left(x\right)d\Delta_{N}\left(x\right) & =\int\phi\left(x\right)dF_{N}\left(x\right)-\int\phi\left(x\right)dF_{\infty}\left(x\right)\\
& =\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\mathbb{E}\left[\phi\left(x\right)\right].\end{aligned}\]
<p>Consequently, the first term has zero bias, since</p>
\[\begin{aligned}
\mathbb{E}\left[\int\psi_{1}\left(F_{\infty}\right)\Delta_{N}\right] & =\frac{1}{N}\mathbb{E}\left[\sum_{n=1}^{N}\psi_{1}\left(F_{\infty}\right)\left(x_{n}\right)\right]-\mathbb{E}\left[\psi_{1}\left(F_{\infty}\right)\left(x\right)\right]\\
& =0.\end{aligned}\]
<p>The second term is given by</p>
\[\begin{aligned}
& \int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\\
& \quad=\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{n}\right)-\mathbb{E}_{x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)d\Delta_{N}\left(x_{1}\right)\\
& \quad=\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}\psi_{2}\left(F_{\infty}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right].\end{aligned}\]
<p>Note that in the expectation, \(x_{1}\) and \(x_{2}\) are independent. So,
when $n_{1}\ne n_{2}$, the term in the sum has mean zero, and</p>
\[\begin{aligned}
& \mathbb{E}\left[\int\int\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)d\Delta_{N}\left(x_{1}\right)d\Delta_{N}\left(x_{2}\right)\right]=\nonumber \\
& \quad\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right).
& \textrm{(2)}
\end{aligned}\]
<p>In general, this is not zero, and so the leading-order bias term of
\(\mathbb{E}\left[T\left(F_{N}\right)-T\left(F_{\infty}\right)\right]\) is
given by the expectation of the quadratic term.</p>
<p>Note that integrals over \(\Delta_{N}\) are, by the CLT, of order
\(1/\sqrt{N}\), so the order of the \(k\)-th term in the von Mises expansion
is order \(N^{-k/2}\). By this argument, the bias of \(T\) is of order \(N\)
and admits a series expansion in \(N\). Indeed, a von Mises expansion is
one way you could justify the first perspective. The expected
second-order term is precisely the value \(a_{1}\).</p>
<p>For our example, we can see that the bias is given by</p>
\[\begin{aligned}
& \frac{1}{2}\frac{1}{N}\left(\mathbb{E}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{1}\right)\right]-\mathbb{E}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{\infty}\right)\left(x_{1},x_{2}\right)\right]\right)\\
& \quad=\frac{2}{2N}\left(\mathbb{E}\left[x_{1}^{2}\right]-\mathbb{E}\left[x_{1}\right]^{2}\right)\\
& \quad=\frac{1}{N}\mathrm{Var}\left(x_{1}\right),\end{aligned}\]
<p>exactly as expected. In this case, the second order term is the exact
bias because our very simple \(T\) is actually quadratic in the
distribution function.</p>
<p>In general, one can estimate the bias by computing a sample version of
the second-order term. In our simple example, \(\psi_{2}\left(F\right)\)
does not actually depend on \(F\), but in general one would have to
replace \(\psi_{2}\left(F_{\infty}\right)\) with
\(\psi_{2}\left(F_{N}\right)\) and the population expectations with sample
expectations. For our example, letting \(\hat{\mathbb{E}}\) denote sample
expectations, this plug-in approach gives</p>
\[\begin{aligned}
& \frac{1}{N}\left(\hat{\mathbb{E}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{1}\right)\right]-\hat{\mathbb{E}}_{x_{1}x_{2}}\left[\psi_{2}\left(F_{N}\right)\left(x_{1},x_{2}\right)\right]\right)\\
& \quad=\frac{1}{N}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}=1}^{N}x_{n_{1}}x_{n_{2}}\right)\\
& \quad=\frac{1}{N}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which is simply a sample estimate of the variance.</p>
<p>Note that you might initially expect that, to use the reasoning in Eq. 1, you
would need to express your estimator as an explicit function of \(N\), or at
least take into account the \(N\) dependence in developing a Taylor series
expansion such as that in Eq. 1. However, the example in the present case shows
that this is not so, as the empirical distribution depends only implicitly on
\(N\). In fact, the asymptotic series in \(N\) follows from the stochastic
behavior of \(\Delta_{N}\) rather than from any explicit \(N\) dependence in the
statistic.</p>
<h1 id="the-infinitesimal-jackknife">The infinitesimal jackknife.</h1>
<p>Rather than use Eq. 1 to estimate the bias directly with the plug-in principle,
we might imagine using it to try to approximate the jackknife estimate of bias.
In this section, I show that (a) a second order infinitesimal jackknife
expansion is necessary and that (b) you then get the same answer as by
estimating the bias from the second term of Eq. 1 directly.</p>
<p>Let \(F_{-i}\) denote the empirical distribution with datapoint \(i\) left
out, and let \(\Delta_{-i}\) denote \(F_{-i}-F_{N}\). The infinitesimal
jackknife estimate of \(T_{-i}\) is given by using
Eq. 1
to extrapolate from \(F_{N}\) to \(F_{-i}\):</p>
\[\begin{aligned}
T_{IJ}^{\left(1\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}.\end{aligned}\]
<p>This is the classical infinitesimal jackknife, which expands only to
first order. The second order IJ is of course</p>
\[\begin{aligned}
T_{IJ}^{\left(2\right)}\left(F_{-i}\right) & :=T\left(F_{N}\right)+T_{1}\left(F_{N}\right)\Delta_{-i}+\frac{1}{2}T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i}.\end{aligned}\]
<p>The difference with is that the base of the Taylor series is \(F_{N}\)
rather than \(F_{\infty}\), and we are extrapolating to estimate the
jackknife rather than to estimate the actual bias. A benefit is that all
the quantities in the Taylor series can be evaluated, and no plug-in
approximation is necessary. For instance, in our example,</p>
\[\begin{aligned}
\psi_{1}\left(F_{\infty}\right)\left(x\right) & =\left(\int\tilde{x}dF_{\infty}\left(\tilde{x}\right)\right)^{2}x,\end{aligned}\]
<p>which contains the unknown true mean \(\mathbb{E}\left[x_{1}\right]\). In
contrast,</p>
\[\begin{aligned}
\psi_{1}\left(F_{N}\right)\left(x\right) & =\hat{\mu}^{2}x,\end{aligned}\]
<p>depending on the observed sample mean.</p>
<p>As before, it is useful to first right out the operation of
\(\Delta_{-i}\) on a generic function of \(x\):</p>
\[\begin{aligned}
\int\phi\left(x\right)d\Delta_{-i}\left(x\right) & =\int\phi\left(x\right)dF_{-i}\left(x\right)-\int\phi\left(x\right)dF_{N}\left(x\right)\\
& =\frac{1}{N-1}\sum_{n\ne i}\phi\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right).\\
& =\left(\frac{1}{N-1}-\frac{1}{N}\right)\sum_{n=1}^{N}\phi\left(x_{n}\right)-\frac{\phi\left(x_{i}\right)}{N-1}\\
& =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\phi\left(x_{n}\right)-\phi\left(x_{i}\right)\right).\end{aligned}\]
<p>From this we see that the first-order term is</p>
\[\begin{aligned}
T_{1}\left(F_{N}\right)\Delta_{-i} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\psi_{1}\left(F_{N}\right)\left(x_{i}\right)\right).\end{aligned}\]
<p>Suppose we tried to use \(T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\) to
approximate \(T_{-i}\) in the expression for \(\hat{B}\). We would get</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\
& =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{1}\left(F_{N}\right)\Delta_{-i}\right)\\
& =\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{1}\left(F_{N}\right)\left(x_{n}\right)\\
& =0.\end{aligned}\]
<p>In other words, the first-order approximate
estimates no bias. (This is in fact for the same reason that the
expectation with respect to \(F_{\infty}\) of the first-order term
evaluated at \(F_{\infty}\) is zero.)</p>
<p>The second order term is given by</p>
\[\begin{aligned}
T_{2}\left(F_{N}\right)\Delta_{-i}\Delta_{-i} & =\left(N-1\right)^{-1}\int\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{n}\right)-\psi_{2}\left(F_{N}\right)\left(\tilde{x},x_{i}\right)\right)d\Delta_{-i}\left(\tilde{x}\right)\\
& =\left(N-1\right)^{-2}\times(\\
& \quad\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)-\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{i},x_{n}\right)-\\
& \quad\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{i}\right)+\psi_{2}\left(F_{N}\right)\left(x_{i},x_{i}\right)\\
& \quad).\end{aligned}\]
<p>As before, using
\(T_{IJ}^{\left(2\right)}\left(F_{-i}\right)\) then to approximate
\(T_{-i}\) gives</p>
\[\begin{aligned}
\hat{B} & =-\left(N-1\right)\left(T\left(F_{N}\right)-\frac{1}{N}\sum_{n=1}^{N}T_{IJ}^{\left(1\right)}\left(F_{-i}\right)\right)\\
& =\left(N-1\right)\left(\frac{1}{N}\sum_{n=1}^{N}T_{2}\left(F_{N}\right)\Delta_{-n}\Delta_{-n}\right),\end{aligned}\]
<p>where we have used the previous result that the first term has empirical
expectation \(0\). Plugging in, we see that</p>
\[\begin{aligned}
\hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n},x_{n}\right)-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}\psi_{2}\left(F_{N}\right)\left(x_{n_{1}},x_{n_{2}}\right)\right),\end{aligned}\]
<p>which is precisely a sample analogue of the population bias in Eq. 2 of the
previous section. Of course, in our specific example, this gives</p>
\[\begin{aligned}
\hat{B} & =\left(N-1\right)^{-1}\left(\frac{1}{N}\sum_{n=1}^{N}x_{n}^{2}-\frac{1}{N^{2}}\sum_{n_{1},n_{2}}^{N}x_{n_{1}}x_{n_{2}}\right)\\
& =\frac{1}{N-1}\left(\hat{\sigma}^{2}-\hat{\mu}^{2}\right),\end{aligned}\]
<p>which matches the exact jackknife’s factor of \(\left(N-1\right)^{-1}\),
in contrast to our direct sample estimate of the bias term, which had a
factor of \(N^{-1}\).</p>In this post, I’ll try to connect a few different ways of viewing jackknife and infinitesimal jackknife bias correction. This post may help provide some intution, as well as an introduction to how to use the infinitesimal jackknife and von Mises expansion to think about bias correction.St. Augustine’s question: A counterexample to Ian Hacking’s ‘law of likelihood’2022-02-17T10:00:00+00:002022-02-17T10:00:00+00:00/philosophy/2022/02/17/st_augustines_paradox<p>In this post, I’d like to discuss a simple sense in which statistical reasoning
refutes itself. My reasoning is almost trivial and certainly familiar to
statisticians. But I think that the way I frame it constitutes an argument
against a certain kind of philosophical overreach: against an attempt to view
statistical reasoning as a branch of logic, rather than an activity that looks
more like rhetoric.</p>
<p>To make my argument I’d like to mash up two books which I’ve talked about before
on this blog. The first is Ian Hacking’s Logic of Statistical Inference (I
wrote <a href="/philosophy/2021/12/09/fidual_cis.html">here</a> about
its wonderful chapter on fiducial inference). The other is an interesting
section in St. Augustine’s confessions, which I <a href="/philosophy/2021/10/27/st_augustine.html">discussed here</a>. Ian Hacking’s ambition is, as the
title of the book suggests, to describe the basis of a logic of statistical
inference. His primary tool is the comparison of the likelihoods of what he
calls “chance outcomes” (implicitly he seems to mean aleatoric gambling devices,
but he is uncharacteristically imprecise, implying, I think, that we simply know
a chance setup when we see it).</p>
<p>St. Augustine, as I discuss in my earlier post, has a worldview stripped of what
modern thinkers would call randomness. In St. Augustine’s vision of the world,an
unknowable and all-powerful God guides to His own ends the outcome even of
aleatoric devices, such as the drawing of lots and, presumably, the flipping of
coins. Many people in the modern age do not think like St. Augustine. So it
is reasonable to ask what I will call “St. Augustine’s question:” is St.
Augustine’s deterministic worldview is correct?</p>
<h1 id="hacking-on-st-augustines-question">Hacking on St. Augustine’s question</h1>
<p>I would like to attempt, using Hacking’s methods, to bring the outcome of a coin
flip to bear on St. Augustine’s question. One might reasonably wonder doubt that
is a fair to Hacking. However, the first sentence of Hacking’s book articulates
the scope of his:</p>
<blockquote>
<p>“The problem of the foundation of statistics is to state a set of principles
which entail the validity of all correct statistical inference, and which
do not imply that any fallacious inference is valid.”</p>
</blockquote>
<p>Hacking’s goal is ambitious (my argument here is essentially that it is
over-ambitious). However, to his credit, it is clear: if we can formulate the
St. Augustine question as a statistical one about a chance outcome, then we
should expect Hacking’s logic to come to the correct epistemic conclusion.
Furthermore, Hacking states himself (when arguing against Neyman-Pearson
test in Chapter 7) that “the best way to refute a principle [is] not
general metaphysics but concrete example.”</p>
<p>Finally, lest it seem too esoteric to argue with St. Augustine, or that this
example is too contrived to be meaningful, at the end of this post, I will draw
connections between my argument and some shortcoming’s of likelihood-based model
comparison that are well known to statisticians but largely ignored by Hacking’s
book.</p>
<h1 id="hackings-law-of-likelihood">Hacking’s law of likelihood</h1>
<p>Hacking’s principle of inference is embodied in his “law of likelihood,” which
is introduced in Chapter 5. The goal is to justifiably connect aleatoric
statements to degrees of logical belief (without going through subjective
probability). Stripping away some of Hacking’s notation, his law of likelihood
states in brief that</p>
<blockquote>
<p>“If two joint propositions are consistent with the statistical data,
the better supported is that with the greater likelihood.”</p>
</blockquote>
<p>Here I should clarify some of Hacking’s terminology. By “statistical data” he
means everything you know before conducting a chance experiment, including the
nature of the nature of how you get the data. A “joint proposition” is some
statement about the world, possibly including things you don’t know, e.g.,
future unobserved data, or some unknown aspect of the real world. Hacking spends
a lot of time defining and discussing his terms.</p>
<p>For the present purpose, it suffices to describe some of Hacking’s own examples
from Chapter 5 of how the law of likelihood is to be used. Suppose that a
biased coin has P(H) = 0.9 and P(T) = 0.1. Then, by the law of likelihood, the
proposition \(\pi_H\) that a yet-unseen flip will be H is better supported than
the proposition \(\pi_T\) that it will be T, since P(H) > P(T). Similarly, if
we observe K heads out of N flips, by the law of likelihood, the proposition
\(\pi_{K/N}\) that P(H) = K / N is better supported than the proposition
\(\pi_{(K-1)/N}\) that P(H) = (K - 1) / N.</p>
<p>Are these assertions trivial? Hacking spends the first part of the book arguing
that they are not, and the latter part of the book demonstrating important
differences, both conceptual and practical, with decision theory and subjective
probability. Suffice to say they are beyond the scope of the present post.</p>
<h1 id="asking-st-augustines-question-with-the-law-of-likelihood">Asking St. Augustine’s question with the law of likelihood</h1>
<p>Let us suppose that we have made single coin flip which came up H. The coin was
designed and flipped symmetrically to the best of our abilities. St. Augustine’s
question can be expressed in terms of these two simple propositions:</p>
<ul>
<li>\(\pi_{R}\) (Randomness): P(H) = 0.5, and we observed H</li>
<li>\(\pi_{A}\) (Augustine): P(H) = 1.0 (God wills it), and we observed H</li>
</ul>
<p>Obviously, the law of likelihood supports \(\pi_{A}\), answering St. Augustine’s
question in the affirmative, i.e., that St. Augustine’s worldview is better
supported than randomness.</p>
<p>Let me be the first to admit that this is pretty trivial. Perhaps you are
disappointed, and sorry you bothered to read this far! Let me try to bring you
back in.</p>
<p>First, observe that the same reasoning applies to any number of coin flips. You
might ask — was the sequence HTHTTH pre-ordained or random, and the law of
likelihood always supports that it was pre-ordained. The same reasoning can be
applied to whether some small number of flips in a particular sequence were
pre-ordained — e.g., when asking whether every flip in the sequence HTHTTH
random, or was at least one of them pre-ordained, the law of likelihood supports
that at least one of them was pre-ordained. The same reasoning applies to
degrees of probability, as well — e.g., when asking whether every flip
in the sequence HTHTTH was fair, versus was it P(H) = 0.6 when H came up
and P(T) = 0.6 when T came up, the law of likelihood supports that the
sequence was not fair.</p>
<p>In short, the law of likelihood always supports the most deterministic
proposition. In this sense, the law of likelihood does not support its own
applicability. Without randomness, there is no need or use for a logic of
statistical inference. When given the opportunity to ask whether or not there
is randomness in a particular setting, the law of likelihood always militates
against randomness, and eats its own tail.</p>
<h1 id="statisticians-know-this-and-so-does-hacking">Statisticians know this, and so does Hacking</h1>
<p>This phenomenon is no surprise to statisticians, of course. Model selection
based on likelihood — whether Bayesian or frequentist in design and use —
favors the more complex models unless some corrective factor is used, such as
regularization or priors. The answer given by the law of likelihood to St.
Augustine’s question is just an extreme end of this phenomenon.</p>
<p>Is Hacking aware of this problem? Of course; Hacking is aware of most things.
For example, in Chapter 7, he discusses very briefly the importance of weighting
likelihoods in some cases (“One author has suggested that a number be assigned
to each hypothesis, which will represent the ‘seriousness’ of rejecting it …
In the theory of likelihood testing, one would use weighted tests.”)
Unfortunately, Hacking’s discussion of Bayesianism in Chapters 12 and 13 does
not take up this point, focusing instead on arguing against uniform priors and
dogmatic subjectivism. Probably most damningly, Hacking does not shrink away
from using the law of likelihood to reason between a large number of expressive
propositions and a single less expressive one, as in, for example, in his
comparison unbiased tests in Chapter 7 (page 89 in the Cambridge Philosophy
Classics 2016 edition). In summary, Hacking does not appear to take very
seriously the fundamental role extra-statistical evidence must play in
applications of the law of likelihood, in order to avoid its own
self-refutation.</p>
<h1 id="we-must-deliberately-choose-the-statistical-analogy">We must deliberately choose the statistical analogy</h1>
<p>The point is that describing the world with randomness is a choice we make, and
we make it because it is sometimes useful to us. In the course of doing
something like statistical inference, we <em>must</em> posit <em>a priori</em> the existence
of randomness as well as explanatory mechanisms of limited complexity. At the
core of statistical reasoning is the <em>discard</em> of information — of viewing a
set of voters, each entirely unique, as equivalent to balls drawn from an urn,
or viewing the days weather, which is fixed from yesterday’s by deterministic
laws of physics, as something exchangeable with some hypothetical population of
other days, conceptually detached from contingency and their own pasts. Failure
to remember this can lead to silly arguments about whether phenomena are “really
random.” In other words, we must choose to make the <a href="/philosophy/2021/08/22/what_is_statistics.html">statistical analogy</a>, and accept
that its applicability may not be indisputable.</p>
<p>From this perspective, Hacking’s ambition — a logic of statistical inference —
seems hopeless, not because of some inevitably subjective nature of probability
itself, but because of the subjective nature of analogy. How can you form a
logic which will give correct conclusions in every application of an analogy?
The affairs of statistics are inevitably human and not purely computational, and
the field is more exciting and fruitful for it.</p>In this post, I’d like to discuss a simple sense in which statistical reasoning refutes itself. My reasoning is almost trivial and certainly familiar to statisticians. But I think that the way I frame it constitutes an argument against a certain kind of philosophical overreach: against an attempt to view statistical reasoning as a branch of logic, rather than an activity that looks more like rhetoric.Some of the gambling devices that build statistics.2022-01-27T10:00:00+00:002022-01-27T10:00:00+00:00/philosophy/2022/01/27/basic_gambling_device<p>In <a href="/philosophy/2021/08/22/what_is_statistics.html">an earlier post</a>, I discuss how statistics uses
gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in
general (epistemic uncertainty). I called this the “statistical analogy.” Of
course, this perspective is not at all new — see section 1.5 of [0], for
example.</p>
<p>When folks employ the statistical analogy, explicitly or implicitly, a few
gambling devices come up again and again. I find that having their taxonomy in
the back of the mind can help see what metaphor(s) is (are) being employed in a
particular analysis. These gambling devices are obviously not fully distinct —
you can typically simulate one with another, and the final “device” obviously
encompasses all the others. But I will separate them here because they tend to
play different metaphorical roles — and, I would argue, increasingly tenuously
in the order I have written them.</p>
<h1 id="the-urn-exchangeability">The urn (exchangeability)</h1>
<p>The gambling device most commonly used in statistics is probably the urn: some
container containing some objects, such as balls of different colors, which is
shaken, and from which some objects are removed. The aleatoric randomness is
provided by shaking as well as drawing blindly from the urn, creating a symmetry
between all objects in the urn. Equivalent gambling devices include drawing
cards from a shuffled deck or random respondents for a poll. Once one is in the
habit of thinking about urns with a finite number of objects, it is a small step
to consider urns with an infinite number of objects, such as super-populations
in causal inference ([1], section 1.12).</p>
<p>The ubiquitous assumption of exchangeability is equivalent to sequential drawing
from a shaken urn ([2], section 3). Consequently, the urn model is at the core
of most frequentist inferential methods, including the bootstrap and normal
approximations for exchangeable data.</p>
<h1 id="bets-using-biased-coins-subjective-probability">Bets using biased coins (subjective probability)</h1>
<p>The biased coin, which chooses between two outcomes with given probabilities,
plays a large role in subjective probability (associated with Bayesian
statistics) as the basis for hypothetical betting. The key idea behind
subjective probability is that, before gathering data, we have beliefs about the
state of the world. If these beliefs satisfy some reasonable assumptions (i.e.,
are “coherent”) then there are some bets that we would be consider fair, and
some that we would not. Equivalent aleatoric versions of these bets can then be
used as metaphors for your subjective beliefs.</p>
<p>For example, suppose that some unknown quantity can be either A or B, and we
would accept as fair a bet in which we get $1 if A occurs but pay $2 if B
occurs. Since these are precisely the odds that would be acceptable for a
biased coin which comes up A 2/3 of the time and B 1/3 of the time, one might
say that your subjective belief about A and B is equivalent to your subjective
belief about a biased coin with probabilities 2/3 and 1/3. The bet on a biased
coin is a metaphor for your subjective belief about A and B. (The full formal
connection between betting and subjective probability is richer and more
complicated than my cartoon. See [3], sections 3.1-3.4.)</p>
<p>With a coin, the aleatoric randomness is produced a symmetric coin shape
together with flipping or spinning, which creates a symmetry between the two
sides. The biased coin can be extended to multiple outcomes with uneven dice,
such as sheep knuckle bones, again with symmetry created between outcomes via
spinning. Of course, you can draw from an urn using biased coins, or produce
bets with urns. That is not my point! The point is that the way these gambling
devices are used metaphorically is distinct.</p>
<h1 id="the-spinner-continuous-uniform-random-variables">The spinner (continuous uniform random variables)</h1>
<p>The urn and the biased coin are fundamentally discrete, though much of
statistics deals with continuous-valued random variables. The spinner is the
most natural way to produce a continuous random variable — namely, a uniform
distribution on the circumference of a circle. A spinner creates aleatoric
randomness by symmetry of the disk together with a vigorous spin. The needle
goes around many times, but the random number is produced by the fractional part
of the number of cycles. Pseudo-random number generators like the Mersenne
twister seem to me to be in the same class, as they are based on the fractional
part of a large number.</p>
<p>The spinner creates sort of a bridge to the rest of probability theory, since
any continuous random variable can be produced by applying function (the inverse
CDF) to a uniform random variable on the unit interval. Given a spinner, one
can begin to imagine complex aleatoric processed based on spinners and
computation alone. Of course we can form approximations to the continuum with a
sufficiently large number of coin flips, for example, or a sufficiently large
urn. However, I think the spinner provides much cleaner intuition for why we
consider continuous random variables to be reasonable aleatoric processes in the
first place.</p>
<h1 id="probabilistic-models">Probabilistic models</h1>
<p>Once we have the probability calculus (via the spinner and computation), we can
begin to form quite complex aleatoric models to represent our uncertainty.
Arguably, this is the realm in which a lot of modern statistical work takes
place. For example, suppose you are analyzing a binary outcome (hospitalized
for COVID or not) as a function of some regressors (age and vaccine status).
For an individual with a given age and vaccine status, we do not know for
certain whether they will be hospitalized. A logistic regression is precisely a
posited aleatoric system to describe this subjective uncertainty. Software like
Stan, which allows generalists to perform inference on their own generative
processes, make this kind of complex modeling relatively easy.</p>
<p>Of course, at this level of abstraction, the metaphor can lose clarity and
force. Why is logistic regression reasonable? Why not some other link
function? Why not other regressors (e.g. interactions)? Taking for granted
that such abstract models provide good metaphors for epistemic uncertainty is at
the root of many misapplications of statistics. In fact, many early
statisticians, particularly those in the frequentist camps, were expressly
unwilling to extend the statistical analogy much further than exchangeability.
One might see a key difference between Neyman-Rubin causal inference ([1]),
which (mostly ) requires only the urn, and Pearlian casual inference ([4]),
which requires probabilistic graphical models, as a difference in willingness
to stretch the statistical analogy.</p>
<p>As with all analogies, the quality of a particular statistical analogy is
subject to an ineradicable subjectivity. But being aware of what analogy
is being made in a particular situation can help clarify disagreements
and avoid missteps.</p>
<h1 id="references">References</h1>
<p>[0] Gelman, Andrew, et al. Bayesian data analysis. CRC press, 2013.</p>
<p>[1] Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.</p>
<p>[2] Shafer, Glenn, and Vladimir Vovk. “A Tutorial on Conformal Prediction.” Journal of Machine Learning Research 9.3 (2008).</p>
<p>[3] Ghosh, Jayanta K., Mohan Delampady, and Tapas Samanta. An introduction to Bayesian analysis: theory and methods. Vol. 725. New York: Springer, 2006.</p>
<p>[4] Pearl, Judea. Causality. Cambridge university press, 2009.</p>In an earlier post, I discuss how statistics uses gambling devices (aleatoric uncertainty) as a metaphor for more the unknown in general (epistemic uncertainty). I called this the “statistical analogy.” Of course, this perspective is not at all new — see section 1.5 of [0], for example.How does AMIP work for regression when the weight vector induces colinearity in the regressors?2021-12-17T10:00:00+00:002021-12-17T10:00:00+00:00/amip/2021/12/17/reweighted_colinear_note<p>How does AMIP work for regression when the weight vector induces
colinearity in the regressors? This problem came up in our paper, as
well as in a couple users of <code class="language-plaintext highlighter-rouge">zaminfluence</code>. Superficially, the
higher-order infinitesimal jackknife has nothing to say about such a
point, since a requirement for the accuracy of the approximation is that
the Hessian matrix be uniformly non-singular. However, we shall see
that, in the special case of regression, we can re-express the problem
so that the singularity disappears.</p>
<h3 id="notation">Notation</h3>
<p>Suppose we have a weighted regression problem with regressor matrix \(X\)
(an \(N \times D\) matrix), response vector \(\vec{y}\) (an \(N\)-vector), and
weight vector \(\vec{w}\) (an \(N\)-vector):
\(\begin{aligned}
%
\hat{\theta}(\vec{w}) :={}& \theta\textrm{ such that }\sum_{n=1}^N
w_n (y_n - \theta^T x_n) x_n = 0
\Rightarrow\\
\hat{\theta}(\vec{w}) ={}&
\left(\frac{1}{N}\sum_{n=1}^N
w_n x_n x_n^T \right)^{-1} \frac{1}{N}\sum_{n=1}^Nw_n y_n x_n
\\={}& \left((\vec{w}\odot X)^T X\right)^{-1} (\vec{w}\odot X)^T \vec{y}.
%
\end{aligned}\)</p>
<p>Here, we have used \(\vec{w}\odot X\) like the Hadamard
product with broadcasting. Formally, we really mean
\((\vec{w}1_D^T) \odot X\), where \(1_D^T\) is a \(D\)-length row vector
containing all ones. Throughout, we will use \(1\) and \(0\) subscripted by
a dimension to represent vectors filled with ones and zeros
respectivley.</p>
<h3 id="how-can-weights-induce-rank-deficiency">How can weights induce rank deficiency?</h3>
<p>We are interested in what happens to the linear approximation at a
weight vector \(\vec{w}\) for which the Hessian
\(\frac{1}{N}\sum_{n=1}^Nw_n x_n x_n^T\) is singular. Assume that \(X\) has
rank \(D\), and that \((\vec{w}\odot X)\) is rank \(D-1\). Specifically, there
exists some nonzero vector \(a_1 \in \mathbb{R}^D\). such that \((\vec{w}
\odot X) a_1 = 0_N\), where \(0_N\) is the \(N\)-length vector of zeros. For
each \(n\), the preceding expression implies that \(w_n x_n^T a_1 = 0\), so
either \(w_n = 0\) or \(x_n^T a_1 = 0\). Without loss of generality, we can
thus order the observations so that we drop the first \(N_{d}\) rows:</p>
\[\begin{aligned}
%
\vec{w}=
\left(\begin{array}{c}
0_{N_{d}}\\
1_{N_{k}}
\end{array}\right)
\quad\textrm{and}\quad
X= \left(\begin{array}{c}
X_d\\
X_k
\end{array}\right)
%
\end{aligned}\]
<p>Here, \(X_d\) is an \(N_{d}\times D\) matrix of dropped
rows and \(X_k\) is an \(N_{k}\times D\) matrix of kept rows, where
\(N_{k}+ N_{d}= N\). We thus have</p>
\[\begin{aligned}
%
Xa_1 =
X= \left(\begin{array}{c}
X_d a_1\\
0_{N_{k}}
\end{array}\right).
%
\end{aligned}\]
<p>Here, \(X_d a_1 \ne 0\) (for otherwise \(X\) could not be
rank \(D\)). In other words, the rows \(X_k\) are rank deficient, the rows
\(X_d\) are not, but \(\vec{w}\odot X\) is rank deficient precisely because
\(\vec{w}\) drops the full-rank portion \(X_d\).</p>
<h3 id="reparameterize-to-isolate-the-vanishing-subspace">Reparameterize to isolate the vanishing subspace</h3>
<p>To understand how \(\hat{\theta}(\vec{w})\) behaves, let’s isolate the
coefficient that corresponds to the subspace that vanishes. To that end,
let \(A\) denote an invertible \(D \times D\) matrix whose first column is
\(a_1\).</p>
\[\begin{aligned}
%
A := \left(\begin{array}{c}
a_1 & a_2 & \ldots & a_D \\
\end{array}\right).
%
\end{aligned}\]
<p>Define \(Z:= XA\) and \(\beta := A^{-1} \theta\) so that
\(X\theta = Z\beta\).
Then we can equivalently investigate the behavior of</p>
\[\begin{aligned}
%
\hat{\beta}(\vec{w}) ={}&
\left((\vec{w}\odot Z)^T Z\right)^{-1}
(\vec{w}\odot Z)^T \vec{y}.
%
\end{aligned}\]
<p>If we write \(Z_1\) for the first column of \(Z\) and
\(Z_{2:D}\) for the \(N \times (D - 1)\) remaining columns, we have</p>
\[\begin{aligned}
%
Z=
\left(\begin{array}{c}
Z_1 & Z_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
Xa_1 & XA_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
X_d a_1 & X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}\right),
%
\end{aligned}\]
<p>where we have used the fact definition of \(a_1\) and the
partition of \(X\) from above.</p>
<h3 id="consider-a-straight-path-from-1_n-to-vecw">Consider a straight path from \(1_N\) to \(\vec{w}\)</h3>
<p>Define \(\vec{w}(t) = (\vec{w}- 1_N) t+ 1_N\) for \(t\in [0, 1]\), so that
\(\vec{w}(0) = 1_N\) and \(\vec{w}(1) = \vec{w}\). We can now write an
explicit formula for \(\hat{\beta}(\vec{w}(t))\) as a function of \(t\) and
consider what happens as \(t\rightarrow 1\).</p>
<p>Because \(\vec{w}\) has zeros in its first \(N_{d}\) entries,</p>
\[\begin{aligned}
%
\vec{w}(t) \odot Z=
\left(\begin{array}{cc}
(1-t) X_d a_1 & (1-t) X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}
\right)
%
\end{aligned}\]
<p>and</p>
\[\begin{aligned}
%
\left((\vec{w}\odot Z)^T Z\right) ={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T X_d a_1 &
(1-t) a_1^T X_d^T X_k A_{2:D}\\
(1-t) A_{2:D}^T X_k^T X_d a_1 &
A_{2:D}^T ( X_k^T X_k + (1-t) X_d^T X_d )A_{2:D} \\
\end{array}\right).
%
\end{aligned}\]
<p>Since the upper left hand entry \((1-t) a_1^T X_d^T X_d a_1 \rightarrow
0\) as \(t\rightarrow 1\), we can see again that the regression is singular when
evaluated at \(\vec{w}\).</p>
<p>However, by partitioning \(\vec{y}\) into dropped and kept components,
\(\vec{y}_d\) and \(\vec{y}_k\) respectively, we also have</p>
\[\begin{aligned}
%
(\vec{w}(t) \odot Z)^T \vec{y}={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T \vec{y}_d\\
A_{2:D}^T \left(X_k^T + (1-t)X_d^T \right)\vec{y}_k
\end{array}\right).
%
\end{aligned}\]
<p>One can perhaps see at this point that the \((1-t)\) will
cancel in the numerator and denominator of the regression. This can be
seen directly by the Schur complement. Letting \(\hat{\beta}_1\) denote
the first element of \(\hat{\beta}\), we can use the Schur complement to
write</p>
\[\begin{aligned}
%
\hat{\beta}_1(\vec{w}(t)) =
\frac{(1-t) a_1^T X_d^T \vec{y}_d}{
(1-t) a_1^T X_d^T X_d a_1 -
O((1-t)^2)
}
=
\frac{a_1^T X_d^T \vec{y}_d}{
a_1^T X_d^T X_d a_1 - O((1-t))
}
%
\end{aligned}\]
<p>where \(O(\cdot)\) denotes a term of the specified order
as \(t\rightarrow 1\). We can see that, as \(t\rightarrow 1\),
\(\hat{\beta}_1(\vec{w}(t))\) in fact varies smoothly, and we can expect
linear approximations to work as well as they might even without the
singularity. Formally, the singularity is a “removable singularity,”
analogous to when a factor cancels in a ratio of polynomials.</p>
<p>An analogous argument holds for the rest of the \(\hat{\beta}\) vector,
again using the Schur complement. Since \(\hat{\theta}\) is simply a
linear transform of \(\hat{\beta}\), the same reasoning applies to
\(\hat{\theta}\) as well.</p>
<h3 id="conslusions-and-consequences">Conslusions and consequences</h3>
<p>Though the regression problem is singular precisely at \(\vec{w}\), it is
in fact well-behaved in a neighborhood of \(\vec{w}\). This is because
re-weighting downweights both the non-singular rows and their
corresponding observations. Singularity occurs only when entries of
\(\vec{w}\) are precisely zero. For theoretical and practical purposes,
you can completely avoid the problem by simply considering weight
vectors that are not precisely zero at the left-out points, taking
instead some arbitrarily small values.</p>
<p>The most common way this seems to occur in practice is when a weight
vector drops all the levels of some indicator. It is definitely on my
TODO list to find some way to allow <code class="language-plaintext highlighter-rouge">zaminfluence</code> to deal with this
gracefully.</p>
<p>Note that this analysis leaned heavily on the structure of linear
regression. In general, when the Hessian matrix of the objective
function is nearly singular, it will be associated with non-linear
behavior of \(\hat{\theta}(\vec{w}(t))\) along a path from \(1_N\) to \(w\).
Linear regression is rather a special case.</p>How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of zaminfluence. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.