Jekyll2022-01-11T05:11:51+00:00/feed.xmlRyan Giordano, statistician.This is the professional webpage and open research journal of Ryan Giordano.How does AMIP work for regression when the weight vector induces colinearity in the regressors?2021-12-17T10:00:00+00:002021-12-17T10:00:00+00:00/amip/2021/12/17/reweighted_colinear_note<p>How does AMIP work for regression when the weight vector induces
colinearity in the regressors? This problem came up in our paper, as
well as in a couple users of <code class="language-plaintext highlighter-rouge">zaminfluence</code>. Superficially, the
higher-order infinitesimal jackknife has nothing to say about such a
point, since a requirement for the accuracy of the approximation is that
the Hessian matrix be uniformly non-singular. However, we shall see
that, in the special case of regression, we can re-express the problem
so that the singularity disappears.</p>
<h3 id="notation">Notation</h3>
<p>Suppose we have a weighted regression problem with regressor matrix \(X\)
(an \(N \times D\) matrix), response vector \(\vec{y}\) (an \(N\)-vector), and
weight vector \(\vec{w}\) (an \(N\)-vector):
\(\begin{aligned}
%
\hat{\theta}(\vec{w}) :={}& \theta\textrm{ such that }\sum_{n=1}^N
w_n (y_n - \theta^T x_n) x_n = 0
\Rightarrow\\
\hat{\theta}(\vec{w}) ={}&
\left(\frac{1}{N}\sum_{n=1}^N
w_n x_n x_n^T \right)^{-1} \frac{1}{N}\sum_{n=1}^Nw_n y_n x_n
\\={}& \left((\vec{w}\odot X)^T X\right)^{-1} (\vec{w}\odot X)^T \vec{y}.
%
\end{aligned}\)</p>
<p>Here, we have used \(\vec{w}\odot X\) like the Hadamard
product with broadcasting. Formally, we really mean
\((\vec{w}1_D^T) \odot X\), where \(1_D^T\) is a \(D\)-length row vector
containing all ones. Throughout, we will use \(1\) and \(0\) subscripted by
a dimension to represent vectors filled with ones and zeros
respectivley.</p>
<h3 id="how-can-weights-induce-rank-deficiency">How can weights induce rank deficiency?</h3>
<p>We are interested in what happens to the linear approximation at a
weight vector \(\vec{w}\) for which the Hessian
\(\frac{1}{N}\sum_{n=1}^Nw_n x_n x_n^T\) is singular. Assume that \(X\) has
rank \(D\), and that \((\vec{w}\odot X)\) is rank \(D-1\). Specifically, there
exists some nonzero vector \(a_1 \in \mathbb{R}^D\). such that \((\vec{w}
\odot X) a_1 = 0_N\), where \(0_N\) is the \(N\)-length vector of zeros. For
each \(n\), the preceding expression implies that \(w_n x_n^T a_1 = 0\), so
either \(w_n = 0\) or \(x_n^T a_1 = 0\). Without loss of generality, we can
thus order the observations so that we drop the first \(N_{d}\) rows:</p>
\[\begin{aligned}
%
\vec{w}=
\left(\begin{array}{c}
0_{N_{d}}\\
1_{N_{k}}
\end{array}\right)
\quad\textrm{and}\quad
X= \left(\begin{array}{c}
X_d\\
X_k
\end{array}\right)
%
\end{aligned}\]
<p>Here, \(X_d\) is an \(N_{d}\times D\) matrix of dropped
rows and \(X_k\) is an \(N_{k}\times D\) matrix of kept rows, where
\(N_{k}+ N_{d}= N\). We thus have</p>
\[\begin{aligned}
%
Xa_1 =
X= \left(\begin{array}{c}
X_d a_1\\
0_{N_{k}}
\end{array}\right).
%
\end{aligned}\]
<p>Here, \(X_d a_1 \ne 0\) (for otherwise \(X\) could not be
rank \(D\)). In other words, the rows \(X_k\) are rank deficient, the rows
\(X_d\) are not, but \(\vec{w}\odot X\) is rank deficient precisely because
\(\vec{w}\) drops the full-rank portion \(X_d\).</p>
<h3 id="reparameterize-to-isolate-the-vanishing-subspace">Reparameterize to isolate the vanishing subspace</h3>
<p>To understand how \(\hat{\theta}(\vec{w})\) behaves, let’s isolate the
coefficient that corresponds to the subspace that vanishes. To that end,
let \(A\) denote an invertible \(D \times D\) matrix whose first column is
\(a_1\).</p>
\[\begin{aligned}
%
A := \left(\begin{array}{c}
a_1 & a_2 & \ldots & a_D \\
\end{array}\right).
%
\end{aligned}\]
<p>Define \(Z:= XA\) and \(\beta := A^{-1} \theta\) so that
\(X\theta = Z\beta\).
Then we can equivalently investigate the behavior of</p>
\[\begin{aligned}
%
\hat{\beta}(\vec{w}) ={}&
\left((\vec{w}\odot Z)^T Z\right)^{-1}
(\vec{w}\odot Z)^T \vec{y}.
%
\end{aligned}\]
<p>If we write \(Z_1\) for the first column of \(Z\) and
\(Z_{2:D}\) for the \(N \times (D - 1)\) remaining columns, we have</p>
\[\begin{aligned}
%
Z=
\left(\begin{array}{c}
Z_1 & Z_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
Xa_1 & XA_{2:D} \\
\end{array}\right)
=
\left(\begin{array}{c}
X_d a_1 & X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}\right),
%
\end{aligned}\]
<p>where we have used the fact definition of \(a_1\) and the
partition of \(X\) from above.</p>
<h3 id="consider-a-straight-path-from-1_n-to-vecw">Consider a straight path from \(1_N\) to \(\vec{w}\)</h3>
<p>Define \(\vec{w}(t) = (\vec{w}- 1_N) t+ 1_N\) for \(t\in [0, 1]\), so that
\(\vec{w}(0) = 1_N\) and \(\vec{w}(1) = \vec{w}\). We can now write an
explicit formula for \(\hat{\beta}(\vec{w}(t))\) as a function of \(t\) and
consider what happens as \(t\rightarrow 1\).</p>
<p>Because \(\vec{w}\) has zeros in its first \(N_{d}\) entries,</p>
\[\begin{aligned}
%
\vec{w}(t) \odot Z=
\left(\begin{array}{cc}
(1-t) X_d a_1 & (1-t) X_d A_{2:D} \\
0_{N_{k}} & X_k A_{2:D} \\
\end{array}
\right)
%
\end{aligned}\]
<p>and</p>
\[\begin{aligned}
%
\left((\vec{w}\odot Z)^T Z\right) ={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T X_d a_1 &
(1-t) a_1^T X_d^T X_k A_{2:D}\\
(1-t) A_{2:D}^T X_k^T X_d a_1 &
A_{2:D}^T ( X_k^T X_k + (1-t) X_d^T X_d )A_{2:D} \\
\end{array}\right).
%
\end{aligned}\]
<p>Since the upper left hand entry \((1-t) a_1^T X_d^T X_d a_1 \rightarrow
0\) as \(t\rightarrow 1\), we can see again that the regression is singular when
evaluated at \(\vec{w}\).</p>
<p>However, by partitioning \(\vec{y}\) into dropped and kept components,
\(\vec{y}_d\) and \(\vec{y}_k\) respectively, we also have</p>
\[\begin{aligned}
%
(\vec{w}(t) \odot Z)^T \vec{y}={}&
\left(\begin{array}{c}
(1-t) a_1^T X_d^T \vec{y}_d\\
A_{2:D}^T \left(X_k^T + (1-t)X_d^T \right)\vec{y}_k
\end{array}\right).
%
\end{aligned}\]
<p>One can perhaps see at this point that the \((1-t)\) will
cancel in the numerator and denominator of the regression. This can be
seen directly by the Schur complement. Letting \(\hat{\beta}_1\) denote
the first element of \(\hat{\beta}\), we can use the Schur complement to
write</p>
\[\begin{aligned}
%
\hat{\beta}_1(\vec{w}(t)) =
\frac{(1-t) a_1^T X_d^T \vec{y}_d}{
(1-t) a_1^T X_d^T X_d a_1 -
O((1-t)^2)
}
=
\frac{a_1^T X_d^T \vec{y}_d}{
a_1^T X_d^T X_d a_1 - O((1-t))
}
%
\end{aligned}\]
<p>where \(O(\cdot)\) denotes a term of the specified order
as \(t\rightarrow 1\). We can see that, as \(t\rightarrow 1\),
\(\hat{\beta}_1(\vec{w}(t))\) in fact varies smoothly, and we can expect
linear approximations to work as well as they might even without the
singularity. Formally, the singularity is a “removable singularity,”
analogous to when a factor cancels in a ratio of polynomials.</p>
<p>An analogous argument holds for the rest of the \(\hat{\beta}\) vector,
again using the Schur complement. Since \(\hat{\theta}\) is simply a
linear transform of \(\hat{\beta}\), the same reasoning applies to
\(\hat{\theta}\) as well.</p>
<h3 id="conslusions-and-consequences">Conslusions and consequences</h3>
<p>Though the regression problem is singular precisely at \(\vec{w}\), it is
in fact well-behaved in a neighborhood of \(\vec{w}\). This is because
re-weighting downweights both the non-singular rows and their
corresponding observations. Singularity occurs only when entries of
\(\vec{w}\) are precisely zero. For theoretical and practical purposes,
you can completely avoid the problem by simply considering weight
vectors that are not precisely zero at the left-out points, taking
instead some arbitrarily small values.</p>
<p>The most common way this seems to occur in practice is when a weight
vector drops all the levels of some indicator. It is definitely on my
TODO list to find some way to allow <code class="language-plaintext highlighter-rouge">zaminfluence</code> to deal with this
gracefully.</p>
<p>Note that this analysis leaned heavily on the structure of linear
regression. In general, when the Hessian matrix of the objective
function is nearly singular, it will be associated with non-linear
behavior of \(\hat{\theta}(\vec{w}(t))\) along a path from \(1_N\) to \(w\).
Linear regression is rather a special case.</p>How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of zaminfluence. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.Fiducial inference and the interpretation of confidence intervals.2021-12-09T10:00:00+00:002021-12-09T10:00:00+00:00/philosophy/2021/12/09/fidual_cis<h2 id="why-is-it-so-hard-to-think-correctly-about-confidence-intervals">Why is it so hard to think “correctly” about confidence intervals?</h2>
<p>I came across the following section in the
(wonderful) textbook
<a href="https://moderndive.com/8-confidence-intervals.html">ModernDive</a>:</p>
<blockquote>
<p>Let’s return our attention to 95% confidence intervals.
… A common but incorrect interpretation is: “There is a 95% probability that the
confidence interval contains p.” Looking at Figure 8.27, each of the confidence
intervals either does or doesn’t contain p. In other words, the probability is
either a 1 or a 0.</p>
</blockquote>
<p>(Although I’m going to pick on this quote a little bit, I want to stress that I
love this textbook. This view of CIs is extremely common and I might well have
taken a similar quote from any number of other sources. This book just
happened to be in front of me today.)</p>
<p>I understand what the authors are saying. Given the data we observed and CI we
computed, there is no remaining randomness — either the parameter is in the
interval, or it isn’t. The parameter is not random, the data is. But I think
there is room to admit that this point, while technically clear, is a little
uncomfortable, even for those of us who are very familiar with these concepts.
After all, there is a 95% chance that a randomly chosen interval contains the
parameter. I chose an interval. Why can I no longer say that there is a 95%
chance that the parameter is in that interval? To a beginning student of
statistics who is encountering this idea for the first time, this commonplace
qualification must seem pedantic at best and confusing at worst.</p>
<h2 id="fiducial-inference">Fiducial inference</h2>
<p>Chapter 9 of Ian Hacking’s <em>Logic of Statistical Inference</em> contains a beautiful
account of precisely <em>why</em> we are so inclined towards the “incorrect”
interpretation, as well as the shortcomings of our intuition. The logic is
precisely that of Fisher’s famous (infamous?) fiducial inference.
Understanding this connection not only helps us to better understand CIs (and
their modes of failure), but also to be more sympathetic to the inherent
reasonableness of students who are disinclined to let go of the “incorrect”
interpretation.</p>
<p>As presented by Hacking, there are two relatively uncontroversial building
blocks of fiducial inference, and one problematic one. Recall the idea that
aleatoric probabilities (the stuff of gambling devices) and epistemic
probabilities (degrees of epistemic belief) are fundamentally different
quantities. (Hacking treats this question better than I could, but I also have
<a href="/philosophy/2021/08/22/what_is_statistics.html">a short post on this topic here</a>). Following Hacking, I will denote
the aleatoric probabilities by \(P\) and the degrees of belief by \(p\).</p>
<h3 id="assumption-one-the-frequency-principle">Assumption one: The “frequency principle.”</h3>
<p>The first necessary assumption of fiducial inference is this:</p>
<blockquote>
<p>If you know nothing about an aleatoric event \(E\) other than its probability,
then \(p(E) = P(E)\).</p>
</blockquote>
<p>This amounts to saying that, for pure gambling devices, absent other
information, your subjective belief about whether an outcome occurs should be
the same as the frequency with which that outcome occurs under randomization. If
you know a coin comes up heads 50% of the time (\(P(heads) = 0.5\)), then your
degree of certainty that it will come up heads on the next flip should be the
same (\(p(heads) = 0.5\)). Hacking calls this assumption the “frequency
principle.”</p>
<h3 id="assumption-two-the-logic-of-support">Assumption two: The “logic of support.”</h3>
<p>The second fundamental assumption is that the logic of epistemic probabilities
should be the same as the logic of aleatoric probabilities. Specifically:</p>
<blockquote>
<p>Degrees of belief should obey Kolmogorov’s axioms.</p>
</blockquote>
<p>For example, if events \(H\) and \(I\) are logically mutually exclusive, then
\(p(H \textrm{ and }I) = p(H) + P(I)\). Conditional probabilities such as \(p(H \vert
E)\) are a measure of how much the event \(E\) supports a subjective belief that
\(H\) will occurs.</p>
<p>Neither the frequency principle nor the logic of support are particularly
controversial, even for avowed frequentists. Note that assumption one states
only how you come about subjective beliefs about systems you know to be
aleatoric, and assumption two states describes only how subjective beliefs
combine coherently. So there is nothing really Bayesian here.</p>
<h2 id="hypothesis-testing-and-fiducial-inference">Hypothesis testing and fiducial inference</h2>
<p>Applying the frequency principle and the logic of support to confidence
intervals, together with an additional (more controversial) logical step, will
in fact lead us directly to the “incorrect” interpretation of a confidence
interval. Let’s see how the logic works.</p>
<p>Suppose we have some data \(X\), and we want to know the value of some parameter
\(\theta\). Suppose we have constructed a valid confidence set \(S(X)\) such
that \(P(\theta \in S(X)) = 0.95\). Following Hacking, let \(D\) denote the
event that our setup is correct — specifically, that we are correct about the
randomness of \(X\) \(S(X)\) is a valid CI with the desired coverage. That is,
given \(D\), we assume that \(X\) is really random, and we know the randomness,
so \(P\) is a true aleatoric probability — no subjective belief here.</p>
<p>Of course, the construction of a confidence interval guarantees only the
aleatoric probability — thus we have used \(P\), not \(p\). However, by the
frequency principle, we are justified in writing</p>
<p>\(p(\theta \in S(X) \vert D) = P(\theta \in S(X)\vert D) = 0.95\),</p>
<p>so long as we know nothing other than the accuracy of our setup \(D\). (Note
that \(\theta \in S(X)\) is a pivot. In general, pivots play a central role in
fiducial inference.)</p>
<p>Note that \(p(\theta \in S(X) \vert D)\) is very near to our “incorrect”
interpretation of confidence intervals! However, in reality, we know more than
\(S(X)\), we actually observe \(X\) itself. Now, \(P(\theta \in S(X) \vert D, X)\)
is either \(0\) or \(1\). Conditional on \(X\), there is no remaining
aleatoric uncertainty to which we can apply the frequency principle. And
most authors — including those of quote that opened this post — stop here.</p>
<p>There is an additional assumption, however, that allow us to formally compute
\(p(\theta \in S(X) \vert X, D)\), and it is this (controversial) assumption that is
at the core of fiducial inference. It is this:</p>
<h3 id="assumption-three-irrelevance">Assumption three: Irrelevance</h3>
<blockquote>
<p>The full data \(X\) tells us nothing more about \(\theta\) (in an epistemic sense),
than the confidence interval \(S(X)\).</p>
</blockquote>
<p>In the case of confidence intervals, the assumption of irrelevance requires at
least two things. First, it requires that our subjective belief that \(\theta
\in S(X)\) does not depend on the particular interval that we compute
from the data. In other words, we are as likely to believe that our CI
contains the parameter no matter where its endpoints lie. Second, it
requires that there is nothing more to be learned about the parameter
from the data <em>other</em> than the information contained in the CI.</p>
<p>These are strong assumptions! However, when they hold, they justify the
“incorrect” interpretation of confidence intervals — namely that there is a
95% subjective probability that \(\theta \in S(X)\), given the data we observed.
For, under the assumption of irrelevance, by the logic of support (and then the
frequency principle as above) we can write</p>
<p>\(p(\theta \in S(X) \vert X, D) =
p(\theta \in S(X) \vert D) =
P(\theta \in S(X) \vert D) = 0.95\).</p>
<h2 id="how-does-this-go-wrong-and-what-does-it-mean-for-teaching">How does this go wrong, and what does it mean for teaching?</h2>
<p>Assumption three is often hard to justify, or outright fallacious. But one of
its strengths is that it points to <em>how</em> the logic of fiducial inference fails,
when it does fail. In particular, it is not hard to construct valid confidence
intervals that contain only impossible values of \(\theta\) for some values of
\(X\). (As long as a confidence interval takes on crazy values sufficiently
rarely, there is nothing in the definition preventing it from doing so.) In
fact, as Hacking points out, confidence intervals are tools for <em>before</em> you see
the data, designed so that you do not make mistakes too often on average, and
can suggest strange conclusions once you have seen a particular dataset.</p>
<p>However, it’s not crazy for someone, especially a beginning student, to
subscribe to assumption three, even if they are not aware of it. After all, we
typically present a confidence interval as <em>the</em> way to summarize what your data
tells you about your parameter. And if that’s the case, then the “incorrect”
interpretation of CIs follows from the extremely plausible frequency principle
and logic of support. At the least I think we should acknowledge the
reasonableness of this logical chain, and teach when it goes wrong rather than
simply reject it by fiat.</p>Why is it so hard to think “correctly” about confidence intervals?To think about the influence function, think about sums.2021-12-01T10:00:00+00:002021-12-01T10:00:00+00:00/amip/2021/12/01/influence_is_sum<p>I think the key to thinking intuitively about the influence function in our
<a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> is this: Lineraization
approximates a complicated estimator with a simple sum. If you can establish
that the linearization provides a good approximation, then you can reason about
your complicated estimator by reasoning about sums. And sums are easy to reason
about.</p>
<p>Specifically, suppose you have data weights \(w = (w_1, \ldots, w_N)\) and
an estimator \(\phi(w) \in \mathbb{R}\) which depends on the weights
in some complicated way. Let \(\phi^{lin}\) denote the first-order
Taylor series expansion around the unit weight vector \(\vec{1} := (1, \ldots, 1)\)</p>
\[\phi^{lin}(w) :=
\phi(\vec{1}) +
\sum_{n=1}^N \psi_n (w_n - 1) =
\phi(\vec{1}) + \sum_{n=1}^N \psi_n w_n
\quad\textrm{where}\quad
\psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{\vec{1}},\]
<p>and we have used the fact that \(\sum_{n=1}^N \psi_n = 0\) for Z-estimators.
(For situations where \(\sum_{n=1}^N \psi_n \ne 0\), just keep that sum around,
and everything I say in this post still applies.) Thinking now of \(\psi\) as
data, we can (in some abuse of notation) write \(\phi^{lin}(\psi) =
\phi(\vec{1}) + \sum_{n=1}^N \psi_n\). If \(\phi^{lin}(w)\) is a good
approximation to \(\phi(w)\), then the effect of leaving a datapoint out of
\(\phi(w)\) is well-approximated by the effect of leaving the corresponding
entry out of \(\psi\) in \(\phi^{lin}(\psi)\). We have, in effect, replaced a
complicated data dependence with a simple sum of terms. This is what
linearization does for us. (NB: if our original estimator had been a sum of the
data, the linearization would be exact!)</p>
<p>Typically \(\psi_n = O_p(N^{-1})\), so it’s a little helpful to define
\(\gamma_n := N \psi_n\). We then can write:</p>
\[\phi^{lin}(\gamma) := \phi(\vec{1}) + \frac{1}{N}\sum_{n=1}^N \gamma_n.\]
<p>We can now ask what kinds of changes we can produce in \(\phi^{lin}(\gamma)\) by
dropping entries from \(\gamma\) (while keeping \(N\) the same), and some of the
core conclusions of our paper become obvious. Definitionally, \(\sum_{n=1}^N
\gamma_n = 0\). For example, if we drop \(\alpha N\) points, for some fixed \(0 <
\alpha < 1\), then the amount we can change the sum \(\frac{1}{N}\sum_{n=1}^N
\gamma_n\) does not vanish, no matter how large \(N\) is, and no matter how
small \(\alpha\) is. The amount you can change the sum
\(\frac{1}{N}\sum_{n=1}^N \gamma_n\) also obviously depends on the tail shape of
the distribution of the \(\gamma_n\), as well as their absolute scale.
Increasing the scale (i.e., increasing the noise) obviously increases the amount
you can change the sum. And, for a given scale (i.e., a given \(\frac{1}{N}
\sum_{n=1}^N \gamma_n^2)\), you will be able to change the sum by the most when
the left-out \(\gamma_n\) all take the same value.</p>
<p>So one way to think about AMIP is this: we provide a good approximation to your
original statistic that takes the form of a simple sum over your data. Dropping
datapoints corresponds to dropping data from this sum. You can then think about
whether dropping sets that are selected in a certain way are reasonable or not
in terms of dropping entries from a sum, about which it’s easy to have good
intutition!</p>I think the key to thinking intuitively about the influence function in our work on AMIP is this: Lineraization approximates a complicated estimator with a simple sum. If you can establish that the linearization provides a good approximation, then you can reason about your complicated estimator by reasoning about sums. And sums are easy to reason about.The bootstrap randomly queries the influence function.2021-11-08T10:00:00+00:002021-11-08T10:00:00+00:00/amip/2021/11/08/bootstrap_influence<p>When we present our <a href="https://arxiv.org/abs/2011.14999">work on AMIP</a> the
relationship with the bootstrap often comes up. I think there’s a lot to say,
but there’s one particularly useful perspective: the (empirical, nonparametric)
bootstrap can be thought of as <em>randomly querying the influence function</em>.
From this perspective, it seems both clear (a) why the bootstrap works as
an estimator of variance and (b) why it won’t work as to find the
approximately most influential set, i.e., the set of points which
have the most extreme values of the influence function (AMIS in our paper).</p>
<p>Let’s suppose that you have a vector \(\psi \in \mathbb{R}^N\), with
\(\sum_{n=1}^N \psi_n = 0\), where \(N\) is very large. We would like to know
about \(\psi\), but suppose we can’t access it directly. Rather, we can only
query it via inner products \(v^T \psi\). Moreover, suppose we can only compute
\(B\) such inner products, where \(B \ll N\). For the purpose of this post,
\(\psi\) will be the influence scores, \(v\) will be rescaled bootstrap weight
vectors, \(N\) will be the number of data points, and \(B\) the number of
bootstrap samples. But the discussion can start out more generally.</p>
<p>Suppose we don’t know anything a priori about \(\psi\), so we query it randomly,
drawing IID entries for \(v\) from a distribution with mean zero and unit
variance. Let the \(b\)-th random vector be denoted \(v^{b}\). We can ask:
What can the collection of inner products \(V_B := \{v^{1} \cdot \psi, \ldots,
v^{B} \cdot \psi \}\) tell us about \(\psi\)?</p>
<p>At first glance, the answer seems to be “not much other than the scale.” The
set \(V_B\) tells us about the projection of \(\psi\) onto a \(B\)-dimensional
subspace, out of \(N \gg B\) total dimensions. Furthermore, since
\(\mathbb{E}[v \cdot \psi] = \sum_{n=1}^N \mathbb{E}[v_n] \psi_n = 0\), the
vectors \(v^b\) are, on average, orthogonal to \(\psi\). So we do not expect
the projection of \(\psi\) onto the space spanned by \(V_B\) to account for an
appreciable proportion of \(|| \psi ||_2\). The set \(V_B\) <em>can</em> estimate the
scale \(|| \psi ||_2\), however, since \(\mathrm{Var}(v \cdot \psi) =
\sum_{n=1}^N \mathbb{E}[v_n^2] \psi_n^2 = \sum_{n=1}^N \psi_n^2 =
||\psi||_2^2\), and \(\mathrm{Var}(v \cdot \psi)\) can be estimated using
the sample variance of \(v^b \cdot \psi\).</p>
<p>Note that the bootstrap is very similar to drawing \(v_n + 1 \sim
\mathrm{Poisson}(1)\); the proper bootstrap actually has some correlation
between different entries due to the constraint \(\sum_{n=1}^N v_n = 1\), but
this correlation is of order \(1/N\) and can be neglected for simplicity in the
present argument. The argument of the previous paragraph implies that the
bootstrap effectively randomly projects \(\psi\) onto a very low-dimensional
subspace, presumably losing most of its detail in doing so. It also makes sense
that the bootstrap can tell us about \(||\psi||_2\) — recall that
\(||\psi||_2\) consistently estimates the variance of the limiting distribution
of our statistic, a quantity that we know the bootstrap is also able to
estimate.</p>
<p>Recall that the AMIP from our paper is \(-\sum_{n=1}^{\lfloor \alpha N \rfloor}
\psi_{(n)}\), where \(\psi_{(n)}\) is the \(n\)-th sorted entry of the \(\psi\)
vector. From the argument sketch above I conjecture that the bootstrap
distribution actually doesn’t convey much information about the AMIP other than
the limiting variance. In particular, in the terms of our paper, I conjecture
that the bootstrap can tell us about the “noise” of AMIP but not the “shape.”</p>
<p>Incidentally, the above perspective is also relevant for situations where we
cannot form and / or invert the full Hessian matrix \(H\), and so cannot compute
\(\psi\) directly. If we imagine sketching \(H^{-1}\), e.g. by using the
conjugate gradient method applied to random vectors, we would run into a
problem conceptually quite similar to the bootstrap. It’s interesting
to think about how one could improve on random sampling in such a case.</p>When we present our work on AMIP the relationship with the bootstrap often comes up. I think there’s a lot to say, but there’s one particularly useful perspective: the (empirical, nonparametric) bootstrap can be thought of as randomly querying the influence function. From this perspective, it seems both clear (a) why the bootstrap works as an estimator of variance and (b) why it won’t work as to find the approximately most influential set, i.e., the set of points which have the most extreme values of the influence function (AMIS in our paper).Saint Augustine and chance.2021-10-27T10:00:00+00:002021-10-27T10:00:00+00:00/philosophy/2021/10/27/st_augustine<p>I came across an interesting passage in the Confessions of Saint Augustine at
the end of section (5) of Vindicianus on Astronomy. Augustine is describing a
period in his early life when he was, to his later shame, interested in
fortune-telling. In this particular passage, a friend is trying to helpfully
convince his younger self that fortune-telling is nonsense. Augustine writes of
the exchange:</p>
<blockquote>
<p>“I asked him why it was that many of their forecasts turned out to be correct.
He replied that the best answer he could give was the power apparent in lots, a
power everywhere diffused in the nature of things. So when someone happens to
consult the pages of a poet whose verses and intention are concerned with a
quite different subject, in a wonderful way a verse often emerges appropriate to
the decision under discussion. He used to say that it was no wonder if, from
the human soul, by some higher instinct that does not know what goes on within
itself, some utterance emerges not by art but by ‘chance’ which is in sympathy
with the affairs or actions of the inquirer.”</p>
</blockquote>
<p>In his Confessions, Saint Augustine sees divine will in every aspect of life
and, moreover, he is writing at the end of the fourth century. So of course his
conception of chance will differ from our modern one. Still, it is striking
that, as he is trying to assert precisely that fortune tellers are correct only
by accident, his concept of “accident” does not admit anything like modern
randomness.</p>
<p>Suppose a fortune teller flips a coin to predict an outcome that itself occurs
half the time and is subsequently correct half the time. We account for this
probabilistically, assert that the randomness of the coin flip disconnects the
prediction from the outcome, and say that the co-occurence of prediction and
outcome is the overlap of unrelated events. Augustine seems to want to say
something similar, but cannot commit himself to the disconnect — he attributes
correct predictions to “some higher instinct” beyond the control of the fortune
teller which, nevertheless, kicks in only some of the time.</p>I came across an interesting passage in the Confessions of Saint Augustine at the end of section (5) of Vindicianus on Astronomy. Augustine is describing a period in his early life when he was, to his later shame, interested in fortune-telling. In this particular passage, a friend is trying to helpfully convince his younger self that fortune-telling is nonsense. Augustine writes of the exchange:Three ridiculous hypothesis tests.2021-09-30T10:00:00+00:002021-09-30T10:00:00+00:00/frequentist/statistics/2021/09/30/four_crazy_hypothesis_tests<p>There are lots of reasons to dislike p-values. Despite their inherent flaws,
over-interpretation, and risks, it is extremely tempting to argue that, absent
other information, the smaller the p-value, the less plausible the null
hypothesis. For example, the venerable Prof. Philip Stark (who I admire and
who was surely choosing his words very carefully), writes in <a href="https://figshare.com/articles/dataset/The_ASA_s_statement_on_p_values_context_process_and_purpose/3085162/4?file=5368499">“The Value of
p-Values”</a>:</p>
<blockquote>
<p>“Small p-values are stronger evidence that the explanation [the null hypothesis]
is wrong: the data case doubt on the explanation.”</p>
</blockquote>
<p>For p-values based on reasonable hypothesis tests with no other information, I
think that Prof. Stark is (usually, mostly) correct to say this. But there is
nothing in the definition of a hypothesis test that requires it to be reasonable
without a consideration of <em>power</em>, and power does not enter the definition of a
p-value.</p>
<p>So, to motivate the importance of explicit power considerations in the use of
hypothesis tests and p-values, let me describe three ridiculous but valid
hypothesis tests. None of this is new, but perhaps it will be fun to have these
examples all in the same place.</p>
<h1 id="hypothesis-tests-and-p-values">Hypothesis tests and p-values</h1>
<p>I will begin by reviewing the definition
of p-values in the context of hypothesis testing. Let our random
data \(X\) take values in a measurable space \(\Omega_x\). The distribution of
\(X\) is posited to lie in some class of distributions \(\mathcal{P}_\theta\)
parameterized by a parameter \(\theta \in \Omega_\theta\). A simple null
hypothesis \(H_0\) specifies a value \(\theta_0 \in \Omega_\theta\) that fully
specifies the distribution of the data, which we will write as \(P(X |
\theta_0)\).</p>
<h4 id="hypothesis-tests">Hypothesis tests</h4>
<p>A valid test of \(H_0\) with level \(\alpha\) consists of two parts:</p>
<ul>
<li>A measurable test statistic, \(T: \Omega_x \mapsto \Omega_T\), possibly
incorporating additional randomness, and</li>
<li>A region \(A(\alpha) \subseteq \Omega_T\) such that
\(P(T(X) \in A(\alpha) | H_0) \le \alpha\).</li>
</ul>
<h4 id="p-values">P-values</h4>
<p>Often (as in our simple example), the regions are nested in the sense that
\(\alpha_1 < \alpha_2 \Rightarrow A(\alpha_1) \subset A(\alpha_2)\). Stricter
tests result in smaller rejection regions. In such a case, we can define
the p-value of a particular observation \(x\) as the smallest \(\alpha\)
which rejects \(H_0\) for that \(x\).</p>
<h4 id="a-simple-example">A simple example</h4>
<p>A simple example, which suffices for the whole present post, is data \(X = (X_1,
\ldots, X_N)\), from a \(\mathcal{N}(\theta, 1)\) distribution. The classical
two-sided test, which is eminently reasonable for many situations, uses</p>
<ul>
<li>\(T(X) = \sqrt{N} \vert \bar{X} - \theta_0 \vert\) and</li>
<li>\(A(\alpha) = \left\{x: T(x) \ge \Phi^{-1}(1 - \alpha / 2) \right\}\),</li>
</ul>
<p>where \(\bar{X}\) is the sample average and \(\Phi\) the cumulative distribution
function of the standard normal. Today I will not quibble with this test,
but rather propose absurd alternatives.</p>
<p>In the case of our simple example, the p-value is simply \(p(x) = 2(1 -
\Phi(T(x)))\). As \(T(x)\) increases, the p-value \(p(x)\) decreases.</p>
<h4 id="the-reasoning">The reasoning</h4>
<p>If \(\alpha\) is small (say, the much-loathed \(0.05\)), and we observe \(x\)
such that \(T(x) \in A(\alpha)\), the argument goes that \(x\) constitutes
evidence against \(H_0\), since such an outcome was improbably if \(H_0\) were
true. Correspondingly, smaller p-values are taken to be associated with
stronger evidence against the null.</p>
<p>However, one can easily construct valid tests that satisfy the above definition
that range from obviously incomplete to absurd. In the counterexamples below I
will be happy to assume that \(H_0\) is reasonable, even correct. (So we are
not in the case of Prof. Stark’s “straw man”, ibid.) Missing in each of the
increasibly egregious counterexamples that I will describe is a consideration of
<em>power</em>, which is an explicit consideration of the ability of the test to reject
when \(\theta \ne \theta_0\).</p>
<h1 id="three-ridiculous-hypothesis-tests">Three ridiculous hypothesis tests</h1>
<h4 id="example-1-throw-away-most-of-the-data">Example 1: Throw away most of the data</h4>
<p>Suppose we use the simple example above example, but throw away all but the
first datapoint. So our hypothesis test is</p>
<ul>
<li>\(T(x) = \vert x_1 - \theta_0 \vert\), \(A(\alpha)\) as above.</li>
</ul>
<p>This test is valid, as are its p-values. In this case, it is true that larger
\(p\) values cast further doubt on the hypothesis (and Prof. Stark’s quote is
true). But the increment is small, since a single datapoint
is much less informative than the full dataset.</p>
<p>Missing from this silly test is the fact that, by using all the data, one can
construct a strictly larger rejection region — and so a test with more power —
with the same level.</p>
<h4 id="example-2-throw-away-all-the-data">Example 2: Throw away all the data</h4>
<p>Since we can use randomness in our test statistic, let us define</p>
<ul>
<li>\(T(x) \sim \textrm{Uniform}[0,1]\).</li>
<li>\(A(\alpha) = \left\{x: T(x) \le \alpha \right\}\).</li>
</ul>
<p>This test has the correct level and valid p-values, but has nothing at
all to do with the data or \(H_0\). It also generates valid confidence
intervals, which are either the whole space \(\Omega_\theta\) or
\(\emptyset\), with probabilities \(1 - \alpha\) and \(\alpha\) respectivelly.</p>
<p>The book “Testing Statistical Hypothesis” defines p-values for randomized
tests as the smallest \(\alpha\) which rejects with probability one.
Using this definition, p-values for this case would always be \(1\).
So, by this technicality, p-values slip through the cracks this counterexample.
However, I would argue that one could just as well augment the data space
with \([0,1]\) and consider the uniform draw to be “data” rather than part of
the hypothesis, in which case the p-values is simply \(\textrm{Uniform}[0,1]\),
independent of the data.</p>
<p>The problem with this test is that it has no greater power to
reject under the alternative than under the null. Again, it is a consideration
of power, rather than the definition of a valid test, that reveals the
nature of the flaw.</p>
<h4 id="example-3-construct-crazy-regions">Example 3: Construct crazy regions</h4>
<p>Let us use \(T(x)\) as in the simple example, but use the region \(A(\alpha) =
\left\{x: T(x) \le \Phi^{-1}((\alpha + 1) / 2) \right\}\). These regions have
the correct levels, but they reject when \(T(x)\) is <em>close</em> to \(\theta_0\)
rather than when it is far away. These tests will have high power against
alternatives which are very close to \(\theta_0\), but no power against large
deviations. Values \(T(x)\) which are very large will have large p-values,
whereas the smallest p-values occur when \(T(x) \approx \theta_0\).</p>
<p>There are at least two ways to think about what is wrong with this test. One is
that it produces rejection regions with smaller-than-optimal Lebesgue measure —
the total length of the rejection region is much smaller than the classical
test’s. Another is that it has highest power against alternatives that we
(typically) care the least about, which are values of \(\theta\) that produce
nearly the same distribution as the null. As above, power considerations
are the key.</p>
<h1 id="lets-think-about-power">Let’s think about power</h1>
<p>Even the best available argument for p-values, hypothesis tests, and confidence
intervals depends on having chosen tests in the first place that take power into
account. The best statisticians (e.g. Prof. Stark and Ronald Fisher) will be
very good at avoiding under-powered tests, and avoid such mistakes easily in
most situations. However, for the general public, it seems to me that there is
a lot of value in making power a fundamental part of teaching and talking about
hypothesis tests, p-values, and confidence intervals.</p>There are lots of reasons to dislike p-values. Despite their inherent flaws, over-interpretation, and risks, it is extremely tempting to argue that, absent other information, the smaller the p-value, the less plausible the null hypothesis. For example, the venerable Prof. Philip Stark (who I admire and who was surely choosing his words very carefully), writes in “The Value of p-Values”:Probability and the statistical analogy: Gambling devices, long-run probability, and symmetry.2021-09-24T10:00:00+00:002021-09-24T10:00:00+00:00/philosophy/2021/09/24/what_is_probability<h1 id="long-run-frequency">Long-run frequency</h1>
<p>In a lot of classical work, probability is defined in terms of long-run
frequency. A coin flip, according to this way of thinking, has a probability
one half of coming up heads precisely because, if we were to somehow “flip it an
infinite number of times,” the proportion of heads in the infinite sequence of
trials would be one half. This definition leads to all sorts of conundrums,
from obvious to subtle, such as</p>
<ul>
<li>
<p>How can a physical property be defined in terms of a physically impossible experiment (flipping an infinite number of times)?</p>
</li>
<li>
<p>What about situations where such an infinite sequence has no physical meaning?</p>
</li>
<li>
<p>What about probabilities that change over time?</p>
</li>
<li>
<p>How can we rationally base one-off decisions on an (infinitely long) string of hypothetical future events?</p>
</li>
</ul>
<p>I don’t intend to re-hash these old questions here. (As usual, I like Ian
Hacking’s discussion of the problems; see his book, The Logic of Statistical
Inference.) But I do believe that the long-run frequency definition of
probability obscures the relationship between aleatoric and epistemic
probability I discussed int he last post, and for that reason I would like to
propose an alternative perspective. I will argue two things: (1) That we know
the “probabilities” of special gambling devices because we specially design them
to produce symmetries, and that (2) When we speak of the probabilities of events
that are not gambling devices, we do so by analogy with a (possibly very
complicated) gambling device or class of gambling devices.</p>
<h1 id="symmetry-and-the-discard-of-information-in-gambling-devices">Symmetry and the discard of information in gambling devices</h1>
<p>Let’s returning to that old workhorse, the coin flip. It seems to me that, upon
observing a single flip, one could confidently assert that the long-run
frequency of heads would be 0.5, without performing a single additional
experiment. Why is that? Because a coin flip is specially designed to produce
<em>symmetric</em> outcomes between heads and tails. The flip, done correctly,
destroys the asymmetry produced by which face was up at the beginning of the
flip. Here, “done correctly” is a key property — the probability of a “coin
flip” is the probability of a process, not a physical object, and the process is
presumed to be executed in good faith, and the purpose of the process is to
eliminate the influence of initial conditions and produce a symmetry in the
outcome. It seems to me that it is the symmetry in this process that gives rise
to a confident assertion about the outcome of the (practically impossible)
infinitely long sequence of flips.</p>
<p>Every gambling device I can think of — dice, roulette wheels, card shuffling,
lottery urns, pseudo-random number generators — are similar: they consist of a
process designed to discard initial conditions and produce a symmetry in the
outcome. We discard initial conditions because, without doing so, the process
does not look “random,” it looks contingent and deterministic. We require
symmetry in the outcome for, without it, we would not know what the
“probabilities” are. For gambling devices, it seems to me that there can be no
harm in defining probabilities in terms of these symmetries rather than an
infinite sequence; indeed, we feel confident about the infintie sequence
precisely because of these symmetries*.</p>
<p>It may seem as if gambling devices so described can only produce uniform
distributions over discrete sets. Of course this is not true, since functions
of uniform distributions may not be uniform. Consider a spinner wheel, for
which some fraction of the circumference is colored red and the rest blue. The
probability of red can be made to be any number in the interval from zero to
one. Billingsley’s Probability and Measure opens with a fun discussion of how
all random numbers could be derived as functions of a single uniform random
number on the unit interval. So when I say “gambling devices,” I happily
include all abstract probabilistic models. Of course, many such models do not
correspond to any device that could actually be built, but I argue that they
derive their motivation from the possibility of purely aleatoric probability,
which was historically manifested in only a few special machines. Ask yourself,
for example — how motivated would we be to study probability if it were
practically impossible to write a reasonably good pseudo random number
generator?</p>
<h1 id="the-probabilty-of-events-that-are-not-gambling-devices">The probabilty of events that are not gambling devices</h1>
<p>Let’s turn now to the “probability” of physical events that do not look like
gambling devices — say, for example, the probability of rain tomorrow? From
today to tomorrow, the initial conditions are obviously not (wholly) discarded,
and no symmetry can be readily seen, as the system is far too complicated.
Perhaps one would be tempted to fall back to long-run probability for these
reasons. However, I would argue that any statement about the probability of
rain tomorrow has, as a necessary constituent part, an <em>analogy</em> between the
real event (rain tomorrow) and some idealized gambling device. The analogy may
be implicit, but the notion of probability depends on it. In the case of rain,
the analogy would typically be with a lottery urn: all days in some set of days
that are, as much as possible, just like today are placed in an urn. Some of
these days have rain tomorrow, some not. The urn is shaken and our “actual
tomorrow” is drawn from this urn, and the “probability” of rain is, by shaking
and by symmetry, the proportion of days with rain tomorrow to begin with.
Indeed, the “infinite sequence” definition of probability is precisely of this
form, only with an infinitely large urn! Obviously, the key decision to be made
is which days go into the urn. In what way must days be “like today”? What
time period is eligible? Are theoretical days (as simulated, say, by a
computer) admitted or only actual days? And so on. The many reasonable ways of
making the choice about the urn’s contents may appear to lead to difficulty in
the long-run frequency definition, but are simply part and parcel of forming an
analogy with a gambling device.</p>
<p>Let us then say this: most physical systems do not have well-defined probability
distributions over their outcomes. The exceptions are gambling devices, which
are specially constructed to discard information and produce symmetries. We
speak of probabilities for systems that are not gambling devices by forming an
analogy with an (idealized) gambling device, possibly a very complicated one in
the form of a probabilistic model or set of such models. Often many analogies
are plausible; correspondingly many probabilities are plausible. To speak of
probabilities, the analogy must be made, even if only implicitly. The analogy
must be made for long-run frequency to make sense, and, once it is made,
long-run frequency does not matter.</p>
<p>I argued in a <a href="/philosophy/2021/08/22/what_is_statistics.html">previous post</a> that
statistics always consists in the formation of such analogies, which I call the
“statistical analogy”. Thus there are two intertwined problems in statistics:
the formation of useful analogies with gambling devices, and, given an analogy,
the production of epistemic statements from aleatoric properties. One might
characterize the frequentist philosophical perspective as being unwilling to
stretch the analogy very far, but most interesting practical statistics
stretches it at least a little bit.</p>
<h1 id="the-statistical-analogy-is-not-falsifiable">The statistical analogy is not falsifiable</h1>
<p>At this point, it may seem like I have simply come around to the obvious point
that statisticians model the world with classes of probability models. But I
would like to try to differentiate the statistical analogy from a “statistical
model,” by which I mean some fixed candidate set of probability distributions.
A violation of a probability model may prompt you to say, “I have chosen a bad
(insufficiently expressive, misspecified, &c) set of probability distributions,”
whereas a violation of the statistical analogy would prompt you to say “that is
not what I meant by probability.” Suppose I simply rotated a coin in the air,
never letting it go, and set it down heads up. You would say: “That is not what
I mean by a coin flip.” Or suppose you told me that it will very likely not
rain tomorrow, and I then flew to another city where it is raining and say you
were mistaken; you would say “That is not what I meant by ‘rain tomorrow.’”
Violations of the statistical analogy are not falsifiable, though they may be
more or less <em>useful</em>. (Though I think that, in practice, the distinction is
not so neat — it may be that one only discovers that a statistical analogy is
not useful only after investigation with a set of statistical models that
discover, say, some correlations or outliers that were not known previously.)</p>
<h1 id="the-problem-of-the-mapping-between-aleatoric-and-epsitemic-probability-remains">The problem of the mapping between aleatoric and epsitemic probability remains</h1>
<p>I would like to close this post by emphasizing that I have been discussing here
only the notion of probability. There remains a more subtle issue, which is the
relationship between aleatoric and epistemic uncertainty. Are they the same?
It may seem that, with a pure gambling device such as a roulette wheel, the two
can be safely equated. But if we’re going to stretch this analogy far beyond
actual coin flips and roulette wheels into public policy and climate modeling
(for example), then we should think about the question more carefully — in a
later post.</p>Long-run frequencyApproximate Maximum Influence Perturbation and P-hacking2021-09-17T10:00:00+00:002021-09-17T10:00:00+00:00/robustness/2021/09/17/amip_p_hacking<p>Let’s talk about Hacking. Not Ian Hacking this time — p-hacking! I’d like
to elaborate on a <a href="https://michaelwiebe.com/blog/2021/01/amip">nice post</a>
by Michael Wiebe, where he investigates whether
<a href="https://arxiv.org/abs/2011.14999">my work with Rachael and Tamara</a>
on robustness to the removal of small, adversarially selected subsets
can be a defense against p-hacking. In short: yes, it can be! Michael
establishes this with some experiments, and I’d like to supplement his
observation with a little theory from our paper. Let me state before
diving in that I was unaware of this nice feature of our work before
Michael’s post, so I’m super grateful to him both for noticing it and
producing such a nice write-up.</p>
<h1 id="p-hacking-for-regression">P-hacking for regression</h1>
<p>Here’s what Michael means by p-hacking in a nutshell. (I’ll eliminate some
details of his analysis that are unnecessary for my point here.) Suppose that
you have a \(N\) data points, consisting of a mean-zero normally-distributed
response, \(y_n\), for \(n=1,...,N\), and a large number of regressors,
\(x_{nk}\), \(k=1,...,K\). Suppose that all of the \(x_{nk}\) have variance
\(\sigma_x^2\) and are drawn independently of \(y_n\). Despite the fact that
the regressors have no real explanatory power, if you run all \(K\) regressions
\(y_n \sim \alpha_k + \beta_k x_{nk}\), for large enough \(K\), you expect to
find at least one \(\hat\beta_k\) that is statistically significant at any fixed
level.</p>
<p>For example, if you construct a test with valid 95% confidence intervals
and \(K=20\), you expect one false rejection amongst the \(20\) regressions. If
you run all 20, but only report the rejection, it may appear that you have found
a significant effect, but you’ve actually only p-hacked. P-hacking is at best
sloppy and at worst dishonest, and we want defenses against it.</p>
<h1 id="some-notation">Some notation</h1>
<p>It will be helpful to write out our estimates explicitly.
First, let’s write \(y_n = \varepsilon_n\) with a residual \(\varepsilon_n
\sim \mathcal{N}(0, \sigma_\varepsilon^2)\), to emphasize that no regressors
have real effects. Our estimators and their (robust) standard errors are</p>
\[%
\begin{align*}
%
\hat\beta_k = \frac{\sum_{n=1}^N x_{nk} \varepsilon_n}{\sum_{n=1}^N x_{nk}^2}
\quad\quad\textrm{and}\quad\quad
\hat\sigma_k^2 =
\frac{\sum_{n=1}^N x_{nk}^2 \hat\varepsilon_{kn}^2}
{(\sum_{n=1}^N x_{nk}^2)^2}
\quad\quad\textrm{where}\quad\quad
\hat\varepsilon_{kn} = y_n - \hat\beta_k x_{nk}.
%
\end{align*}
%\]
<p>A standard 95\% univariate test rejects \(H_0: \beta_k = 0\) if</p>
\[%
\begin{align*}
%
\textrm{Statistical significance:}\quad\quad
\sqrt{N} \hat\beta_k > 1.96 \hat\sigma_k.
%
\end{align*}
%\]
<p>Now, we actually expect that, asymptotically and marginally, \(\sqrt{N}
\hat\beta_k \sim \mathcal{N}(0, \sigma_\varepsilon^2 / \sigma_x^2)\) by the
central limit theorem. For any fixed \(k\), we expect \(\hat\sigma_k
\rightarrow \sigma_\varepsilon / \sigma_x\), and even for \(k\) chosen via
p-hacking, we might expect \(\hat\sigma_k\) not to be orders of magnitude off.
So many draws \(k=1,..,K\) from the distribution of \(\sqrt{N}\hat\beta_k\),
we expect a few to be “statistically significant”. So far, I’ve just
re-stated formally how p-hacking works.</p>
<h1 id="adversarial-removal">Adversarial removal</h1>
<p>Now, let us ask how much data we would need to drop to make a significant result
\(\hat\beta_k\) statistically insignificant. Suppose without loss of
generality that \(\hat\beta_k > 0\). If we can drop datapoints that decrease
the estimate \(\hat\beta_k\) by the difference \(\hat\beta_k - 1.96 \hat\sigma_k /
\sqrt{N}\), then the estimator will become insignificant. Let’s assume for
simplicity that the standard deviation \(\hat\sigma_k\) is less sensitive to
datapoint removal than is \(\hat\beta_k\), which is typically the case in
practice. Then, in the notation of Section 3 of our <a href="https://arxiv.org/abs/2011.14999">our
paper</a>,</p>
<ul>
<li>The “signal” is \(\hat\beta_k - 1.96 \hat\sigma_k / \sqrt{N}\),</li>
<li>The “noise” is \(\hat\sigma_k\) (since we only consider
the sensitivity of \(\hat\beta_k\), the variance of whose limiting distribution
after scaling is \(\hat\sigma_k\))</li>
<li>The “shape” \(\Gamma_\alpha\) is determined in a complicated way from
\(\alpha\) and the distributions of the residuals and regressors, but which
converges to a non-zero constants and deterministically satisfies
\(\Gamma_\alpha \le \sqrt{\alpha(1-\alpha)}\),</li>
</ul>
<p>and we expect to be unable to make the estimate insignificant when</p>
\[%
\begin{align*}
%
\textrm{Robust to adversarial subset dropping:}\quad\quad
\frac{\hat\beta_k - 1.96
\hat\sigma_k / \sqrt{N}}{\hat\sigma_k} \ge \Gamma_\alpha.
%
\end{align*}
%\]
<p>Multiplying both sides by \(\sqrt{N}\) and making some cancellations
and rearrangements, we see that
\(%
\begin{align*}
%
\textrm{Robust to adversarial subset dropping:}\quad\quad
\frac{\sqrt{N} \hat\beta_k}{\hat\sigma_k}
\ge \sqrt{N} \Gamma_\alpha + 1.96.
%
\end{align*}
%\)</p>
<p>Now, as \(N \rightarrow \infty\), the left-hand side of the preceding display
converges in distribution, and the right-hand side blows up. In other
words, estimates formed from p-hacking are always non-robust, at any
\(\alpha\), for sufficiently large \(N\)!</p>
<h1 id="regression-is-not-special">Regression is not special</h1>
<p>Although it’s convenient to work with regression, absolutely nothing
about the preceding analysis is special to regression. The key is that
p-hacking relies on variability on the order of the same size as
shrinking confidence intervals, but adversarial subset removal produces
changes that do not vanish asymptotically. P-hacking is thus dealt
away with for the same reason that, as we say in the paper,
statistical insignificance is always non-robust.</p>
<h1 id="postscript">Postscript</h1>
<p>Michael makes a somewhat different (but still interesting) point in his plot
“False Positives are Not Robust.” I believe that plot can be explained by
observing that, in his setup, the effect of increasing \(\gamma\) is to increase
\(\sigma_\varepsilon\) in my notation. The flat line in his plot can the be
attributed to the fact that both the signal and the noise are proportional to
\(\sigma_\varepsilon\), and so the residual scale cancels in the signal-to-noise
ratio.</p>Let’s talk about Hacking. Not Ian Hacking this time — p-hacking! I’d like to elaborate on a nice post by Michael Wiebe, where he investigates whether my work with Rachael and Tamara on robustness to the removal of small, adversarially selected subsets can be a defense against p-hacking. In short: yes, it can be! Michael establishes this with some experiments, and I’d like to supplement his observation with a little theory from our paper. Let me state before diving in that I was unaware of this nice feature of our work before Michael’s post, so I’m super grateful to him both for noticing it and producing such a nice write-up.What is statistics? (The statistical analogy)2021-08-22T10:00:00+00:002021-08-22T10:00:00+00:00/philosophy/2021/08/22/what_is_statistics<p>By this I mean: What differentiates statistics from other modes of thinking that
are not fundamentally statistical?</p>
<h1 id="some-non-answers">Some non-answers.</h1>
<p>Here are some non-answers. Statistics
cannot be captured the by kinds of computations people do. For example, sample
means can be computed and used with no statistics in sight. The answer cannot
simply be presence of randomness as a concept; mathematical probability, for
example, is a fundamental tool of statistics, but is not itself statistics.
Neither will a mode of analysis suffice; at the extremes, you will find
econometricians, machine learners, and applied Bayesians using extremely
disparate assumptions and conceptual tools to solve even superficially similar
problems. And though statistics almost always involves data, not all data
engineering is statistical.</p>
<h1 id="a-dichotomy-aleatoric-and-epistemic-uncertainty">A dichotomy: aleatoric and epistemic uncertainty.</h1>
<p>I answer this question for myself using a dichotomy found at the root of Ian
Hacking’s wonderful book, “The Emergence of Probability:” between <em>aleatoric</em>
and <em>epistemic</em> uncertainty. I see the aleatoric vs epistemic dichotomy used
every now and then by academic statisticians, but almost never in the way I mean
them, so let me try to define them precisely here. (I cannot be sure that
Hacking would agree with my definitions, so I will not presume to attribute
them to him, though certainly my thinking was highly influenced by his.)</p>
<ul>
<li>Epistemic uncertainty is incomplete knowledge in general.</li>
<li>Aleatoric uncertainty is incomplete knowledge of well-defined states of
a carefully constructed gambling device (an “aleatoric device”, e.g.
a roulette wheel, a coin flip, or an urn of colored balls).</li>
</ul>
<p>Obviously, aleatoric uncertainty involves epistemic uncertainty. If I ask,
“Will this fair coin come up heads on the next flip?” I do not know the answer,
so there is epistemic uncertainty. But because the coin is carefully
constructed to be symmetric, and because I will flip it skillfully to discard
information about its original orientation, I do know something more about the
coin. In particular, I know a symmetry between heads and tails, suggesting that
there is no reason to believe one outcome is more likely than the other. Again,
this symmetry is present because I have carefully constructed the situation to
assure it.</p>
<p>Aleatoric uncertainty involves epistemic uncertainty, but the reverse is not
true. For example, there is epistemic uncertainty in the question: “Does the
Christian God Exist?” But in this question there is no obvious aleatoric
uncertainty. At least, there was none until Pascal’s wager put it there, as we
will shortly see.</p>
<h1 id="the-statistical-analogy-is-an-analogy-between-epistemic-and-aleatoric-uncertainty">The “statistical analogy” is an analogy between epistemic and aleatoric uncertainty.</h1>
<p>According to Hacking, the statistical revolution began when two phenomena
occurred (in sequence):</p>
<ol>
<li>
<p>Mathematicians realized, starting roughly in the 17th century that aleatoric
uncertainty was mathematically tractable, and</p>
</li>
<li>
<p>Scientists, mathematicians, and philosophers began to use the same
computations to treat ordinary epistemic uncertainty with no obvious
aleatoric component.</p>
</li>
</ol>
<p>I argue that the second act — the attempt to quantify epistemic uncertainty
using calculations designed for aleatoric devices — is the core of statistics.
No effort is statistical without it. No statistical analysis excludes it. It
has become so commonplace an identity that it is almost entirely tacit, but
it is a mode of thinking that had to be invented, and whose potential and
risks are still being worked out.</p>
<p>Pascal’s wager, according to Hacking, was a watershed moment. A lottery with a
well-defined cost and payoff is an aleatoric device, about which we can reason
mathematically, and nearly irrefutably. Analogizing the choice of whether to
believe in a Christian God with a lottery makes the latter amenable to the same
mathematical reasoning. Of course, it does not confer the same certainty, the
weakness being in the analogy. But suddenly there is a path, albeit a
treacherous one, to dealing with general epistemic uncertainty using
mathematical tools; expressing it in degrees, combining it with computations.
Most people probably think (rightly) that Pascal’s wager was not a triumph,
though the audacity of arguments like it paved the way for the many
fruitful applications to follow.</p>
<p>I argue that the formation of the analogy between epistemic uncertainty and some
aleatoric system is the key step in any statistical analysis. For this reason,
I will refer to it in posts to follow this one as “the Statistical Analogy.”</p>
<h1 id="being-explicit-about-the-statistical-analogy-is-good-practice-and-good-pedagogy">Being explicit about the statistical analogy is good practice and good pedagogy.</h1>
<p>Sometimes
you get lucky and are analyzing an aleatoric device directly, as in a lot of
textbook problems, and certain kinds of physical experiments. Much of the time,
however, there are meaningful choices to be made. Being aware that you are
forming the analogy — often implicitly — is a good habit to avoid blunders
in applied statistics.</p>
<p>To teach this, I like to ask students to consider these three questions:</p>
<ol>
<li>Will the next flip of a coin come up heads?</li>
<li>Will it rain in Berkeley tomorrow?</li>
<li>Is there life after death?</li>
</ol>
<p>These three questions exist on a spectrum of decreasing aleatoric uncertainty.
The second question is the interesting one. We are in the habit of thinking
about it statistically, in that we assume that there is a “correct” answer in
the form of a percentage. But implicit in such an answer is an aleatoric
device, and it is good to think of what it is. Typically the answer is of the
form of an urn of days, some with rainy weather, and some without. The
questions an applied statistician needs to ask then become immediate: What balls
go into the urn? How well is the urn “mixed” from one draw to the other? Is the
urn fixed over time? And so on.</p>
<p>In coming posts I will elaborate on this fundamental idea, which helps clarify,
for me at least, many conceptual aspects of statisical practice.</p>By this I mean: What differentiates statistics from other modes of thinking that are not fundamentally statistical?Convergence in probability of order statistics.2021-08-15T10:00:00+00:002021-08-15T10:00:00+00:00/probability/2021/08/15/convergence_of_quantiles<p>Order statistics converge in probability to their sample quantiles, basically no
matter what. That is a fact that I was surprised to find missing (as far as I
could see) from the texts on my bookshelf. The books I have seem to analyze
order statistics under stricter regularity conditions in order to get central
limit theorems.</p>
<p>Obviously this is not new and the proof is nothing special, but some things are
easier to prove yourself than find in a book, I guess. It’s a result that’s nice
to have around, especially if you’re thinking about <a href="https://arxiv.org/abs/2011.14999">our AMIP
paper</a>. I had to sweat a little to avoid
permitting myself to separately analyze point masses and continuity points,
which would be an unnecessary complication.</p>
<h1 id="statement">Statement:</h1>
<p>Let \(x_{(\lfloor \alpha N \rfloor)}\) denote the \(\lfloor \alpha N
\rfloor\)-th order statistic of a dataset \(x_1, \ldots, x_N\), for \(0 <
\alpha < 1\), and where \(x_{(0)}\) is undefined. Let the the data \(x_n\) be
IID with distribution function \(F(x) = p(X \le x)\) and \(F_{-}(x) = p(X <
x)\). Let \(q(\alpha) := \inf \{x: F(x) \ge \alpha \}\). Then \(x_{(\lfloor
\alpha N \rfloor)} \rightarrow q(\alpha)\) in probability.</p>
<h1 id="proof">Proof:</h1>
<p>Let \(x_{(k)}\) denote the \(k\)-th order statistic. By definition,</p>
\[\begin{align*}
%
x_{(k)} \le{} x \Leftrightarrow \sum_{n=1}^N \mathbb{I}\left( {x_n \le x}\right)
\ge k
\quad\textrm{and}\quad
x_{(k)} \ge{} x \Leftrightarrow \sum_{n=1}^N \mathbb{I}\left( {x_n \ge x}\right)
\ge N - k +1.
%
\end{align*}\]
<p>For sufficiently large \(N\), \(\lfloor \alpha N \rfloor > 0\), so \(x_{(\lfloor
N\alpha \rfloor)}\) is well-defined. Applying the first equivalence with \(k =
\lfloor \alpha N \rfloor\) with any \(\epsilon > 0\) gives</p>
\[\begin{align*}
%
p\left(x_{(\lfloor N\alpha \rfloor)} \le q(\alpha) - \epsilon \right)
={}&
% p\left(\sum_{n=1}^N \mathbb{I}\left( {x_n \le q(\alpha) - \epsilon} \right)
% \ge \lfloor N \alpha \rfloor \right)
% \\={}&
p\left(\frac{1}{N}\sum_{n=1}^N \mathbb{I}\left( {x_n \le q(\alpha) - \epsilon}\right)
\ge
\frac{\lfloor N \alpha \rfloor}{N} \right)
\\\rightarrow{}&
\mathbb{I}\left( {F(q(\alpha) - \epsilon) \ge \alpha)}\right) = 0,
%
\end{align*}
%\]
<p>by the strong law of large numbers and the definition of \(q(\alpha)\).
Similarly,</p>
\[\begin{align*}
%
p(x_{(\lfloor N\alpha \rfloor)} \ge q(\alpha) + \epsilon)
={}&
p\left( \frac{1}{N}\sum_{n=1}^N \mathbb{R}\left({x_n < q(\alpha) + \epsilon}\right)
\le \frac{\lfloor N \alpha \rfloor}{N} - \frac{1}{N} \right)
\\\rightarrow{}&
\mathbb{I}\left( {F_{-}(q(\alpha) + \epsilon) < \alpha}\right) = 0,
%
\end{align*}\]
<p>again by the strong law of large numbers and the fact that \(F\) is increasing
with \(F(q(\alpha)) \ge \alpha\). It follows that \(x_{(\lfloor N\alpha
\rfloor)} \in (q(\alpha)- \epsilon, q(\alpha) + \epsilon)\), with probability
approaching one, for any \(\epsilon > 0\). QED.</p>Order statistics converge in probability to their sample quantiles, basically no matter what. That is a fact that I was surprised to find missing (as far as I could see) from the texts on my bookshelf. The books I have seem to analyze order statistics under stricter regularity conditions in order to get central limit theorems.