By this I mean: What differentiates statistics from other modes of thinking that are not fundamentally statistical?

Some non-answers.

Here are some non-answers. Statistics cannot be captured the by kinds of computations people do. For example, sample means can be computed and used with no statistics in sight. The answer cannot simply be presence of randomness as a concept; mathematical probability, for example, is a fundamental tool of statistics, but is not itself statistics. Neither will a mode of analysis suffice; at the extremes, you will find econometricians, machine learners, and applied Bayesians using extremely disparate assumptions and conceptual tools to solve even superficially similar problems. And though statistics almost always involves data, not all data engineering is statistical.

A dichotomy: aleatoric and epistemic uncertainty.

I answer this question for myself using a dichotomy found at the root of Ian Hacking’s wonderful book, “The Emergence of Probability:” between aleatoric and epistemic uncertainty. I see the aleatoric vs epistemic dichotomy used every now and then by academic statisticians, but almost never in the way I mean them, so let me try to define them precisely here. (I cannot be sure that Hacking would agree with my definitions, so I will not presume to attribute them to him, though certainly my thinking was highly influenced by his.)

  • Epistemic uncertainty is incomplete knowledge in general.
  • Aleatoric uncertainty is incomplete knowledge of well-defined states of a carefully constructed gambling device (an “aleatoric device”, e.g. a roulette wheel, a coin flip, or an urn of colored balls).

Obviously, aleatoric uncertainty involves epistemic uncertainty. If I ask, “Will this fair coin come up heads on the next flip?” I do not know the answer, so there is epistemic uncertainty. But because the coin is carefully constructed to be symmetric, and because I will flip it skillfully to discard information about its original orientation, I do know something more about the coin. In particular, I know a symmetry between heads and tails, suggesting that there is no reason to believe one outcome is more likely than the other. Again, this symmetry is present because I have carefully constructed the situation to assure it.

Aleatoric uncertainty involves epistemic uncertainty, but the reverse is not true. For example, there is epistemic uncertainty in the question: “Does the Christian God Exist?” But in this question there is no obvious aleatoric uncertainty. At least, there was none until Pascal’s wager put it there, as we will shortly see.

The “statistical analogy” is an analogy between epistemic and aleatoric uncertainty.

According to Hacking, the statistical revolution began when two phenomena occurred (in sequence):

  1. Mathematicians realized, starting roughly in the 17th century that aleatoric uncertainty was mathematically tractable, and

  2. Scientists, mathematicians, and philosophers began to use the same computations to treat ordinary epistemic uncertainty with no obvious aleatoric component.

I argue that the second act — the attempt to quantify epistemic uncertainty using calculations designed for aleatoric devices — is the core of statistics. No effort is statistical without it. No statistical analysis excludes it. It has become so commonplace an identity that it is almost entirely tacit, but it is a mode of thinking that had to be invented, and whose potential and risks are still being worked out.

Pascal’s wager, according to Hacking, was a watershed moment. A lottery with a well-defined cost and payoff is an aleatoric device, about which we can reason mathematically, and nearly irrefutably. Analogizing the choice of whether to believe in a Christian God with a lottery makes the latter amenable to the same mathematical reasoning. Of course, it does not confer the same certainty, the weakness being in the analogy. But suddenly there is a path, albeit a treacherous one, to dealing with general epistemic uncertainty using mathematical tools; expressing it in degrees, combining it with computations. Most people probably think (rightly) that Pascal’s wager was not a triumph, though the audacity of arguments like it paved the way for the many fruitful applications to follow.

I argue that the formation of the analogy between epistemic uncertainty and some aleatoric system is the key step in any statistical analysis. For this reason, I will refer to it in posts to follow this one as “the Statistical Analogy.”

Being explicit about the statistical analogy is good practice and good pedagogy.

Sometimes you get lucky and are analyzing an aleatoric device directly, as in a lot of textbook problems, and certain kinds of physical experiments. Much of the time, however, there are meaningful choices to be made. Being aware that you are forming the analogy — often implicitly — is a good habit to avoid blunders in applied statistics.

To teach this, I like to ask students to consider these three questions:

  1. Will the next flip of a coin come up heads?
  2. Will it rain in Berkeley tomorrow?
  3. Is there life after death?

These three questions exist on a spectrum of decreasing aleatoric uncertainty. The second question is the interesting one. We are in the habit of thinking about it statistically, in that we assume that there is a “correct” answer in the form of a percentage. But implicit in such an answer is an aleatoric device, and it is good to think of what it is. Typically the answer is of the form of an urn of days, some with rainy weather, and some without. The questions an applied statistician needs to ask then become immediate: What balls go into the urn? How well is the urn “mixed” from one draw to the other? Is the urn fixed over time? And so on.

In coming posts I will elaborate on this fundamental idea, which helps clarify, for me at least, many conceptual aspects of statisical practice.