Beginning with a talk by Ben Nachman in the statistics department, I’ve gotten more and more excited about the field of simulation-based inference (SBI). As the name implies, this field refers to the problem of performing inference on a given dataset using only simulations from a model (as opposed to an evaluable density function or otherwise mathematically tractable class of candidate generative processes). Since SBI must be used for intractable likelihoods it has a lot of overlap with “likelihood free inference” (LFI) though I think the latter term is too narrow, since you can certainly use SBI even when you have a likelihood.
Simulation-based inference reading list
\[ \def\expect#1#2{\mathbb{E}_{#1}\left[#2\right]} \def\x{x} \]
In order to get familiar with the basics of the field, I organized a reading group this summer, for which the reading list is below. This which was of course selected from a larger list, but I won’t reproduce that larger list here, since I later became aware of much longer out there, including this one and this one.
- Week 1 (ABC): Marin et al. (2012)
- Week 2 (Regression): Izbicki and Lee (2017)
- Week 3 (CDE): Cranmer, Pavez, and Louppe (2015)
- Week 4 (NPE basics): Lueckmann et al. (2017)
- Week 5 (next level NPE): Radev et al. (2023)
- Week 6 (emulators): Price et al. (2018)
- Week 7 (calibration): Zhao et al. (2021)
- Week 8 (SBI on SBI): Müller et al. (2021)
Some reading changes I would make if I were to do it again
From these readings, one can see that an SBI problem always takes the form of draws from a joint \(\theta, x \sim p(\theta, x)\), and the task of finding a \(q(\theta \vert x)\) that you hope to be close to the conditional density \(p(\theta | x_{obs})\) for one particular \(x_{obs}\) which was never observed in the training set.
It’s interesting to notice (e.g. reading Izbicki and Lee (2017)) that the SBI problem is formally very similar to the problems of “probabilistic forecasting” and “conditional density estimation,” with the differences that (a) in SBI you get pairs \(\theta, x\) from simulation rather than nature and (b) you plan to use your estimate at a single \(x_{obs}\) rather than multiple future \(x_{new}\). Similarly, it’s interesting to notice (e.g. reading Zhao et al. (2021)) that the problem of assessing and / or improving an SBI estimator is the same as improving or evaluating calibration in probabilistic forecasting. It would be exciting to do this SBI reading group again in collaboration with someone who is expert in probabilistic forecasting and calibration.
In retrospect, I wish we had replaced the emulator reading and one of the NPE readings on ones with ones on quantile regression and triangular transport (Ramgraber et al. (2025)), since I think emulators are formally distinct (and ultimately more classical) approach, arguably belonging more properly to LFI than SBI.
The components of an SBI problem
It seems to me that there are three key (related) aspects of any SBI approach:
- How do you represent the candidate conditional densities \(q(\theta \vert x)\)?
- How do you represent the mapping from \(x \mapsto q(\theta \vert x)\)?
- How do you choose a loss function measuring the error of a candidate \(q(\theta \vert x)\) for the target \(p(\theta \vert x)\)?
For example,
- In Izbicki and Lee (2017):
- Densities are arbitarary functions (maybe negative, maybe unnormalized)
- The mapping is given in the form of a generalized additive model, and
- The loss is a particular expected L2 loss over both \(x\) and \(\theta\);
- In the Bayesian verison of CDEs from Cranmer, Pavez, and Louppe (2015) (e.g. Rozet and Louppe (2021)):
- Densities are positive functions (maybe unnormalized)
- The mapping is an exponentiated regression function, and
- Loss is binary cross entropy between joint and marginal
- In the NPE literature (e.g. Radev et al. (2023)):
- Densities are actual densities (positive, normalized)
- The mapping is a neural network, and
- The loss is expected (over \(x\)) KL divergence (between the conditionals)
Note that Müller et al. (2021) is basically an NPE with a (hacky–looking but apparently successful) “Riemann” distribution for step 1 to make the problem look more like an LLM — and a surprisingly effective prior over possible data generating processes.
I should say that my summary CDEs is not the way it’s traditionally expressed, but it’s true — if you parameterize the classifer as a function of the unnormalized density rather than the log odds ratio, you will see that what you get is a KL divergence term that matches NPE, together with a term that “penalizes” non–normalized densities at the optimum.
From this point of view, it appears that Izbicki and Lee (2017) is really sui generis — everyone else is essentially doing KL divergence. But that also opens up the possibility of doing things other than KL divergence, since all you really need is some proper scoring rule \(\ell\), and to minimize \(\expect{p(x)}{\expect{p(\theta \vert x)}{\ell(q(\theta \vert x), \theta)}}\). It’s interesting to ask what kinds of mileage you can get out of being more flexible with \(\ell\).