Over the last two years, Trevor Campbell, Jonathan Huggins, and I have made a sequence of bets as to whether MixFlows (Xu, Chen, and Campbell (2023), Xu and Campbell (2023), Xu and Campbell (2025)) can provide high-dimensional posterior density evaluation. In this post I track the bet, and describe how I understand its outcome — which includes, for me, an important meta-lesson in how to do good research.
\[ \def\F{F} \def\Fhat{\hat{\F}} \def\L{\mathcal{L}} \def\Lhat{\hat{\L}} \def\xvec{\vec{x}} \def\meann{\frac{1}{N} \sum_{n=1}^N} \newcommand{\expectp}[2]{\mathbb{E}_{#1}\left[#2\right]} \]
A BayesComp Bet
Two years ago at the 2023 BayesComp in Finland, Trevor Campbell presented Xu, Chen, and Campbell (2023), the first in a sequence of three super creative, provocative papers written by / with first author and current UBC PhD student, Zuheng (David) Xu. (By the way, I hear David is looking for a postdoc — if you’re someone who can hire him, you probably should!) Over the next two years, David and Trevor refined the idea in Xu and Campbell (2023) and, finally, in Xu and Campbell (2025), which Trevor presented in the session Alex Strang and I organized for BayesComp 2025 in Singapore.
Long story short (I provide a little more background below), David and Trevor proposed a way to view many MCMC algorithms as random normalizing flows. Conditional on this randomness, you can treat the MCMC algorithm as a standard normalizing flow — meaning you can, in principle, compute an ELBO. Of course, to do so, you need to be able to keep track of the Jacobians and compute the density of the flow. Correspondingly, you should, in principle, be able to run the flow backwards and so use the proposal density, together with the transformation Jacobians, to compute the posterior density.
When I heard this in the 2023 BayesComp, I didn’t believe MixFlows could work for density evaluation. My reasoning, in a nutshell, was that the flows MCMC are based on are designed to forget where they came from, and so you could not possibly use the proposal density to estimate the target density. A subsidiary, related objection of mine was the the flow maps were designed to be chaotic, and so the induced densities would be pathological, even if the resulting distribution converged in weaker senses like total variation distance. Trevor (and Jonathan Huggins) were more optimistic. I understood their arguments to be based on good initial experimental results for pointwise density evaluation (albeit in low dimensions), the idea that normalizing flows are designed to produce density estimates (otherwise the objective function doesn’t make sense), and the possibility that some magic was happening in the averaging. (I should acknowledge that I may not be doing their arguments full justice here — after all, I disagreed with them!)
Trevor, Jonathan and I spent a lot of time at BayesComp 2023 debating this question. I think it’s fair to say that none of us had sufficient clarity at the time to definitively resolve it to anyone’s satisfaction. So we made a bet as to whether, in a year, MixFlows would be able to produce reliable pointwise posterior density estimates in a high-dimensional (>1000) non-trivial problem. A nice bottle of scotch (or comparably expensive non-alcoholic beverage) was on the line. Initially, the bet was between Jonathan and me, and despite good progress (Xu and Campbell (2023)), we agreed that in 2024 high-dimensional density evaluation had not yet been achieved. Trevor and I then renewed the bet for another year, and this second bet has just come due again.
At the time of writing, David is still running some final experiments, so the second bet has not been officially called. But, as I explain below, I think that equation 9 of their most recent work, Xu and Campbell (2025), resolves the question pretty clearly in my favor.
But despite this, I think there’s a very real sense in which David and Trevor “won” the bet anyway, in a way that has me rethinking my whole approach to academic work.
MixFlows: Uniting MCMC and Normalizing Flows
Let me explain a little more about the achivement of Xu and Campbell (2025), and how I think it shows that high-dimensional posterior density evaluation has not been solved.
The core idea of MixFlows is as follows. An MCMC chain always begins with a proposal distribution \(q_0(\theta)\), and then applies a sequence of random transformations, \(\phi_t(\cdot)\) for \(t=1,\ldots,T\), that marginally leave a target distribution \(\pi(\theta)\) invariant. After drawing \(\theta \sim q_0(\theta)\), then applying a large number of these transformations, MCMC theory tells you that your draws \(\theta_T := \phi_T \circ \ldots \circ \phi_1(\theta)\) is approximately distributed according to the target distribution \(\theta_T \sim \pi(\cdot)\). Now, suppose you can represent each step in an MCMC procedure as a random diffemorphism. If you then condition on the randomness, then the sequence of steps \(\phi_T \circ \ldots \circ \phi_1\) has a well–defined Jacobian, and you still have \(\theta_T \sim \pi(\cdot)\), approximately, with high probability. That is, \(\phi_T \circ \ldots \circ \phi_1\) is a valid normalizing flow — and you never had to optimize a single parameter! The devil is in the details, and Xu and Campbell (2025) is virtuosic in this regard — particularly section 4, where they manage to turn an accept-reject MH step into a random diffeomorphism (!!!).
The key to connecting this method to posterior density evaluation is in equation 9 (and the earlier, corresponding unnumbered equation for forward flows). They show that the density \(q_T(\theta)\) after \(T\) steps of the flow is given by
\[ q_T(\theta) = \frac{1}{T}\sum_{t=1}^T \frac{q_0(\theta_t)}{\pi(\theta_t)} \quad\textrm{where}\quad \theta_t = \phi_t \circ \ldots \circ \phi_1(\theta). \]
This formula is beautiful — the key to its elegance is realizing that the Jacobian of a transform that leaves \(\pi(\cdot)\) invariant must have an expression in simple terms of density ratios involving \(\pi\) itself.
To understand the formula’s implications, you can think of \(\theta_t\) as approximately drawn from \(\pi(\cdot)\) for sufficiently large \(t\) — after all, they are effectively the intermediate states of an MCMC chain targeting \(\pi\). Then for large \(T\), if we can apply an ergodic law of large numbers, the average is approximately
\[ \frac{1}{T}\sum_{t=1}^T \frac{q_0(\theta_t)}{\pi(\theta_t)} \approx \expectp{\pi(\theta')}{\frac{q_0(\theta')}{\pi(\theta')}} = \int \frac{q_0(\theta')}{\pi(\theta')} \pi(\theta') d\theta' = \int q_0(\theta') d\theta' = 1. \]
If this approximate equality holds, then \(q_T(\theta) \approx \pi(\theta)\) as desired.
How I (probably) won the bet
The intuition about how \(q_T(\theta)\) provide a good estimate of \(\pi(\theta)\) also shows how it might not provide a good estimate. In particular, \(q_T(\theta)\) is exactly like using importance sampling from \(\pi\) to estimate the integral \(\expectp{q_0(\theta')}{1}\) — and so MixFlows inherits all the problems of importance sampling in high dimensions. In particular, if the dimension of \(\theta\) is high, and \(q_0(\theta)\) is just a little bit off \(\pi(\theta)\), then the variance of \(q_0(\theta')/ \pi(\theta')\) under \(\pi(\theta')\) will be enormous, and so \(q_T(\theta)\) of little practical value as an estimate of \(\pi(\theta)\).
In particular, the problem of an MCMC chain “forgetting” its proposal distribution is made clear. If you start at a typical point and run the chain backwards, then the marginal distribution of \(\theta_t\) remains \(\pi(\cdot)\), since the flows leave \(\pi\) invariant forwards and backwards. If the support of the proposal \(q_0(\cdot)\) has low probability under \(\pi(\cdot)\), then most values of \(q_0(\theta_t) / \pi(\theta_t)\) are zero, because they don’t land in the support of the proposal. But once in a while, a draw from \(\pi(\cdot)\) does end up in the support of \(q_0(\cdot)\), and for that draw \(q_0(\theta_t) / \pi(\theta_t)\) is enormous. Thus, for a fixed \(T\), the induced distribution \(q_T(\theta)\) is very “wrinkly” as a function of \(\theta\) — it has nearly infinitely large spikes where at least one \(\theta_t\) in the backwards path lands in the support of \(q_0(\cdot)\), and is otherwise zero. This is not at odds with the draws having good TV or other moment–based properties — it’s symptomatic of the fact that density estimation is just harder than moment estimation.
It’s worth noting that a benefit of \(q_T(\theta)\) is that it reverses the order of importance sampling — instead of using \(q_0(\cdot)\) to approximate \(\pi(\cdot)\), as you might naiveley do, \(q_T(\theta)\) uses \(\pi(\cdot)\) to approximate \(q_0(\cdot)\). This is desirable because it’s easier to form inner approximations (e.g. using mean field or MAP) than to form outer approximations. But it’s still importance sampling, and importance sampling is inherently hard in high dimensions, no matter which way you do it.
How David and Trevor (definitely) won the bet
In response to the above argument at BayesComp 2025, Trevor said something that really shook me. His response was that, if he had seen this in 2023, he might never have encouraged David to pursue this project. Now, I don’t want to seem to overstate the importance of this silly little bet bet either positively or negatively on David and Trevor’s decision to pursue MixFlows — I’m pretty sure that neither of them feel the need to look to anyone else’s opinion to decide what kind of work is good, and rightly so. But I find myself thinking a lot about Trevor’s response, especially as regards my attitude towards my own work.
Because the absolute worst outcome of this whole affair would have been for David and Trevor to have given up on this idea back in 2023. Density evaluation or no, this line of MixFlows work is amazing, provocative, creative, and exciting, and we are all much better for having it. This sequence of papers has certainly refined and expanded my own understanding both of MCMC and normalizing flows — the simplicity of the argument of the preceding paragraph is testimony to this. The ultimate application may not be density evaluation, but there are a zillion other things you can think of doing with this tool now that we have it — and even in the unlikely event that it doens’t turn in a tool, the theoretical understanding it provides and connetions it draws is well worth all the effort.
Given how this worked out, I can see that I probably over-prune my own research tree. Had I been in Trevor’s shoes in 2023, I don’t think I would have pursued MixFlows, and I can see that would have been a mistake. Correspondingly, I can safely assume that I am making comparably conservative mistakes in my own research projects right now, and that this is probably a disservice to myself (and now to my advisees). Research is about risks, and sometimes articulating clearly why a novel idea can’t shoot the moon can still provide lots of unexpected value and insight if it’s given room to breathe.
So I feel like I was forceably confronted with an important meta lesson about successful research. Time will tell, but I think that lesson will be even more valuable to me than diffeomorphic ergodic flows. Pending David’s experiments, however, I don’t think I’ll be turning down that second bottle of scotch…