BioHacker News | Beyond Diffusion: Inductive Moment Matching

▲Beyond Diffusion: Inductive Moment Matching(lumalabs.ai)

202 points by outrun86 125 days ago | 10 comments

▲goldemerald 125 days ago

I've been lightly following this type of research for a few years. I immediately recognized the broad idea as stemming from the lab of the ridiculously prolific Stefano Ermon. He's always taken a unique angle for generative models since the before times of GenAI. I was fortunate to get lunch with him in grad school after a talk he gave. Seeing the work from his lab in these modern days is compelling, I always figured his style of research would break out into the mainstream eventually. I'm hopeful the the future of ML improvements come from clever test-time algorithms like this article shows. I'm looking forward to when you can train a high quality generative model without needing a super cluster or webscale data.

▲imjonse 125 days ago

Some of their research had already broken out into mainstream, DDIM at least was their paper and probably others too in the diffusion domain.

▲programjames 125 days ago

Anyone willing to give an intuitive summary of what they did mathwise? The math in the paper is super ugly to churn through.

▲bearobear 125 days ago

Last author here (I also did the DDIM paper, https://arxiv.org/abs/2010.02502). I know this is going to be very tricky math-wise (and in the paper we just wrote the most general thing to make reviewers happy), so I tried to explain the idea more easily under the blog post (https://lumalabs.ai/news/inductive-moment-matching).

If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.

In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).

▲niemandhier 125 days ago

Just as a counter perspective: I think your paper is great!

Please don’t let people ever discourage you from writing proper papers. Ever since meta etc. started asking for „2 papers in relevant fields“ we see a flood of papers that should be tweets.

▲qumpis 125 days ago

What happens if we don't add any moments matching objective? e.g. at train time just fit a diffusion model that predicts the target given any pair of timesteps (t, t')? Why is moment matching critical here?

Also regarding linearity, why is it inflexible? It seems quite convenient that a simple linear interpolation is used for reconstruction, besides, even in DDIM, the directions towards the final target changes at each step as the images become less noisy. In standard diffusion models or even flow matching, denoising is always equal to the prediction of the original data + direction from current timestep to the timestep t'. Just to be clear, it is intuitive that such models are inferior in few-step generations since they don't optimise for test time efficiency (in terms of the tradeoff of quality vs compute), but it's unclear what inflexibility exists there beyond this limitation.

Clearly there's no expected benefit in quality if all timesteps are used in denoising?

▲littlestymaar 125 days ago

Stupid question, what's a “timestep” in that context?

▲nmca 125 days ago

The authors own summary from the position paper is:

In particular, we examine the one-step iterative process of DDIM [39, 19, 21] and show that it has limited capacity with respect to the target timestep under the current denoising network design. This can be addressed by adding the target timestep to the inputs of the denoising network [15].

Interestingly, this one fix, plus a proper moment matching objective [5] leads to a stable, single-stage algorithm that surpasses diffusion models in sample quality while being over an order of magnitude more efficient at inference [50]. Notably, these ideas do not rely on denoising score matching [46] or the score-based stochastic differential equations [41] on which the foundations of diffusion models are built.

▲oofbey 125 days ago

In normal diffusion you train a model to take lots of tiny steps, all the same small size. e.g. "You're gonna take 20 steps, at times [1.0, 0.95, 0.90, 0.85...]" and each time the model takes that small fixed-size step to make the image look better.

Here they train a model to say "I'm gonna ask you to take a step from time B to A - might be a small step, might be a big step - but whatever size it is, make the image that much better." You you might ask the model to improve the image from t=1.0 to t=0.25 and be almost done. It gets a side variable telling it how much improvement to make in each step.

I'm not sure this right, but that's what I got out of it by skimming the blog & paper.

▲kadushka 125 days ago

No, we typically train any diffusion model on a single step (randomly chosen).

▲hyperbovine 125 days ago

The math is totally standard if you've read recent important papers on score matching and flow matching. If you haven't, well, I can't see how you could possibly hope to understand this work at a technical level anyways.

▲lukasb 125 days ago

"Inference can generally be scaled along two dimensions: extending sequence length (in autoregressive models), and augmenting the number of refinement steps (in diffusion models)."

Does this mean that diffusion models for text could scale inference compute to improve quality for a fixed-length output?

▲svachalek 125 days ago

Yes, although so far it seems the main advantage of text diffusion models is that they're really, really fast. Iterations reach an asymptote very quickly.

▲kadushka 125 days ago

I don’t know which text diffusion models you’re talking about, the latest and greatest is this one: https://arxiv.org/abs/2502.09992 and it’s extremely slow – couple of orders of magnitude slower than a regular LLM, mainly because it does not support KV caching, and requires many full sequence processing steps per token.

▲janalsncm 125 days ago

I’m not familiar with that paper but it would probably be best to compare speeds with an unoptimized transformer decoder. The Vaswani paper came out 8 years ago so implementations will be pretty highly optimized at this point.

On the other hand if there was a theoretical reason why text diffusion models could never be faster than autoregressive transformers it would be notable.

▲kadushka 124 days ago

There’s not enough improvement over regular LLMs to motivate optimization effort. Recall that the original transformer was well received because it was fast and scalable compared to RNNs.

▲lukasb 125 days ago

Yeah I guess progressive refinement is limited in quality by how good the first N iterations are that establish the broad outlines.

▲vessenes 125 days ago

FWIW i don’t think we’ve seen nearly all the ideas for text diffusion yet — why not ‘jiggle the text around a bit’ when things have stabilized, or add space to fill, or have a separate judging module identify space that needs more tokens? Lots of super interesting possibilities.

▲xela79 125 days ago

this went over my head quickly; read through it a few times, than asked GPT for a summary on my level of understanding, which does clear it up for me ,personally , to grasp the overall idea:

Alright, imagine you have a big box of LEGO bricks, and you're trying to build a really cool spaceship. There are two main ways people usually build things like this:

Step-by-step (Autoregressive Models) – Imagine you put one LEGO brick down at a time, making sure each piece fits perfectly before adding the next. It works, but it takes a long time.

Fix and refine (Diffusion Models) – Imagine you start by dumping all the LEGO bricks in a messy pile. Then, you slowly move pieces around, fixing mistakes until you get a spaceship. This is faster than the first method, but it still takes a lot of tiny adjustments.

What's the Problem? People have been using these two ways for a long time, and they’ve gotten really good at them. But no matter how big or smart your LEGO-building robot gets, these methods don’t get that much better. They’re kind of stuck.

The New Way: Inductive Moment Matching (IMM) IMM is like a magical LEGO helper that doesn’t just follow the usual slow steps. Instead, it looks at what the final spaceship should look like ahead of time and figures out how to jump closer to the final result in fewer steps.

Instead of moving one LEGO brick at a time or slowly fixing a messy pile, it’s like the helper knows where each piece should go ahead of time and moves big sections all at once. That makes it way faster and still super accurate!

Why is This Cool? Faster – It builds things much more quickly than the old methods. More efficient – It doesn’t waste as much time adjusting tiny details. Works with all kinds of problems – This method can be used for pictures, videos, and maybe even other things like 3D models. Real-World Example Imagine drawing a picture of a dog. Old way: You draw one tiny detail at a time, or you start with a blurry dog and keep fixing it. New way (IMM): You already kind of know what the dog should look like, so you make big strokes to get there quickly!

So basically, IMM is a super smart way to skip unnecessary steps and get amazing results much faster.

▲azinman2 125 days ago

Thank you, this is helpful framing. Obviously all the details are missing, but the blog post was impenetrable for me, and I’m quite technical.

▲b2w 125 days ago

So like intuitive photographic memory?

▲Climatebamb 125 days ago

More like "Oh i remember what you roughly want, i rememeber basic steps of reaching it just not details, lets generate the details" vs. "learning x steps from noise to image".

You make the way of reaching your target faster.

▲b2w 124 days ago

This helped me, thanks!

▲33a 125 days ago

Reminds me of https://ggx-research.github.io/publication/2023/05/10/public...

▲echelon 125 days ago

Does this mean high quality images and video will be possible in one or a few sampling steps?

Fast, real time video generation? (One second of compute per one second of output.)

Does this mean more efficient and more generalizable training and fine tuning?

▲bbminner 125 days ago

Can anyone share insight into how this is different from consistency models? The insight seems quite similar?

▲bearobear 125 days ago

Consistency models is a special case of IMM where you do moment matching with 1 sample from each distribution (i.e., you cannot match distributions properly). See Fig 5 for an ablation study, of course, adding more samples when you are doing moment matching makes it more stable during training :)

▲throwaway2562 125 days ago

I’m trying to understand what the ‘spectral’ interpretation of IMM is: but perhaps I shouldn’t

https://sander.ai/2024/09/02/spectral-autoregression.html

▲bbminner 125 days ago

Makes sense. How can you even approximately estimate higher order differences in conditional moments in such a high dim space? Seems statistically impossible to get a reasonable estimate for a gradient. Moment matching in sample space has always been very hard.

▲richard___ 125 days ago

Reminds me of the Kevin Frans shortcut networks paper?

▲unit149 125 days ago

[dead]

▲brcmthrowaway 125 days ago

This is a gamechanger