Alright, imagine you have a big box of LEGO bricks, and you're trying to build a really cool spaceship. There are two main ways people usually build things like this:
Step-by-step (Autoregressive Models) – Imagine you put one LEGO brick down at a time, making sure each piece fits perfectly before adding the next. It works, but it takes a long time.
Fix and refine (Diffusion Models) – Imagine you start by dumping all the LEGO bricks in a messy pile. Then, you slowly move pieces around, fixing mistakes until you get a spaceship. This is faster than the first method, but it still takes a lot of tiny adjustments.
What's the Problem? People have been using these two ways for a long time, and they’ve gotten really good at them. But no matter how big or smart your LEGO-building robot gets, these methods don’t get that much better. They’re kind of stuck.
The New Way: Inductive Moment Matching (IMM) IMM is like a magical LEGO helper that doesn’t just follow the usual slow steps. Instead, it looks at what the final spaceship should look like ahead of time and figures out how to jump closer to the final result in fewer steps.
Instead of moving one LEGO brick at a time or slowly fixing a messy pile, it’s like the helper knows where each piece should go ahead of time and moves big sections all at once. That makes it way faster and still super accurate!
Why is This Cool? Faster – It builds things much more quickly than the old methods. More efficient – It doesn’t waste as much time adjusting tiny details. Works with all kinds of problems – This method can be used for pictures, videos, and maybe even other things like 3D models. Real-World Example Imagine drawing a picture of a dog. Old way: You draw one tiny detail at a time, or you start with a blurry dog and keep fixing it. New way (IMM): You already kind of know what the dog should look like, so you make big strokes to get there quickly!
So basically, IMM is a super smart way to skip unnecessary steps and get amazing results much faster.
Does this mean that diffusion models for text could scale inference compute to improve quality for a fixed-length output?
If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.
In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).
Please don’t let people ever discourage you from writing proper papers. Ever since meta etc. started asking for „2 papers in relevant fields“ we see a flood of papers that should be tweets.
In particular, we examine the one-step iterative process of DDIM [39, 19, 21] and show that it has limited capacity with respect to the target timestep under the current denoising network design. This can be addressed by adding the target timestep to the inputs of the denoising network [15].
Interestingly, this one fix, plus a proper moment matching objective [5] leads to a stable, single-stage algorithm that surpasses diffusion models in sample quality while being over an order of magnitude more efficient at inference [50]. Notably, these ideas do not rely on denoising score matching [46] or the score-based stochastic differential equations [41] on which the foundations of diffusion models are built.
Here they train a model to say "I'm gonna ask you to take a step from time B to A - might be a small step, might be a big step - but whatever size it is, make the image that much better." You you might ask the model to improve the image from t=1.0 to t=0.25 and be almost done. It gets a side variable telling it how much improvement to make in each step.
I'm not sure this right, but that's what I got out of it by skimming the blog & paper.
Fast, real time video generation? (One second of compute per one second of output.)
Does this mean more efficient and more generalizable training and fine tuning?