149 points by timbilt 30 days ago | 9 comments
janalsncm 29 days ago
One of the benefits of using thinking tokens compared to “thinking in a latent” space is that you can directly observe the quality of the CoT. In R1 they saw it was mixing languages and fixed it with cold start data.

It would be hard to SFT this because you can only SFT the final result not the latent space.

I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.

I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.

edouard-harris 29 days ago
> In R1 they saw it was mixing languages and fixed it with cold start data.

They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])

The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.

[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]

[1] https://arxiv.org/pdf/2412.14093

janalsncm 29 days ago
Interpretability also matters when you’re training. If the model works, yes, technically only the final result matters. But in practice it probably won’t work right away and so it’s great to have methods to figure out what is going wrong as you’re training.

For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.

As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.

So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.

pilooch 29 days ago
It could be argued that "thinking" / CoT in latent space abstracts away the language issue, and that in fact language in reasoning steps doesn't matter. Latent tokens could actually be decoded afterwards to any target language. Much more powerful IMO.

On a side note, there's decent research on how well bilingual humans do actually think in both language, and are actually better at decisive thinking outside of their mother tongue.

janalsncm 28 days ago
I think another argument is that the CoT is simply unrolling the recurrent loop that this method uses, and doing an unembedding -> embedding -> unembedding during the decoding process.

So at best, using a recurrent loop is only saving you from doing the embedding -> unembedding at each token which is relatively small compared with the height of the decoder blocks.

nielsole 29 days ago
With a bit of fiddling you should be able to get the LLM to translate/summarize the thinking process. Not a 1:1 thing, but still
WithinReason 29 days ago
how would you do it?
nielsole 29 days ago
my naive way would be to try to do seq2seq with the hidden state as input. Not sure how to replace the supervised samples though.
WithinReason 29 days ago
OK but what would you use as ground truth?
WhitneyLand 29 days ago
One of the hoped for benefits of this approach that’s described later in the paper. It’s not fully fleshed out what this will mean but the prospect is tantalizing.

"On a more philosophical note, we hope that latent reasoning captures facets of human reasoning that defy verbalization, such as spatial thinking, physical intuition or (motor) planning. Over many iterations of the recurrent process, reasoning in a high-dimensional vector space would enable the deep exploration of multiple directions simultaneously, instead of linear thinking, leading to a system capable of exhibiting novel and complex reasoning behavior."

ckrapu 29 days ago
My opinion is that opaque reasoning is a prerequisite for many of the worst possible AI outcomes.

We should make reasoning fully visible in the output space.

optimalsolver 29 days ago
Is there any actual evidence that the reasoning tokens output by current models actually represent the computation happening in the hidden layers?

In both cases, the model is doing a ton of processing that you can't actually inspect, except here, you at least get some efficiency gains.

Even more importantly, you're also less likely to convince yourself that you know what the model is thinking.

ckrapu 27 days ago
In the autoregressive decoding framework, the hidden layers' state for computation of token `t` is conditionally independent of all hidden states for `t-1`, `t-2` and so on given the observed tokens.

Put differently, the observed tokens are a bottleneck on the information that can be communicated across tokens. Any scheming performed by an LLM which requires more than one token to formulate must therefore pass through the visible tokens. With opaque vectors transferred across decoding steps, this is not the case.

The computation in the hidden layers, as far as we can tell, is not sufficient for scheming in a single decoding step. It looks like it requires O(10^2) or O(10^3) steps instead, judging from anecdotal evidence like the reports of scheming from o1 (https://cdn.openai.com/o1-system-card-20241205.pdf)

As far as your last point goes, I'd rather have a more transparent system, all other factors held constant.

anothermathbozo 29 days ago
No and we’ve observed evidence to the contrary
mola 29 days ago
Do you have some reading material on this? How did they understand the difference between stated cot and "actual processing"
miven 29 days ago
Chain of thought isn't exactly transparent either, you shouldn't fall into the pitfall of believing that the final sequence of tokens thinking about the task is the only processing the model actually performs during CoT.

There might me a lot of other hidden computations happening within the model's latents which may not immediately influence the predicted tokens but be relevant for the model's internal processing. And even disregarding that, the model is under no formal obligation to stick to the chain of thought it produced for its final decisions.

nsikorr 29 days ago
The paper suggests that that is still possible with the proposed architecture if needed.
DennisP 29 days ago
That actually sounds like it'd be really helpful.
Imanari 29 days ago
maybe let it reason in latent space but have a method to transform and output it to text for inspection.
nialv7 29 days ago
Slightly off topic, I rarely see paper talks about their failed training runs, and why those runs failed. This paper is definitely a breath of fresh air. And their analyses of their failures, the changes they made to fix them, and the rational behind that, are all very insightful.
tkellogg 29 days ago
The R1 paper did it as well. Agreed, it's always very interesting.
HarHarVeryFunny 29 days ago
Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.
Manabu-eo 28 days ago
While not the main focus, see Section 6.1 and Figure 10 for a simple adaptative exit strategy for inference.

I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.

thomasahle 29 days ago
> Latent / embedding-space reasoning seems a step in the right direction

Might be good for reasoning, but it's terrible for interpretation / AI-safety.

Tostino 29 days ago
Why is it any different to do 4 recurrent passes than having a model that is 4x deeper?
lonk11 28 days ago
Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.

Tostino 28 days ago
Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.

thomasahle 29 days ago
I guess the most interpretable is to have as shallow a model as possible, but with longer cot. It would be quite interesting seeing the trade-off between the two. Though, unfortunately, deeper is probably better.
janalsncm 29 days ago
> seems a step in the right direction

I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.

> externally specifying the number of recurrent iterations

Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.

HarHarVeryFunny 29 days ago
> I can’t see why

It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.

It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).

viraptor 29 days ago
> I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens.

Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.

jonathanrmumm 29 days ago
If we're talking conscious thought, millions of simultaneously firing neurons to form words. If we're unconscious intelligence, it's closer to latent space. A lot of intelligence that can't be articulated.
viraptor 29 days ago
(citation needed) It sounds fun and all, but we barely have any connection between human brain and llms as they exist today.
porridgeraisin 29 days ago
We need to reboot Bryan Cantrill's "Don't anthropomorphize the lawn mower" talk with a new edition titled "Don't anthropomorphize the internet document simulator"
bcantrill 27 days ago
porridgeraisin 18 days ago
Nice, right from the horse's mouth. Let me watch that.
ckrapu 29 days ago
Identifying scheming in the latent streams would be harder as you would have an extra layer of obfuscation between you and the model’s reasoning.
timbilt 30 days ago
Twitter thread about this by the author: https://x.com/jonasgeiping/status/1888985929727037514
danielbln 29 days ago
If you don't have a twitter account and want to read the full thread: https://xcancel.com/jonasgeiping/status/1888985929727037514
radarsat1 29 days ago
If you keep digging in that thread the author posts a gist containing information on how the recurrence works:

https://gist.github.com/JonasGeiping/65959599ca637d72d50c96c...

tmnvdb 29 days ago
Interesting stuff. As the authors note, using latent reasoning seems to be a way to sink more compute into the model and get better performance without increasing the model size, good news for those on a steady diet of 'scale pills'
EternalFury 29 days ago
Isn’t this equivalent to maximizing latent space activation without corrective user input? How does it implement self correction or backtracking?
anentropic 29 days ago
is what they call "test-time" here the same as what is often called "inference time" elsewhere?
alach11 29 days ago
Yes, those are the same.