It would be hard to SFT this because you can only SFT the final result not the latent space.
I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.
I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.
They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])
The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.
[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]
For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.
As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.
So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.
On a side note, there's decent research on how well bilingual humans do actually think in both language, and are actually better at decisive thinking outside of their mother tongue.
So at best, using a recurrent loop is only saving you from doing the embedding -> unembedding at each token which is relatively small compared with the height of the decoder blocks.
"On a more philosophical note, we hope that latent reasoning captures facets of human reasoning that defy verbalization, such as spatial thinking, physical intuition or (motor) planning. Over many iterations of the recurrent process, reasoning in a high-dimensional vector space would enable the deep exploration of multiple directions simultaneously, instead of linear thinking, leading to a system capable of exhibiting novel and complex reasoning behavior."
We should make reasoning fully visible in the output space.
In both cases, the model is doing a ton of processing that you can't actually inspect, except here, you at least get some efficiency gains.
Even more importantly, you're also less likely to convince yourself that you know what the model is thinking.
Put differently, the observed tokens are a bottleneck on the information that can be communicated across tokens. Any scheming performed by an LLM which requires more than one token to formulate must therefore pass through the visible tokens. With opaque vectors transferred across decoding steps, this is not the case.
The computation in the hidden layers, as far as we can tell, is not sufficient for scheming in a single decoding step. It looks like it requires O(10^2) or O(10^3) steps instead, judging from anecdotal evidence like the reports of scheming from o1 (https://cdn.openai.com/o1-system-card-20241205.pdf)
As far as your last point goes, I'd rather have a more transparent system, all other factors held constant.
There might me a lot of other hidden computations happening within the model's latents which may not immediately influence the predicted tokens but be relevant for the model's internal processing. And even disregarding that, the model is under no formal obligation to stick to the chain of thought it produced for its final decisions.
I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.
Might be good for reasoning, but it's terrible for interpretation / AI-safety.
The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.
I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.
> externally specifying the number of recurrent iterations
Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.
It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.
It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).
Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.
[0] https://www.youtube.com/watch?v=bQfJi7rjuEk (slides: https://speakerdeck.com/bcantrill/intelligence-is-not-enough...)
https://gist.github.com/JonasGeiping/65959599ca637d72d50c96c...