I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result. If it doesn’t solve an issue on the first or second pass, it seems to rapidly start making things up, make totally unrelated changes claiming they’ll fix the issue, or trying the same thing over and over.
They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.
Right now I work around it by regularly making summaries of instances, and then spinning up a new instance with fresh context and feed in the summary of the previous instance.
imagine instead of predicting just the next token, the LLM predicts a mask over the previous tokens, that is then thresholded and only “relevant” tokens are kept in the next inference
one key distinction between humans and LLMs is that humans are excellent at forgetting irrelevant data. we forget tens of times a second and only keep what's necessary
People have also been reporting that ChatGPT's new "memory" feature is poisoning their context. But context is also useful. I think AI companies will have to put a lot of engineering effort into keeping those LLMs on the happy path even with larger and larger contexts.
Pure speculation on my part but it feels like this may be a major component of the recent stories of people being driven mad by ChatGPT - they have extremely long conversations with the chatbot where the outputs start seeming more like the "spicy autocomplete" fever dream creative writing of pre-RLHF models, which feeds and reinforces the user's delusions.
Many journalists have complained that they can't seem to replicate this kind of behavior in their own attempts, but maybe they just need a sufficiently long context window?
In Claude Code you can use /clear to clear context, or /compact <optional message> to compact it down, with the message guiding what stays and what goes. It's helpful.
Claude has some amazing features like this that aren’t very well documented. Yesterday I just learned it writes sessions to disk and you can resume them where you left off with -continue or - resume if you accidentally close or something.
Also loving the shift + tab (twice) to enter plan mode. Just adding here in case it helps anyone else.
Yeah, it seems like they stealth ship a lot. Which is cool, but can sometimes lead to a future that's unevenly distributed, if you catch my drift.
I know there is work being done on LLM “memory” for lack of a better term but I have yet to see models get more responsive over time with this kind of feedback. I know I can flag it but right now it doesn’t help my “running” context that would be unique to me.
I have a similar thought about LLM “membranes”, which combines the learning from multiple users to become more useful, I am keeping a keen eye on that as I think that will make them more useful on a organizational level
A silly example is any of the riddles where you just simplify it to an obvious degree and the LLM can't get it (mostly gone with recent big models), like: "A man, a sheep, and a boat need to get across a river. How can they do this safely without the sheep being eaten".
A more practically infuriating example is when you want to do something slightly different than a very common problem. The LLM might eventually get it right, after too much guidance, but then it'll slowly revert back to the "common" case. For example, replacing whole chunks of code with whatever common thing when you tell it add comments. This happens frequently to me with super basic vector math.
But, OpenAI and friends should let me purge my questions and, more importantly, the LLM response from the chat. More often than not, it’s poisoning itself with bad ideas, flip-flopping, etc. I hate having to pick up and move to a new chat but if I don’t the conversation will only go downhill.
This is possible in tools like LM Studio when running LLMs locally. It's a choice by the implementer to grant this ability to end users. You pass the entire context to the model in each turn of the conversation, so there's no technical reason stopping this feature existing, besides maybe some cost benefits to the inference vendor from cache.
It's already the case on tools like block.github.io/goose:
```
Summarize Conversation This will summarize your conversation history to save context space.
Previous messages will remain visible but only the summary will be included in the active context for Goose. This is useful for long conversations that are approaching the context limit.
```
This is already pretty much figured out: https://www.promptingguide.ai/techniques/react
We use it at work and we never encounter this kind of issues.
It mostly happens when you pass it similar but updated code, for some reason it then doesn't really see the newest version and reasons over obsolete content.
I've had one chat recover from this, though.
I try to keep hygiene with prompts; if I get anything bad in the result, I try to edit my prompts to get it better rather than correcting in conversation.
After hitting my head against a wall with a problem I need to stop.
I need to stop and clear my context. Go a walk. Talk with friends. Switch to another task.
When I came back all the tests were passing!
But as I ran it live a lot of cases were still failing.
Turns out the LLM hardcoded the test values as “if (‘test value’) return ‘correct value’;”!
It then deleted the entire implementation and made the function raise a “not implemented” exception, updated the tests to expect that, and told me this was a solid base for the next developer to start working on.
In programming, I already have a very good tool to follow specific steps: _the programming language_. It is designed to run algorithms. If I need to be specific, that's the tool to use. It does exactly what I ask it to do. When it fails, it's my fault.
Some humans require algorithmic-like instructions too. Like cooking a recipe. However, those instructions can be very vague and a lot of humans can still follow it.
LLMs stand on this weird place where we don't have a clue in which occasions we can be vague or not. Sometimes you can be vague, sometimes you can't. Sometimes high level steps are enough, sometimes you need fine-grained instructions. It's basically trial and error.
Can you really blame someone for not being specific enough in a system that only provides you with a text box that offers anthropomorphic conversation? I'd say no, you can't.
If you want to talk about how specific you need to prompt an LLM, there must be a well-defined treshold. The other option is "whatever you can expect from a human".
Most discussions seem to juggle between those two. LLMs are praised when they accept vague instructions, but the user is blamed when they fail. Very convenient.
Can you blame them for that?
For other products, do you think people contact customer support with an abundance of information?
Now, consider what these LLM products promise to deliver. Text box, answer. Is there any indication that different challenges might yield difference in the quality of outcome? Nope. Magic genie interface, it either works or it doesn't.
(Context: Working in applied AI R&D for 10 years, daily user of Claude for boilerplate coding stuff and as an HTML coding assistant)
Lots of "with some tweaks i got it to work" or "we're using an agent at my company", rarely details about what's working or why, or what these production-grade agents are doing.
Sure, it takes some creative prompting, and a lot of turns to get it to settle on the proper coordinate system for the whole thing, but it goes ahead and does it.
This took me two days so far. Unfortunate, the scope of the thing is now so large that the quality rapidly starts to degrade.
Basically like Java Spring Boot or NestJS type projects.
I’ve had a similar experience, where instead of trying to fix the error, it added a try/catch around it with a log message, just so execution could continue
For example, the other day I was converting models but was running out of disk space. The agent decided to change the quantization to save space when I'd prefer it ask "hey, I need some more disk space". I just paused it, cleared some space, then asked the agent to try the original command again.
Is this with something like Aider or CLine?
I've been using Claude-Code (with a Max plan, so I don't have to worry about it wasting tokens), and I've had it successfully handle tasks that take over an hour. But getting there isn't super easy, that's true. The instructions/CLAUDE.md file need to be perfect.
What kind of tasks take over an hour?
I generally treat all my sessions with it as a pairing session, and like in any pairing session, sometimes we have to stop going down whatever failing path we're on, step all the way back to the beginning, and start again.
At least that’s easy to catch. It’s often more insidious like “if len(custom_objects) > 10:” or “if object_name == ‘abc’” buried deep in the function, for the sole purpose of making one stubborn test pass.
“Claude, this is a web server!”
“My apologies… etc.”
To get the best results, I make sure to give detailed specs of both the current situation (background context, what I've tried so far, etc.) and also what criteria the solution needs to satisfy. So long as I do that, there's a high chance that the answer is at least satisfying if not a perfect solution. If I don't, the AI takes a lot of liberties (such as switching to completely different approaches, or rewriting entire modules, etc.) to try to reach what it thinks is the solution.
It's not often that I have to do this. As I mentioned in my post above, if I start the interaction with thorough instructions/specs, then the conversation concludes before the drift starts to happen.
I absolutely have, for what it's worth. Particularly when the LLM has some sort of test to validate against, such as a test suite or simply fixing compilation errors until a project builds successfully. It will just keep chugging away until it gets it, often with good overall results in the end.
I'll add that until the AI succeeds, its errors can be excessively dumb, to the point where it can be frustrating to watch.
1) switch to a more expensive llm and ask it to debug: add debugging statements, reason about what's going on, try small tasks, etc 2) find issue 3) ask it to summarize what was wrong and what to do differently next time 4) copy and paste that recommendation to a small text document 5) revert to the original state and ask the llm to make the change with the recommendation as context
I've had the same experience as parent where LLMs are great for simple tasks but still fall down surprisingly quickly on anything complex and sometimes make simple problems complex. Just a few days ago I asked Claude how to do something with a library and rather than give me the simple answer it suggested I rewrite a large chunk of that library instead, in a way that I highly doubt was bug-free. Fortunately I figured there would be a much simpler answer but mistakes like that could easily slip through.
You might not even need to switch
A lot of times, just asking the model to debug an issue, instead of fixing it, helps to get the model unstuck (and also helps providing better context)
Sounds like a lot of employees I know.
Changing out the entire library is quite amusing, though.
Just imagine: I couldn't fix this build error, so I migrated our entire database from Postgres to MongoDB...
Humans as well don’t remember the entire context either. For your case the summary already says tried library A and B and it didn’t work, it’s unlikely the LLM will repeat library A given that the summary explicitly said it was attempted.
I think what happens is that if the context gets to large the LLM sort of starts rambling or imitating rambling styles it finds online. The training does not focus on not rambling and regurgitation so the LLM is not watching too hard for that once the context gets past a certain length. People ramble too and we repeat shit a lot.
When humans get stuck solving problems they often go out to acquire new information so they can better address the barrier they encountered. This is hard to replicate in a training environment, I bet its hard to let an agent search google without contaminating your training sample.
does this mean that even AI gets stuck in dependency hell?
It affects people too. Something I learned halfway through a theoretical physics PhD in the 1990s was that a 50-page paper with a complex calculation almost certainly had a serious mistake in it that you'd find if you went over it line-by-line.
I thought I could counter that by building a set of unit tests and integration tests around the calculation and on one level that worked, but in the end my calculation never got published outside my thesis because our formulation of the problem turned a topological circle into a helix and we had no idea how to compute the associated topological factor.
Sexual reproduction is context-clearing and starting over from ROM.
Sure, you could be cloned, but that wouldn't be you. The process of accumulating memories is also the process of aging with death being an inevitability.
Software is sort of like this too, hence rewrites.
That's not quite clear philosophically. I like the thought experiment of them migrating each of your neurons one at a time from the biological into a computerized emulation (each emulated neuron having a physical mechanism to send proper electrical/chemical signals to those that are still biological, while doing software message passing with the digital ones) - do you at any point stop being you? Once the migration is complete and you're being fully emulated in silicon, is it still you? If they then restart the computer, is it still you after it's resumed? If they duplicate the code/weights onto two computers, are both you? And if they then reassemble your biological body with the neural connectivity based on that in the emulation - is it still you? And what if they clone the two digital "yous" into two separate bodies?
I personally don't have a clear answer to any of these, not more than I would in the plain Ship of Theseus thought experiment.
Interesting, and I used to think that math and sciences were invented by humans to model the world in a manner to avoid errors due to chains of fuzzy thinking. Also, formal languages allowed large buildings to be constructued on strong foundations.
From your anecdote it appears that the calculations in the paper were numerical ? but I suppose a similar argument applies to symbolic calculations.
https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e...
using integration over phase space instead of position or momentum space. Most people think you need an orthogonal basis set to do quantum mechanical calculation but it turns that "resolution of unity is all you need", that is, if you integrate |x><x| over all x you get 1. If you believe resolution of unity applies in quantum gravity, then Hawking was wrong about black hole information. In my case we were hoping we could apply the trace formula and make similar derivations to systems with unusual coordinates, such as spin systems.
There are quite a few calculations in physics that involve perturbation theory, for instance, people used to try to calculate the motion of the moon by expanding out thousands of terms that look like (112345/552) sin(32 θ-75 ϕ) and still not getting terribly good results. It turns out classic perturbation theory is pathological around popular cases such as the harmonic oscillator (frequency doesn't vary with amplitude) and celestial mechanics (the frequency to go around the sun, to get closer or further from sun, or to go above or below the plane of the plane of the ecliptic are all the same.) In quantum mechanic these are not pathological, notably perturbation theory works great for an electron going around an atom which is basically the same problem as the Earth going around the Sun.
I have a lot of skepticism about things like
https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome...
in high energy physics because frequently they're comparing a difficult experiment to an expansion of thousands of Feynman diagrams and between computational errors and the fact that perturbation theory often doesn't converge very well I don't get excited when they don't agree.
----
Note that I used numerical calculations for "unit and integration testing", so if I derived an identity I could test that the identity was true for different inputs. As for formal systems, they only go so far. See
https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste...
What could any human do with a context window of 10 minutes and no other memory? You could write yourself notes… but you might not see them because soon you won’t know they are there. So maybe tattoo them on your body…
You could likely do a lot of things. Just follow a recipe and cook. Drive to work. But could you drive to the hardware store and get some stuff you need to build that ikea furniture? Might be too much context.
I think solving memory is solving agi.
But we already have AGI
If you can’t zero shot your way to success the LLM simply doesn’t have enough training for your problem and you need a human touch or slightly different trigger words. There have been times where I’ve gotten a solution with such a minimal prompt it practically feels like the LLM read my mind, that’s the vibe.
I really think the insufferable obsequiousness of every LLM is one of the core flaws that make them terrible peer programmers.
So, you're right in a way. There's no sense in arguing with them, but only because they refuse to argue.
But I think the bigger reason arguing doesn't work is they are still fundamentally next-token-predictors. The wrong answer was already something it thought was probable before it polluted its context with it. You arguing can be seen as an attempt to make the wrong answer less probable. But it already strengthened that probability by having already answered incorrectly.
Anyway, what this means I think is you will find AI agents continuing to colonize spaces with meaningful local and global reward functions. But most importantly, it likely means that complex problem spaces will see marginal improvements (where are all these new math theorems we were promised many months ago?).
It’s also very tempting to say ”ah but we can just make or even generate reward functions for those problems and train the AI”. I suspect this won’t happen, because if there was simple functions, we’d have discovered them already. Software engineering is one such mystery, and the reason I love it. Every year, we come up with new ideas and patterns. Many think they will solve all our problems, or at least consistently guide as in the right direction. But yet, here we are, debating language features, design patterns, tooling, UX etc etc. The vast majority of easy truths are already found. The rest are either complex or hard to find. Even when we think we found one, it often takes man-decades to conclude that it wasn’t even a good idea. And they’re certainly not inferrable from existing training data.
- Removing problematic tests altogether
- Making up libs
- Providing a stub and asking you to fill in the code
This is an attack vector. Probe the models for commonly hallucinated libraries (on npm or github or wherever) and then go and create those libraries with malicious code.
This is a perennial issue in chatbot-style apps, but I've never had it happen in Claude Code.
Is this just a context limitation, or are they missing some kind of self-correction loop? Curious if anyone has seen agents that can catch their own mistakes and adjust during a task. Would love to hear how far that has come.
I saw some results showing that LLMs struggle to complete tasks which would take longer than a day. I wonder if the average developer, individually, would be much better if they had to write the software on their own.
The average dev today is very specialized and their code is optimized for job security, not for correctness and not for producing succinct code which maps directly to functionality.
So if you project outwards a while, you hit around 10000 hours about 6 years from now.
Is that a reasonable timeline for ASI?
It's got more of a rationale behind it than other methods perhaps?
Compilers already do better than me million hours for every program I write because I am not crafting assembly code.
Computers save billions of hours compared to doing it by hand with an abacus or pen and paper.
Productivity of humans is always dependent on tools they have access to, even with agents become that much productive so will humans who use tools
——
Projecting doubling rate over so many generations is no different than saying earth is flat just because it feels flat in your local space. There is no reason to believe exponential doubling holds for 10 generations from now.
There is just one example in all of history that did that over 10 generations , Moore law . Doubling for so many generations at near constant rate is close to impossible , from the times of the ancient tale of grain and chessboard, people constants struggle with the power of the exponential.
——
I would say current LLM approaches are fairly close to end than beginning of the cycle, perhaps 1-2 generations left at best .
The outlay on funding is already at $50-$100 B/year per major player. No organization can spend $500B/year on anything. I don’t see any large scale collaboration in the private sector or in public like space station / fusion reactor for allocating resources to get to say a 3rd gen from now.
Comparing it with semiconductor tech the only other exponential example we have, the budget for foundries and research grew similarly and those are today are say $10-20B for leading edge foundry, but doesn’t keep growing at the same pace.
Constraints on capital availability and risk when deploying it is why we have so few fab players left and remaining players can afford to stagnate and not seriously invest .
I could be quite wrong of course but it is not a certain bet that we will get fundamental breakthroughs from them.
There are specific areas which are always going to have major improvements .
In the semi conductor industry, Low power processors or multi core dies etc produced some results when core innovations slowed down during 2008-2018, i.e. till before the current EUV breakthrough driven generations of chip advances.
The history of EUV lithography and ASML’s success is an unlikely tale and it happened after both public and industry consortium funding of work for 2 decades that was abandoned multiple times .
Breakthroughs will happen eventually, but each wave ( we are on the fourth one for AI?) stagnates after initial rapid progress .