AI Brain

Inside an AI Brain

Anthropic has been doing research for a while now on what happens inside the “mind” of their LLM Claude. They recently released two papers that peel back the curtain a bit on the LLM “thought process”.

The folks at Anthropic came up with a way called “circuit tracing” to see how Claude actually thinks. To do this, they build a simpler version of the model to map out the important steps it takes. They look at how different parts, called “features,” talk to each other to do things like answer questions or even write poems. This helps them figure out why the model works the way it does, like how it reasons or why it sometimes hallucinates (i.e., makes up stuff).

What I find most interesting are their findings regarding hallucinations. Large language models can be thought of as really good guessers. They’re trained to predict the next word that makes sense based on what they’ve seen. Sometimes, when they’re asked about something they don’t fully know, instead of saying “I don’t know,” they make up an answer that sounds plausible, and this is known as a hallucination.

The researchers found that models have internal “circuits” that can help them avoid this. There are “can’t answer” features that should push the model to say it doesn’t know. But when the model recognizes something, like a famous name, “known answer” features can kick in and bypass the “can’t answer” ones.

Hallucinations can happen when these “known answer” features misfire. For example, if you ask about a specific paper by an author the model recognizes, it might activate the “known answer” features just because it knows the author, even if it doesn’t actually know their papers, leading it to invent a title.

So, preventing hallucinations might mean improving the model’s ability to know when it doesn’t know something and making sure those “known answer” signals are more accurate.

There are a lot of techie details in the two papers, but it may be worth skimming just to get a sense of the complexity of these models.

You can find the Anthropic research here: https://lnkd.in/eysbdw8q
The two papers were released in March 2025 and are titled:
On the Biology of a Large Language Model
Circuit Tracing: Revealing Computational Graphs in Language Models