thoughts and things I've learned along the way đ
This study explores how transformer-based large language models (LLMs) perform Theory of Mind (ToM) tasks, particularly focusing on false belief scenarios. The analysis bridges high-level behavioral analoguesâsuch as tracking and updating belief states of entitiesâwith low-level computational mechanisms within the model that facilitate next token prediction, to propose an algorithm that models learn to perform this task. A circuit of 28 attention heads account for 16% of total heads in Gemma-2-2B and recover full ToM task performance. Iâll assume youâre comfortable with some basics, but Iâll also cover a lot of theory and specific technical details along the way. Feel free to hop around using the contentsâif youâre already familiar with most parts, you can jump straight to the conclusion[1].
Mechanistic interpretability gives us a way to reverse engineer the internal workings of neural networks, turning the representations they learn and the decisions they make into understandable algorithms. Itâs like having an x-ray or microscope to see inside the model to trace which parts of the model matter for a given task, then decomposing those paths within the model into interpretable components that we can reason about, piece by piece.
With my current focus on transformer-based LLMs, ToM, and mechanistic interpretability, Iâve been wrestling with many questions about ToM tasks.
How exactly do decoder-only language models (DOLMs) perform and solve ToM tasks? Whatâs happening under the hood? What kinds of algorithms is the model relying on? Is it appropriate to evaluate DOLMs the way a psychologist would analyze a human subject to gauge its level of ToM? One common framework for this is ATOMS (Abilities in Theory of Mind Space), which categorizes concepts like beliefs, intentions, desires, emotions, knowledge, and percepts. Can we further contextualize this behavior by zooming in and analyzing the internal mechanisms that enable ToM capabilities in models?
If a DOLM is trained across multiple ToM datasets representing different categories, and has robust performance across direct probing, and we find a clear algorithmic process that leans heavily on the structure of language to solve these tasks, does that automatically mean itâs not really engaging in ToM, or could it be that this is the way models represent the abstract reasoning that an aspect of ToM requires? Another key question is whether ToM tasks can be solved purely by leveraging linguistic properties and syntactic structures via compositionality[1].
If functional compotence (formal and social reasoning, world knowledge, situation modeling, ability to use language in real world scenarios) can be achieved from exploiting linguistic signals that represent compositionality via formal linguistic compotence (the knowledge of grammatical and syntactic rules and statistical regularities of language)[2], are the latter just âshortcutsâ that âgive answers awayâ[3][4], or are they just the fundamental features that DOLMs rely on to perform and solve these tasks?
Iâm also asking myself if we even have clear, interpretable algorithms to explain how humans solve ToM tasks mechanisticallyâoutside of the scope of combining prior knowledge with observed behaviors and contextual nuances (intentionally ignoring emotions and cultural norms)âin the human brain? When digging into this, apparently the left prefrontal cortex encodes semantic information during speech processing, showing how neural responses are dynamic and context-dependent. This suggests that the brain somehow uses compositionality to process language[5], so maybe the way models handle this through linguistic structure isnât that far off from certain aspects of human reasoning?
Thereâs always the argument that model brittleness is inevitableâno dataset, no matter how large, will cover every possible scenario. New, unseen ToM data could always âbreakâ a model. But even beyond that, do the internal mechanisms for solving this problem remain consistent across different samples? While retraining on updated datasets could lead to short-term improvements, thereâs still the broader challenge of evaluating the task effectively, given both our incomplete understanding of ToM and the limitations of DOLMs.
While Iâm skeptical about why models are performing ToM or are not performing ToM, I think thereâs value in breaking down the abstract reasoning involved in ToM tasks into an interpretable circuit (algorithm). By understanding the internal representations of DOLMs, we can start to see how they structure ToM tasks. Even if they arenât doing it like humans, we can still gain insights from the mechanisms they use to process âmental statesâ.
In the human brain, the language network is a set of interconnected areas in the frontal and temporal lobes that handle everything from language comprehension to generation. Itâs highly tuned for various linguistic operations, covering everything from word meanings (semantics) to the broader context of conversations (pragmatics).
Humans have this amazing ability to infer the mental states of others using ToM. But conceptually, how could we represent ToM in a way thatâs understandable for an algorithm? How might we frame it linguistically to help an algorithm get closer to understanding the mental states of others?
To explore how ToM could be represented algorithmically, we will dig into the linguistic principles of semantics, and pragmatics in the context of this false belief passage:
âIn the room there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it on the basket. He leaves the room and goes to school. While John is away, Mark takes the cat off the basket and puts it on the box. Mark leaves the room and goes to work. John comes back from school and enters the room. John looks around the room. He doesnât know what happened in the room when he was away. John thinks the cat is on theâŠâ
Semantics is all about representing meaning in language. It focuses on how words, phrases, and sentences convey meaning, and how humans interpret that meaning. Itâs not just about the surface-level meaning of words, but also how those meanings combine and interact in context. Semantics covers a lot of ground, including things like compositional semantics, semantic similarity, word embeddings, distributional semantics, and distributed semantics.
To linguistically understand the semantics of the ToM passage, we need to identify the entities, actions, relationships, and any implied meanings to correctly predict the final token basket
. To do this, we need to break down the sentence into all its entities and actions and map out how they interact. This is crucial for making sense of whatâs happening, especially when dealing with more abstract reasoning like ToM.
Actions: takes, puts, leaves, goes, comes back, enters, looks, doesnât know, thinks
Understanding and interpreting this passage to extract the meaning of each sentence requires identifying entities, their properties, and relationshipsâcore tasks of semantic parsing. This would help the model comprehend the context and infer implied meanings, which is essential for making accurate ToM predictions.
ToM involves representing complex mental states and expectations. For example, in this case, DOLMs can grasp both the underlying meaning and context, allowing them to predict that John
thinks the cat
is on the basket
, even though itâs actually on the box
using only formal linguistic elements. In humans this requires going beyond the literal content, and inferring the beliefs and mental states of othersâkey to performing these tasks.
A key concept in semantics, pragmatics focuses on how context influences the interpretation of meaning in language. This includes factors like speaker intent, conversational implicature, and situational context. To predict the final word in the example passage sequence, a model must understand not just the literal meaning of the words but also Johnâs mental state, his expectations, and the context in which he is making the statement.
To obtain contextual understanding, we need to know the situational contextâ
John
placed the cat
on the basket
before leaving for school
, and he is unaware that Mark
moved the cat
to the box
while he was away.
Understanding Johnâs beliefs and what he expects to find upon his return is crucial. We need the ability to infer the most likely location that fits Johnâs expectation and the context (e.g. the basket
). This involves recognizing that John
thinks the cat
is still where he left it, demonstrating the importance of pragmatics in interpreting language and predicting intended meaning.
These principles and operations can help interpret how humans perform ToM linguistically, but how do these concepts transfer to large language models in relation to ToM? Being trained for next word prediction, LLMs end up learning a lot about the structure of language, including linguistic structures that were, until recently, thought to be out of reach for statistical models. For example, a common way to test linguistic abstraction in LLMs is through probing. This involves training a classifier on internal model representations to predict abstract categories, like part-of-speech or dependency roles. The goal is to see whether these abstract categories can be recovered from the modelâs internal states. Using this method, researchers have claimed that LLMs essentially ârediscover the classical NLP pipelineâ, learning linguistic features like part-of-speech tags, parse trees, and semantic roles across different layers.
I think ToM prediction heavily relies on context to make sense of the mental states and intentions behind the words and actions of others, and the final word prediction is based on implied meanings (implicature) and inferred intentions (presupposition), which are central to pragmatics. Given the literature, even if the phenomena is just statistical, I think some form of semantic and pragmatic inference in LLMs has been learned, regardless of how uneven or weak the performance.
One intriguing hypothesis in the psychology literature states that ToM emerges as a byproduct of learning language. The example passage above contains a sentential complement: âJohn thinks the cat is on theâŠâ, and a subordinate clause: ââŠthe cat is on theâŠâ, nested within it. These linguistic structures are shown to play an important role in childrenâs cognitive development, serving as a foundation for building a linguistic basis for passing false-belief tasks. In fact, training children on complement-based language tasks is a highly significant predictor of better false-belief reasoning performance, suggesting that adequately learning these linguistic structures helped the children better infer mental states[6].
The study also found that children with langauge delay could not bypass the linguistic requirements of these tests by relying on alternative strategies like interpreting behavior, gestures, body cues, or life experiences. Even children with autism, provided they had sufficient language ability, required the same linguistic scaffolding to succeed. Further research corroborates this, showing that knowledge of sentential complements is a strong concurrent predictor of false-belief performance in children with autism[7]. This aligns with the idea that a system trained to mimic humans would naturally develop ToM-like behaviors as a byproduct of learning human language. It weakly supports the hypothesis that ToM in humans may have originally developed as a side effect of increasing linguistic complexityâa fascinating example of how emergent capabilities can arise from seemingly unrelated tasks.
Thinking on a foundational level, if specific linguistic structures like sentential complements are important for the development of explicit ToM, how universal are Gricean principles (supposes that meaning transcends literal language)[8] prior to attaining third degree ToM (which is most common in adolescents and adults[9], not children around ages 1 to 4) and do the requirements for ToM definitively transcend linguistic boundaries? Grice suggests that conversational meaning is heavily rooted in pragmatic inference, going well beyond the literal. But if ToM hinges on particular linguistic structures, it raises an interesting tension: can the ability to generate and interpret implicature, unequivocally operate independently of formal language competence?
Expecting meaning and understanding to fully transcend linguistic boundaries ignores the empirical evidence that language proficiency is essential for some aspects of cognitive abilities. By misattributing an assumption of perfection to Griceâs theories[10], we might overestimate the extent to which meaning can be derived without reliance on specific linguistic structures. While Grice assumes that speakers navigate and infer meanings beyond explicit language, the reliance on specific linguistic structures for ToM development suggests that there are limits to this transcendence.
In other words, the capacity to understand implied meanings (implicatures) may not be purely universalâin certain cases it could be bottlenecked by how well one can wield linguistic tools, whether explicitly or implicitly. This perspective doesnât necessarily challenge Griceâs theories but adds nuance: conversational implicature doesnât just rely on shared norms of cooperation but also on the underlying linguistic competence of both speaker and listener. In essence, while Grice gives us a high-level roadmap for deriving implied meaning, the implementation detailsâthe actual ability to pull this offâseem tied to linguistic and cognitive capabilities.
When it comes to ToM in humans, Iâd hypothesize that language acts as a scaffold. Factors like age, language compotency, and executive function probably interact to determine ToM development. If this holds for humans, it might generalize to models: the number of parameters and architectural capacity could plausibly correlate with a modelâs ToM performance. Are we seeing linguistic proxies for ToM emerge as models scale, or is there something deeper going on? Thatâs worth digging into.
The broader goal of this analysis is to identify the circuit responsible for modeling a false-belief task, with the more narrow focus being to pinpoint that circuit by understanding the behavior of attention heads, MLPs, and residual streams.
The model used for this analysis is Gemma-2-2B from Googleâs family of Gemma models.
It is a decoder-only transformer that has 25 layers and 7 attention heads per attention layer.
In terms of the internal mechanisms of a language model, a feature is a property of the input that humans can understand and is represented in the modelâs activations (the tokens from the ToM passage). A circuit informs us of how these features are extracted from the input and then processed by the model to perform specific behaviors (e.g. reasoning), which gives us an algorithmic understanding of how the model works. So first, we analyze the features, use them to trace out circuits that connect and process those features, and once we understand more circuits we can better understand the model.
To begin looking at ToM prediction through the lens of a decoder-only transformer, we can start by defining a simple hypothesis of an interpretable algorithm that focuses heavily on Johnâs mental state about where he placed the cat. This serves as a starting point to understand how the model might represent and process ToM-related reasoning:
Fitting PCA to Gemmaâs activations across MLP, attention, and residual stream patterns for the main entities, locations, and actions of the ToM passage reveals directions in the PCA space that show how the model is structuring the text internally.
In the early MLP layers, clusters of tokens are starting to form.
In the middle layers, as the clusters become more distinct, they spread out.
And the later layers show the most spread. Progressing through the layers, it seems tokens are clustering based on functional similarities in the text. Showing clear seperation of key tokens early on (John
, cat
, basket
, box
) and having close proximity to one another in later layers, showing what could be a false belief (John
, cat
, basket
).
The same can be said for the attention mechanisms, where in early layers distinct clusters emerge.
In later layers thereâs a refined focus in attention with more complex clustering patterns. Meanwhile, the residual stream seems to capture broader aspects of information, showing a more continuous evolution of representations across a wider context.
In the early attention layers, weâre seeing simpler, lower-level features, but as we move through the model, itâs clear the representations are getting more complex and structured. In later layers, the model appears to combine information from different parts of the input sequence, as shown by the mixed colors in various clusters. The positioning of key elements in these layers might represent the modelâs understanding of their roles in the narrative.
In the later layers, weâre picking up on some cool patterns: locations relevant to John
and Mark
seem to cluster, similar words in the attention heads are grouping up, and interestingly, basket
is ranked higher than box
in the residual stream hierarchy.
Even from this limited perspective, you can see how the model is capable of distinguishing concepts, integrating contextual information, and focusing on task-relevant features in each mechanism. The differences between each mechanism highlight how they contribute to the evolving representation. Attention heads seem especially important for forming distinct, task-relevant clusters of information in deeper layers, while the pre- and post- residual stream shows how information is continuously transformed as it flows between mechanisms and layers.
Thanks to nostalgebraist we have the logit-lensâso we can track how language models refine their predictions across layers by directly messing with parts of the model to figure out how they contribute to the output. Most of the methods in this analysis fit this kind of framework.
To make sense of whatâs happening, we need a solid performance metric to track how things change when we intervene, that way we can get a clear read on how the modelâs behavior shifts.
For the ToM task, where the goal is to distinguish between the believed and actual locations of objects, the model needs to predict both the original and updated locations after certain actions. The metric weâll use here is logit difference, which represents the difference between the logit of the believed location and the logit of the actual location. So:
logit(basket) - logit(box)
[11].
When we deconstruct the residual stream using the logit-lens, we can look after each layer and calculate the logit difference at that point. This simulates what would happen if we âdeletedâ all subsequent layers, giving us a snapshot of the modelâs evolving prediction. The final layernorm is applied to the residual stream values, which are then projected in the direction of the logit difference to measure the modelâs performance at each layer.
Whatâs interesting is that the model shows almost no capacity to handle the task until we get to layer 22. And thenâboomâattention layer 22 kicks in and almost all the performance happens there, and then things get a tiny bit better, then a lot worse right after layer 23. Itâs not just a smooth upward trajectory; thereâs a clear peak followed by a clear descent.
So, whatâs going on here? Itâs a strong signal that layers 22 and 23 are doing something really specificâwriting to the residual stream in a way that allows the model to solve the task. This insight gives us a clear direction: we need to figure out what kind of computation these layers are performing. It opens up exciting questions: How do attention layers (move information around) compare with MLPs (process information) in their contribution to this spike? And within those attention layers, which heads are doing the heavy lifting? Whatâs going on in the residual stream exactly?
This is where things get really fun. When narrowing down the problem, we can now start isolating the mechanisms and digging into specific computations. Repeating the previous analysis, but for each layer by mechanism reveals how to begin the narrowing process.
Its clear that attention layers matter a lot, and Iâm not too surprised. I would imagine that the ToM task is centered around moving information around, pulling Johnâs believed location of the cat into focus while ignoring or forgetting the actual location of the cat. While there is minimal processing by the MLPs that matter (perhaps some level of understanding context is processed here), which warrents investivation, the emphasis is on the attention.
Whatâs particularly interesting is that attention layer 22 gives us a big boost in performance, but then things take a turnâMLP layer 22 and attention layer 23 and subsequent MLP layers actually make things worse. So, the attention mechanism is crucial, but thereâs a point where additional layers start to hurt more than help. This kind of dynamic tells us something important about how information flows through the model and where it can break down.
We can further break down the output of each attention layer by looking at the sum of the outputs of each individual attention head. Every attention layer consists of 7 heads, and each head acts independently and additively to influence the final result.
Interestingly, while there is positive activity that contributes to the prediction of the ToM task, only a few heads really matter. It seems many heads contributeâits possible that this distributed behavior is somehow importantâbut their activations appear quite weak. Head 3 at layer 0, head 4 at layer 22 and head 3 at layer 23 contribute positively on some range of significance, which kind of makes sense given the previously observed behavior on the attention in layer 22 and good performance until layer 23. On the flip side, head 7 at layer 18 and heads 5 and 4 at layers 23 and 25 respectively are negatively impacting the model a lot.
There are a couple of big meta-level takeaways here. First, even though our model has 7 attention heads in every layer, we can localize the behavior of the model to just a handful of key heads. This strongly supports the argument that attention heads are the right level of abstraction for understanding the modelâs behavior.
Second, the presence of negative heads is really surprisingâlike head 7 at layer 23, which makes the incorrect logit seven times more likely. I donât fully understand whatâs happening there yet, but itâs definitely worth digging into more.
Looking back at the PCA output for layer 22, its clear that the model is doing something interesting in terms of concept clustering. It appears to be distinguishing between actors (John
and Mark
), objects and it looks like the model is honing in on story elements that are crucial for ToM processing, but in a way where we can clearly see a refined heirarchical representation.
Based on the PCA, its possible that the ToM task may be aligned with the linear representation hypothesis[12][13] âthe idea that models pick up properties of the input and represent them as directions in activation space. When we dig into layer 22âs PCA, a few interesting things stand out.
The PCA breaks down into three clusters of concepts:
John
, Mark
, cat
)thinks
, knows
)basket
, box
, room
)In the residual stream pre, there is clustering of scene elements and actors, and the separation between different semantic groups looks linear. But in the residual stream post (shared space where all layers interact) the separation is even clearer, aligning these clusters more tightly around token concepts:
John
and thinks
basket
and initial statebox
and current stateThe clustering remains clear as the attention and MLP layer outputs are added back to the residual stream with updated relationships. The separation of âknowledge statesâ (e.g. what John knows vs. doesnât know, what Mark knows) appears linear. This makes sense because if tokens did not cluster within residual stream space, linear transformations across layers would be less informative and they wouldnât be meaningful.
Its clear that the model is keeping two separate but parallel âtracksâ:
The key thing here is that after Mark moves the cat, the two tracks split, but the belief track stays locked into Johnâs original understanding. This suggests that the model is able to maintain two simultaneous yet distinct statesâreality and beliefâkeeping them separate but interrelated to maintain parallel states. I could imagine that even as the sequence progressesâMark and Johnâs actions, them leaving, returningâthe belief state remains consistent.
Whatâs also cool is that the PCA reveals these tokens cluster at consistently distinct distances, showing the same grouping across transformations. Thereâs almost a hypothetical âboundaryâ within the MLP and residual post layers, cleanly dividing what the model has learned about John
, Mark
, and their connection to the basket
and the cat
.
Attention heads are valuable to study because we can directly analyze their attention patternsâbasically, we can see which positions they pull information from and where they move it to. This is especially helpful in our case since weâre focused on the logits, so we can just look at the attention patterns from the final token to understand their direct impact.
One common mistake when interpreting attention patterns is to assume that the heads are paying attention to the token itselfâmaybe trying to account for its meaning or context. But really, all we know for sure is that attention heads move information from the residual stream at the position of that token. Especially in later layers, the residual stream might hold information that has nothing to do with the literal token at that position! For example, the period at the end of a sentence might store summary information for the entire sentence up to that point. So when a head attends to it, itâs possibly moving that summary information, not caring that its just punctuation. This can make it hard to asses what the attention heads are doing when tokens are being attended to.
But at the same time, I think when an attention head is attending to a token, it is accessing abstract information stored at that position.
In transformer architectures, each token position has a residual streamâa vector that carries forward information as the model processes each layer. We can think of the residual stream as the place where everything communicated from earlier layers are communicated to later layers. It aggregates outputs from previous attention heads and MLPsâeverything the model has thought so far.
Both attention heads and MLPs read from this stream, apply their edits, and then write the modified info back into the residual stream using linear operations (just simple addition). This linearity is keyâit allows the input to any layer to be decomposed as the sum of contributions from various mechanisms across different layers.
By the later layers, the residual stream holds rich, high-level abstractions: syntactic structures, semantic relationships, and even summaries of phrases or entire sentences. This enables the model to map syntax onto semantics in a powerful way. Attention heads read from specific positions in the residual stream and write new information to target positions, which helps move abstract, context-heavy information aroundâindependent of specific tokens.
Going back to our period example, at the position of a period at the end of a sentence, the residual stream might hold a summary of the entire sentence rather than just the token embedding for the period itself. This layered representation is built up across attention blocks and MLPs, incorporating syntactic roles, semantic meanings, and sentence structure. Attention patterns help transfer these complex, high-level abstractions between positions, enabling the model to handle hierarchical structures.
As the model processes information, each layer can access everything from the residual stream but focuses on specific directions that are relevant for the task based on the similarity of information held between mechanisms. After aligning with the directions it needs, the model writes the information to another mechanism. The flow of information between mechanisms depends on how similar the directions in the residual stream are, guiding the movement of abstract information across the model.
More on how transformers process information using linear algebra here.
Rather than the input needing to go through every single layer of the network, the model can choose which layers it wants information to go through via the residual stream and what paths it wants to send information to. This is why we can expect model behavior to be kind of localized, so as the input goes through each mechanism, not every piece of the input will receive an activation.
The model is using the residual stream to achieve compositionality between different pieces of information. For example, there could be some attention head in layer 2 that composes with some head in layer 22. Technically, this ends up looking like some head in some early layer will output some vector to the residual stream, then some head in a layer downstream will take as an input the entire residual stream and mostly focus on the output of the previous layer, then run some computation on it. For any pair of composing mechanisms in the model, they are completely free to choose their own interpretation of the input, so thereâs no reason that the encoding of the information between head 0 in layer 0 and head 5 in layer 3 will be the same as the encoding between head 2 in layer 0 and head 3 in layer 1. While extremely useful for capturing long-range dependencies, this means we can expect the residual stream to be difficult to interpret.
So, whatâs happening here is the model builds up hierarchical representations of languageâwords within phrases, phrases within sentences, sentences within paragraphsâand tracks sequences of events, which is particularly important for tasks like ToM, where understanding the the order of events, actor actions and possibly even directional or spatial information is key.
In this framework, attention heads work like routers, directing specific pieces of information to the right places to solve the task. They arenât just focusing on literal tokens but transferring abstract concepts like âthe last place John saw the catâ, which are encoded in the residual stream.
In any case, itâs easy to get tricked if you think an attention head is just focusing on a literal token. We should be looking at this information alongside the information stored in the residual stream at that positionâwhich often contains more abstract concepts.
To trace which parts of the modelâs attention are key for this task we need to dive deeper into the attention patterns. Specifically, we want to see how the model attends to tokens related to John. One approach is to track the activations of key tokens (John
, basket
, box
, cat
) across layers, and show how their representations evolve. Another approach is pinpointing which attention heads contribute most to predicting basket
. By combining these methods we can zero in on heads that attend to what John believes about the catâs location.
Looking at the most basic units of computation in the attention heads will give the most fine-grained account of what is happening when the model is processing information to be sent to the MLPs. The attention mechanism will weigh the importance of different parts of the ToM passage, and each attention head will compute three components:
The way QKV attention works is sort of like how a search engine operates. Imagine youâre looking for a video on YouTube âthe text you type in the search bar is your query. The search engine then compares that query to a bunch of keysâlike video titles, descriptions, tags that are stored in its database. Finally, it retrieves and ranks the best-matching videosâwhich are the values, and then you get the resultâthe output. So, attention is basically about mapping a query to the most relevant keys and pulling out the corresponding values. This allows attention heads to specialize: some heads prioritize token alignment (through Q and K), while others are focused on aggregating and relaying information (through V and O).
In somewhat technical terms, the values for the QK vectors control how much attention each token pays to others within the attention mechanism. A larger Q relative to K suggests the current token is more strongly driving the attention, meaning itâs âsearchingâ for relevant information to attend to. On the other hand, when K is larger than Q, it indicates that the token associated with K is drawing more attention from other tokensâessentially, itâs being âattended to.â The Vs hold the actual information or features from the input tokens and play a crucial role in determining what information is passed forward once the attention scores between Q and K are calculated.
However, itâs important to note that the relative sizes of Q and K donât directly determine who is âdoing the attending.â Instead, both vectors interact through dot-product attention: Q represents the token initiating the attention (the one trying to find relevant content), and K represents the token being attended to (the potential source of relevant information). The attention scores are computed based on the interaction between Q and K, meaning both vectors play a role in deciding where attention is focused. The difference in their values might offer clues about the roles of specific tokens in the attention process, but both vectors contribute to the overall mechanism.
Another way to think about all this: attention is fundamentally about where not to look, and similar to the vital role guided by human cognition, multi-head attention gives models the ability to focus on some things and ignore others. This allows them to pick out salient information from input sequences to pursue their objectives in an ordered way.
Selecting a few heads across layers, we can see how things are playing out in the context of the last token basket
being predicted. Specifically, for the asymmetric patterns in layer 22 head 4:
The pattern shows the model is attending strongly to both the initial state (cat on basket
) and the intermediate state (cat moved to box
). The high query attention to the initial basket
placement suggests the model understands this is relevant to Johnâs belief state, and even captures John
in the initial state with high attention activations relative to Mark
.
In the context of predicting the final token basket
, the value contributions from both basket
and box
at their earlier positions in the sequence shows the model is tracking both possible locations of the cat; the real state (cat on box
) and Johnâs believed state (cat on basket
), with the highest value contributions emphasizing tokens important to resolving the false belief and passing that information forward to other layers and heads.
The strong attention to the position where John first moved the cat makes sense since thatâs what John last saw before leaving. The model appears to be using this head to integrate information about object locations and subject knowledge states. Given previous analysis, whether this head is an induction head or not, its key to some belief state emphasis, and likely follows a collection of heads that build up to this.
More formally, for each token position we have QKV vectors,
Qi Ki Vi
And the attention score for the tokens position to another positions,
score(i,j) = softmax((Qi · Kj) / âdk)
And output for position i is,
outi = ÎŁj(score(i,j) Ă Vj
For the 4th head of layer 22 , the QKV vectors for the attention mechanism will look something like this,
Qbasket â 1.0 (tall blue spike) Kbasket â 0.3 (red line) Vbasket â 0.8 (tall green spike)
Kbox â 0.2 (red line) Vbox â 0.4 (medium green spike)
score(basket,basket) = softmax((Qbasket · Kbasket) / âdk) â softmax((1.0 Ă 0.3) / â64)
score(basket,box) = softmax((Qbasket · Kbox) / âdk) â softmax((1.0 Ă 0.2) / â64)
outbasket = score(basket,basket) Ă Vbasket + score(basket,box) Ă Vbox
outbasket = score(basket,basket) Ă Vbasket + score(basket,box) Ă Vbox
Where the tall blue spike for basket
is implemented via the strong Q vector weighting, helping the model search for or focus on Johnâs initial belief state, and the strong green spikes for both V vector positions for basket
and box
carry location information.
The moderate red activity combines both states weighted by attention scores, allowing the model to maintain a strong representation of Johnâs initial belief state of the basket
location (false belief, contradiction), to track the current state of the box
location (true belief, reality), and to weight them appropriately for belief state tracking.
In terms of linguistic representations, the attention patterns show action-state-verb agreements, which track state changes through verbs. Small but consistent attention to prepositions like on
and off
that describe spatial relationships, which work together with the objects (basket
/box
) to establish location states. Thereâs also attention around verbs that relate to mental states like knows
and thinks
, marking Johnâs contemplative state.
In relation to this, we can also see the suppression of the actual current state (cat on box
) in favor of the believed state (cat on basket
). This suppression seems to mostly operate in layers 23 and 25, heads 5 and 4. So its possible these heads maintain the activation of basket
while relatively suppressing box
, which would help preserve Johnâs false belief about the catâs location. This can be observed in several other ways:
Attention patterns:
basket
and relatively lower attention to box
.basket
over all instances of the token in the sequence, where box
activations are relatively low.Activation patterns:
basket
consistently has higher activation than box
, despite box
being the actual current location of the cat
.Activation patching is a super useful technique where internal activations in a neural network are replaced to target specific model behaviors and circuits. It allows us to choose which part to change so we can learn more about the model.
The obvious limitation of the techniques weâve used so far is that they only focus on the final parts of the circuitâthe bits that directly affect the logitsâand they only show correlations at best. Thatâs useful, but clearly not enough to fully understand the whole circuit. What we really want is to figure out how everything composes together to produce the final output, and ideally, weâd like to build an end-to-end circuit that explains the entire behavior.
This is where causal tracing comes in. First introduced in the ROME paper (although the history of the technique can be traced back to Judea Pearl), activation patching lets us dig deeper into the modelâs internal computations. Hereâs how it works:
Patching into a transformer can be done in a bunch of different ways (e.g. values of the residual stream, the MLP, or attention heads' output). If you want to get really granular, you can patch at specific sequence positions (not shown). This flexibility lets us explore different components of the model and figure out exactly where certain behaviors are coming from.
You run the model twiceâonce with a clean input (original) that produces the correct answer, and once with a corrupted input (counterfactual) that doesnât. The trick is that during the corrupted run, you intervene by patching in an activation from the clean run at a specific point in the network. Basically, you replace the corrupted activation at a certain layer and position with the corresponding clean activation and then let the model continue its computation. The key insight here is that you can measure how much this patch shifts the output toward the correct answer, we can then assess the importance of that particular activation.
By iterating over lots of different activations, you can map out which ones matter. If patching a certain activation makes a big difference in pushing the model toward the right answer, it tells us that activation is important for the task. In other words, activation patching functions as a denoising algorithm. In this approach, we run the model on a corrupted input then introduce the clean input by patching in activations from the clean run. The flip side is noising, where we start with a clean input and patch in activations from the corrupted run, effectively adding noise.
With noising, just because performance drops when you ablate a component doesnât automatically mean it was necessary for the task. For example, if you ablate layer 0 in Gemma-2-2B, performance gets much worse across a bunch of tasks, but that doesnât mean layer 0 is specifically crucial for the ToM task, it seems to function more like an extended embedding layer which is useful for processing tokens but isnât doing anything specific to ToM. The key point is that noising can lead to some ambiguous results, while denoising tends to give clearer answers.
The ability to localize computations like this is huge, if the modelâs computations are spread out all over the place, itâs going to be much harder to form a clean, understandable story of whatâs going on. But if we can pinpoint exactly which parts of the model matter, we can zoom in, figure out what theyâre representing, how theyâre connected, and ultimately have another useful tool that we can use to reverse-engineer the circuit responsible for the observed behavior.
basket
.An important thing to note is that these functions are not neatly isolated, but are distributed and overlapping across multiple positive and negative attention heads. For instance, several heads likely work together to represent the âmental stateâ, and many of these heads also contribute to other tasks. Suppression-like activity, for example, doesnât come from a single headâit emerges from the interactions between multiple heads throughout the network.
Specifically, 14.3, 16.2, 20.2, and 25.5 all show evidence of negative behavior on the final prediction. Each head has strong Q attention and low V attention to the basket
token, and either Q or V attention to box
. The most frequent and strongest activations are happening in the middle of the sequence.
Diving deeper, this plot kind of shows how the modelâs thinking evolves layer by layer. The blue regions in this plot indicate where patching helped the model get closer to the correct prediction basket
, red regions show where patching hurt (pushing it towards box
), while white regions indicate neutral activations (neither positive nor negative, transplanting the clean run into the model has no effect on the layers/words). The clean run is the uncorrupted inputâwhere the model gets things right (John thinks the cat is on the basket
). The corrupted run comes from swapping adjacent tokens, which messes up the meaning of the sentence and leads to wrong answers. The goal is to patch activations from the clean run into the corrupted one at various layers and sequence positions and see how much it improves the modelâs logit difference (i.e., how much closer it gets to predicting the correct answer).
Patching the basket
token in layer 1 of the corrupted run gives a massive boost, almost recovering full performance. But, as we move to later layers, significant activation changes happen at the the
tokenâwhich is the token right before the position of the final token, representing the modelâs prediction. This shift hints at something important: the model first focuses on where the cat
was (on the box
), and later on, it shifts to what word needs to be filled in (basket
vs. box
). Thereâs a super interesting patternâstarting from the box
token in layer 0 and running up to the final the
token in layer 25. This implies a distinct computational flow across the modelâs layers. Early on, (layers 0-10) itâs all about the box
token (likely where the model locks in the idea that the cat was on the box).
Between layers 10-15, the patching impact spreads more evenly across the key tokens. This is probably where the modelâs pulling everything together, building up a complete understanding of whatâs going on and learning about the box
vs basket
contradiction. Then, by layers 20-25, the focus shifts hard onto the leaves
token and the final the
tokenâthis is where the modelâs deciding which word (basket
vs. box
) to predict. While patching basket
is super helpful in early layers, it starts to hurt later on (negative blue regions). It seems like the model needs to remember the catâs second position (box
) early on but then âforgetâ it by the end to make the right call (basket
).
One cool takeaway is how localized the effect isâpatching just a few tokens or layers can fix a lot of the modelâs mistakes. Itâs not spreading out the info evenly across the whole network. Instead, thereâs a very directed flow of information from box
to the
over time, as if the relevant information for choosing basket
over box
is stored at the box
token located at the position in the passage where Mark moved the cat.
This fits with the bigger picture: earlier layers are encoding the critical scene details (e.g., Mark moving the cat), while early and midstream activations are key for representing changes in location (whether the cat ends up on the basket
or box
). The whole process aligns with previous attention analysesâearly layers set up the scene, mid layers handle object movement and maintaining the scene, and late layers focus on reinforcing Johnâs false belief.
Another takeaway is how models seem to encode and summarize abstract information at specific token positions that act as structural anchor points[14]. Specifically, the tokens box
and leaves
stand out. Their isolation to patching suggests that rather than Mark or Johnâs belief state being directly moved to the final token, or attending back to the original source tokens, the model compresses and aggregates causally relevant information at the intermediate tokens box
and leaves
in the residual streamâbox
representing the objectâs location and leaves
representing Markâs action.
Then the token the
takes on a final aggregation role, pulling everything together before prediction at layer 22. By the time the prediction happens, all the information from earlier in the context is funneled through these âstorageâ positions. As a result, these tokens become just as importantâif not more soâthan the constituent parts of the sentence that originally introduced the information.
Weak evidence in this analysis shows that the summarization motif may not just be for sentiment, but might be a general mechanism models use to track and update information about sequential events; so the model is using the tokens as a reference point to maintain a coherent representation of the sceneâin this case for ToM. This behavior was discovered by patching clean residual activations for content (box
, leaves
) and functional (the
) tokens into a corrupted run at specific layers to isolate their contribution.
The baseline logit difference for the clean run is 16.52. But when clean activations are patched in for box
, leaves
and the
, the logit difference increases by 36%. This implies that the model is relying on these positions to store contextual information and that these tokens play a central role in the modelâs predictions.
blocks.2.hook_resid_post: Original: 16.517208099365234, Content Tokens: 119.10560607910156, 'the': 119.10560607910156
blocks.10.hook_resid_post: Original: 16.517208099365234, Content Tokens: 22.416271209716797, 'the': 22.416271209716797
blocks.22.hook_resid_post: Original: 16.517208099365234, Content Tokens: 6.501191139221191, 'the': 6.501191139221191
blocks.25.hook_resid_post: Original: 16.517208099365234, Content Tokens: 18.57611846923828, 'the': 18.57611846923828
We also see a sharp divergence in logit differences between the original and patched content tokens and the
token at the end of the sequence. So early layers play a foundational role in encoding token-specific information, building up representations of individual tokens, including both semantic (content) and functional (grammatical) tokens. Divergence here reflects that removing or altering these tokens disrupts the encoding process at these layers.
Original Logit: 16.517208099365234
Ablated Logit (Content Only): -17.1851749420166
Ablating the content tokens causes the logits to flip dramatically, dropping to a large negative value. Ablating the functional token the
alongside content tokens doesnât worsen the result. This suggests that the
on its own cannot contribute meaningfully to the prediction without the content tokensâits role as an anchor point seems to depend on the presence of the content tokens.
Itâs plausible that the attention heads likely focus on the
to pull in information from the content tokens, because patching the
produces effects that closely mirror those of content tokens across layers. This aligns with the observation that patching the
has a similar causal impact as patching content tokensâitâs not acting independently but rather facilitating the aggregation of meaningful context.
To investigate the ToM direction in the modelâs representation space, Distributed Alignment Search (DAS)âan optimization method that finds the best possible internal direction, by which a metric evaluates how changes of a given direction causally influence the modelâs outputsâwas used to identify a vector in the activation space that aligns maximally with correct predictions on the ToM task. This ToM direction was then tested for its causal role in the model predicting Johnâs believed location of the cat. Activations were projected onto the ToM direction and ablated, with both single and combined directional ablations used to assess its significance.
Ablating the ToM direction caused clear accuracy drops, highlighting its importance for belief representation. Specifically, the box
, leaves
, and the
token positions at layer 22 played a critical role in task performance. Combined ablations had the most dramatic impact, causing accuracy to plummet from 0.625 pre-ablation to 0.0 post-ablationâa shocking -0.625 change. This suggests that the ToM direction, along with the specific token representations, are central to the modelâs ability to summarize context before making the final prediction.
This is fascinating because classical constituency theory suggests that understanding something like the cat is on the basket
would require the model to explicitly encode a representation of cat
. If you interfere with the modelâs ability to represent cat
, it should break down on tasks involving that idea, similarly to how intervening on tokens intermediate to the location prediction inhibits the prediction. This principle is widely used in visual psychophysics to study encodingâyou knock out specific pieces of information and see what breaks.
If interfering with a representation prevents the system from performing, youâve identified something integral. In the context of transformers, this plays out as behavioral implications of compositionality: you can test and observe how ToM directions in the residual stream encode early context and carry it forward to influence later semantics. This lines up with the nature of ToM tasks, which require tracking both believed and actual object locations.
The model appears to leverage multiple token positions (box
, leaves
) to maintain belief-relevant activations in parallel, processing different facets of the belief state simultaneously. Thereâs a clear progression: early context tokens like box
and leaves
store critical information, which are then funneled into the token the
for final processing. This demonstrates a funky, structured, memory-like pipeline where information flows through specific points in the residual stream, enabling the model to piece together belief-related representations over time.
The activation patching results for the breakdown between queries, keys, values and outputs shed a brighter light on whatâs happening inside attention heads across layers. Letâs step back for a secondâeach attention head does two key things: 1) deciding what information to move and where to attend to that information (governed by the attention pattern, controlled by the QK interaction) and 2) deciding what information to move forward (handled by the V vectors, influenced by the OV projection). By patching the value vectors, we can tease apart which factor is more crucial and doing the heavy lifting.
In the z
plot (output vector), patching outputs from certain heads noticeably shifts the modelsâ output from box
to basket
, particularly in the last 5-10 layers. The behavior is pretty distributed, but some heads stand out: 16.7, 17.6, 22.2, 22.4 and 25.4 have the largest positive impact, along with 0.1, 3.1, 6.1, 8.1 (all previous layers have the same head position, very interesting), 12.2, 14.3, 17.3, 16.2, 20.2, and 23.5 having the largest negative impact.
Looking at the q
plot (Q vectors), we see familar patternsânegative heads in particular are pretty impactful, suggesting that modifying the queriesâ focus is key for steering the model away from inaccurate outputs. This signal shows up across early, middle, and late layers, possibly reflecting the modelâs attempts to align with the âtrue beliefâ.
The k
plot (K vectors) is less clear, though heads like 14.1 and 17.2 show some influence. Finally, the v
plot (V vectors) highlight a few key heads, with 22.1 and 22.2 standing out. Since Vs represent the actual information passed through attention, heads with influential Vs directly shape the modelâs final predictions.
The analysis reveals that some attention heads are consistently impactful across Qs, Ks, and Vs, while others are more specialized. For instance, head 23.5 influences both Qs and outputs, while head 2 targets Ks and Vs. In layer 17, head 3âs Q plot shows a subtle negative activation shift, indicating how the model adjusts its belief about the location of the cat
. The head assigns high activation to box
and lower to John
, suggesting a balance between factual grounding and perspective-taking. This adjustment becomes clearer by layer 22, head 4, where the model confidently determines box as the true location, discounting Johnâs outdated belief.
My hypothesis? Qs and Ks encode separate perspectives. Qs represent the modelâs mental model of the catâs location from the perspective of the subjects, Ks encode the objective reality, and Vs carry the actual belief being passed forward (true or false). Zs (output) then act as the final arbiter, integrating these signals into the modelâs prediction. Itâs this interactionâQs driving belief updates, Ks grounding reality, and Vs carrying the nuanced informationâthat nudges the model toward its final answer. Itâs possible to see this play out at a finer scale with causal evidence at the QKVO dimension-level, where dimensions in the attention mechanism are input tokens.
How might the behavior of a model change if we selectively replace the output of attention head A directed toward head B (where B follows A in the computation sequence) with the corresponding value from a different input distribution, while keeping all other components unchanged? What if we do this across different head types? Path patching will shift the focus from evaluating the isolated importance of individual attention heads to understanding the functional role of the circuit formed by their connection.
This causal intervention captures the complex interdependencies between attention heads and shows how the modelâs circuitry works together to solve the ToM task. The experiment defines attention head groups (e.g., âprevious token headsâ, âinduction headsâ) identified by a set of metrics that determine whether a modelâs attention head is acting like a specific head from the head group[15]. Multiple path patching experiments are run to compute the clean and corrupted logits, the activations from heads that send information into receiving heads are patched, and the logit difference is measured to calculate the impact on model output.
After this process we have a couple artifacts. The first one is a plot that shows sender-receiver pairs (y and x axis) that shows us how things flow through the network to reveal the circuits structure. A positive effect of the magnitude of the influence of heads means that patching the senderâs activation in the receiver context tends to increase the difference between correct and incorrect logits (improving correctness or pushing in some direction), while negative values push in the opposite direction (blue). Each cell represents how much patching the activation from a sender head to a receiver head affects the modelâs performance.
The idea is that you take the activations from a âsender headâ in the corrupted scenario and insert them into the clean scenario model run at the same point, effectively asking: How does changing what this one head writes cause changes to the final output and to other heads downstream?
The darkest blue patches appear when 5.4, a previous token head, is the sender, and 12.3, a previous token head, is the receiver.
Layer 5 Head 4 â Layer 12 Head 3
effect_of_head_to_head: -1.3715
Suggesting 12.3 is particularly sensitive to interference from other heads. Many of the strongest positive effects are around a value of 0.3 to 0.5, where the highest values tend to appear in interactions between later layers (L17-L23), although the receiver 12.2 shows very strong positive interactions (red) with middle layers, along with 17.6 (induction head).
It appears that certain heads, particularly in layers 8 and 12, are critical junction points in the network, while later layers (especially around layers 17 and 22) are important for positive reinforcement of the modelâs computations. But why? By combining this causal understanding of the information flow between heads to the flow of QKV attention weights between heads, we can examine the correlation between each attention head to track the flow of information and see how one head (sender) influences another (receiver) to form a layered composition.
Hundreds of heatmaps corresponding to all QKV compositions from the collection of identified attention heads
These heatmaps show what part of the text the model was focusing on when it was predicting particular parts of the catâs final trajectory. By extracting the Q, K, V, and O vectors for any head/layer, we can visualize specific compositions to analyze:
We can then understand which featuresâtokens like John
, Mark
, basket
, box
âare encoded in each component, and measure how strongly each dimension correlates with these features. This provides a fine-grained view into how each component of the attention mechanism functions, compared to coarser-grained analyses done earlier.
For example, a particular K-dimension receiving attention weights from an O-dimension might consistently activate whenever box
appears, indicating that this output dimension is keyed to Johnâs perspective. A Q-dimension might align with basket
, linking that dimension to the original location of the cat. A V-dimension might respond to cat
, encoding where and how the cat is situated at each step. By correlating these dimensions with the corresponding tokens, we can infer which components carry signals about characters, actions, or locations between heads.
In this particular heatmap, where 5.4 is the sender, 8.1 is the receiver, and the keys of 8.1 are attending to the output of 5.4, we can see duplicate tokens aligning with high strength, with the attention weights biasing Markâs perspective of where he moves the cat.
On tests run on a small, templatized dataset used to construct different false belief passages that structurally resemble the original ToM narrative, the model seems to have developed a systematic, multi-step process for solving the task. Demonstrating its ability track the protagonistsâ belief[16]. Early layers handle low-level tasks like syntactic dependencies, while middle layers focus on context-driven processing, identifying key facts like cat on box
. By the time we reach the later layers, the model integrates this context and resolves ambiguities, landing on the correct conclusion (cat on basket
) by using semantic attention patterns to disentangle competing perspectives.
Different heads specialize in distinct functions. Take layer 22 head 4âitâs a fantastic, likely example of specialization in action. This head does a few key things:
Composes and maintains perspectives: It attends to tokens that represent the subjectâs belief. Check out this plot again. The sequence captures where John believes the cat will be located when he returns, and the headâs query vectors attend to token keys that occur earlier in the sequence that match downstream patterns.
The spikes for query, key and value in this head appear concentrated on tokens earlier in the sequence, specifically in Johnâs region where basket
and cat
occur with high value contributions and box
with significantly lower value contributions, indicating these tokens are central to the repetitive patterns in the sequence. The attention seems biased toward earlier occurrences of tokens like basket
and cat
with stronger contributions for these earlier tokens in heads 2, 3 and 4 compared to the layers other heads, showing a clear leftward bias and the modelsâ capability to separate Johnâs belief from Markâs belief.
Given the modelâs attention mechanism, this is by design. Each token can attend to any other token via learned queries and keysâno matter how far apart in the sequence they appear. The model can unify widely separated pieces of context (Mark left with the cat
in sentence one, John returned but doesnât see it
in sentence ten) into a coherent representation of who knows what.
Resilience through sparse, localized representations: Whatâs interesting is that the role the headâs take over evolves across layers. The output of a head at one layer isnât just a simple transformation of what it did in the previous layer. There are complex interactions between heads and the residual stream, allowing the model to gradually shift its internal representation and get closer to solving the task as it moves through the layers.
One fascinating insight is how patching just a few key componentsâlike specific tokens or headsâwith activations from a clean run is enough to steer the model back to the correct answer. This suggests the model processes information in a sparse, localized way, breaking the problem down into specialized subtasks. It doesnât rely on a single brittle representation; instead, it layers insights, gradually refining its understanding over time. For example, the model identifies John as the belief holder early in the sequence and uses this as an anchor.
This insight flows forward through the layers, shaping how subsequent events are interpreted. The same approach applies across the narrativeâthe model maintains cohesive tracking of all linguistic elements by integrating earlier representations stored in the residual stream with new information from later layers. This long-range dependency management is key to its performance.
Sophisticated mechanisms for processing: Zooming out, the attention head analysis shows the model has developed specialized circuits for:
These capabilities allow it to handle false belief tasks by unifying âattention across tokensâ with âattention across timeâ to maintain parallel states of the subjectsâ knowledge.
Localized circuit for belief tracking: Itâs worth noting how interventions and ablation experiments reinforce the idea that these capabilities are localized (e.g. heads exhibiting induction behavior show significant performance drops when ablated).
Thinking about how the model generalizes the task given the data from analyzing the queries, keys, values, outputs, and the head effects via path pactching, we can start to build a bigger picture of what is happening and start thinking about a circuit. A more in-depth qualitative analysis can be found here. Thinking about the circuit from a high level:
Previous Token â Duplicate Token: The outputs from early previous token heads are fed as queries, keys, and values into the duplicate token head (8.1). By capturing the same tokens from multiple angles, 8.1 maintains parallel, multi-perspective state representationsâone for each subject or belief contextâenabling the model to track what each subject knows or believes.
Duplicate Token â Induction: The multi-perspective states generated by 8.1 are then passed to the induction heads (L14â17). Here, they serve as queries that tap into specialized key-value patterns, refining each subjectâs actions, beliefs, and locations. This âinductionâ step crystallizes temporal relationships (who did what, and when) into coherent narrative arcs.
Induction â Copy Suppression: Once the induction heads have established these evolved belief states, they flow into the copy suppression heads (L14â23). The suppression layers use belief-oriented queries and keys to filter out redundancies or conflicting states (e.g., repeated mentions of Markâs actions vs. Johnâs). This ensures the final narrative tracks each subjectâs perspective accurately without over-amplifying duplicates.
Copy Suppression â Final Integration: The output from the suppression phase is handed off to the Late previous token heads (L16â23) for the ultimate integration. These heads arbitrate among the refined beliefs, subject-object bindings, and temporal events, consolidating them into a single coherent representation. The value vectors at this stage crystallize the final model output, ensuring that both real and believed states converge into an internally consistent conclusion.
The full circuit evolves from early semantic feature representations into layered belief-action integration, duplicate token heads maintain parallel representations of reality, induction heads refine these parallel states over time capturing cause-effect or before-after relationships, copy suppression ensures the model doesnât mistakenly merge conflicting beliefs and the final previous token heads produce the final prediction, melding all subject-object-location states into a unified narrative.
Each layer builds on prior patterns, maintaining Markâs actions as current-world events while keeping Johnâs beliefs separate. The circuit appears to maintain a fundamental asymmetry between the two subjectsâhighlighting a meaningful cognitive distinction. The negative effects from ablation studies (particularly around 8.1 and 12.3) reveal critical integration points where parallel processing streams must be correctly combined to maintain accurate belief tracking.
The system balances belief preservation and action-driven updates, forming a dual-representation architecture, tracking what Mark does to know the true state, what John believes to make the final prediction, while maintaining the separation between these two representationsâultimately yielding a model output that can differentiate between actual events and each subjectâs belief or knowledge state.
The linear representation hypothesis tells us that activations are sparse, linear combinations of meaningful feature vectors.
Dictionary learning aligns closely with the linear representation hypothesis[17], aiming to express complex data as a linear combination of simpler elements, or âbasis vectorsâ. These basis vectors form a dictionaryâwe can think of it as data structure that holds key-value pairsâand when both are combined they can efficiently represent the original data, making it easier to analyze, compress, or reconstruct. In models, a dictionary of learned concepts with associated directions allows specific elements to be activated based on relevance to the input; for example, queen
could be represented by a combination of female
and royalty
directions. Sparsity is key here, as most concepts are irrelevant to a given input, resulting in many feature scores remaining zero.
Sparse autoencoders (SAEs) extend this by learning both the dictionary and a sparse vector of coefficients for each input. They leverage the hypothesis that model internals operate as sparse linear combinations of these concept directions, providing a structured way to find interpretable directions in the residual stream, MLPs, or multi-head attention. Each latent variable in the autoencoder thus represents a distinct learned concept, enabling interpretable, causal insight into how the model organizes knowledge. They can do this because their objective is to reconstruct input activations where the hidden state captures the weights of meaningful neuron combinations, and the decoder matrix learns the dictionaryâs feature vectors.
There are many directions to find because of 1) polysemanticity, where many neurons fire for multiple, often times unrelated features.
And 2) superposition, neural networks represent more concepts (features) than they have neurons and use linear combinations of neurons to represent these concepts.
Basically, neurons represent many different things and these things are spread across many different neurons. Because of superposition, we have a limited number of neurons for all our features, so there are lots of features and not so many neurons in any given activation space. But the irony is that the features are actually sparse, so only a few of them are active at any given time. This makes SAEs useful.
So we can take the activation vectors from attention, an MLP or the residual stream, expand them in a wider space using the SAE, where each dimension is a new feature and the wider space will be sparse. This allows us to reconstruct the original activation vector from the wider sparse space, then we get complex features that the mechanism has learned from the input[18]. From this we can extract rich structures and representations that the model has learned.
The SAE suite used is Google Deepmindâs Gemma Scope, and the output was visualized using Neuronpedia. Gemma Scope is a collection of hundreds of SAEs on every layer and sublayer of Gemma-2-2B and 9B. Using the trained SAE on the ToM passage, we can take features from layer 22 of Gemma-2-2B out of superposition, and see which features in the model are activated.
The model has specific features dedicated to representing different aspects of the narrative in the residual stream. For example, feature 61 focuses on references to positions and locations in a narrative. This feature has a high explanation score[19], showing that the model is correctly isolating different narrative elements through distinct features.
These features suggest that the model is building an internal representation of the physical setup described in the passage, tracking where objects and subjects are placed. Itâs also clear that several features are responsible for keeping track of John and Markâs movements and actions.
The model also has features representing actions that directly change the scene.
One standout aspect of the modelâs capacity to track scene changes lies in its approach to temporal sequencingâitâs almost like itâs keeping a detailed record of event order. For instance, take feature 11786, which captures statements involving returning or coming back from a situation or event. This kind of specialized tracking is just one of many spatial and temporal features we find scattered throughout the residual stream, indicating the modelâs capability for not only understanding static states but also for representing the flow of actions as they unfold in time and space.
The residual stream, in particular, plays a key role as an information-preservation highway across the layers. For example, it receives inputs from 10.4 and relays them through to 14.0 and then to 17.3. Through this pathway, we can observe representations of actions forming within the residual stream itself, often refined further by the MLPs.
In the MLP features, weâre seeing a recurring theme, feature 11284 looks like itâs picking up on verbs associated with actions and states in a narrative frame. The action related features are a lot clearer in the residual stream and MLPs. This is probably helping the model track actions in the storyâmeanwhile, feature 5852 seems more tuned into verbs and phrases related to visual attention or perception, which may be important for encoding Johnâs final act of scanning the room. The features in the MLP layer are giving the model structures for managing specific narrative events, helping it ground actions and observations.
Several features seem to be directly tied to representing belief states and knowledge. Feature 13597 is likely crucial for capturing Johnâs lack of knowledge about what happened in the room while he was away. Feature 5107 probably signals the modelâs awareness of Johnâs ignorance, potentially reflecting uncertainty and doubt. Feature 12703 could be involved in modeling Johnâs thought process when he returns to the room, helping the model represent how John updates his beliefs. These features seem important for understanding how the model processes the scenario, especially when tracking subjectsâ evolving mental states.
Another key aspect for the ToM task is spatial processing. Similar to residual stream feature 81, feature 12441 likely tracks the positions of the cat
, box
, and basket
, while feature 14364 seems to process how subjects and objects move around the room.
Whatâs pretty clear from this is that the residual stream and MLP features in layer 22 show a high degree of specialization. ToM-related features are distributed across multiple distinct MLPs, suggesting the model doesnât rely on a single âToM moduleâ. Instead, it integrates various aspects of reasoning to achieve ToM understanding.
The features range from low-level tasks (like tracking object positions) to high-level abstractions (like representing uncertainty and beliefs), showing a lot of nuance. Gemma seems to have developed specialized concepts for belief representation, spatial awareness, temporal sequencing, and handling contradictionsâsupplementing what we see in attention heads. It really speaks to the power of gradient descent; itâs finding solutions and representations beyond what weâd initially expect. For example in feature 5107, where the modelâs activations correspond to uncertainty and doubt when modeling Johnâs state of mind as he returns to the room.
Each head of the multi-head attention mechanism is an autonomous module with a specific independent function, connected to other modules with similar or different functions. If specific attention heads can be grouped into components, they can produce functional clusters, and synchronize across different positions in an input sequence. Because of this, its possible that attention circuits can be viewed as modular hierarchies of interacting subcomponentsâmimicking biological modularityâwhere functional clusters collectively represent high-level abstractions of the input data.
As a rough analogue to how neural fMRI scans capture distributed activations, attention heads shift focus across tokens, similar to how human brain regions activate based on focus and task demands. We can make this analogy by thinking about the parallels between functional lobes in the brain and the structure of a transformers attention mechanismâa brain-like region. Each brain lobe in humans has a specialized role, for example, the occipital lobe handles vision, and the frontal lobe manages planning. Like lobes aiding decision-making by accessing relevant knowledge, attention heads work similarly, processing contextual knowledge to weigh parts of the input sequence.
If we zoom out from any single head, we can define specific groups of attention heads across layers as circuit components. From there, we can start mapping out how these components fire across the ToM passage, visualizing how attention heads interact with each other to solve the task. The methodology aligns closely with an approach where SAE concepts are seen as functionally coherent clusters[20]. Tests were run on a small dataset that uses different templates to construct false belief passages that structurally resemble the original ToM narrative.
The results show distinct ToM subcircuitsâsets of attention heads that tend to cluster together and activate at key points during the task. These components act as cohesive units, each one relative to others, activating or staying dormant at different sequence positions. This makes it possible to see which components have groups of heads that activate together across different contexts, and allows us to see how information flows through the model as its making its predictions. For example, within the action-location state, certain heads may consistently activate with suppression heads, particularly when managing changes in the scene and beliefs about the scene in the penultimate state. So its possible to see which specific heads within each component interact most frequently, giving insight into sub-patterns within the larger components. Think of it like t-SNE but for attention circuits.
High activation values indicate components that are more activated against low activation values.
The circuit forming the ToM pipeline is not strictly linear. Certain heads come online in tandem, sharing or passing crucial signals, which fosters the multiâperspective representation we see in the final output.
Starting with the co-occurrence matrix, all of the heads selected from the causal analysis are firing in relation to each other on a sliding scale, suggesting they work together to maintain and update state information about subjectsâ throughout the sequence at different co-occurence rates.
The late suppresion heads show particularly weaker co-occurrence to other components relative to the suppression mechanisms, and is weakest wrt to the action-location state (duplicate token) heads. Suggesting that duplicate token heads are sending information to recevier heads in later layers, the late suppression heads often follow up to prune or reconcile repeated references, ensuring only the final, non-contradictory perspective remains.
While late suppression heads show weak co-occurrence with duplicate token heads, they have relatively neutral co-occurrence with initial subject state heads, suggesting that late suppression isnât blanketly suppressing all prior information, but rather selectively targeting action-location informationâhelping to maintain the âfalse beliefâ by specifically suppressing the true location information while preserving the subjectâs initial state understanding.
Thereâs a clear hierarchical structure in the suppression heads (early, mid, and late), with strong co-activation between early and mid suppression heads, but detracting activation patterns for late suppression heads. This highlights early suppression during initial state filtering, mid suppresion during pattern refinement by the induction heads and late suppression applying the final arbitration before prediction.
In the temporal activation plots, cluster 2 (action-location state) shows the strongest activation during the intermediate state, its presumably doing more complex bridging: integrating earlier scene states (cat on basket
) with the next set of events (Mark moves the cat while John is gone
), or maintain parallel beliefs.
The suppression heads (clusters 0, 4, and 5) show interesting temporal patterns where early suppression heads show the most activation in the initial state and the least in the final state, mid suppression heads maintain relatively consistent activation levels and show the most activation in the final state, and late suppression heads ramp up to be the most active in the penultimate state. This pattern suggests that early suppressin heads is the only one needed for the raw parsing in the beginning, and the mid/late heads become crucial at the end (e.g., final âstate filteringâ or ensuring the modelâs ultimate output aligns with John does not know the cat moved
).
The intermediate subject state (induction heads, Cluster 6) shows selective activation during the initial and final states, suggesting they may be crucial for maintaining and updating the modelâs representation of subject knowledge that it learned during the beginning of the passage, when the model needs to recall and apply patterns from earlier in the sequence to recall the initial state to predict Johnâs belief. Information is then transferred to the final state to co-activate with the previous/duplicate token heads and suppression to predict the correct location. So itâs important for connecting earlier events back to later states.
The final subject state heads (Cluster 1) show subtle, increasing activation from the intermediate to final state, suggesting they integrate information to form the final representation of the subjectsâ belief states. Also, the intermediate subject state (induction heads) shows an interesting dip in activation during the penultimate state, right when the early/mid suppression heads (Cluster 4) show increased activation. This could indicate a mechanism where the model temporarily suppresses ongoing state tracking right before the final state computationâpure speculation, but perhaps a kind of âresetâ before integrating all the information for the final belief state determination.
Looking at the distribution of activation strengths across clusters in the second temporal plot, itâs noteworthy that the action-location state shows the most extreme activation values. This suggests that tracking the parallel states of each subjectâs physical state/location in relation to the believed location of the cat, and integrating those semantic features in context might serve as a kind of âground truthâ against which subject states need to be compared.
Everything here aligns with the QKV patterns seen from the ablation studies and interventions. The temporal activation patterns provide additional evidence that previous token heads serve as foundational sequential processors, and induction heads act more like specialized pattern recognition and recall mechanisms that are particularly important for handling long-range dependencies in the false belief task.
This again is confirming that the model maintains multiple representations of reality (actual locations) and beliefs (subject states) through coordinated activation of different head clusters.
Theory of Mind Circuit
Each component in the ToM circuit serves a specific role at different points in the sequence. The timing and strength of the activations suggest a well organized circuit that tracks states throughout the narrative. It efficiently balances between retaining initial knowledge, updating as the story progresses, and suppressing outdated information. This aligns with human-like belief updating, where new observations modify existing beliefs without completely discarding past knowledge. Itâs especially crucial for false belief tasks, as it supports reasoning about beliefs that differ from realityâunderstanding what John believes (cat on basket
) versus what is actually true (cat on box
).
The full circuit reveals a nuanced algorithm in its attention:
So the model builds the subjectâs false belief about an objectâs location by: 1) Identifying belief holders and objects. 2) Update states and track who sees what and the catâs movement. 3) Suppress redundant tokens irrelevant to the belief holder. 4) Finalize ToM inferences (Johnâs false belief vs. Markâs real knowledge).
The resulting circuit confirms the causal intervention findings: certain heads and subcircuits consistently operate at specific narrative phases, and certain groups of heads rely on each other to build a coherent ToM representation. Itâs consistent with the broader conclusion that the model has learned a structured approach to time, location, and perspective.
At the final stages, the suppression heads play a key role. They show both positive and negative modulations between the QK mechanisms, enhancing and inhibiting specific connections as needed. Here, the value mechanism filters out information irrelevant to Johnâs knowledge, ensuring only Johnâs incorrect belief about an objectâs location are propagated to influence the modelâs final output.
The ToM circuit satisfies the three criteria discussed in Wang et al[11]. Minimality demonstrates each headâs contribution to ToM capability via its direct impact on logit differences by component. The score, reflecting the percentage of the modelâs total logit difference (0.8365) attributed to each head, highlights the importance of each head to the task.
Average logit difference (ToM dataset, using entire model): 0.8365
Average logit difference (ToM dataset, only using circuit): 0.9373
Competing signals are removed which likely restricted the models performance
The ToM circuit hits all the key benchmarks: faithfulâthe circuit actually outperforms the full model, showing it captures the necessary functions; completeâall heads essential for each component are included; minimalâthe plot highlights clear specialization with only a minimal number of heads carrying substantial weight.
Breaking it down, the ToM circuit shows concentrated importance in certain heads, with over 40% in the previous token heads. This suggests that these heads are keeping a coherent grasp of where John thinks the cat is, and is critical for handling ToM tasks.
Meanwhile, the duplicate token heads have smaller but non-negligible contributions, acting more as a supporting context provider, likely reflecting that they already did their job earlier in the sequence (e.g., building parallel states/bridging temporal transitions).
The circuit also shows a high degree of modularity: heads are highly specialized, with relevant computations neatly contained within each component. This limits interdependence with other network parts outside the defined circuit, showing a clean and compartmentalized structure.
Ablation studies are widely used in neuroscience and they are super useful for neural networks as well. The idea is to systematically âremoveâ (or ablate) specific mechanismsâlike neurons, layers, or attention headsâwithin the model to assess their contribution and see how much they really matter to overall performance.
When we mean-ablate the entire ToM circuit, performance drops by about 80.66%, showing a massive reduction in the believed-actual difference of the modelâs inference accuracyâthe modelâs confidence of the basket
token as the correct prediction.
Full Circuit Mean Ablation Results:
Number of heads ablated: 28
Original believed-actual diff: 0.836511
Ablated believed-actual diff: 0.162061
Total circuit effect: 0.674451
This suggests that these heads are working together in a highly interdependent way. The remaining performance (~16.20%) implies that outside the ToM circuit, thereâs not much capacity left for correct prediction of ToM tasks, as expected. Component-wise, the early and mid suppression heads had the most significant effect on the decrease in performance when ablated, which reduced performance by ~22.50% and ~45.14% respectively, highlighting the importance of copy suppression for this task.
In neuroscience, it is widely known that if it werenât for inhibition coming from the frontal lobe, humans would be more prone to negative social actions. This inhibition is also a component of the process of selective attention and is manifested in the suppression of goal irrelevant stimuli[21]. Copy supression[22] in the ToM circuit are heads in the model that respond to predictions made by prior heads and adjusts the prediction logits negatively. These heads have the advantage of seeing all preceding context and intermediate predictions generated so far. By leveraging this, they can calibrate the modelâs confidence in predicting the next token, effectively fine-tuning the logits to suppress information before the final prediction is made.
Consider an induction head thatâs tracking a belief state. Suppose the model processes the sentence: John put the cat on the basket
, and the current token is the
. The induction head predicts basket
as the next token based on the context. This prediction is written to the residual stream and will be mapped to the logits for the final output. However, before the model commits to this prediction, the copy suppression mechanism kicks in. It performs post-processing on the logits by suppressing any outputs that have been previously seen but arenât relevant to the current context established by the induction head.
It could be nothing but napkin math, but I think this suppression is happening at each step of the attention mechanism, specifically emerging from the mechanismâs dot product, softmax, and aggregation steps.
If attention scores are computed as:
\[\text{Score}(i, j) = \frac{Q_i \cdot K_j}{\sqrt{d_k}}\]Qi (query vector of token i) and Kj (key vector of token j) are real-valued vectors.
And if Qi · Kj < 0 the alignment between Qi and Kj is negative, which means token j actively suppresses token iâs focus. This could be a form of âimplicit suppressionâ therefore suppression could be happening at the attention score level.
The attention scores are then transformed into weights via softmax:
\[\text{Weight}(i, j) = \frac{e^{\text{Score}(i, j)}}{\sum_k e^{\text{Score}(i, k)}}\]And for negative attention scores Score(i,j < 0) the exponential function eScore(i,j) maps the score to a small positive value close to 0. This effectively suppresses token jâs contribution to the aggregated output, as the attention weight Weight(i,j) becomes negligible which results in an exponential suppression via softmax.
The attention-weighted sum aggregates V vectors:
\[\text{Output}_i = \sum_j \text{Weight}(i, j) \cdot V_j\]Even though attention weights are non-negative, Vj can contain negative values. If token j has been suppressed (i.e., Weight(i,j) â 0), its contribution to Outputi is effectively erased so at the aggregation of V suppression is reinforced.
This suppression then propagates through subsequent layers. Early layers suppress irrelevant or conflicting tokens (e.g., âboxâ), while later layers reinforce relevant tokens (e.g., âbasketâ) based on the modelâs evolving âbeliefâ. Reflecting the modelâs ability to shift focus and selectively erase contributions dynamically across layers and heads. To further quantify suppression and modulation in the ToM task to theoretically identify heads and layers where suppression magnitude is highest and measure how suppression evolves for key tokens (e.g., âboxâ, âbasketâ, âJohnâ) across layers to reveal which components are most responsible for suppressing conflicting beliefs, we can formalize the behavior.
The suppression effect of token j on token i at a specific attention head h in layer l is:
\[S_{ij}^{(l, h)} = \frac{Q_i^{(l, h)} \cdot K_j^{(l, h)}}{\sqrt{d_k}}\]$S_{ij}^{(l, h)}$ < 0: Token j suppresses token i.
$S_{ij}^{(l, h)}$ > 0: Token j suppresses token i.
The suppression magnitude for token i is defined as the total contribution of all suppressing tokens:
\[\text{Suppression Magnitude}_i^{(l, h)} = \sum_{j : S_{ij}^{(l, h)} < 0} \left| S_{ij}^{(l, h)} \right|\]After softmax, the suppression effect becomes:
\[W_{ij}^{(l, h)} = \frac{e^{S_{ij}^{(l, h)}}}{\sum_k e^{S_{ik}^{(l, h)}}}\]$W_{ij}^{(l, h)}$ â 0: Token jâs contribution to token i is suppressed.
The contribution of token j to the final output vector token i is:
\[C_{ij}^{(l, h)} = W_{ij}^{(l, h)} \cdot V_j^{(l, h)}\]The aggregated output:
\[\sum \lvert S_{ij}^{(l,h)} \rvert\]And the cumulative suppression of token i across layers is:
\[\text{Total Suppression}_i = \sum_l \sum_h \text{Suppression Magnitude}_i^{(l, h)}\]I think the activation patching and ablation studies justify this a little bit. In this framework, a large negative Qi · Kj implies that token j is actively suppressed when predicting token i. In practice, if the âboxâ token is negatively aligned from the perspective of the âbasketâ token (or vice versa), it gets a near-zero attention weightâthus effectively erasing its contribution in the final output. When specific heads are zeroed out (through ablation) or patched (replacing corrupted activations with clean ones), you often see a marked swing in whether âboxâ or âbasketâ dominates the final logits. This direct cause-and-effect relationship is exactly what the formal model of suppression predicts:
Thus, the presence or absence of certain negative alignmentsânow made visible and manipulable by patchingâvalidates the notion that negative dot products can strongly shape which tokens persist or vanish by the output layer.
Suppression magnitude is defined as $\sum \lvert S_{ij}^{(l,h)} \rvert$ and quantifies how much each token is suppressed across layers or heads. The bigger this magnitude, the more we expect the model to âforgetâ or âdown-weightâ that token in future computations. When patching in the âcleanâ activations for a heavily suppressed token (e.g., âbasketâ when the corrupted run incorrectly focuses on âboxâ), the negative alignment is effectively undone. The result is that âbasketâ reasserts itself in the final prediction, often flipping the model to the correct answer. The corresponding heatmaps reveal how heads that show large negative influencesâmanifest in strong negative activations or big negative deltasâtranslate directly into a more likely âboxâ in the output. When these are removed or overridden these negative signalsâthe logit for âbasketââgoes up. So a high suppression magnitude in certain heads/layers matches large negative shifts in the final basket vs. box logitsâproviding evidence that this theoretical suppression measure correlates with real changes in predictions.
All prior QKVO breakdowns show that a single attention headâs dot products can strongly affect the aggregated output vector for each token. If we can identify which heads produce large negative (or positive) attention scores, we know exactly where suppression is happening. When ablating or patching entire heads, some heads exert major control over whether the model ends up with âboxâ or âbasketâ as its final prediction, strongly suggesting that head-level interpretability is a practical abstraction for verifying how suppression is realized in the model during inference.
But there are tons of caveats here. In practice, a negative QK dot product alone does not always guarantee the model will focus positively on the ânext best option.â Other tokens might also have low or modest QK alignment, so the end result depends on relative alignment scores across all tokens. Because each attention sublayerâs output is added back to the residual stream, even if a token is suppressed in a single attention head, sometimes the residual from earlier layers can reintroduce features of that token. So, a more rigorous account of copy suppression might also track how the suppressed signals are or arenât re-amplified in subsequent MLP sublayers (and whether subsequent heads focus on them anyway).
Essentially, while some heads focus on specific tasksâlike predicting the next word based on the context of previous next word predictorsâother heads monitor the earlier predictions and adjust them, ensuring the model doesnât over-rely on copying tokens that arenât contextually appropriate. The degree of copy suppression is influenced by how much attention the model pays to the tokens itâs considering copying. This aligns with the iterative nature of LLMs. They refine their predictions layer by layer, with each layer building upon the representations from the previous ones as information flows toward the final layers.
This is purely speculative, but I suspect the model might have the capability to represent second-order false beliefsâessentially, understanding that one person can hold a false belief about another personâs belief. This could emerge from its ability to juggle parallel representations of both true and false information, potentially through mechanisms like copy suppression.
Thereâs a lot more we do not know about these heads and they probably have more complex things going on that describe when it is good to surpress information and when it is bad.
There are key interactions that we can see backed by quantitative and qualitative evidence.
Circuit components have complementary timing in the way they activate across the sequence. Components are processed sequentially, the action-location state activates strongly in middle and later layers, components complement each other during belief processing, subject states and suppression head clusters show complementary patterns, previous token heads provide steady baseline processing, induction heads build up activations over the sequence, and copy suppression prevents simple copying at the end.
Out of 175 total attention heads in Gemma-2-2Bâs attention mechanism, there are 28 that display a significant increase in ToM performance when isolated, and a significant decrease in model performance when they are ablated. This is not an isolated result. In a separate study, element-wise analysis of LLM neurons have been found to show increased firing rates for isolated sets of neurons when performing ToM tasks when compared to isolated human neurons that show consistent fire rates across similar false-belief tasks[23]. In both cases showing how modular organization allows both systems to achieve efficient and focused reasoning for complex tasks.
Copy suppression helps the model maintain separate representations between what is actually true (reality) and what is believed to be true (belief), and this could have several implications for AI alignment. Because the model has learned to maintain multiple potentially conflicting âversions of realityâ this highlights the capability for nuanced reasoningâunderstanding different perspectives, and possibly even lying. Investigating suppression mechanisms could be crucial for understanding how models might deceive. These capabilities could also be useful for alignment research and help with:
Could this be useful to safeguard against belief corruption? How reliable would this mechanism be for alignment? Does it scale to more complex belief systems? What are the failure modes, especially in edge cases?
A common critique of LLMs is that they rely purely on formal linguistic competence, and therefore canât truly learn meaning in a deep sense[24]. However, when considering emergent understandingâthe idea that models develop an implicit sense of meaning based on patterns in the dataâIt begs to question: How do mechanisms effectively capture semantics to succeed at ToM?
One plausible hypothesis is that while induction heads primarily track formal patterns, semantic meaning embedded in those patterns gets absorbed through training. For example, repeated references to âthe cat being on the basketâ provide a robust contextual anchor. Although induction heads focus on sequence-level correlations, they often align with real-world semantics present in the training data. When a model predicts that the cat is in the basket
, it might be leveraging a weakly implicit form of semantic understanding encoded in its layers.
This idea is particularly relevant in tasks requiring predictions about mental states or perspectives. Even if the model initially exploits distributional patterns, these patterns can align with semantic reasoning, implicatures, presuppositions, and other pragmatic cues. The model may not have a metaâunderstanding of Grice, but it does learn that certain implicatures typically follow specific contexts (e.g., âJohn thinksâŠâ followed by actions that reveal mismatches between his belief and reality).
For example, deeper layersâsay, layer 22âdonât just pass through raw pattern data from earlier layers without running operations of that data. Instead, they integrate information representing a mix of formal linguistic structure and contextual cues. Hypothetically, the model is blending formal relationships with the functional relationships encoded in the data because theyâve internalized the relevant linguistic forms. This raises another question: When the model predicts Johnâs perspective in the ToM task, is it actually reasoning about Johnâs mental state (functional competence)? Or is it just leveraging high-level linguistic correlations (formal competence) that happen to align with correct answers? I think thereâs a blurry line hereâif meaning can emerge from form when structured, implicit grounding exists in the data.
Induction heads, while not explicitly designed to handle grounded semantics, may approximate grounding by exploiting consistent statistical patterns present in the training data. For example, if the model frequently encounters phrases like âJohn thinks the cat is in the basketâ followed by predictable narrative outcomes, it could learn to associate these patterns with semantic relationships. By layer 22, earlier layers have already processed and encoded contextual cues such as entity roles, temporal and spatial relationships, enabling deeper layers to recombine these representations into contextually appropriate predictions. This process reflects how large language models can appear to reason about meaning despite lacking explicit semantic grounding.
Even without explicit grounding, models trained on structured datasets can still encode weak semantic signals. Benchmarks like MMLU, ARC-C or Winogrande embed linguistic patterns that implicitly carry semantic entailments. Models like Gemma-2-2B seem to capture these relationships effectively, even if theyâre operating formally. Tasks like Winogrande make this particularly clear: While solving these tasks seems to require semantic reasoning, models often succeed by exploiting subtle textual cues embedded in the data. This suggests that while the induction heads found in this analysis might not directly access labeled semantic relationships, they capitalize on implicit signals encoded in the training data. For example, co-occurrences of specific token patterns might encode semantic entailments without the model ever âknowingâ what those entailments mean explicitly.
In large models, emergent semantic inference seems plausible due to the interplay between the architecture and the training data. Benchmarks like BoolQ and TriviaQA provide structured patterns that tie linguistic forms to functional outputs, creating a complex statistical scaffolding that weakly approximates grounded understanding. While induction heads and specific layers remain pattern-driven, the broader training process imbues the model with enough implicit grounding to perform tasks requiring nuanced semantic judgements. This bridges the gap between form and meaning, allowing the model to have some groundingâeven if it never reaches full semantic understanding.
By bridging high-level behavioral analogues (belief states) with low-level computational mechanisms (transformer attention heads, MLPs and residual streams), the hope of this work, and future work is to validate or invalidate that certain circuits are causally implicated in tasks that map onto ToM-like reasoning.
The proposed ToM circuit:
Extends on the IOI (focuses on tracking a models ability to reconstruct the syntax of natural language) work to identify specific attention heads that are important to false belief tasks. The proposed circuit tracks and updates belief states of entities in regards to locations and objects using strong formal linguistic competence and tentative functional competence via the manipulation of linguistic elements, to distinguish facts from the believed reality of a 3rd person perspective.
Underscores the idea that learning certain language structures (sentential complements, subordinate clauses) can powerfully scaffold belief tracking, both for humans and LLMs. When the model processes âJohn thinks the cat is on theâŠâ it tracks Johnâs outdated belief state as distinct from reality.
Weakly shows that as LLMs scale and learn dense correlations, they develop a causal, semantic grounding.
Each component of the ToM circuit plays a distinct role in maintaining perspectives, tracking tokens consistently across layers, and updating dynamically. The model integrates changes in real timeââthe cat is on the boxâ, John does not know thisârather than just repeating a previously seen phrase. This suggests the model is doing more than superficial matching; itâs maintaining stable relations that resemble an understanding of the narrative.
Even if the modelâs underlying process is just a ânextâtoken probability,â it ends up with something that functionally resembles semantic inference: it can keep Johnâs perspective distinct from Markâs, producing the correct outcome. Itâs not necessarily âgroundedâ in the sense of realâworld perception e.g. vision, or physical environments. Instead, it is âvirtually groundedâ on approximate textual correlations via linguistic structureâstill enough to solve tasks that appear to require understanding even if it never physically sees a cat or a basket.
Heads like 8.1 and the induction heads, systematically track different character viewpoints and times of their observations. The modelâs attention patterns are not only forming correlations, but produce text that matches the scenarioâs cause-and-effect constraints through temporal (the model organizes events by who was present at which time), spatial (it encodes where the cat moves) and perspective (it maintains completely different âmental statesâ for characters) causality. The crucial indication is that interfering with these heads via path patching, changes the final output, telling us the model is using these location/time/perspective signals to drive predictionsârather than trivially memorizing.
The results from activation/path patching identified a circuit thatâs causally linked to ToM performance, and provides some causal evidence that form can carry function and certain heads are necessary for successful ToM-like inference. The circuit captures stable relationships (like who believes what) that go beyond surface-level token transitions. Some emergent semantic-like behavior existsâremoving specific heads consistently reduces performanceâwhich pushes beyond correlation towards a stronger (though still not definitive) causal story.
The fact that patching in certain Q, K, or V activations from a clean run restores correct predictions indicates these attention heads are doing a bit more than just memorizing surface patterns. The heads appear to encode aspects of perspective and belief in context. The strong improvements following targeted interventions suggest the model internally represents subtle cues needed for ToM tasks. This finding is still a step short of indisputable evidence for genuine semantics, but beyond naive statistical correlation.
While speculative, experiments show that removing certain heads disrupts semantic coherence. Similar to visual psychophysics, where knocking out features tests perceptual encoding, altering heads in transformers reveals how ToM directions encode context. While itâs not definitive proof that the model truly understands semantics, itâs a concrete demonstration that complex formal pattern capturing is sufficient to manifest in behaviors associated with semantic interpretation.
Is robust to targeted ablations. Causal interventions reveal that a small subset of 28 heads are essential for maintaining ToM capabilities. Ablating them severely degrades falseâbelief performance where the model fails the task. Full task recovery following ablations affirm the importance of these components in maintaining robust ToM functionality.
Furthermore, experiments show that when certain tokens (those involved in representing âbelief statesâ like where John thinks the cat is) are patched from a clean run, the corrupted modelâs performance on the ToM task recovers.
The removal of early and mid copy suppression heads severely impairs ToM performance. These heads ensure that the model can maintain Johnâs belief over the actual facts, preventing confusion between an agentâs belief and the real states of the world.
Demonstrates a causal relationship between a specific group of attention heads and ToM task performance.
Works with copy suppression to ensure that distinct belief representations are tracked and preserved, preventing conflation between reality and differing subjectsâ beliefs. The circuitâs interplay allows for more accurate predictions of behavior based on mismatched beliefs, a hallmark of human ToM. These heads must be intact for the model to maintain contradictory states to perform real updating rather than just copying.
This has several implications, one being that the modelâs internal representations are doing more than just predicting the next token. As the model processes the sequence, it maintains a belief about the entire future, not just the next word. If the model reads a sentence like âJohn hid the cat in the basket, but Mark moved it to the box when John wasnât lookingâ, it has to keep track of where the cat might be (belief state) to predict any future reference to the catâs location, even several sentences later.
Copy suppression allows the transformer to model this process explicitly by preventing overcommitment to any one interpretation of reality. This enables the model to maintain belief dynamics similar to how humans mentally track both reality and agentsâ beliefs about reality, filter out irrelevant things, and not just blindly following patterns.
It is a memoryâlike mechanism representing, âattention over internal statesâ. Letting the model maintain parallel representations of reality vs. each characterâs perspective. Taken together, attention (with implicit copy suppression) is functioning as a flexible memory system, letting the model handle multiâstep reasoning tasksâincluding false beliefsâthat appear to require more than surface pattern matching.
By observing how LLMs solve these tasks, we see that they rely on approximations of human-like cognitive strategies, suggesting parallels in reasoning.
The circuit found for passing the false-belief task emerges from the interplay of largeâscale pattern matching + the architectural capacity (attention) to keep track of who knows what. In essence, the synergy of linguistic structure + powerful attentionâbased memory yields behavior that goes beyond rote pattern recognition and starts to look like real perspectiveâtracking. While still statistical at its core, the model shows that formal patterns can encode functional semanticsâenough to handle tasks that appear to demand a working theory of mind.
The parallels to human thinking are fascinatingâbut still, thereâs a big âbutâ here: how much of this translates to other model architectures and ToM tasks beyond false beliefs across a wider range of data? I think its likely other models will use similar mechanisms[14], but these are questions not fully answered by this work.
Further experiments will also explore a wider range of model architectures, the application of crosscoders, transcoders, datasets, and tasks covering more ToM aspects to assess the robustness and generalizability of these findings. For example, it would be interesting to study the difference between predicting why someone will act (ex-anteâpredict why a subject will perform an action.) but with more focus on explaining why they did act (ex-postâisolates how the model understands rather than being concerned with its prediction) or develop more empirical studies or experiments that directly test hypotheses about weak grounding or emergent semantics in LLMs, in a broader ToM context.
While this work aims to bring high-level behavioral understanding to how models perform ToM, there are many unanswered questions. The method the model learned to do this task, did it just memorize it? How generalizable is it? How do we know the modelâs methods are truly based on what it is infering will happen versus what it has memorized (developing a method to quantify this could significantly advance this debate)? Itâs moving linguistic elements around but does it truly understand its utterance? Concretely defined, what is true understanding?
Imagine someone dismisses a bookâs ability to tell a story, arguing, âItâs just ink marks on paper!â Technically true, but missing the point: the magic lies in how those marks are arranged. The specific organizationâwords forming sentences, sentences forming a narrativeâunlocks meaning, emotion, and depth. The key question isnât whether a book is reducible to ink and paper, but whether those marks, when structured just right, can encode the rich dynamics of storytelling. Similarly, when thinking about LLMs, the question isnât whether theyâre âjust matrix multiplicationsâ, but whether their computations, when structured, can replicate the processes that underpin cognitive abilities.
I think findings between their behavior, our behavior and whatâs happening to them internally will get us closer to the answers. While its possible to say we have a partial map between human and machine language processing, transformers do not fully capture the consistency and generality of human cognitionâthey just know that if a given feature exists, another given feature is likely to come next. However, the success of the formal and functional linguistic competence of large language models should not be ignored.
Elmoznino, A Complexity-based Theory of Compositionality. MilaâQuebec AI Institute, UniversitĂ© de MontrĂ©al. 2024.[1]
Mahowald, Dissociating Language And Thought In Large Language Models. University of Texas at Austin, Georgia Institute of Technology, UCLA, MIT. 2024.[2]
Ullman, Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks. Harvard. 2023.[3]
Bender, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? University of Washington, Black in AI, The Aether. 2021.[4]
Jamali, Semantic encoding during language comprehension at single-cell resolution. Nature. 2024.[5]
de Villiers, The Role of Language in Theory of Mind Development. Lippincott Williams & Wilkins. 2014.[6]
Tager-Flusberg, How Language Facilitates the Acquisition of False-Belief Understanding in Children with Autism. APA PsycNet. 2005.[7]
Grice, Meaning. The Philosophical Review. 1957.[8]
Valle, Theory of Mind Development in Adolescence and Early Adulthood: The Growing Complexity of Recursive Thinking Ability. Europeâs Journal of Psychology. 2015.[9]
Davies, Griceâs Cooperative Principle: Getting The Meaning Across. University of Leeds. 2015.[10]
Wang, Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. Redwood Research, UC Berkley. 2022.[11]
Park, The Linear Representation Hypothesis and the Geometry of Large Language Models. 2024.[12]
Mikolov, Linguistic Regularities in Continuous Space Word Representations. Microsoft Research. 2013.[13]
Tigges, Linear Representation of Sentiment in Large Language Models Eluether AI Institute, SERI MATS, Stanford University, Pr(AI)R Group, Independent. 2023.[14]
Tigges, LLM Circuit Analyses Are Consistent Across Training and Scale Eluether AI Institute, ILLC, University of Amsterdam, Brown University. 2024.[15]
Kosinski, Evaluating Large Language Models in Theory of Mind Tasks. Stanford University. 2023.[16]
Bricken, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Anthropic. 2023.[17]
Templeton, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. 2024.[18]
Bills, Language models can explain neurons in language models OpenAI. 2023.[19]
Li, The Geometry of Concepts: Sparse Autoencoder Feature Structure MIT. 2024.[20]
McDougall, Copy Suppression: Comphrehensively Understanding an Attention Head. Independent, University of Texas, Google Deepmind. 2023.[21]
Dimitrov, Inhibitory Attentional Control in Patients with Frontal Lobe Damage. Science Direct, Brain & Cognition. 2003.[22]
Jamali, Unveiling theory of mind in large language models: A parallel to single neurons in the human brain. Massachusetts General Hospital, Harvard Medical School, MIT. 2023.[23]
Bender, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. University of Washington, Saarland University. 2020.[24]
McDougall, Indirect Object Identification Exercises and Solutions used in sections 3.2, 3.3, 3.4 Independent. 2024.[25]
Hardy, Code for the project can be found here.[26]