Learn mechanistic interpretability

Step 3 of 6 in the mech interp curriculum. Assumes you've done steps 0-2.

We replicate the most foundational discovery in transformer interpretability: induction heads (Olsson et al., Anthropic, 2022). These are a specific two-attention-head circuit that lets transformers do in-context pattern-matching - copying patterns from earlier in their input. They're most of how language models learn from things you say earlier in your prompt.

The ONE new idea this step teaches: in real transformers, work is done by circuits - multiple attention heads in different layers cooperating. To understand what a transformer does, you have to find these circuits.

By the end you'll have:

Trained your first multi-layer transformer (step 2 was just one layer).
Solved a task that requires in-context pattern matching - predicting the next token in a repeated random sequence.
Inspected the model's attention patterns and seen the two-head induction circuit cleanly.

What's new in this step

Step 2 introduced the single-layer transformer. Almost everything in that project - attention, embeddings, residual stream, weight decay, training loop - is unchanged here. If any of those words feel rusty, re-skim step 2 first.

The only thing actually new in this project:

Multiple layers. We go from 1 layer to 2. With more than one layer, attention heads in later layers can read what earlier heads wrote into the residual stream. This is the door that lets circuits exist.
Composition. When a head in Layer 1 builds its query (or its key, or its value) using information written by a head in Layer 0, we say the Layer 1 head "composes" with the Layer 0 head. Composition is the only way multi-head behaviour happens.
Attention pattern reading. Once trained, we look at the attention patterns of each head - which positions attended to which - and read off, by eye, what algorithm each head implements. This is the most basic mech-interp move on a transformer.

Everything else (nn.Module, attention maths, training loops, etc.) - you already know.

What is an induction head?

Easiest way to define it is by what it does on a specific input.

Suppose the model has seen the sequence so far:

... the cat sat on the mat ... the cat sat on the ___

When the model has to predict the blank, it would be helpful to notice that "the cat sat on the" has appeared earlier, and that the very next token last time was mat. An induction head does exactly that:

At the current position, look for an earlier occurrence of the same token (the).
Find what came right after that earlier the (which was mat).
Predict mat.

Induction-head walkthrough on the sentence "the cat sat on the mat … the cat sat on the ?". Three labelled steps: (1) match - curved arrow from the current "the" back to the earlier "the"; (2) shift - short arrow forward by one token to "mat"; (3) copy - arrow up to a thought bubble "predict: mat" above the current position. Caption: Induction = match-and-shift.

Mechanically, the induction head's attention pattern at the current position is concentrated on the position immediately after the earlier matching token. It copies the value at that position to the residual stream, which the unembedding then reads out as a prediction.

This is the entire trick behind a huge fraction of in-context learning. Tell GPT-4 "translate French to German" and give it examples; under the hood, induction heads are picking up on patterns like "the last time we saw a French word followed by a German word, the German word came after a :, so this time predict a German word."

Olsson et al. 2022 argues induction heads are the single most important discovery in transformer interpretability so far. We're going to find them ourselves.

The experiment in plain English

The task. We design a synthetic task that can only be solved by induction. Each input is a length-50 sequence of integers in {0, ..., 63}:

The first 25 tokens are random.
The last 25 tokens are the exact same sequence repeated.

So a sample input might be:

[37, 4, 19, 51, 2, ..., 14, 8,   37, 4, 19, 51, 2, ..., 14, 8]
       (random first half)         (same sequence repeated)

To predict the next token at position 30 (for example), the model needs to look at position 30-25=5 and read off what comes after. There is no other algorithm that can solve this task - the tokens are random and uncorrelated, so memorising distributions doesn't help. Generalising bigrams/trigrams doesn't help. The model has to learn induction.

The model. A 2-layer attention-only transformer (no MLPs). 4 attention heads per layer. d_model = 64. Total: ~50k parameters. Tiny.

Training. A few thousand steps of AdamW on freshly-sampled batches of random repeated sequences. Loss is cross-entropy on next-token prediction across all positions. We separately track loss on the first half (where it's unsolvable, ~log(64)) vs the second half (where it should drop to near zero).

What we expect. After training:

Loss on the first half stays at ~log(64) ≈ 4.16 - the model can't predict random tokens it has never seen.
Loss on the second half drops to near zero - the model has learnt induction.
When we plot the attention patterns of all 8 heads (2 layers × 4 heads each), exactly two of them will have distinctive structure:
- One head in Layer 0 will have a previous-token pattern (sub-diagonal).
- One head in Layer 1 will have an induction pattern (diagonal shifted by one half).

The other 6 heads will look more uniform or noisy. That's normal - not every head specialises, and some are doing supporting work.

The circuit explained

Here's why this two-head combination works. You don't need to follow the maths in detail on a first read - but the picture matters.

Imagine the model is at position T in the second half of the sequence, holding token X. It wants to predict what comes next. The right answer is "whatever came after X last time X appeared, which was at position T - 25. So I want to predict the token at position T - 24."

The induction head's job is to attend from T directly to T - 24 and copy that token's identity.

The problem: at position T - 24, the residual stream only knows about itself - about the token at position T - 24. There's no way for the induction head to know "this position is one-after-an-X" just by looking at position T - 24's residual stream.

The fix (the previous-token head, Layer 0): this head, at every position p, attends to position p - 1 and copies that earlier token's information into position p's residual stream. So after Layer 0, the residual stream at position T - 24 carries information about both the token at T - 24 and the token at T - 25 (which is X).

The induction head (Layer 1) closing the loop: this head, at position T:

Builds a query that says "I am looking for a position whose previous token was X."
Builds a key for each candidate position using the output of the previous-token head. So position T - 24's key encodes "my previous token was X." (It also encodes other things, but X is in there.)
The query-key match fires when the candidate's previous token equals the destination's current token - exactly the induction condition.
The value copied back is just the candidate position's actual token identity, which then gets read out as the prediction.

This is K-composition: the induction head builds its keys using the output of the previous-token head. Layer 1's keys read from what Layer 0 wrote into the residual stream.

When you visualise the Layer 1 head's attention pattern in Section 7 of the notebook, you'll see a clean diagonal shifted by 25 positions - that's this whole machine working.

crack open as needed

Glossary - terms added in this step

Just the new ones. For terms from earlier steps see those step pages.

Attention pattern: the matrix of attention weights for one head, after softmax. Shape (query_position, key_position). Entry [i, j] is "how much does the token at position i look at the token at position j?" Visualising this matrix as a heatmap is the basic interpretability move on a transformer.
Previous-token head: an attention head whose attention pattern is a clean sub-diagonal - every position attends to the position immediately before it. The job is "tell me what was just said." Found in Layer 0 of our trained model.
Induction head: an attention head that, given the current token X, attends back to the position right after the last time X appeared, and copies that next-token's identity forward. The job is "I've seen this before - last time, what came next?" Found in Layer 1.
Circuit: a small, identifiable subgraph of a neural network that implements an interpretable algorithm. The previous-token head + induction head together form the induction circuit.
Composition (K-composition, Q-composition, V-composition): a head in a later layer reading from the residual stream what an earlier head wrote there. In our case the induction head K-composes with the previous-token head: it builds its keys from information the previous-token head injected.
In-context learning (ICL): a model's ability to learn a pattern from earlier in its prompt and apply it to later predictions, without any weight updates. Most of ICL in real LLMs is induction heads doing their job.
Attention-only transformer: a transformer with all the MLP layers removed. We use this here because the induction circuit forms purely from attention, and removing the MLPs makes it visible without clutter.

Run it

The notebook is induction_heads.ipynb.

01Setup

Imports + device + seeds. Standard.

code · python

02Hyperparameters

All knobs in one place: vocab size 64, sequence length 50, model dims, training steps.

code · python

03Data generation - random repeated sequences

A function make_batch(batch_size) that returns a (batch_size, 50) tensor: random first half + repeated second half. We don't use a fixed dataset; we generate fresh batches on every step.

code · python

04The 2-layer attention-only transformer

Almost identical to step 2's model, but:

We wrap the attention block in a loop over n_layers = 2.
No MLP.
We use the same Attention class from step 2 (you've seen it before).

code · python

class Attention(nn.Module):
    def __init__(self, d_model, n_heads, d_head):
        super().__init__()
        self.W_Q = nn.Parameter(torch.randn(n_heads, d_model, d_head) * (1.0 / d_model**0.5))
        self.W_K = nn.Parameter(torch.randn(n_heads, d_model, d_head) * (1.0 / d_model**0.5))
        self.W_V = nn.Parameter(torch.randn(n_heads, d_model, d_head) * (1.0 / d_model**0.5))
        self.W_O = nn.Parameter(torch.randn(n_heads, d_head, d_model) * (1.0 / d_model**0.5))
        self.d_head = d_head

def forward(self, x, return_pattern=False):
        # x: (batch, pos, d_model)
        q = torch.einsum('bpd,hde->bphe', x, self.W_Q)
        k = torch.einsum('bpd,hde->bphe', x, self.W_K)
        v = torch.einsum('bpd,hde->bphe', x, self.W_V)

scores = torch.einsum('bphe,bqhe->bhpq', q, k) / (self.d_head ** 0.5)

# causal mask: position i can only attend to positions ≤ i
        T = x.shape[1]
        causal_mask = torch.triu(torch.ones(T, T, dtype=torch.bool, device=x.device), diagonal=1)
        scores = scores.masked_fill(causal_mask, float('-inf'))

attn = F.softmax(scores, dim=-1)   # (batch, head, query_pos, key_pos)
        z    = torch.einsum('bhpq,bqhe->bphe', attn, v)
        out  = torch.einsum('bphe,hed->bpd',   z, self.W_O)
        if return_pattern:
            return out, attn
        return out

class Transformer(nn.Module):
    def __init__(self, d_vocab, d_model, n_heads, d_head, n_layers, seq_len):
        super().__init__()
        self.W_E   = nn.Parameter(torch.randn(d_vocab, d_model) * (1.0 / d_model**0.5))
        self.W_pos = nn.Parameter(torch.randn(seq_len, d_model) * (1.0 / d_model**0.5))
        self.attns = nn.ModuleList([Attention(d_model, n_heads, d_head) for _ in range(n_layers)])
        self.W_U   = nn.Parameter(torch.randn(d_model, d_vocab) * (1.0 / d_model**0.5))

def forward(self, tokens):
        x = self.W_E[tokens] + self.W_pos
        for attn in self.attns:
            x = x + attn(x)
        return x @ self.W_U

def forward_with_patterns(self, tokens):
        """Same as forward but also returns each layer's attention pattern."""
        x = self.W_E[tokens] + self.W_pos
        patterns = []
        for attn in self.attns:
            out, pat = attn(x, return_pattern=True)
            x = x + out
            patterns.append(pat)
        return x @ self.W_U, patterns

code · python

05Training

~3,000 steps of AdamW. We log loss on the first half and on the second half separately so you can see the divergence.

code · python

optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

first_losses, second_losses, log_steps = [], [], []

for step in range(n_steps + 1):
    tokens = make_batch(batch_size, half_len, d_vocab, device)   # (B, 2*half_len)
    logits = model(tokens)                                       # (B, 2*half_len, d_vocab)

# predict token[t+1] from position t — shift by 1
    pred   = logits[:, :-1, :]    # (B, 2*half_len - 1, d_vocab)
    target = tokens[:, 1:]        # (B, 2*half_len - 1)

loss = F.cross_entropy(pred.reshape(-1, d_vocab), target.reshape(-1))

optimizer.zero_grad()
    loss.backward()
    optimizer.step()

if step % log_every == 0:
        with torch.no_grad():
            # split loss into first-half and second-half positions
            per_pos = F.cross_entropy(
                pred.reshape(-1, d_vocab),
                target.reshape(-1),
                reduction='none',
            ).reshape(batch_size, -1)                     # (B, 2*half_len - 1)
            first  = per_pos[:, :half_len - 1].mean().item()
            second = per_pos[:,  half_len - 1:].mean().item()
        first_losses.append(first)
        second_losses.append(second)
        log_steps.append(step)
        if step % 500 == 0:
            print(f'step {step:5d}  | first-half loss {first:.3f}  | second-half loss {second:.3f}')

print('\nTraining complete.')

06Loss curves - the wow moment

Two curves on the same plot. First-half loss stays near log(64) ≈ 4.16. Second-half loss drops to near zero. The gap is the model learning induction.

code · python

07Find the previous-token head and the induction head

We compute two scores per head:

Previous-token score: average attention from position i to position i-1. High for a clean previous-token head.
Induction score: on a fresh repeated sequence, average attention from position 25 + k to position k + 1. High for a clean induction head.

We plot a bar chart of these scores across all 8 heads. Exactly one Layer 0 head should top the previous-token score; exactly one Layer 1 head should top the induction score.

code · python

# grab attention patterns on a fresh batch
with torch.no_grad():
    eval_tokens = make_batch(256, half_len, d_vocab, device)
    _, patterns = model.forward_with_patterns(eval_tokens)   # list of (B, H, P, P)
    patterns = [p.mean(dim=0).cpu().numpy() for p in patterns]  # average over batch → (H, P, P) each

def previous_token_score(pattern):
    """How much does each position attend to the position immediately before it?
    pattern shape: (P, P).  We average over positions 1..P-1.
    """
    P = pattern.shape[0]
    return np.mean([pattern[i, i - 1] for i in range(1, P)])

def induction_score(pattern, half_len):
    """On the second half (positions half_len..P-1), how much does position
    half_len+k attend to position k+1 (the token after the previous occurrence)?
    """
    P = pattern.shape[0]
    scores = []
    for k in range(half_len - 1):
        # destination position in the second half
        dst = half_len + k
        # source position: the one right after the matching first-half occurrence
        src = k + 1
        if dst < P:
            scores.append(pattern[dst, src])
    return float(np.mean(scores))

# compute scores for every head
all_heads = []
for layer_idx, layer_pat in enumerate(patterns):
    for h in range(n_heads):
        pat = layer_pat[h]
        all_heads.append({
            'layer': layer_idx,
            'head':  h,
            'prev_score':       previous_token_score(pat),
            'induction_score':  induction_score(pat, half_len),
            'pattern':          pat,
        })

# print a table
print(f'{"layer":>5}  {"head":>4}  {"prev-token score":>18}  {"induction score":>17}')
for h in all_heads:
    print(f'{h["layer"]:>5}  {h["head"]:>4}  {h["prev_score"]:>18.3f}  {h["induction_score"]:>17.3f}')

code · python

# bar chart of both scores
labels = [f'L{h["layer"]}H{h["head"]}' for h in all_heads]
prev   = [h['prev_score']      for h in all_heads]
ind    = [h['induction_score'] for h in all_heads]

x = np.arange(len(all_heads))
width = 0.4

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(x - width / 2, prev, width, label='previous-token score')
ax.bar(x + width / 2, ind,  width, label='induction score')
ax.set_xticks(x); ax.set_xticklabels(labels)
ax.set_ylabel('score (0 = no signal, 1 = perfect)')
ax.set_title('Which heads do what?')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

# identify the winners
prev_winner = max([h for h in all_heads if h['layer'] == 0], key=lambda h: h['prev_score'])
ind_winner  = max([h for h in all_heads if h['layer'] == 1], key=lambda h: h['induction_score'])
print(f'\nLayer-0 previous-token head: L{prev_winner["layer"]}H{prev_winner["head"]}  (score {prev_winner["prev_score"]:.3f})')
print(f'Layer-1 induction head:      L{ind_winner["layer"]}H{ind_winner["head"]}  (score {ind_winner["induction_score"]:.3f})')

08Visualise the attention patterns of all 8 heads

For each head, plot its attention pattern as a heatmap. Look at the two heads identified in Section 7 first. You should see:

Previous-token head: a clean sub-diagonal stripe (attention from i to i-1).
Induction head: an off-diagonal stripe shifted by 25 in the bottom-right quadrant.

The other 6 heads are mostly noisy or uniform.

code · python

Induction heads

What's new in this step

What is an induction head?

The experiment in plain English

The circuit explained

Run it

01Setup

02Hyperparameters

03Data generation - random repeated sequences

04The 2-layer attention-only transformer

05Training

06Loss curves - the wow moment

07Find the previous-token head and the induction head

08Visualise the attention patterns of all 8 heads