learn-mech-interp

a hands-on curriculum

Learn mechanistic interpretability

Six self-contained projects, each replicating a famous experiment and teaching exactly one new big idea. Start with looking at the weights of a digit classifier and end with training your own sparse autoencoder. Inspired by Neel Nanda.

what is mechanistic interpretability?

A neural network is a few million (or billion) numbers - its weights. When you push data through them, those numbers produce useful behaviour: a classification, a sentence, a prediction. Nobody designed them by hand; gradient descent did. So the model works, but on first inspection the weights look like noise.

Mechanistic interpretabilityis the project of looking inside trained models and reverse-engineering what those weights are actually computing. Not vague stories like “this neuron likes cats”, but algorithm-level descriptions clean enough that a human could implement them by hand. The bet is that if we can read what a model is doing, we can catch it doing something we don't want before it ships.

This curriculum walks you from the simplest possible case (literally reading digit templates off a weight matrix) to the frontier of the field today (sparse autoencoders pulling monosemantic features out of real language models). Six steps. No prerequisites beyond Python.

the six steps

  1. 00

    MNIST template weights

    Weights are interpretable - you can just look.

    replicates Olah-era weight visualisation

  2. 01

    Toy Models of Superposition

    Features ≠ neurons. Superposition is why.

    replicates Elhage et al., Anthropic 2022

  3. 02

    Grokking modular addition

    Models learn algorithms, not just templates.

    replicates Nanda et al. 2023

  4. 03

    Induction heads

    Real transformers do work via attention-head circuits.

    replicates Olsson et al., Anthropic 2022

  5. 04

    IOI circuit in GPT-2 small

    Activation patching - verifying circuits causally.

    replicates Wang et al. 2022

  6. 05

    Sparse autoencoders

    Dictionary learning - automatically finding features.

    replicates Cunningham et al. 2023