a hands-on curriculum
Learn mechanistic interpretability
Six self-contained projects, each replicating a famous experiment and teaching exactly one new big idea. Start with looking at the weights of a digit classifier and end with training your own sparse autoencoder. Inspired by Neel Nanda.
what is mechanistic interpretability?
A neural network is a few million (or billion) numbers - its weights. When you push data through them, those numbers produce useful behaviour: a classification, a sentence, a prediction. Nobody designed them by hand; gradient descent did. So the model works, but on first inspection the weights look like noise.
Mechanistic interpretabilityis the project of looking inside trained models and reverse-engineering what those weights are actually computing. Not vague stories like “this neuron likes cats”, but algorithm-level descriptions clean enough that a human could implement them by hand. The bet is that if we can read what a model is doing, we can catch it doing something we don't want before it ships.
This curriculum walks you from the simplest possible case (literally reading digit templates off a weight matrix) to the frontier of the field today (sparse autoencoders pulling monosemantic features out of real language models). Six steps. No prerequisites beyond Python.
the six steps
- 00
MNIST template weights
Weights are interpretable - you can just look.
replicates Olah-era weight visualisation
- 01
Toy Models of Superposition
Features ≠ neurons. Superposition is why.
replicates Elhage et al., Anthropic 2022
- 02
Grokking modular addition
Models learn algorithms, not just templates.
replicates Nanda et al. 2023
- 03
Induction heads
Real transformers do work via attention-head circuits.
replicates Olsson et al., Anthropic 2022
- 04
IOI circuit in GPT-2 small
Activation patching - verifying circuits causally.
replicates Wang et al. 2022
- 05
Sparse autoencoders
Dictionary learning - automatically finding features.
replicates Cunningham et al. 2023