a hands-on curriculum
Learn mechanistic interpretability
Six self-contained projects, each replicating a famous experiment and teaching exactly one new big idea. Start with looking at the weights of a digit classifier and end with training your own sparse autoencoder. Inspired by Neel Nanda.
the six steps
- 00
MNIST template weights
Weights are interpretable — you can just look.
replicates Olah-era weight visualisation
- 01
Toy Models of Superposition
Features ≠ neurons. Superposition is why.
replicates Elhage et al., Anthropic 2022
- 02
Grokking modular addition
Models learn algorithms, not just templates.
replicates Nanda et al. 2023
- 03
Induction heads
Real transformers do work via attention-head circuits.
replicates Olsson et al., Anthropic 2022
- 04
IOI circuit in GPT-2 small
Activation patching — verifying circuits causally.
replicates Wang et al. 2022
- 05
Sparse autoencoders
Dictionary learning — automatically finding features.
replicates Cunningham et al. 2023