a hands-on curriculum

Learn mechanistic interpretability

Six self-contained projects, each replicating a famous experiment and teaching exactly one new big idea. Start with looking at the weights of a digit classifier and end with training your own sparse autoencoder. Inspired by Neel Nanda.

what is mechanistic interpretability?

A neural network is a few million (or billion) numbers - its weights. When you push data through them, those numbers produce useful behaviour: a classification, a sentence, a prediction. Nobody designed them by hand; gradient descent did. So the model works, but on first inspection the weights look like noise.

Mechanistic interpretabilityis the project of looking inside trained models and reverse-engineering what those weights are actually computing. Not vague stories like “this neuron likes cats”, but algorithm-level descriptions clean enough that a human could implement them by hand. The bet is that if we can read what a model is doing, we can catch it doing something we don't want before it ships.

This curriculum walks you from the simplest possible case (literally reading digit templates off a weight matrix) to the frontier of the field today (sparse autoencoders pulling monosemantic features out of real language models). Six steps. No prerequisites beyond Python.

Learn mechanistic interpretability

what is mechanistic interpretability?

the six steps

MNIST template weights

Toy Models of Superposition

Grokking modular addition

Induction heads

IOI circuit in GPT-2 small

Sparse autoencoders