learn-mech-interp

a hands-on curriculum

Learn mechanistic interpretability

Six self-contained projects, each replicating a famous experiment and teaching exactly one new big idea. Start with looking at the weights of a digit classifier and end with training your own sparse autoencoder. Inspired by Neel Nanda.

the six steps

  1. 00

    MNIST template weights

    Weights are interpretable — you can just look.

    replicates Olah-era weight visualisation

  2. 01

    Toy Models of Superposition

    Features ≠ neurons. Superposition is why.

    replicates Elhage et al., Anthropic 2022

  3. 02

    Grokking modular addition

    Models learn algorithms, not just templates.

    replicates Nanda et al. 2023

  4. 03

    Induction heads

    Real transformers do work via attention-head circuits.

    replicates Olsson et al., Anthropic 2022

  5. 04

    IOI circuit in GPT-2 small

    Activation patching — verifying circuits causally.

    replicates Wang et al. 2022

  6. 05

    Sparse autoencoders

    Dictionary learning — automatically finding features.

    replicates Cunningham et al. 2023