Mechanistic interpretability pioneer
Chris Olah
Profile
Chris Olah is the closest thing the field has to a patron saint of looking inside neural networks. He co-founded Anthropic in 2021 alongside Dario Amodei, Daniela Amodei, Jared Kaplan, Jack Clark, and others, and he now leads its interpretability team — the group whose entire job is to figure out what the weights of a frontier model are actually doing. Before that he led interpretability at OpenAI and worked at Google Brain. He’s also the co-founder of Distill, the short-lived but hugely influential journal that raised the bar for how ML research should be communicated visually.
The story that matters for developers: for most of deep learning’s recent history, “what is this model doing?” was treated as basically unanswerable. Olah refused to accept that. Starting with his early blog posts on colah.github.io — the famous Understanding LSTM Networks explainer is still the clearest thing written on the topic — he built up a program of reverse-engineering neural nets neuron by neuron, weight by weight. That program became the Circuits thread on Distill, which argued you could treat individual neurons and circuits like biology: zoom in, find the structure, name the parts.
At Anthropic, that program has scaled from vision models to production LLMs. Toy Models of Superposition (2022) explained why neurons are polysemantic — models pack more features than they have dimensions, in superposition — and hinted at how to unpack them. Scaling Monosemanticity (2024) then actually did it on Claude 3 Sonnet, pulling out millions of interpretable features (“Golden Gate Bridge”, “code with bugs”, “sycophantic praise”) and showing you could steer the model by clamping them. This is no longer speculation — it’s a real tool.
Why he matters if you’re learning AI: interpretability is the hinge that alignment and safety swing on. If we can’t see what a model is computing, we’re stuck evaluating it like a black box and hoping. Olah’s bet — and it’s increasingly paying off — is that models are not inscrutable, just complicated, and that with enough patience you can read them. For a developer, his writing is also the single best tutorial on how neural networks actually work internally. Start with his blog, then the Transformer Circuits thread.
Key Articles & Papers
Understanding LSTM Networks Feature Visualization The Building Blocks of Interpretability Zoom In: An Introduction to Circuits Multimodal Neurons in Artificial Neural Networks A Mathematical Framework for Transformer Circuits Toy Models of Superposition Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet On the Measure of Intelligence in Neural Networks (colah.github.io)Spotify Podcasts