← Prometheans 100+

rising

Recognition

TIME 100 AI 2024

Related

← Prometheans 100+

TIME 100 AI 2024

Co-founder leading Anthropic's mechanistic interpretability research

Chris Olah

Co-founder, Interpretability Team Lead — Anthropic

Listen — profile

0:00 / 2:31

Profile

Chris Olah is the closest thing the field has to a patron saint of looking inside neural networks. He co-founded Anthropic in 2021 alongside Dario Amodei, Daniela Amodei, Jared Kaplan, Jack Clark, and others, and he now leads its interpretability team — the group whose entire job is to figure out what the weights of a frontier model are actually doing. Before that he led interpretability at OpenAI and worked at Google Brain. He’s also the co-founder of Distill, the short-lived but hugely influential journal that raised the bar for how ML research should be communicated visually.

The story that matters for developers: for most of deep learning’s recent history, “what is this model doing?” was treated as basically unanswerable. Olah refused to accept that. Starting with his early blog posts on colah.github.io — the famous Understanding LSTM Networks explainer is still the clearest thing written on the topic — he built up a program of reverse-engineering neural nets neuron by neuron, weight by weight. That program became the Circuits thread on Distill, which argued you could treat individual neurons and circuits like biology: zoom in, find the structure, name the parts.

At Anthropic, that program has scaled from vision models to production LLMs. Toy Models of Superposition (2022) explained why neurons are polysemantic — models pack more features than they have dimensions, in superposition — and hinted at how to unpack them. Scaling Monosemanticity (2024) then actually did it on Claude 3 Sonnet, pulling out millions of interpretable features (“Golden Gate Bridge”, “code with bugs”, “sycophantic praise”) and showing you could steer the model by clamping them. This is no longer speculation — it’s a real tool.

Why he matters if you’re learning AI: interpretability is the hinge that alignment and safety swing on. If we can’t see what a model is computing, we’re stuck evaluating it like a black box and hoping. Olah’s bet — and it’s increasingly paying off — is that models are not inscrutable, just complicated, and that with enough patience you can read them. For a developer, his writing is also the single best tutorial on how neural networks actually work internally. Start with his blog, then the Transformer Circuits thread.

Key Articles & Papers

Understanding LSTM Networks 2015 — The explainer that taught a generation of ML engineers how recurrent nets actually work. Still unmatched. Feature Visualization 2017 — With Alexander Mordvintsev and Ludwig Schubert — the definitive guide to generating images that show what vision neurons 'want to see'. The Building Blocks of Interpretability 2018 — Shows how feature visualization, attribution, and dimensionality reduction compose into an interactive interface for understanding a model. Zoom In: An Introduction to Circuits 2020 — The manifesto for mechanistic interpretability: treat individual neurons like biology and reverse-engineer the network. Multimodal Neurons in Artificial Neural Networks 2021 — The CLIP investigation that found single neurons firing for 'Spider-Man' across photos, drawings, and the word itself — grandmother cells, basically. A Mathematical Framework for Transformer Circuits 2021 — Anthropic's opening move on transformers — reformulates attention heads as composable circuits you can read. Toy Models of Superposition 2022 — Explains why neurons are polysemantic: models cram more features than dimensions into superposition. Load-bearing for everything that followed. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023 — Sparse autoencoders pull clean, interpretable features out of a small transformer — proof of concept for reading LLMs. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 2024 — The same technique, now working on a production frontier model. Millions of features, and you can steer the model by clamping them. On the Measure of Intelligence in Neural Networks (colah.github.io) ongoing — His personal blog is still the best long-form introduction to thinking visually about deep learning.

YouTube

2026

2026

2026

2024

2024

2021

Spotify Podcasts

EP18 - Why the Pope Invited Anthropic AI Founder to Vatican

New Wave with Elisa and Sanjay

2026

Pope Leo Wants to Disarm AI. Explained.

AI For Humans: Weekly AI News, Tools & Trends

2026

Ep. 61 — Pope Leo XIV's AI encyclical with Anthropic on stage; Microsoft Fara 1.5 beats OpenAI Operator; CISA cut out of the AI cyber response

Daily Prompt with Archer and Iris - Daily AI News

2026

@AnthropicAI：Anthropic 共同創辦人 Chris Olah 於梵蒂岡演說，呼籲透過全球道德審視與外部監督，確保 AI 發展符合人類尊嚴。 AI 開發的結構性限制…

EasyVibeCoding Podcast

2026

#452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Lex Fridman Podcast

2024

Five: Chris Olah on what the hell is going on inside neural networks

The 80,000 Hours Podcast on Artificial Intelligence (September 2023)

2023

Chris Olah’s views on AGI safetyby evhub

The Nonlinear Library: LessWrong Top Posts

2021

#108 – Chris Olah on working at top AI labs without an undergrad degree

80,000 Hours Podcast

2021

#107 – Chris Olah on what the hell is going on inside neural networks

80,000 Hours Podcast

2021

Related People