PrometheusRoot
Blog Links Prometheans 100+ Why are you here?
← Prometheans 100+
C
rising
Researcher
X / Twitter Website GitHub
interpretabilityanthropicvisualizationsafety

Related

pioneer Dario Amodei
← Prometheans 100+

Mechanistic interpretability pioneer

Chris Olah

Research Lead, Interpretability — Anthropic

Profile

Chris Olah is the closest thing the field has to a patron saint of looking inside neural networks. He co-founded Anthropic in 2021 alongside Dario Amodei, Daniela Amodei, Jared Kaplan, Jack Clark, and others, and he now leads its interpretability team — the group whose entire job is to figure out what the weights of a frontier model are actually doing. Before that he led interpretability at OpenAI and worked at Google Brain. He’s also the co-founder of Distill, the short-lived but hugely influential journal that raised the bar for how ML research should be communicated visually.

The story that matters for developers: for most of deep learning’s recent history, “what is this model doing?” was treated as basically unanswerable. Olah refused to accept that. Starting with his early blog posts on colah.github.io — the famous Understanding LSTM Networks explainer is still the clearest thing written on the topic — he built up a program of reverse-engineering neural nets neuron by neuron, weight by weight. That program became the Circuits thread on Distill, which argued you could treat individual neurons and circuits like biology: zoom in, find the structure, name the parts.

At Anthropic, that program has scaled from vision models to production LLMs. Toy Models of Superposition (2022) explained why neurons are polysemantic — models pack more features than they have dimensions, in superposition — and hinted at how to unpack them. Scaling Monosemanticity (2024) then actually did it on Claude 3 Sonnet, pulling out millions of interpretable features (“Golden Gate Bridge”, “code with bugs”, “sycophantic praise”) and showing you could steer the model by clamping them. This is no longer speculation — it’s a real tool.

Why he matters if you’re learning AI: interpretability is the hinge that alignment and safety swing on. If we can’t see what a model is computing, we’re stuck evaluating it like a black box and hoping. Olah’s bet — and it’s increasingly paying off — is that models are not inscrutable, just complicated, and that with enough patience you can read them. For a developer, his writing is also the single best tutorial on how neural networks actually work internally. Start with his blog, then the Transformer Circuits thread.

Key Articles & Papers

Understanding LSTM Networks 2015 — The explainer that taught a generation of ML engineers how recurrent nets actually work. Still unmatched. Feature Visualization 2017 — With Alexander Mordvintsev and Ludwig Schubert — the definitive guide to generating images that show what vision neurons 'want to see'. The Building Blocks of Interpretability 2018 — Shows how feature visualization, attribution, and dimensionality reduction compose into an interactive interface for understanding a model. Zoom In: An Introduction to Circuits 2020 — The manifesto for mechanistic interpretability: treat individual neurons like biology and reverse-engineer the network. Multimodal Neurons in Artificial Neural Networks 2021 — The CLIP investigation that found single neurons firing for 'Spider-Man' across photos, drawings, and the word itself — grandmother cells, basically. A Mathematical Framework for Transformer Circuits 2021 — Anthropic's opening move on transformers — reformulates attention heads as composable circuits you can read. Toy Models of Superposition 2022 — Explains why neurons are polysemantic: models cram more features than dimensions into superposition. Load-bearing for everything that followed. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023 — Sparse autoencoders pull clean, interpretable features out of a small transformer — proof of concept for reading LLMs. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 2024 — The same technique, now working on a production frontier model. Millions of features, and you can steer the model by clamping them. On the Measure of Intelligence in Neural Networks (colah.github.io) ongoing — His personal blog is still the best long-form introduction to thinking visually about deep learning.

Spotify Podcasts

#107 – Chris Olah on what the hell is going on inside neural networks
#107 – Chris Olah on what the hell is going on inside neural networks
#108 – Chris Olah on working at top AI labs without an undergrad degree
#108 – Chris Olah on working at top AI labs without an undergrad degree
Chris Olave shock stat
Chris Olave shock stat
#452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity
#452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity
Serendipity, weird bets, & cold emails that actually work: Career advice from 16 former guests
Serendipity, weird bets, & cold emails that actually work: Career advice from 16 former guests
12-24-25 Hour 2 - Christmas Eve with Mike Opelka
12-24-25 Hour 2 - Christmas Eve with Mike Opelka
Chris Martin aja olahraga, kamu kapan?
Chris Martin aja olahraga, kamu kapan?
1-19-26 Hour 3 - Ilhan Omar curses America, Forgets 'S' in U.S. is for States
1-19-26 Hour 3 - Ilhan Omar curses America, Forgets 'S' in U.S. is for States
12-31-25 Hour 1 - The worst of Late Night Libs in 2025
12-31-25 Hour 1 - The worst of Late Night Libs in 2025
6-26-23 Hour 2 - Sunday Show Wrap-up
6-26-23 Hour 2 - Sunday Show Wrap-up

Related People

pioneer Dario Amodei
© 2026 PrometheusRoot