Head of AI Safety at US AI Safety Institute, NIST
Paul Christiano
Profile
Paul Christiano is the researcher whose fingerprints are on nearly every chatbot you’ve ever talked to. In 2017, while at OpenAI, he co-authored Deep Reinforcement Learning from Human Preferences with Jan Leike, Dario Amodei, and others — the paper that introduced RLHF. That technique, refined over the next few years into InstructGPT and then ChatGPT, is the reason a raw next-token predictor became something you’d actually want to talk to. If you’ve ever wondered what turned GPT-3 from a weird autocomplete toy into a useful assistant, this is it.
He left OpenAI in 2021 to found the Alignment Research Center (ARC), a small nonprofit focused on the theoretical side of making AI systems honest and controllable. ARC’s best-known output is the Eliciting Latent Knowledge (ELK) report with Mark Xu — a research program that asks a deceptively simple question: how do you train a model to tell you what it actually “knows,” especially when the truth is inconvenient? ARC also did the early dangerous-capability evaluations on GPT-4 before release.
In April 2024 Christiano was appointed Head of AI Safety at the U.S. AI Safety Institute within NIST, where he now runs evaluations of frontier models for national-security-relevant capabilities. The appointment caused an internal staff revolt at NIST over his effective altruism ties, but he took the job anyway. As of 2026 he leads AI safety at the renamed Center for AI Standards and Innovation.
For developers learning AI, Christiano is worth paying attention to for two reasons. First, RLHF — the thing you’re interacting with every day — came out of his head. Second, he’s one of the few safety researchers who is technically deep, publicly honest about uncertainty (he puts his own odds of an AI catastrophe somewhere around 20%), and not doom-shouting. When he says something is a problem, it’s worth taking seriously.
Key Articles & Papers
Deep Reinforcement Learning from Human Preferences Eliciting Latent Knowledge (ELK) My views on 'doom' Current work in AI alignment Thoughts on the impact of RLHF research ARC evaluations of GPT-4Videos
YouTube
Spotify Podcasts