← Prometheans 100+

builder

Recognition

TIME 100 AI 2023 TIME 100 AI 2024

Related

← Prometheans 100+

TIME 100 AI 2023 TIME 100 AI 2024

Anthropic researcher, weak-to-strong generalization and AI alignment

Jan Leike

AI Alignment Researcher — Anthropic Superalignment Lead — OpenAI

Listen — profile

0:00 / 2:38

Profile

Jan Leike is one of the most recognizable names in AI alignment, and the person whose 2024 resignation from OpenAI forced the entire industry to have an uncomfortable conversation about whether frontier labs are actually investing in safety. A German-born researcher trained under Marcus Hutter at the Australian National University, Leike spent years at DeepMind before joining OpenAI, where he eventually co-led the Superalignment team alongside Ilya Sutskever. That team was announced in July 2023 with a headline commitment: 20% of OpenAI’s compute dedicated to solving alignment of superintelligent systems within four years.

It didn’t last. In May 2024, Leike resigned publicly, posting a thread on X saying he had “been disagreeing with OpenAI leadership about the company’s core priorities for quite some time” and that “over the past years, safety culture and processes have taken a backseat to shiny products.” Within days OpenAI dissolved the Superalignment team entirely. The episode became a defining moment for AI safety discourse — a senior researcher at the world’s most hyped AI lab publicly saying the resources simply weren’t there. He joined Anthropic two weeks later, announcing his new mission: “scalable oversight, weak-to-strong generalization, and automated alignment research.”

Before the drama, Leike built the technical foundations that much of modern alignment rests on. His 2017 paper with Paul Christiano on deep reinforcement learning from human preferences is the direct ancestor of RLHF — the technique that made ChatGPT feel usable. His 2018 “Scalable Agent Alignment via Reward Modeling” sketched the iterated-amplification approach that Anthropic’s Constitutional AI later built on. He also led the work on summarizing books with human feedback and the 2023 weak-to-strong generalization paper, which asked whether weaker supervisors (including humans) can still align stronger-than-them models.

For developers, Leike matters because he represents a specific bet: that alignment is an engineering problem you can make progress on if you actually staff it. His blog at aligned.substack.com is one of the clearest windows into how a senior alignment researcher actually thinks about the problem — not doom prophecy, not dismissal, but work.

Key Articles & Papers

Deep Reinforcement Learning from Human Preferences 2017 — The foundational RLHF paper — training agents from pairwise human comparisons instead of hand-crafted rewards. The direct technical ancestor of ChatGPT. Scalable Agent Alignment via Reward Modeling: A Research Direction 2018 — Leike's blueprint for how to align AI systems more capable than their supervisors — iterative reward modeling as a path to scalable oversight. Recursively Summarizing Books with Human Feedback 2021 — Using recursive task decomposition and human feedback to summarize entire novels — an early demo of scalable oversight in practice. Introducing Superalignment 2023 — The announcement that defined the era: OpenAI pledged 20% of compute to solving superintelligence alignment in four years, with Leike and Sutskever co-leading. Weak-to-Strong Generalization 2023 — Can weaker models (or humans) successfully supervise stronger models? Leike's team ran the first empirical study of this question. Why I'm Leaving OpenAI 2024 — The resignation thread. Calm, specific, and damning — the moment AI safety concerns stopped being abstract for the industry. Aligned — Jan Leike's Substack 2024 — His personal blog on alignment research, scalable oversight, and how to actually make progress on the problem. Personal Research Site 2024 — Publication list and current projects — the canonical source for what Leike is actually working on.

Videos

YouTube

2025

2024

2024

2024

2024

Spotify Podcasts

Ilya Sutskever and Jan Leike RESIGN from OpenAI - My in-depth analysis - end of an era! | Artificial Intelligence Masterclass

Artificial Intelligence Masterclass

2025

Sam Altman WRECKS OpenAI - Jan Leike joins Anthropic - Brain Drain from OpenAI | Artificial Intelligence Masterclass

Artificial Intelligence Masterclass

2025

LW - Ilya Sutskever and Jan Leike resign from OpenAI by Zach Stein-Perlman

The Nonlinear Library: LessWrong

2024

Jan Leike | Superintelligent Alignment

Foresight Institute Radio

2023

EA - OpenAI's massive push to make superintelligence safe in 4 years or less (Jan Leike on the 80,000 Hours Podcast) by 80000 Hours

The Nonlinear Library

2023

#159 – Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less

80,000 Hours Podcast

2023

24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

2023

96. Jan Leike - AI alignment at OpenAI

Towards Data Science

2021

AIAP: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike

Future of Life Institute Podcast

2019

#23 - How to actually become an AI alignment researcher, according to Dr Jan Leike

80,000 Hours Podcast

2018

Related People