Yejin Choi

Dieter Schwarz Foundation HAI Professor, Senior Fellow at HAI — Stanford University Professor (former) — University of Washington Researcher (former) — Allen Institute for AI

Listen — profile

0:00 / 2:21

Profile

Yejin Choi is the researcher who keeps asking the question the rest of the field would rather skip: do these models actually understand anything? She’s spent the better part of two decades trying to teach machines common sense — the unglamorous, unwritten knowledge that lets a human know ice cream melts, knives cut, and you can’t push a rope. In 2022 the MacArthur Foundation gave her a “genius” grant for the work. In 2023 TIME named her one of the 100 most influential people in AI.

For most of her career she was the Wissner-Slivka Professor at the Paul G. Allen School at the University of Washington and a senior research manager at AI2, where she led the Mosaic project on commonsense reasoning. In January 2025 she joined Stanford HAI as the Dieter Schwarz Foundation HAI Professor while also serving as Senior Director of Language and Cognition Research at NVIDIA. She’s now openly skeptical of pure scaling and is pushing the field toward smaller, more grounded models trained on human norms rather than scraped web text.

Her technical legacy is substantial. COMET and ATOMIC turned commonsense reasoning into a benchmark and a knowledge graph the field could actually work on. Delphi asked whether a neural network could make moral judgments — and produced a controversy and a pile of follow-up research when it sometimes got things absurdly wrong. And the nucleus-sampling paper she co-authored (“The Curious Case of Neural Text Degeneration”) gave us top-p sampling, which is now in basically every text generation pipeline. If you’ve tuned top_p in an API call, you’ve used her work.

For developers learning AI, Choi is the corrective voice worth keeping in your head. Models that ace the bar exam still fail at “if I put a candle in a microwave, what happens?” Her benchmarks — HellaSwag, WinoGrande, the various follow-ups — exist specifically to find these gaps. She’s not a doomer and not a hype merchant. She’s the one reminding everyone that pattern-matching at scale is not the same as understanding, and that the gap matters when you’re building anything that has to deal with the real world.

Key Articles & Papers

The Curious Case of Neural Text Degeneration 2019 — Introduced nucleus (top-p) sampling — the decoding strategy now baked into virtually every LLM API. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction 2019 — Showed transformers could generate novel commonsense knowledge, not just retrieve it. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning 2019 — The 877k if-then knowledge graph that anchored a decade of commonsense research. HellaSwag: Can a Machine Really Finish Your Sentence? 2019 — The benchmark that exposed how easily models fake comprehension — still a standard eval today. WinoGrande: An Adversarial Winograd Schema Challenge at Scale 2019 — Large-scale commonsense benchmark designed to resist annotator artifacts and pattern shortcuts. Can Machines Learn Morality? The Delphi Experiment 2021 — Provocative attempt to model descriptive ethics — sparked a debate the field is still having. The Curious Case of Commonsense Intelligence 2022 — Daedalus essay laying out why commonsense is the bottleneck for AI, written for a general audience. (Comet-)Atomic 2020: On Symbolic and Neural Commonsense Knowledge Graphs 2021 — The expanded version that added social and physical commonsense at scale.

Videos

Controversies

Delphi (2021) drew sharp criticism when users posted screenshots of the system producing obviously bad moral judgments — including racially insensitive outputs and absurd context-free verdicts. Critics including Margaret Mitchell argued that framing a neural net as a moral oracle was itself the problem, regardless of accuracy. Choi and her team responded that Delphi was always an experiment intended to expose the gap between AI and human ethics, not to deploy moral judgment, and added clearer disclaimers and a paper documenting the limitations. The episode became a useful case study in how research demos get read in public.