Percy Liang

Associate Professor of Computer Science, CRFM Director — Stanford University Senior Fellow — Stanford Institute for Human-Centered AI (HAI)

Listen — profile

0:00 / 3:14

Profile

Percy Liang is the person you have to reckon with the moment you stop trusting a model’s marketing and start asking how it was actually measured. An associate professor of computer science at Stanford and director of the Center for Research on Foundation Models (CRFM), Liang has spent the last decade building the infrastructure that keeps the AI field honest — the datasets everyone trains on, the benchmarks everyone cites, and increasingly the vocabulary everyone uses. His group’s 2021 report literally coined the term “foundation models,” and if you’ve ever argued about what a language model can and can’t do, you were probably standing on ground he helped survey.

Before the LLM era, Liang was one of the most influential figures in natural language processing. His lab produced SQuAD, the Stanford Question Answering Dataset that became the default reading-comprehension benchmark for years, along with foundational work in semantic parsing (turning natural language into executable logical forms). Just as importantly, his lab was among the first to puncture the hype: the 2017 “Adversarial Examples for Evaluating Reading Comprehension Systems” paper showed that models scoring near-human on SQuAD collapsed when you appended a distracting sentence — an early, uncomfortable demonstration that benchmark scores and genuine understanding are not the same thing. That skepticism is the throughline of his entire career.

Today Liang is best known for HELM (Holistic Evaluation of Language Models), the open-source framework that treats evaluation as a first-class scientific problem rather than a leaderboard. HELM’s insight is that “which model is best?” is the wrong question; you have to measure accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency together, across dozens of scenarios, with every prediction transparent and reproducible. HELM Safety extended this to risk categories like fraud, deception, and discrimination. For developers, HELM is the closest thing the field has to an independent, transparent referee — the place you check when a new model claims the crown. (HELM entered maintenance mode in mid-2026 as Liang’s focus shifted, but its methodology remains the reference standard.)

Liang is also a co-founder of Together AI and, more radically, the driving force behind Marin — an “open lab” that practices what he calls open development. Marin goes past open-weights and open-source: every experiment, including the failures, is preregistered and live on GitHub for anyone to inspect, critique, or run themselves. It’s an argument that frontier AI research shouldn’t happen behind closed doors, and it puts Liang firmly on the transparency side of the industry’s defining fault line. For someone learning AI today, he’s worth studying precisely because he insists on the unglamorous parts — rigorous measurement and open process — that separate real capability from a good demo.

Key Articles & Papers

On the Opportunities and Risks of Foundation Models 2021 — The 200-page CRFM report that coined the term 'foundation models' and set the research agenda for the entire field. Essential context for why we talk about AI the way we do now. Holistic Evaluation of Language Models (HELM) 2022 — The paper behind HELM — argues that evaluating LLMs means measuring many metrics across many scenarios transparently, not just chasing one accuracy number. SQuAD: 100,000+ Questions for Machine Comprehension of Text 2016 — The reading-comprehension benchmark that shaped a generation of NLP research and taught the field how to build large crowdsourced datasets. Adversarial Examples for Evaluating Reading Comprehension Systems 2017 — Showed that models 'acing' SQuAD fell apart under simple adversarial edits — an early, influential warning that benchmark scores can mask a lack of real understanding. Semantic Parsing on Freebase from Question-Answer Pairs 2013 — A landmark in semantic parsing — learning to map natural-language questions to executable database queries from answers alone. Considerations for Governing Open Foundation Models 2024 — A Science policy piece arguing that regulation of open models should be evidence-based, since there's little proof they raise marginal risk over existing tech. Introducing Marin: An Open Lab for Building Foundation Models 2025 — Lays out 'open development' — building frontier models fully in the open, with preregistered experiments anyone can inspect, review, or run.

Videos

Controversies

Liang isn’t a scandal figure, but he’s a deliberate voice in the field’s most contested debate: how open foundation models should be. He has publicly pushed back on arguments that open-weight models pose uniquely dangerous risks, contending in his Stanford policy work that there’s little compelling evidence they increase marginal harm relative to existing technology — and that regulation should therefore be grounded in evidence rather than speculation. Critics on the AI-safety side argue this underweights the difficulty of controlling capabilities once weights are released. It’s a genuine, unresolved disagreement about the trade-off between openness and safety, and Liang is one of the most credible advocates for the open, transparent-by-default position.