Tri Dao

Assistant Professor of Computer Science — Princeton University Chief Scientist & Co-Founder — Together AI AI2050 Early Career Fellow — Schmidt Sciences

Listen — profile

0:00 / 2:19

Profile

If you’ve ever wondered why modern language models can ingest a 200k-token document without your GPU catching fire, thank Tri Dao. His 2022 paper FlashAttention — written as a Stanford PhD student under Christopher Ré — didn’t invent a new kind of attention. It just computed the existing one correctly. By being ruthlessly honest about GPU memory hierarchy (HBM is slow, SRAM is fast, stop shuttling data between them unnecessarily), he made attention 2-4x faster and used 10-20x less memory without changing the math. Every serious transformer training run now uses some descendant of his kernel.

FlashAttention wasn’t a one-off. He shipped FlashAttention-2 in 2023 and FlashAttention-3 in 2024, each squeezing more out of successive Nvidia architectures (Ampere, then Hopper’s async tensor cores). Then he co-authored Mamba with Albert Gu — a state-space model that challenges the “attention is all you need” orthodoxy by scaling linearly with sequence length instead of quadratically. Mamba-2 unified SSMs and attention under one theoretical framework. Mamba-3 landed at ICLR 2026.

Since 2023 he’s worn two hats: Assistant Professor at Princeton and Founding Chief Scientist at Together AI, the open-source inference and training company. He’s not the CEO — he’s the systems researcher making the stack go fast. His pitch, consistent across talks: the next order of magnitude in AI comes from co-designing algorithms with hardware, not from scaling blindly.

For developers learning AI, Dao is the clearest example of why systems knowledge still matters in the transformer era. You can use FlashAttention without understanding it. But reading his papers is the fastest way to internalize why modern ML is hardware-constrained — and why the person who understands both the math and the memory hierarchy has an unreasonable advantage.

Key Articles & Papers

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 2022 — The original paper. Same attention, radically better execution. Won ICML 2022 Outstanding Paper runner-up and changed how the field thinks about kernel-level efficiency. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 2023 — A 2x speedup over the original by fixing work partitioning across thread blocks and warps. The version most training stacks adopted first. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision 2024 — Retargets H100 Hopper GPUs — async tensor cores, FP8 — for another 1.5-2x bump. A masterclass in exploiting what hardware actually exposes. Mamba: Linear-Time Sequence Modeling with Selective State Spaces 2023 — Co-authored with Albert Gu. A serious alternative to attention, with 5x higher inference throughput and linear scaling. The paper that made SSMs credible for language modeling. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 2024 — Shows attention and SSMs are two views of the same underlying matrix structure. Yields Mamba-2, which is 2-8x faster than Mamba-1. Elegant theory that actually ships. State Space Duality (Mamba-2) Part I - The Model 2024 — Tri's own blog series explaining the SSD framework without the arxiv formalism. The easiest way in if the paper is dense. Hungry Hungry Hippos: Towards Language Modeling with State Space Models 2022 — The H3 paper — the predecessor work that narrowed the gap between SSMs and transformers and set up Mamba. Useful if you want the full arc.

Controversies

None of note. Dao is a researcher’s researcher — heads-down, ship-the-paper, let-the-benchmarks-argue. The closest thing to controversy around his work is the ongoing attention-vs-SSM debate, which is a healthy technical disagreement rather than anything personal.