Flash Attention creator, efficiency breakthrough
Tri Dao
Profile
If you’ve ever wondered why modern language models can ingest a 200k-token document without your GPU catching fire, thank Tri Dao. His 2022 paper FlashAttention — written as a Stanford PhD student under Christopher Ré — didn’t invent a new kind of attention. It just computed the existing one correctly. By being ruthlessly honest about GPU memory hierarchy (HBM is slow, SRAM is fast, stop shuttling data between them unnecessarily), he made attention 2-4x faster and used 10-20x less memory without changing the math. Every serious transformer training run now uses some descendant of his kernel.
FlashAttention wasn’t a one-off. He shipped FlashAttention-2 in 2023 and FlashAttention-3 in 2024, each squeezing more out of successive Nvidia architectures (Ampere, then Hopper’s async tensor cores). Then he co-authored Mamba with Albert Gu — a state-space model that challenges the “attention is all you need” orthodoxy by scaling linearly with sequence length instead of quadratically. Mamba-2 unified SSMs and attention under one theoretical framework. Mamba-3 landed at ICLR 2026.
Since 2023 he’s worn two hats: Assistant Professor at Princeton and Founding Chief Scientist at Together AI, the open-source inference and training company. He’s not the CEO — he’s the systems researcher making the stack go fast. His pitch, consistent across talks: the next order of magnitude in AI comes from co-designing algorithms with hardware, not from scaling blindly.
For developers learning AI, Dao is the clearest example of why systems knowledge still matters in the transformer era. You can use FlashAttention without understanding it. But reading his papers is the fastest way to internalize why modern ML is hardware-constrained — and why the person who understands both the math and the memory hierarchy has an unreasonable advantage.
Key Articles & Papers
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Mamba: Linear-Time Sequence Modeling with Selective State Spaces Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality State Space Duality (Mamba-2) Part I - The Model Hungry Hungry Hippos: Towards Language Modeling with State Space ModelsControversies
None of note. Dao is a researcher’s researcher — heads-down, ship-the-paper, let-the-benchmarks-argue. The closest thing to controversy around his work is the ongoing attention-vs-SSM debate, which is a healthy technical disagreement rather than anything personal.
Spotify Podcasts