PrometheusRoot
Blog Links Prometheans 100+ Why are you here?
← Prometheans 100+
T
rising
ResearcherFounder
X / Twitter GitHub
flash-attentionefficiencytogether-aialgorithms
← Prometheans 100+

Flash Attention creator, efficiency breakthrough

Tri Dao

Co-Founder & Chief Scientist — Together AI

Profile

If you’ve ever wondered why modern language models can ingest a 200k-token document without your GPU catching fire, thank Tri Dao. His 2022 paper FlashAttention — written as a Stanford PhD student under Christopher Ré — didn’t invent a new kind of attention. It just computed the existing one correctly. By being ruthlessly honest about GPU memory hierarchy (HBM is slow, SRAM is fast, stop shuttling data between them unnecessarily), he made attention 2-4x faster and used 10-20x less memory without changing the math. Every serious transformer training run now uses some descendant of his kernel.

FlashAttention wasn’t a one-off. He shipped FlashAttention-2 in 2023 and FlashAttention-3 in 2024, each squeezing more out of successive Nvidia architectures (Ampere, then Hopper’s async tensor cores). Then he co-authored Mamba with Albert Gu — a state-space model that challenges the “attention is all you need” orthodoxy by scaling linearly with sequence length instead of quadratically. Mamba-2 unified SSMs and attention under one theoretical framework. Mamba-3 landed at ICLR 2026.

Since 2023 he’s worn two hats: Assistant Professor at Princeton and Founding Chief Scientist at Together AI, the open-source inference and training company. He’s not the CEO — he’s the systems researcher making the stack go fast. His pitch, consistent across talks: the next order of magnitude in AI comes from co-designing algorithms with hardware, not from scaling blindly.

For developers learning AI, Dao is the clearest example of why systems knowledge still matters in the transformer era. You can use FlashAttention without understanding it. But reading his papers is the fastest way to internalize why modern ML is hardware-constrained — and why the person who understands both the math and the memory hierarchy has an unreasonable advantage.

Key Articles & Papers

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 2022 — The original paper. Same attention, radically better execution. Won ICML 2022 Outstanding Paper runner-up and changed how the field thinks about kernel-level efficiency. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 2023 — A 2x speedup over the original by fixing work partitioning across thread blocks and warps. The version most training stacks adopted first. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision 2024 — Retargets H100 Hopper GPUs — async tensor cores, FP8 — for another 1.5-2x bump. A masterclass in exploiting what hardware actually exposes. Mamba: Linear-Time Sequence Modeling with Selective State Spaces 2023 — Co-authored with Albert Gu. A serious alternative to attention, with 5x higher inference throughput and linear scaling. The paper that made SSMs credible for language modeling. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 2024 — Shows attention and SSMs are two views of the same underlying matrix structure. Yields Mamba-2, which is 2-8x faster than Mamba-1. Elegant theory that actually ships. State Space Duality (Mamba-2) Part I - The Model 2024 — Tri's own blog series explaining the SSD framework without the arxiv formalism. The easiest way in if the paper is dense. Hungry Hungry Hippos: Towards Language Modeling with State Space Models 2022 — The H3 paper — the predecessor work that narrowed the gap between SSMs and transformers and set up Mamba. Useful if you want the full arc.

Controversies

None of note. Dao is a researcher’s researcher — heads-down, ship-the-paper, let-the-benchmarks-argue. The closest thing to controversy around his work is the ongoing attention-vs-SSM debate, which is a healthy technical disagreement rather than anything personal.

Spotify Podcasts

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed
Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed
Living in Peace- Ajahn Tri Dao
Living in Peace- Ajahn Tri Dao
#7 TRIẾT LÝ SÁCH & ĐẠO LÝ MẠNG
#7 TRIẾT LÝ SÁCH & ĐẠO LÝ MẠNG
#825: How to Lower Your Histamine Load
#825: How to Lower Your Histamine Load
Phương Anh Đào: Hành trình chạm đến phiên bản tốt hơn mỗi ngày -  Have A Sip #179
Phương Anh Đào: Hành trình chạm đến phiên bản tốt hơn mỗi ngày - Have A Sip #179
TRÍ TUỆ NHÂN SINH | Giải Mã Nhân Tính Tư Mã Ý Qua Lăng Kính Tâm Lý Học Hành Vi Hiện Đại & Đạo Phật
TRÍ TUỆ NHÂN SINH | Giải Mã Nhân Tính Tư Mã Ý Qua Lăng Kính Tâm Lý Học Hành Vi Hiện Đại & Đạo Phật
Biên đạo múa Tấn Lộc: Chỉ cần một người nhìn thấy giá trị công việc mình là đủ - Have A Sip #230
Biên đạo múa Tấn Lộc: Chỉ cần một người nhìn thấy giá trị công việc mình là đủ - Have A Sip #230
Gi.ết Người Yêu Bằng 31 Nhát Dao Vào Cổ Và Ng.ực Ở Huế, Tất Cả Vì Ghen Tuông Khi Phát Hiện Tin Nhắn
Gi.ết Người Yêu Bằng 31 Nhát Dao Vào Cổ Và Ng.ực Ở Huế, Tất Cả Vì Ghen Tuông Khi Phát Hiện Tin Nhắn
TRÍ TUỆ NHÂN SINH | Lĩnh Ngộ 5 Tuệ Giác Này Từ Kinh Dịch Để Trưởng Thành Và Không Bị Cuộc Đời Đào Thải
TRÍ TUỆ NHÂN SINH | Lĩnh Ngộ 5 Tuệ Giác Này Từ Kinh Dịch Để Trưởng Thành Và Không Bị Cuộc Đời Đào Thải
Bản Đồ Tâm Trí Carl Jung: Kẻ Đang Điều Khiển Bạn?
Bản Đồ Tâm Trí Carl Jung: Kẻ Đang Điều Khiển Bạn?
© 2026 PrometheusRoot