The Hundred-Page Language Models Book
hands-on with PyTorch
Burkov's bet is that you can teach someone how language models actually work — mathematically, not just metaphorically — in the time it takes to read a long magazine article. The bet mostly pays off.
The book follows a deliberate arc: you don't get transformers on page one. Burkov starts with the machine learning foundations most LLM tutorials skip or handwave — gradient descent, loss functions, the geometry of optimization — then walks through count-based models and RNNs before earning the right to explain attention. This is the right pedagogy. Readers who've watched ten YouTube explainers about transformers but still feel vaguely confused about why they replaced RNNs will find the sequencing clarifying in a way those videos are not. The transformer chapter, which covers self-attention, query-key-value mechanics, positional encoding, and the decoder-only architecture that powers GPT-style models, is the book's strongest section. Diagrams and PyTorch implementations sit side by side with the math, and both earn their page count.
The LLM chapter — on scale, fine-tuning, and prompt engineering — is where the compression starts to cost something. Burkov covers instruction tuning and RLHF at a level that orients the reader without equipping them. You'll understand what these techniques are and why they matter; you won't understand the implementation decisions that make them work or fail at scale. The treatment of training data — what goes in, how it's cleaned, why it matters — is thin to the point of being misleading. A reader who finishes this chapter believing they understand how a frontier model was trained understands maybe a third of it. That's a reasonable tradeoff for a hundred-page book, but it's worth naming explicitly rather than implying the coverage is complete.
The "read-first, buy-later" distribution model is worth mentioning because it shapes what kind of book this is. Burkov published all chapters freely online, then asked readers to pay if they found value. That constraint — write something so good people choose to support it — tends to sharpen prose. This book is tighter than most ML textbooks ten times its length. There are no chapters included for tenure-padding reasons. The PyTorch code runs on Google Colab, which removes the setup friction that kills many technical books before chapter two.
The critics who call it shallow aren't wrong, but they're applying the wrong standard. This isn't a book for people who want to implement a production training pipeline. It's a book for technical people — developers, data scientists, engineering managers — who want to stop nodding along when someone mentions self-attention and actually understand the mechanism. For that audience it's close to optimal: dense enough to be worth your time, short enough to actually finish.
If you're already deep in ML research, skip it. If you've been building on top of LLMs without understanding how they work, this is the fastest path to a real mental model that currently exists.