The Hundred-Page Language Models Book

hands-on with PyTorch

Listen — short summary

0:00 / 3:13

Burkov's bet is that you can teach someone how language models actually work — mathematically, not just metaphorically — in the time it takes to read a long magazine article. The bet mostly pays off.

The book follows a deliberate arc: you don't get transformers on page one. Burkov starts with the machine learning foundations most LLM tutorials skip or handwave — gradient descent, loss functions, the geometry of optimization — then walks through count-based models and RNNs before earning the right to explain attention. This is the right pedagogy. Readers who've watched ten YouTube explainers about transformers but still feel vaguely confused about why they replaced RNNs will find the sequencing clarifying in a way those videos are not. The transformer chapter, which covers self-attention, query-key-value mechanics, positional encoding, and the decoder-only architecture that powers GPT-style models, is the book's strongest section. Diagrams and PyTorch implementations sit side by side with the math, and both earn their page count.

The LLM chapter — on scale, fine-tuning, and prompt engineering — is where the compression starts to cost something. Burkov covers instruction tuning and RLHF at a level that orients the reader without equipping them. You'll understand what these techniques are and why they matter; you won't understand the implementation decisions that make them work or fail at scale. The treatment of training data — what goes in, how it's cleaned, why it matters — is thin to the point of being misleading. A reader who finishes this chapter believing they understand how a frontier model was trained understands maybe a third of it. That's a reasonable tradeoff for a hundred-page book, but it's worth naming explicitly rather than implying the coverage is complete.

The "read-first, buy-later" distribution model is worth mentioning because it shapes what kind of book this is. Burkov published all chapters freely online, then asked readers to pay if they found value. That constraint — write something so good people choose to support it — tends to sharpen prose. This book is tighter than most ML textbooks ten times its length. There are no chapters included for tenure-padding reasons. The PyTorch code runs on Google Colab, which removes the setup friction that kills many technical books before chapter two.

The critics who call it shallow aren't wrong, but they're applying the wrong standard. This isn't a book for people who want to implement a production training pipeline. It's a book for technical people — developers, data scientists, engineering managers — who want to stop nodding along when someone mentions self-attention and actually understand the mechanism. For that audience it's close to optimal: dense enough to be worth your time, short enough to actually finish.

If you're already deep in ML research, skip it. If you've been building on top of LLMs without understanding how they work, this is the fastest path to a real mental model that currently exists.

Key takeaways

Understanding transformers requires climbing the rungs below them: count-based models and RNNs reveal exactly what problem attention was invented to solve.
The decoder-only transformer, not the full encoder-decoder design, is the architecture behind most modern autoregressive language models.
Self-attention lets every token attend to every other token simultaneously via queries, keys, and values — eliminating the sequential bottleneck that capped RNN performance.
Scale is not a trick: adding parameters, data, and context length is what produces the emergent reasoning capabilities that define large language models.
Instruction finetuning is what turns a next-token predictor into a model that follows directions — without it, an LLM just continues your text.
Hallucinations are not bugs; they are a predictable output of a model trained to generate plausible next tokens with no grounding in ground truth.
Prompt engineering works because LLMs are sensitive to framing — systematic techniques like few-shot examples and chain-of-thought produce reliably better outputs.