by Lewis Tunstall, Leandro von Werra, and Thomas Wolf

Published: 2022
Publisher: O'Reilly Media, Inc.
Pages: 408
ISBN-13: 9781098136789

Cited on

Thomas Wolf

Natural Language Processing with Transformers, Revised Edition

Listen — short summary

0:00 / 3:39

The transformer architecture rewrote the rules of NLP in 2017, and within five years it became nearly impossible to work in machine learning without touching Hugging Face's library — which makes this book, written by three of the people who built it, something worth paying attention to.

*Natural Language Processing with Transformers* is a practitioner's manual, not a research survey. Tunstall, von Werra, and Wolf walk through the full range of NLP tasks — text classification, named entity recognition, question answering, summarization, generation — and show exactly how to get transformer models working on each one. The code runs. The datasets are real. The problems are the problems you actually hit in production: too little labeled data, models too large to deploy, latency that makes the whole thing unusable. Each chapter is structured around a specific task and a specific challenge, which keeps the book from becoming encyclopedic mush.

Thanks to language, thoughts have become airborne and highly contagious brain germs—and no vaccine is coming.
— Géron, *Natural Language Processing with Transformers, Revised Edition*, Foreword

Where the book earns its place is in the production and efficiency chapters. The chapter on making transformers efficient — covering knowledge distillation, quantization, ONNX export, and weight pruning — is one of the clearest treatments of the subject available anywhere. The authors show concretely what each technique buys you and what it costs, with actual benchmark numbers rather than vague promises. The chapter on few-shot and zero-shot learning covers real decision-making terrain: when to use zero-shot classification, when to try embedding lookup, when to fine-tune, and how to evaluate the tradeoffs without fooling yourself. This is the kind of material that typically lives scattered across blog posts and papers; having it collected and ordered is genuinely useful.

Like many great scientific breakthroughs, it was the synthesis of several ideas, like attention, transfer learning, and scaling up neural networks, that were percolating in the research community at the time.
— Tunstall, von Werra, Wolf, *Natural Language Processing with Transformers, Revised Edition*, Preface

The book is weaker when it gets further from its authors' home turf. The question answering chapter, which introduces Haystack and the retriever-reader architecture, covers interesting ground but leans on the Haystack API heavily enough that readers will need to track library changes on their own — a real problem for a book published in a fast-moving field. The future directions chapter on sparse attention and multimodal transformers is necessarily thin; by the time you're reading this, much of it has already been superseded. These aren't fatal flaws, but they're the parts to read critically.

This reduces the time it takes a practitioner to train and test a handful of models from a week to a single afternoon!
— Tunstall, von Werra, Wolf, *Natural Language Processing with Transformers, Revised Edition*, ch. 1

The underlying architecture explanation in chapter three, where the authors build a transformer encoder from scratch in PyTorch, is excellent — not because it's novel but because it's done without unnecessary mystification. The scaled dot-product attention derivation, the positional embedding discussion, the taxonomy of encoder-only versus decoder-only versus encoder-decoder models: all of it is clear and grounded. If you've read other explanations of self-attention and still felt like you were missing something, this chapter probably fixes that.

Who will find it most useful: data scientists and ML engineers who already know Python and have some PyTorch or TensorFlow experience, and who want a single resource that covers the practical span from fine-tuning to deployment. Researchers who want deep theoretical treatment should look elsewhere. People completely new to deep learning will struggle. But for the working practitioner trying to apply transformers to real problems in 2022 and shortly after, there was nothing more useful.

Key takeaways

Self-attention replaced recurrence because it lets the model process all tokens in parallel and attend to any position in the sequence simultaneously, eliminating the information bottleneck that capped RNN performance on long inputs.
Transfer learning — pretrain on massive unlabeled text, then fine-tune on a small labeled dataset — is the core reason transformers work so well across tasks; without it, each task would require millions of labeled examples.
The Hugging Face ecosystem (Hub, Transformers, Datasets, Tokenizers) collapsed the gap between reading a paper and shipping a model from weeks of engineering to an afternoon of code.
Knowledge distillation combined with INT8 quantization and the ONNX Runtime can cut model size by 4× and latency by 3×–6× with negligible accuracy loss, making production deployment practical on CPU.
A single multilingual model like XLM-RoBERTa, fine-tuned on one language, transfers surprisingly well to related languages without any labeled data in those languages — zero-shot cross-lingual transfer is real and usable.
When labeled data is scarce, zero-shot NLI-based classification frequently outperforms a supervised baseline trained on fewer than fifty examples, because the NLI model already encodes semantic relationships the small dataset cannot teach.
Extractive QA at scale requires a retriever–reader pipeline: the retriever sets a hard ceiling on performance, so optimizing recall there matters more than squeezing another point out of the reader.

Read the longer summary

Listen — long summary

0:00 / 16:37

A 2022 cookbook from inside the kitchen

This is a 2022 book about Hugging Face Transformers, written by three of the people who built Hugging Face Transformers. That’s the strength and the blind spot. Lewis Tunstall, Leandro von Werra, and Thomas Wolf know exactly how every surface in the library works because they wrote those surfaces. Reading them explain a tokenizer or a Trainer is reading the source code lightly narrated. The downside is that the book often presents the Hugging Face way of doing things as if it were the only way, which was already a stretch in 2022 and is plainly wrong by 2026.

The argument across the eleven chapters is straightforward: the transformer architecture, plus pretrained weights, plus task-specific fine-tuning, plus the Hugging Face ecosystem, lets a working data scientist solve most of the NLP tasks worth solving. Each chapter pairs one task with one dataset and ships a working Jupyter notebook. The pedagogy is relentlessly hands-on. Aurélien Géron’s foreword captures the vibe — he writes that “thoughts have become airborne and highly contagious brain germs — and no vaccine is coming.” Cute. The operative claim underneath the metaphor is that the ideas in this book should infect a production system within a single afternoon.

That afternoon claim mostly held in 2022. It holds less well now. We’ll get to that. First, the parts of the book that have aged best.

The chapter that earns the price of the book

Chapter 3, “Transformer Anatomy,” is the part we’d hand a junior engineer who needs to understand what is actually happening inside one of these models. It builds the encoder side of the architecture from scratch in PyTorch — token embeddings, scaled dot-product attention, multi-head attention as a list of single-head modules, the position-wise feed-forward block, layer normalization, positional embeddings — and the implementation is short enough to read in one sitting and type out in one more.

The supermarket analogy for self-attention is the best one we’ve seen in print. You’re shopping with a recipe in hand: each ingredient on the list is a query, each shelf label is a key, what you grab off the shelf is a value. Real self-attention is the smooth version: every label matches every ingredient to some degree, and you walk out with a weighted basket. From there the math is short — three linear projections of the embeddings, a dot product as the similarity function, scaling by the square root of the dimension to keep the softmax stable, multiply by the values, done.

What this chapter does that most tutorials get wrong: it makes you build the multi-head attention block as a Python list of single-head blocks before showing the more efficient batched version. It uses BertViz to visualize how attention weights distribute across heads on real sentences, including the classic example of the word “flies” attending to “arrow” in one sentence and to “fruit” in another. Read this chapter, then re-read the original Vaswani paper. The paper is dense; the chapter is the unpacking. We’d argue you don’t really understand transformers until you’ve done the implementation in this chapter once by hand.

The decoder treatment is shorter and weaker. Masked self-attention gets a paragraph and a torch.tril trick. The encoder-decoder cross-attention layer is left as homework. If you want to follow a generative model end-to-end, supplement this chapter with Andrej Karpathy’s nanoGPT or minGPT — the book even points you there.

The chapter closes with a tour of the model zoo: BERT, RoBERTa, DistilBERT, ALBERT, ELECTRA, DeBERTa, the GPT line, T5, BART, M2M-100, BigBird. By 2026 most of these are legacy weights that nobody trains from. But the architectural taxonomy the chapter draws — encoder-only for understanding, decoder-only for generation, encoder-decoder for translation and summarization — is still the conceptual map of the field. The names of the models change every six months. The taxonomy doesn’t.

A task tour with diminishing returns

Chapters 2 and 4 through 7 each take one task, one dataset, one model, and walk through the fine-tuning. Chapter 2 fine-tunes DistilBERT on emotion classification of tweets. Chapter 4 fine-tunes XLM-RoBERTa on multilingual named entity recognition and shows cross-lingual zero-shot transfer working surprisingly well from German into French and Italian — and worse for English, which the chapter handles honestly rather than handwaving past. Chapter 5 covers decoding strategies for text generation. Chapter 6 fine-tunes PEGASUS on the SAMSum dialogue summarization dataset. Chapter 7 builds an extractive question-answering system with the Haystack library on Amazon product reviews.

Chapter 5 is the most generally useful of these. The decoding strategies it covers — greedy search, beam search with n-gram penalties, temperature sampling, top-k, nucleus — are still what every text generation API exposes today, including the ones built on much bigger models. The intuition for when to use which carries cleanly forward. Greedy or beam for arithmetic and factual lookup, where you want the most probable answer. Sampling with moderate temperature for creative text. Nucleus sampling when you want diversity without the long tail of off-topic tokens. None of this is invalidated by GPT-4 or Claude or Llama. It’s the same machinery underneath.

Chapter 7’s framing of question answering as retrieve-then-read is still load-bearing in 2026. The book uses BM25 (a sparse retriever) and dense passage retrieval for the retrieval step, and a span-extraction model on top. What the rest of the world now calls retrieval-augmented generation is structurally the same architecture; the difference is that the reader is now a generative model rather than a span extractor. The chapter actually closes with a brief section on RAG using a 2020-era model from Patrick Lewis and colleagues, which reads as prophetic in retrospect. If you read this chapter and squint, you can see the entire 2024-2026 RAG industry being foreshadowed.

The summarization chapter has aged worse. Fine-tuning a 568M-parameter PEGASUS encoder-decoder on dialogue data made sense in 2022. In 2026, a generic instruction-tuned LLM produces better summaries with a one-line prompt and zero training. The chapter’s discussion of ROUGE and BLEU as evaluation metrics is still useful — those metrics are flawed in well-understood ways, the chapter is honest about it, and we still use them anyway because nothing better has emerged at scale. But the workflow of grabbing a 2020-era pretrained summarization model and fine-tuning it on your own corpus is not where most teams should start anymore.

The classification chapter is the one we’d actually skip on a re-read. The fine-tune-DistilBERT-on-your-data pattern was correct in 2022 and is correct-but-rarely-optimal in 2026. For most production classification problems today, the right comparison is a strong embedding-plus-logistic-regression baseline against an instruction-tuned LLM with a handful of in-context examples. Fine-tuning is the third option, not the first.

The cookbook problem

Across the task chapters, the book has a consistent tic: it presents the Hugging Face way as the way. The Trainer class, the AutoModel hierarchy, the pipeline function, the Datasets library, the Hub — these all appear without alternatives. In 2022 this was reasonable because there were few real alternatives for fine-tuning workflows. In 2026 it’s a problem because the practitioner’s stack has expanded substantially, and many serious teams have moved off the Trainer for anything non-trivial.

The book also blurs the line between “this is how transformers work” and “this is how the Hugging Face library works.” The Trainer class gets pages of explanation. The optimizer schedule, gradient accumulation, mixed-precision training, and learning-rate warmup logic that the Trainer wraps gets a paragraph or a footnote. If you’re learning from this book, you should periodically force yourself to write a training loop in raw PyTorch to make sure you understand what happens when you call trainer.train(). The library is a tool. Don’t let the tool become the mental model.

The one chapter that breaks this pattern is, again, Chapter 3. When the book builds attention from scratch, it builds attention from scratch. No magic. That’s why we keep coming back to it.

The production chapter pulls its weight

Chapter 8, “Making Transformers Efficient in Production,” is the second chapter we’d hand someone, and the one with the longest shelf life. It walks through four compression techniques on an intent classification task and benchmarks each against a BERT baseline: knowledge distillation, dynamic quantization to INT8, weight pruning, and graph optimization with ONNX Runtime.

The opening case study is borrowed from Roblox: BERT scaled to over a billion daily requests on CPU-only infrastructure by stacking distillation and quantization. That stack — distill a teacher model into a smaller student, then quantize the student to 8-bit integers — is still the bread and butter of shipping transformer models cheaply. The chapter walks through it cleanly. The custom DistillationTrainer that subclasses the standard Trainer to add a Kullback-Leibler divergence term to the cross-entropy loss is one of the cleanest examples of subclassing in the book, and the hyperparameter search with Optuna at the end is a nice touch that we don’t see in many tutorials.

The quantization explanation is good. The book derives the linear affine map from a floating-point range to an 8-bit integer range, walks through the scale factor and zero point, and benchmarks INT8 matrix multiplication at roughly 100x speedup over FP32 on isolated tensors. The result on the actual model is more modest — about a 2x latency improvement on top of an already-distilled student — and the chapter is honest about the gap between microbenchmarks and end-to-end gains.

The pruning section is briefer and more theoretical. Magnitude pruning and movement pruning both get coverage. But the chapter admits that current commodity hardware does not accelerate sparse matrix operations enough for pruning to pay off in real latency. That caveat still mostly holds in 2026, though specialized inference engines have started to chip away at it.

Two things are missing from the production chapter that we’d want in 2026. There’s no discussion of FlashAttention or any of the attention-kernel optimizations that are now the dominant lever for inference speed on modern GPUs. And there’s no discussion of mixed-precision inference in BF16 or FP16, which is now the default. Both omissions are forgivable for a 2022 book; both are worth flagging if you’re using this chapter as a present-tense reference.

Few-label techniques, mixed bag

Chapter 9, “Dealing with Few to No Labels,” walks through what to do when the labeled-data well is dry. The decision tree at the start of the chapter is genuinely useful. Do you have any labels? How many? Do you have unlabeled data on top of those? The answers route you to different techniques: zero-shot classification, embedding-based nearest-neighbor lookup, data augmentation via synonym replacement, fine-tuning a domain-adapted language model, semi-supervised methods like UDA and UST.

The zero-shot trick using natural language inference is clever and worth knowing even now. Frame each candidate label as a hypothesis sentence (“This text is about banking”) and use a model trained on entailment to score that hypothesis against the input text. The chapter benchmarks this against a Naive Bayes baseline on a GitHub issues tagger and shows zero-shot beating the baseline up to roughly 50 labeled examples per class.

What’s dated is that much of the rest of the chapter — UDA, UST, embedding-lookup classifiers using GPT-2 — has been substantially supplanted by the option of “just give your problem to a frontier LLM with a few-shot prompt.” The book pre-dates that option being affordable or fast. If you’re reading this chapter in 2026, the right way to use it is as a backup plan: when LLM prompting is too expensive at inference time or too slow at scale, these techniques still apply, and the embedding-plus-nearest-neighbor classifier is still a reasonable default for low-stakes high-volume work.

One small thing: the book uses GPT-2 for the embedding-lookup classifier, which is a strange choice because GPT-2 was never trained for sentence embeddings. Modern sentence-transformer models do this much better and have for years. The technique is right; the model choice is dated.

Who should read this in 2026

The book is from 2022. It does not cover instruction tuning, RLHF, in-context learning beyond the simplest few-shot prompts, chain-of-thought reasoning, mixture-of-experts at scale, RAG as the dominant deployment pattern, or any model with more than about 11 billion parameters. The largest deployed model when this book was written was GPT-3, accessed through OpenAI’s API; the authors mention GPT-Neo and GPT-J as community alternatives. Reading those passages in 2026 is reading a fossil record.

So who is this book actually for now? Three audiences.

First, anyone who wants to understand what’s happening inside a transformer at the level of working code rather than analogy. Chapter 3 alone justifies the book’s price. Read it once, type out the implementation, and you will never have to ask what self-attention “really” is again. We’d assign this chapter to anyone joining an ML team, regardless of whether they’re going to work on transformers directly.

Second, anyone shipping smaller fine-tuned models in production where cost and latency matter more than absolute capability. The compression chapter is still the cleanest treatment of distillation and quantization we know in book form, and the patterns transfer cleanly to modern small models.

Third, anyone who wants the conceptual furniture of the field — the vocabulary, the taxonomy, the named patterns. Encoder-only versus decoder-only versus encoder-decoder. Retriever-reader. Greedy versus beam versus sampling. Knowledge distillation. These terms still mean what they meant in 2022, and the book teaches them well.

We would not hand this book to someone whose primary goal is to build with frontier LLMs in 2026. For that, the practical skills are prompt engineering, RAG architecture design, eval suite construction, and inference cost optimization at scale, and this book covers approximately none of those at the depth a 2026 practitioner needs. It would be a strange first book and a fine second one.

What this book gets right — the architecture, the tooling fundamentals, the production tricks for small models, the decoding strategies — is durable. What it gets wrong, mostly by virtue of being a snapshot of a fast-moving field, is everything that happened after early 2022. As snapshots go, it’s a careful and clear one. Just don’t mistake it for the present.