Deep Learning: Foundations and Concepts

Listen — short summary

0:00 / 3:01

Deep learning didn't arrive from nowhere, and this book argues it makes more sense if you trace it back to its mathematical roots. Bishop and Bishop build from probability theory and Bayes through classical regression, classification, and multilayer perceptrons, all the way to transformers, graph neural networks, and diffusion models — and the argument, implicit throughout, is that the same conceptual thread runs from least-squares regression to a 100-billion-parameter language model. That's an ambitious claim, and it's roughly true.

The book's strongest chapters are in the middle and the back. The generative model coverage — VAEs, normalizing flows, GANs, and especially diffusion models — is as good as anything in the textbook literature right now, and the transformer chapter is more coherent than what you'd piece together from blog posts and papers. Bishop has spent decades learning how to explain hard ideas simply, and when the material plays to those strengths — things that are fundamentally probabilistic and graphical — it shows. The figures alone are worth something: 600 of them, and the good ones make ideas click in ways that three paragraphs of prose wouldn't.

The probabilistic grounding is also where the book trips occasionally. There's a self-contained introduction to probability that works well enough, but some of the mathematical machinery it introduces — design matrices, Moore-Penrose pseudo-inverses — surfaces once and disappears. You can tell the book grew through accretion, with material from the 2006 *Pattern Recognition and Machine Learning* transplanted into new chapters rather than fully rebuilt. A few readers have noticed terminological imprecision: "feed-forward" used loosely, "Markov blanket" and "Markov boundary" treated as synonyms when they're not. The use of "error function" throughout, instead of the standard "loss function," creates its own confusion since the book also covers the Erf function. None of this is fatal, but it suggests a first edition that could use another pass. Recurrent networks get less than they deserve for a pedagogical text — tucked into a subsection of the transformer chapter — which is odd for a book that positions itself as historically grounded.

The book's target is upper-undergraduate or early-graduate students who have some mathematical background and want to understand the field rather than just use it. For that reader, it delivers. Compare it to the Goodfellow-Bengio-Courville *Deep Learning* — the old bible — and this one covers more of the modern landscape while keeping a tighter probabilistic thread. It doesn't replace Murphy's *Probabilistic Machine Learning* for depth, but it's more approachable and more current. If you want to understand why the field works the way it does, not just reproduce results, this is the book to work through.

Key takeaways

Probability theory is not background material to skim — the reason deep learning works is statistical, and every architecture choice makes sense only when you understand the probabilistic model underneath it.
Diffusion models and large language models did not appear from nowhere; they trace directly to variational autoencoders and the theory of latent variable models, and understanding that lineage is the fastest way to understand the current state of generative AI.
The transformer's decisive innovation is the attention mechanism, which lets every element in a sequence directly influence every other, replacing the recurrent bottleneck with global context — and that substitution turns out to generalize far beyond language.
Deep networks learn hierarchical representations: early layers detect low-level structure, deeper layers detect abstractions, and this emergence from raw data — rather than hand-crafted features — is the core reason deep learning displaced every prior approach.
Convolutional networks win on images because weight sharing exploits translation invariance — the same filter works anywhere in the image — and grasping this principle tells you when convolutional architectures will and will not transfer to other problem domains.
Training deep networks requires active management of gradients: vanishing gradients kill learning in deep layers, and the fixes — normalization, residual connections, careful initialization — are not engineering heuristics but principled responses to well-understood failure modes.
GANs, VAEs, and diffusion models all learn generative distributions, but through fundamentally different mechanisms, and the differences in what they optimize explain their different failure modes, sample quality characteristics, and practical strengths.

Read the longer summary

Listen — long summary

0:00 / 12:15

A textbook for the post-Transformer era

When Pattern Recognition and Machine Learning came out in 2006, neural networks were a niche corner of machine learning. By 2023, they were the field. Bishop’s PRML was the canonical introduction for a generation of students; the obvious question was whether he’d write the canonical introduction for the next one. Deep Learning: Foundations and Concepts, co-authored with his son Hugh, is that attempt.

The book has serious credentials. Chris Bishop runs AI4Science at Microsoft Research and sits in the Royal Society; Hugh works on production deep learning at the autonomous-driving company Wayve. The endorsements stack three Turing Award winners — Hinton, LeCun, Bengio — onto the back cover. Springer’s marketing leans on the phrase “the New Bishop.” That’s a heavy lift, and the book mostly carries it, with caveats that matter if you’re deciding which deep-learning textbook to commit a semester to.

What’s actually new

A reasonable first question is how much of this is just PRML in a new jacket. The honest answer is more than the publisher would like to admit. The probability theory in Chapters 2–3, the regression and classification material in 4–5, much of the EM, latent-variable, and sampling content — these are visibly inherited from the 2006 book, sometimes with the same examples and figures. Several Goodreads reviewers who came in expecting a fresh treatment noticed the patch-up immediately, and at least one called it lazy.

The genuinely new chapters are where the value sits. Chapters 12 (Transformers), 17 (GANs), 18 (Normalizing Flows), 19 (Autoencoders), and especially 20 (Diffusion Models) didn’t exist in any meaningful form when PRML was written. The Transformer chapter in particular is one of the cleaner pedagogical introductions to attention you’ll find in a textbook. It walks the architecture from scaled dot-product attention through multi-head, masked, and cross-attention without the breathless tone that makes most popular write-ups exhausting. The Diffusion chapter does similar work for what is, frankly, a hard topic to teach without fluency in stochastic processes.

So the right way to think about the book is this: a slimmed, modernised version of PRML’s first half, plus a substantial new second half on contemporary architectures. If you already know PRML, the first nine or ten chapters will feel like review. If you don’t, they’ll feel like prerequisite-friendly throat-clearing.

The probability backbone

What sets this book apart from Deep Learning by Goodfellow, Bengio, and Courville (2016) — the book it’s most directly competing with — is the probability backbone. The Bishops refuse to present neural networks as a bag of engineering tricks. Backprop is gradient descent on a likelihood. Regularization is a Bayesian prior. A VAE is an explicit probabilistic model with a learned approximate posterior. A diffusion model is a Markov chain whose reverse-time score we’re estimating.

This commitment has real didactic payoff. Once you’ve absorbed the framing, generative models stop being a zoo of unrelated architectures and start looking like variations on a single theme: what family of distributions are you approximating, and how are you estimating its parameters? Chris and Hugh draw that thread explicitly through the VAE / GAN / Normalizing Flow / Diffusion sequence, and it’s the single best feature of the book. The chapter on continuous latent variables sets up the variational machinery, and by the time you reach diffusion in Chapter 20, the math has already been earned.

The cost is some bloat at the front. Chapters 2–3 cover probability and standard distributions in genuinely PhD-textbook depth. If you’ve taken a stats course, you’ll skim it; if you haven’t, you’ll need it. There are also moments where the math gets introduced and then abandoned. One reviewer noted that the Moore-Penrose pseudoinverse appears on page 116 and never reappears in the rest of the book; we checked, and they’re right. The design-matrix machinery that justifies it gets one paragraph and disappears. This kind of thing makes the book feel like it’s hedging — presenting more math than it needs in order to gesture at rigour, rather than deploying math when it earns its keep.

Where the book genuinely shines

The generative-model chapters (15–20) are where Deep Learning earns its place over Goodfellow et al. Goodfellow’s book, written before diffusion models existed and when GANs felt like the future, spends an outsized chunk on adversarial training. The Bishops rebalance this. GANs get one chapter of moderate depth, accurately reflecting their place in the current landscape. Diffusion models get a full chapter that walks through the forward and reverse processes, the connection to score matching, and the variational lower bound. For anyone trying to bridge from “I know what a CNN is” to “I want to understand how Stable Diffusion works,” this is one of the cleanest paths available in print.

The Transformer chapter is similarly well-pitched. It builds attention from first principles rather than dropping the formula on the reader. The connection between attention, position encoding, and the inductive biases of CNNs and RNNs is drawn out clearly. The chapter even includes a brief but honest discussion of why Transformers won — it wasn’t just more compute, it was the compatibility between the architecture and parallel hardware. This is the kind of context that matters and that papers rarely supply.

The 600 illustrations (200 black-and-white, 400 colour) are genuinely useful. Many books pad their figure count with redundant block diagrams; Deep Learning leans on figures that earn their space, like geometric pictures of high-dimensional behaviour, before-and-after visualizations of normalization, latent-space traversals. If you learn visually, this book will feel like it was made for you.

Where it fumbles

The most surprising weakness is how little space RNNs get. The book treats them almost entirely as a precursor to attention, with the main discussion folded into the Transformer chapter as section 12.2.5. This is a mistake — not because RNNs are still state of the art (they aren’t) but because backpropagation through time, the vanishing gradient problem, and the LSTM gating innovations are pedagogically valuable on their own terms. They’re how the field discovered that long-range dependencies are the central challenge of sequence modelling. Skipping past them is like teaching the calculus of variations without mentioning the brachistochrone.

Chapter ordering is occasionally puzzling. Chapter 11 is Structured Distributions, covering directed and undirected graphical models, Markov properties, the works. Chapter 13 is Graph Neural Networks. The graphical-models material is exactly the conceptual setup for understanding GNNs, but it sits two chapters away with the unrelated Transformers chapter wedged between. The reviewers in Quantitative Finance flagged this too, and they’re right. It reads like a structural artefact of the book being assembled in pieces rather than designed end-to-end.

There are also small accuracy issues that suggest the book wasn’t pressure-tested as hard as it could have been. Chapter 1 reproduces a GPT-4 transcript proving the infinitude of primes “in the style of a Shakespeare play.” It’s a charming opener, but the proof as transcribed is wrong: the constructed integer is shown to be coprime to the listed primes, but the argument needs it to be prime itself, which isn’t established. In a book teaching mathematical thinking, leaving this uncorrected is a strange choice. Similarly, the Bishops use “error function” throughout where the rest of the field has settled on “loss function.” It’s not wrong, but it’s gratuitously non-standard, and the same symbol E ends up doing double duty for both losses and energy functions, with Erf making cameo appearances. These are small things. They add up.

The book also gives the field’s politics a wide berth. There is no honest discussion of what’s still genuinely contested in deep learning — the question of whether the scaling-laws picture is the right paradigm, the open debate over emergent capabilities, the architectural alternatives to attention that have been gaining traction. Bishop’s instinct is to teach what is settled, and that’s defensible for a textbook, but the result is a book that can feel like a snapshot of mid-2023 consensus rather than a guide to the live frontier.

Voice, pedagogy, and the competition

The Bishops’ prose is competent and clear, never showy. They don’t try to be Knuth and they don’t try to be Norvig. The tone is “experienced lecturer,” which is fine, and given Chris Bishop’s actual experience as a lecturer, it’s earned. But the book has none of the aesthetic ambition of Murphy’s Probabilistic Machine Learning series or the warmth of Prince’s Understanding Deep Learning. Several reviewers we read said they preferred Prince. We understand the impulse. Prince writes with more obvious care about the reader, his book is freely available online, and his exercises are integrated with the text rather than tacked onto chapter ends.

What the Bishops have over Prince is breadth. Understanding Deep Learning is an excellent one-semester treatment; Deep Learning: Foundations and Concepts is a two-semester reference. If you’re teaching a course, you can cover Bishop linearly and have something useful in every week. If you’re self-studying, you can use it as a problem-set generator and an honest-to-god index for hard topics. Murphy’s books are deeper but more punishing; Goodfellow is older and increasingly stale. Bishop sits in a real gap.

Who should read it

If you’re a graduate student starting an ML PhD, yes. This is the best single textbook covering the modern deep-learning toolkit at the depth a researcher needs. Pair it with Murphy’s Probabilistic Machine Learning: Advanced Topics (2023) for theoretical depth and Goodfellow et al. for the historical context Bishop skips.

If you’re a working software engineer who wants to understand what’s under the hood of the LLM you’re calling, this is probably overkill, but the diffusion and transformer chapters specifically are worth the price. Skim the probability chapters; you don’t need them for the engineering work.

If you’re an undergraduate just starting out, this is too dense for a first ML book. Read Hands-On Machine Learning by Géron first, or one of the freely available Stanford CS course notes. Come back to Bishop when you’re hungry for a reference.

If you’ve already read PRML, don’t expect to learn nothing, but expect Chapters 2–9 to be familiar. Skip to the second half. The diffusion chapter alone justifies pulling the book off the shelf.

A last note on the genre. Textbooks in fast-moving fields age. Bishop is candid about this in the preface, where he says he’s chosen to focus on “ideas that are likely to endure the test of time” rather than the latest implementations. That’s the right call. The specific architectures of 2025 won’t all matter in 2030, but the probability theory will, the gradient-based optimization will, and the variational frameworks for generative modelling probably will. Deep Learning: Foundations and Concepts is an honest attempt to teach the durable parts of a non-durable field. It doesn’t always hit. But it hits often enough that we’ll be using it for years.