Machine Learning: A Probabilistic Perspective

a probabilistic perspective

Listen — short summary

0:00 / 3:48

The premise is almost stubbornly principled: every machine learning algorithm worth knowing is a special case of probabilistic inference, and you'd understand it better if you treated it that way. Kevin P. Murphy's *Machine Learning: A Probabilistic Perspective* is a 1,100-page argument for that position, and for the most part, it wins.

What Murphy does that most ML textbooks don't is impose a consistent grammar across wildly different techniques. Ridge regression, lasso, naive Bayes, SVMs, hidden Markov models — each one gets derived from the same underlying machinery of likelihoods, priors, and posteriors. That consistency is genuinely useful. The ridge penalty stops looking like a regularization trick and starts looking like a Gaussian prior on the weights. MAP estimation stops looking like a compromise and starts looking like what it is — a point estimate from a posterior distribution, with all the limitations that implies. If you've been floating between different ML communities, each with its own notation and its own mythology about what its methods "really are," this book forces a reconciliation.

Rather than describing a cookbook of different heuristic methods, this book stresses a principled model-based approach to machine learning.
— Murphy, *Machine Learning: A Probabilistic Perspective*, Preface

The strongest chapters sit in the middle: the treatment of graphical models, latent variable models, and the EM algorithm is methodical and clear. The inference section — variational methods, MCMC, particle filtering — is genuinely difficult material presented without panic. Murphy doesn't hide the computational hardness of exact inference; he takes you through why it's intractable and then teaches you the approximate methods the field actually uses. There's intellectual honesty here that patchwork tutorials can't match. The chapter on frequentist statistics is also unexpectedly good — the critique of confidence intervals and p-values is crisp, and Murphy names what's wrong without preaching about it.

For any given model, a variety of algorithms can often be applied. Conversely, any given algorithm can often be applied to a variety of models.
— Murphy, *Machine Learning: A Probabilistic Perspective*, Preface

The weaker parts are predictable. Chapter 28, the deep learning chapter, was already thin at publication in 2012 and reads today like a postcard from just before the flood. Restricted Boltzmann machines and stacked autoencoders were the frontier then; the book ends right where the field was about to explode. This isn't a flaw Murphy could have avoided, but it matters if you're coming to the book now expecting modern coverage. You'll get the mathematical foundations that make transformers comprehensible, but not the architectures themselves. The early printings also had a typo problem severe enough that reviewers flagged equations you couldn't trust — later printings fixed most of it, but it's worth knowing.

This kind of modularity, where we distinguish model from algorithm, is good pedagogy and good engineering.
— Murphy, *Machine Learning: A Probabilistic Perspective*, Preface

The honest comparison is to Bishop's *Pattern Recognition and Machine Learning*: the two books cover similar ground and both call themselves Bayesian. Bishop is tighter and more principled; Murphy is wider and more pragmatic. Murphy's book includes more algorithms and more connections to the working practitioner's toolkit; Bishop's derivations are cleaner. Which one you want depends on whether you'd rather have depth or breadth. For most working practitioners today, Murphy's breadth wins on first pass — and the book rewards returning to specific chapters as you need them more than it rewards linear reading.

This is the book to own if you want a single reference that covers classical machine learning from first principles, in a language that scales from introductory derivations to serious inference problems. Come back to it when you hit a method you don't understand. The first hundred pages won't be where you start, but the index will earn its keep.

Key takeaways

Casting ML algorithms as probabilistic models reveals that ridge regression, lasso, SVMs, and naive Bayes are all variations of the same MAP inference problem with different prior choices.
MAP estimation gives you a point estimate but discards your uncertainty — the full Bayesian posterior is harder to compute but tells you what you actually know.
Graphical models are a language for writing down independence assumptions explicitly, which makes both the model interpretable and efficient inference algorithms possible.
The EM algorithm converts a hard joint optimization over parameters and latent variables into alternating tractable steps, making it the standard tool for fitting mixture models and HMMs.
Exact inference is NP-hard for most interesting graphical models; variational methods and MCMC are not stopgaps — they are the only practical path.
L1 and L2 regularization correspond to Laplace and Gaussian priors on weights — the same bias-toward-simplicity expressed two different ways, with different sparsity consequences.
The kernel trick lets you work in infinite-dimensional feature spaces without computing the features explicitly, unifying SVMs, Gaussian processes, and kernel regression under one idea.