Machine Learning Q and AI

Listen — short summary

0:00 / 3:16

The gap between "I know ML basics" and "I can actually do ML work" is not one of effort — it's one of vocabulary and conceptual coverage. Raschka's *Machine Learning Q and AI* exists to close that gap, 30 questions at a time.

Each chapter is a standalone brief: a question, an answer, some diagrams, exercises. The format is honest about what it is — not a coherent narrative, but a knowledge audit. The chapter on encoder vs. decoder transformers doesn't just recite architecture diagrams you've already seen; it clarifies why GPT-style models produce text left-to-right while BERT-style models see the whole sequence at once, and what that means practically when you're choosing which to fine-tune. The chapters on multi-GPU training break down tensor parallelism, pipeline parallelism, and data parallelism in a way that intro courses never bother with because they assume you'll deal with it when you get there. Here, you get there early.

Machine learning and AI are moving at a rapid pace. Researchers and practitioners are constantly struggling to keep up with the breadth of concepts and techniques.
— Raschka, *Machine Learning Q and AI*, Introduction

Where the book is strongest is where ML education is weakest: evaluation. The chapters on confidence intervals, conformal predictions, k-fold cross-validation, and proper metrics are some of the most practically useful pages in any ML book published in the last decade. Most practitioners treat model evaluation as "run it on the test set, report accuracy." Raschka takes this seriously — covering why bootstrapping your test-set predictions gives better confidence estimates than naive normal approximation, and why conformal predictions offer finite-sample guarantees that classical confidence intervals don't. If you've shipped a model and had no real sense of how uncertain your accuracy estimate was, these chapters will retroactively embarrass you and then fix it. The chapter on proper metrics — specifically why cross-entropy loss and MSE satisfy mathematical requirements that arbitrary alternatives often fail — is something worth sending to anyone who has ever invented a custom loss function and wondered why training was unstable.

Even experienced machine learning researchers and practitioners will encounter something new that they can add to their arsenal of techniques.
— Raschka, *Machine Learning Q and AI*, Introduction

The weaknesses are predictable: 264 pages covering 30 topics means depth is always rationed. Some chapters — Poisson regression, the lottery ticket hypothesis — feel like overviews that gesture at depth without achieving it. The exercises are useful but not difficult. Raschka doesn't pretend otherwise; the book is designed as an entry point into each topic, with references for going deeper, and it delivers on that scope. It's also free to read online now, which removes any remaining excuse for skipping it.

For someone who has worked through an intro ML course and finds research papers occasionally opaque, this is the most efficient remediation available. For someone already working in the field who keeps encountering concepts they half-know — conformal predictions, LoRA, distribution shifts, vision transformer inductive biases — it's a fast audit that surfaces the specific gaps worth fixing. Raschka doesn't waste your time, and neither does the format.

Key takeaways

Vision transformers need far larger training sets than CNNs because they lack the inductive biases — local connectivity, spatial invariance, weight sharing — that CNNs hardwire into their architecture.
Encoder-style transformers (BERT) and decoder-style transformers (GPT) are not interchangeable: bidirectional encoders excel at classification and understanding tasks, while autoregressive decoders are the right tool for generation.
Parameter-efficient fine-tuning methods like LoRA match full fine-tuning performance by updating only a small fraction of weights, making large-model adaptation practical without the compute bill.
Conformal predictions provide finite-sample coverage guarantees that classical confidence intervals cannot — bootstrapping and normal approximations are asymptotic, which means they can mislead you in small-data regimes.
Data-centric AI — systematically improving label quality and data coverage — routinely outperforms model-centric approaches like architecture search or hyperparameter tuning on real-world problems.
Every large neural network contains a sparse 'winning ticket' subnetwork that, trained from its original initialization, matches the full network's accuracy — most of the model's capacity is structurally redundant.
Self-supervised pretraining on unlabeled data followed by fine-tuning on a small labeled set frequently beats training from scratch, making it the default first move when labeled data is scarce.