Zero to AI Engineer
A reference, not a bootcamp schedule

Zero to AI engineer, organized by topic.

Most "learn AI" material is either a marketing-flavored overview or a single deep paper with no path to it. This is neither: it's a topic-based map from linear algebra to distributed multi-GPU training, built from a verified 56-day engineering curriculum and (for the AI-engineering practice layer specifically) Chip Huyen's AI Engineering. Every topic states what it covers, why the order matters, what to build to prove you understand it, and exactly where the primary source material lives — no invented benchmarks, no numbers we can't point to a source for.

7topic tracks
56source curriculum days
7AI Engineering chapters cited
36primary sources linked
Start Here

How this reference is organized

Each of the seven tracks below groups related days from the source curriculum into one topic, states the concrete thing you should be able to build by the end, and lists the primary learning resources — verbatim from the source material, not paraphrased. This is deliberately not a day-by-day schedule: the source curriculum's own daily loop (watch → build in Colab → write a 5-line summary → self-check, with three recall questions on yesterday's material before starting anything new) is the right way to pace yourself once you pick a track, but the pacing itself isn't the point of this page — the map of what to learn, in what order, and why, is.

The rule that survives contact with reality: code every day you study. A day with video but no code is a day that doesn't stick — this holds whether you're on day 3 of linear algebra or day 50 wiring up FSDP. Print .shape constantly once you reach deep learning; shape bugs are the majority of real debugging time there. And once you reach the GPU track: measure before you optimize, and measure again after — a performance claim without a number is a feeling, not an engineering result.
Start Here

Why the order matters — it's a dependency chain, not a menu

Every track below is load-bearing for the one after it, all the way through GPU systems: the matrix multiply you hand-code in Track 1 is the CUDA kernel you tile in Track 6 and the tensor-core workload you feed in the same track. This is the argument for not skimming the math, however tempting that is for an experienced developer — broadcasting and the chain rule resurface as silent bugs inside neural networks if you don't have them cold.

vectors & gradients
classical ML
backprop from scratch
attention & transformers
applied LLM engineering
GPU performance
distributed training

This sequence and its framing come directly from the source curriculum's own "why the order matters" section — it is not a claim this site invented independently.

Track 1 · Days 1–7 in the source plan

Math & Python Foundations

Goal: fluency with NumPy/pandas, geometric intuition for vectors and gradients, and one working linear regression trained entirely by hand. Proof you're done: you can explain gradient descent on a whiteboard and have the code to prove it.

1.1

Vectorization, NumPy & pandas

Day 1

Why first: every later track assumes array-native thinking. Loop-based code is the single most common beginner habit that has to be unlearned before anything else sticks.

Covers: why array operations beat loops, array shapes, broadcasting rules, DataFrame slicing with .loc, boolean masks, groupby.

Build to prove itTime a 1M-element loop against its vectorized equivalent; rewrite three loops as vectorized ops; one small pandas analysis with a derived column. Target: 50×+ speedup, zero for-loops in the data code.
1.2

Vectors & matrices

Day 2

Why it matters later: the matrix multiply you hand-code here is the same operation you'll tile into a CUDA kernel in the GPU track and feed to tensor cores after that — it is genuinely the same 20 lines, at increasing levels of "how fast can this go."

Covers: vectors as arrows and as data, span, linear transformations, matrix multiplication as composed transformations.

Build to prove itImplement matrix multiply with explicit loops, verify against np.matmul on random matrices, and visualize a 2×2 transformation acting on a grid of points.
1.3

Derivatives, gradients & partials

Day 3

Why it matters later: the chain rule is backpropagation — not an analogy for it, the literal mechanism. Skimming this is the single most common reason "my net trains but I don't know why" happens later.

Covers: derivative as sensitivity to a nudge; the chain rule; partial derivatives — the gradient is a vector of partials, one per input (the source curriculum flags this explicitly as the jump from single-variable to multivariable calculus that neural nets require).

Build to prove itNumerically estimate d/dx of f(x) = x² + 3x at five points, compare to the analytic 2x+3, plot error vs. step size on a log scale. Then work one 2-variable gradient by hand: ∇f(w,b) for f(w,b) = (wx+b)² at a fixed x.
1.4

Probability essentials

Day 4

Covers: distributions, mean and variance, sampling, the normal distribution, and why ML treats data as draws from a distribution.

Build to prove itSimulate 10,000 coin flips and plot the running average converging to 0.5 (law of large numbers); sample from a normal distribution and verify mean/std against the sample.
1.5

What machine learning actually is

Day 5

Covers: training vs. testing, the bias/variance tradeoff, overfitting.

Build to prove itHand-derive the gradient of mean squared error for y = wx + b on paper, then verify it numerically using the Day 3 technique.
1.6

Linear regression from scratch

Days 6–7

Gradient descent end to end: loss surface, learning rate, convergence — no new material, this is purely a build-and-consolidate stretch.

Build to prove itPure NumPy trainer: generate noisy linear data, define MSE, update w and b for 1,000 steps, plot the loss curve. Try three learning rates and watch one diverge — the divergent run is as instructive as the working one. Milestone: load a real dataset (housing or weather), fit your scratch regression, and compare coefficients against scikit-learn's LinearRegression.
Track 2 · Days 8–14 in the source plan

Classical Machine Learning

Goal: train, evaluate, and improve real models on tabular data; ship a first Kaggle submission. Proof you're done: a leaderboard entry you can defend, preprocessing choice by preprocessing choice.

2.1

Classification: softmax & logistic regression

Day 8

Why it matters later: softmax reappears inside attention, inside LLM sampling, and inside every classifier after this — the source curriculum is explicit that this is the one place to learn it properly, once.

Covers: decision boundaries, the sigmoid, log loss, odds and log-odds, softmax as sigmoid's multi-class generalization, and why cross-entropy is softmax's natural loss.

Build to prove itTrain logistic regression on a real dataset, plot the confusion matrix, inspect predicted probabilities. Then implement softmax + cross-entropy in ~10 lines of NumPy and verify probabilities sum to 1 and loss falls as logits sharpen.
2.2

Trees, forests, boosting

Day 9

Covers: decision trees (splits, impurity), why forests reduce variance, the boosting idea.

Build to prove itCompare a single tree vs. a 100-tree random forest vs. gradient boosting on the same dataset, across five random seeds — quantify the variance story, don't just eyeball it.
2.3

Model evaluation

Day 10

Covers: cross-validation, the confusion-matrix family, ROC/AUC, precision vs. recall, and data leakage — described in the source material as "the silent killer" of otherwise-plausible model results.

Build to prove itHand-roll a 5-fold CV loop with NumPy index splits; verify it matches library cross-validation to 3 decimal places.
2.4

Feature engineering

Day 11

Covers: encoding categoricals, scaling, imputation, leakage-safe pipelines.

Build to prove itTurn a raw, messy dataset into a documented pipeline object — imputer, encoder, scaler, model — that goes raw-CSV-in, score-out.
2.5

Unsupervised learning: clustering & PCA

Day 12

Covers: k-means step by step, hierarchical clustering, principal component analysis.

Build to prove itPCA a labeled dataset down to 2D and color by true class; run k-means on the same data; compare cluster assignments to labels with a crosstab.
2.6

End-to-end project & first submission

Days 13–14

No new material — full-cycle consolidation: EDA, cleaning, feature engineering, model comparison, cross-validated scoring, then a real leaderboard submission.

Build to prove itSubmit to a Kaggle Getting Started competition (Titanic or House Prices are the canonical entries), then iterate exactly once — one feature or one model swap — and record the score delta. The leaderboard discipline of measuring a change, not just making one, is the actual point.
Track 3 · Days 15–21 in the source plan

Deep Learning From Scratch

Goal: understand backpropagation well enough to build it from nothing, then use PyTorch the way it was designed to be used — while quietly building GPU instincts (parameter counts, memory, mixed precision) for later. Proof you're done: a CNN trained on real images, and the ability to trace a gradient through your network by hand.

3.1

Neural net intuition

Day 15

Covers: what a network is, layers as learned transformations, gradient descent on a loss landscape.

Build to prove itProve — in code, not just prose — that stacking linear layers without nonlinearities collapses to a single linear layer. Two matrices in NumPy settle it.
3.2

Backprop from scratch (micrograd)

Days 16–17

Why it's the anchor of this whole track: the source curriculum calls this "the single most important session of the program" — building the autograd engine yourself is what makes every later framework (PyTorch's autograd, and eventually your own custom CUDA ops) legible rather than magical.

Build a minimal autograd Value class with backward passes for addition, multiplication, and tanh; then a tiny neural-net library (Neuron, Layer, MLP) with a training loop on toy data.

Build to prove itAdd one new operation (exp or ReLU) with its own backward pass, verify it with a numerical gradient check (from Track 1's derivative-estimation technique) to 1e-6, and train the MLP to convergence on toy data.
3.3

PyTorch fundamentals

Day 18

Covers: tensors, autograd, nn.Module, Dataset/DataLoader, the canonical training loop.

Build to prove itRebuild the Track 1 linear regression in PyTorch in under 40 lines. Then deliberately break it — wrong tensor shape, forgotten zero_grad() — and read what the errors actually say. Recognizing these two failure modes on sight saves hours later.
3.4

Training an MLP properly + regularization

Day 19

Covers: loss functions, SGD vs. Adam, learning-rate effects, recognizing overfitting in training curves, and the standard anti-overfitting toolkit — dropout, weight decay (L2), early stopping — and what each actually does to the optimization, not just that it "helps."

Build to prove itTrain an MLP on MNIST past 97% test accuracy on a free-tier GPU, logging train/val loss every epoch. Add dropout + weight decay and compare curves. Habit worth starting now: print the model's parameter count and peak GPU memory allocated after training — the first of the GPU instincts this track is quietly building.
3.5

Convolutional neural networks

Day 20

Covers: convolutions as learned filters, pooling, parameter sharing, why CNNs fit image data specifically.

Build to prove itSwap the MNIST MLP for a small CNN; compare test accuracy vs. parameter count; visualize the first-layer filters.
3.6

Review + image project (CIFAR-10)

Day 21

No new material — consolidation and a real image-classification project.

Build to prove itTrain a baseline CNN on CIFAR-10, then improve it exactly once (augmentation or a deeper architecture). Time one training epoch, then wrap the training step in mixed-precision autocast and time it again — a small, early preview of the GPU-performance track ahead. Milestone: trace one gradient through your network by hand, on paper, without notes.
Track 4 · Days 22–28 in the source plan

Transformers & Large Language Models

Goal: build a GPT from the inside out — attention, healthy activations, the tokenizer, the transformer block — then understand how a raw next-token predictor becomes a helpful assistant. Proof you're done: you can explain how a GPT works end to end, because you built each piece of it.

4.1

Sequences and attention

Day 22

Why it's not as new as it looks: the source curriculum points out explicitly that the core of attention is matrix multiplication (Track 1) plus softmax (Track 2) — nothing you haven't already built by hand.

Covers: why fixed-window and recurrent approaches struggle at long range; attention as a learned, weighted lookup across a sequence.

Build to prove itCompute single-head attention by hand for a 3-token toy example in NumPy — queries, keys, values, scaled dot products, softmax, weighted sum — printing and annotating every intermediate matrix.

Attention visualizer — the exact 3-token exercise above, running live

Edit Q, K, V below (3 tokens × 2 dimensions, so every number stays hand-checkable) and recompute. This is scaled dot-product attention — the same formula you're asked to derive by hand on Day 22 — with every intermediate matrix shown, matching the self-check above.

Q:
K:
V:
What to notice: each output row is a weighted blend of every V row — token 1's output is mostly built from whichever tokens its query vector aligns with most strongly under the dot product. That weighted-lookup mechanic, run in parallel across all tokens and stacked into multiple heads, is the entire mechanism the Track 4 GPT is built from.
4.2

How language models work (makemore)

Day 23

Covers: next-token prediction as the entire training objective; sampling; a bigram and then an MLP-based character-level language model.

Build to prove itTrain a bigram character model on a names dataset, sample new names from it, and articulate precisely why the output is bad and what additional context would fix it.
4.3

Healthy networks: activations, gradients, BatchNorm

Day 24

Why it's here: this is the gap between "my net trains" and "I know why my net trains" — and LayerNorm, BatchNorm's cousin, sits inside every transformer block you build next.

Covers: activation statistics, dead neurons, vanishing/exploding gradients, why weight initialization matters, and BatchNorm.

Build to prove itPlot activation and gradient histograms layer by layer; break the initialization on purpose and watch the histograms saturate; fix it first with proper scaling, then with BatchNorm.
4.4

Tokenization: build a GPT tokenizer

Day 25

Covers: Byte Pair Encoding from scratch, encode/decode, and why several of an LLM's famous weaknesses — spelling, arithmetic, non-English text — are really tokenizer artifacts rather than reasoning failures. Tokenizers are a separately-trained component with their own training set, a point most introductory material glosses over.

Build to prove itImplement minimal BPE: learn merges from a small corpus, encode and decode a sentence round-trip, and compare your token counts against tiktoken on the same text.
4.5

Build a tiny GPT

Day 26

The transformer block, multi-head self-attention, residual connections, LayerNorm, and the full training loop — assembled from every primitive built in Tracks 1–4 so far.

Build to prove itType the whole thing on a GPU and train on a Shakespeare corpus. Generate text at three checkpoints during training and keep the samples — watching gibberish become Shakespeare-shaped is the payoff for everything before this.
4.6

From base model to assistant: SFT + RLHF

Days 27–28

Covers: pretraining vs. finetuning, how a raw next-token predictor becomes a helpful assistant (supervised fine-tuning, RLHF), scaling laws, tool use, and LLM security — the pipeline stages most curricula mention but skip explaining.

Build to prove itTake your Day 26 model's samples and write one page mapping each pipeline stage (base model → SFT → RLHF) onto what your model has and lacks; sketch what an SFT dataset for it would need to look like. Milestone: re-derive single-head attention on paper without notes.
Track 5 · Days 29–35 in the source plan · deepened with Chip Huyen, AI Engineering

AI Engineering & Evaluation

Goal: climb the applied stack on top of the model internals you already understand — APIs, evaluation, RAG, fine-tuning, and hardened agents. Proof you're done: a shipped AI capstone with measurable evals you can defend.

A note on sourcing for this track specifically: the base structure (days 29–35) comes from the source curriculum, same as every other track. Where noted below, individual topics are deepened using Chip Huyen's AI Engineering: Building Applications with Foundation Models (O'Reilly, 2025) — cited by chapter, not reproduced wholesale. Treat the Huyen citations as "read the chapter for the full argument," not as this page's own claim.
5.1

LLM APIs, prompting & structured outputs

Day 29

Covers: the hosted-model ecosystem and Transformers-library basics; system prompts, few-shot examples, JSON-constrained/structured output, retries and timeouts.

Build to prove itA script that classifies support tickets via a hosted LLM API: system prompt, three few-shot examples, JSON-schema-constrained output, basic retry handling.

Huyen deepening: Chapter 5, "Prompt Engineering" — covers prompt structure, in-context learning mechanics, and failure modes in more depth than an API quickstart.

5.2

Evals: measuring LLM systems

Day 30

Why the source material calls this out specifically: "you cannot improve what you don't measure" — the source curriculum describes evals as the single most under-taught practitioner skill, on the grounds that a prompt change without an eval is just vibes.

Covers: golden sets, exact-match vs. LLM-as-judge scoring, regression-testing prompts.

Build to prove itBuild a 30-example labeled golden set for the Day 29 classifier; write an eval harness that runs the classifier over the set and reports accuracy and per-class errors; change the prompt twice and let the eval — not intuition — pick the winner.

Huyen deepening: Chapter 3 ("Evaluation Methodology") and Chapter 4 ("Evaluate AI Systems") — the fuller treatment of exact-match vs. AI-judge tradeoffs, golden-set construction, and safety/toxicity evaluation that a single practice day can only introduce.

5.3

Retrieval-augmented generation

Day 31

Covers: embeddings, cosine similarity, chunking strategy, retrieve-then-generate, and why RAG beats simply stuffing more text into the context window.

Build to prove itIndex ~20 of your own documents with an embedding model, retrieve the top-3 chunks per query, feed them to an LLM for grounded, cited answers. Score 10 queries with a Day 30-style mini-eval and measure retrieval hit-rate directly, not just answer quality.

Huyen deepening: Chapter 6, "RAG and Agents" — the fuller treatment of retrieval strategy design and where RAG's context-construction choices actually fail in production.

Choosing a vector store — by architecture, not by name

The Day 31 exercise doesn't require a specific product — any of these will index 20 documents fine. The choice starts to matter once you're past the toy exercise. Categorize by what you actually need, and confirm current benchmarks/pricing yourself before committing, since this market moves fast and any specific numbers here would be stale within a year.

If your priority is…Reach forWhy
You already run PostgresA Postgres vector extension (e.g. pgvector)No new infrastructure to operate; good enough recall for most RAG workloads at moderate scale; keeps vectors next to your relational data for joined queries.
Purpose-built, self-hosted, need metadata filteringA dedicated open-source vector databaseBuilt specifically for approximate nearest-neighbor search at scale, with first-class support for filtering results by metadata alongside similarity — matters once retrieval needs "similar AND from this date range."
Fully managed, don't want to operate infrastructureA managed vector-database serviceTrades operational control for a hosted, scaled service — the right call when your team's bottleneck is engineering time, not infrastructure cost.
Billion-scale vectors, need horizontal scale-outA distributed vector database designed for scale-outMost single-node solutions degrade well before a billion vectors; scale-out-native systems trade setup complexity for headroom you don't need until you actually hit it.
Multimodal (text + image) or embedded/edge deploymentA multimodal-native or embeddable vector libraryNot every vector store handles mixed embedding types or runs inside a mobile/edge process — check this explicitly if your RAG system isn't text-only or needs to run offline.
Deliberately not naming specific vendors with specific numbers here: the vector-database market is moving fast enough that any "X handles Y million QPS" claim would likely be stale by the time you read this. Evaluate current options against your own workload — recall@k on your actual documents, p99 latency under your actual query volume, and total cost including egress — rather than trusting any single benchmark, including this site's.
5.4

Fine-tuning with LoRA / PEFT

Day 32

Covers: when to fine-tune vs. prompt vs. RAG; full fine-tuning vs. parameter-efficient methods; how LoRA's low-rank update matrices work, and why the resulting checkpoint is megabytes rather than gigabytes.

Build to prove itFine-tune a small open model (sub-1B parameters) with LoRA on the ticket-classification task from Day 29, and compare it against the prompted version using the Day 30 eval harness — a real prompted-vs-fine-tuned table, not an assumption about which wins.

Huyen deepening: Chapter 7, "Finetuning" — the fuller argument on the model-memory arithmetic behind the fine-tune/prompt/RAG decision, plus Chapter 8, "Dataset Engineering," on why a small, carefully-curated instruction set can outperform a much larger noisy one for fine-tuning specifically.

Decision tool: prompt, RAG, or fine-tune?

A structured version of the framework Tracks 5.1–5.4 walk through in sequence. This encodes standard practitioner guidance (try cheapest first; add retrieval for knowledge gaps; fine-tune only for behavior prompting can't fix) — it's a decision aid, not a substitute for building the golden-set eval from Track 5.2 to actually measure which option wins for your case.

5.5

Agents + tool use

Day 33

Covers: function calling, the agent loop (reason, act, observe, repeat), stopping conditions — deliberately without an agent framework, so the raw loop is visible.

Build to prove itA minimal agent with two tools (e.g. a calculator and a file-lookup tool) that decides, per query, whether and which tool to call — with a trace log proving correct tool selection across five test queries.

Huyen deepening: Chapter 6, "RAG and Agents" (the agentic half) — covers evaluating agentic systems specifically, beyond single-turn tool selection.

Past a single agent: multi-agent orchestration

The Day 33 exercise builds one agent with two tools — the right place to start, because you can see the whole reason/act/observe loop directly. Real systems often split responsibility across several agents once a single agent's context and tool set get too broad to reason about reliably. This is a genuine extension of what you just built, not a different technology:

PatternShapeWhen it's worth the added complexity
Single agent, many toolsWhat you built on Day 33, scaled up — one reasoning loop choosing among a larger tool set.Default choice. Stay here until you can point at a concrete failure the single-agent version causes — added agents mean added coordination bugs and cost, not automatic quality.
Specialized agents with a routerA dispatcher agent classifies the request and hands off to a narrower specialist agent (e.g. one for lookups, one for calculations, one for writing).When one broad system prompt starts producing worse tool selection than several narrow ones would — narrower context per agent is often more reliable than one agent doing everything.
Sequential pipelineAgent A's output becomes Agent B's input becomes Agent C's input — a fixed chain, not a dynamic loop.When the task has genuinely sequential stages (e.g. research → draft → fact-check) and you want each stage's output inspectable and independently evaluable, rather than buried inside one agent's internal reasoning.
Verifier/critic patternA second agent checks the first agent's output before it's used, and can reject or request revision.When correctness matters more than latency/cost — the same adversarial-verification idea used elsewhere in AI engineering (and in this very site's own review process), applied at inference time instead of at build time.
The failure mode multi-agent systems add, that single agents don't have: coordination bugs — Agent A's output not matching what Agent B expects, silent failures propagating through a chain instead of surfacing immediately, and multiplied cost/latency since each agent hop is its own model call. Extend the Day 34 hardening practices (guardrails, trace logging, failure cataloging) across every agent-to-agent handoff, not just at the outer boundary — an inter-agent failure is just as real as a user-facing one and much easier to miss.
5.6

Hardening agents: guardrails & failure modes

Day 34

The framing worth keeping: an agent that works on the happy path is a demo; one that fails safely is a product.

Covers: prompt injection, tool-permission boundaries, max-iteration guards, structured tool schemas, trace logging, graceful degradation.

Build to prove itAttack your Day 33 agent deliberately: a document with injected instructions, a tool that errors, a query that loops. Add a guard for each, and log every step as a structured trace. Deliverable is a short "failure catalog" — three attacks, three defenses, evidence from the traces.
5.7

Review + Capstone 1: AI Practitioner

Day 35

No new material — ship. Strong project options in the source curriculum: a RAG assistant over technical documentation (with evals), an agent that triages a real class of failures, or the tiny GPT from Track 4 extended with one experiment of your own design.

Build to prove itOne project, end to end: deployed script or notebook, README, eval results, and a written postmortem.

Huyen deepening: Chapter 10, "AI Engineering Architecture and User Feedback" — how the individual techniques from this whole track (prompting, evals, RAG, fine-tuning, agents) assemble into one production-grade system, and how user feedback closes the loop after ship.

Track 6 · Days 36–49 in the source plan

GPU & CUDA Performance Engineering

Goal: start from "what even is a streaming multiprocessor" and end writing, profiling, and integrating real kernels — CUDA and Triton — then apply the full performance toolkit to your own models. Proof you're done: your Track 4 GPT trains measurably faster, and you can explain every gained token/second from first principles.

6.1

Why GPUs: throughput vs. latency

Day 36

Covers: CPU vs. GPU design philosophy — few clever cores vs. thousands of simple ones; streaming multiprocessors, warps, latency hiding; why memory bandwidth, not FLOPs, bounds most real workloads.

Build to prove itRun nvidia-smi and decode every field. Benchmark a large matmul on CPU vs. GPU in PyTorch. Compute achieved GB/s for a simple elementwise op and compare it to the card's spec sheet — your first real bandwidth calculation.
6.2

The CUDA programming model: first kernels

Day 37

Covers: kernels, threads, blocks, grids; threadIdx/blockIdx index math; host vs. device memory; cudaMalloc/cudaMemcpy.

Build to prove itWrite and verify a vector-add kernel against NumPy. Then write an index-printing kernel and predict its output before running it — if you can predict thread indices correctly, you actually understand the model, not just the syntax.
6.3

GPU memory hierarchy & coalescing

Day 38

Covers: registers, shared memory, L2 cache, HBM — their relative sizes, latencies, and bandwidths; memory coalescing — why adjacent threads should read adjacent addresses.

Build to prove itWrite a naive matmul kernel (one thread per output element), then a deliberately uncoalesced version (transposed access pattern), and measure the slowdown directly. Numbers, not vibes.
6.4

Tiling + shared-memory matmul

Day 39

Why it echoes later: the same idea — keep hot data in fast memory — is what FlashAttention does at production scale further into this track.

Covers: loading tiles of two matrices into shared memory, synchronizing, and accumulating; why this reduces HBM traffic substantially.

Build to prove itImplement tiled matmul with shared memory. Benchmark three ways: naive vs. tiled vs. torch.matmul (cuBLAS). Expect cuBLAS to win — the gap between your tiled version and cuBLAS is itself the lesson in how much engineering lives inside production kernels.
6.5

Profiling + the roofline model

Day 40

Covers: compute-bound vs. memory-bound vs. overhead-bound workloads; the roofline model; PyTorch profiler basics.

Build to prove itProfile a real training loop (e.g. the Track 3 CIFAR model) with the PyTorch profiler: find the top-5 ops by time, classify each as compute- or memory-bound, and compute arithmetic intensity for at least one of them.

Roofline calculator — the Day 40 exercise, computed live

Arithmetic intensity = total FLOPs ÷ total bytes moved. Compare it against your GPU's ridge point (peak compute ÷ peak memory bandwidth) to classify an operation as compute-bound or memory-bound — exactly the classification the self-check above asks you to do by hand. Defaults below are matmul-shaped; edit for your own op.

What this doesn't model: real ops rarely move the theoretical minimum bytes (cache reuse, tiling, and fusion all change effective bytes moved) — this calculates the roofline classification from the FLOPs/bytes numbers you provide, it doesn't derive those numbers for you. Measuring actual bytes moved for a real kernel needs a profiler (Nsight, the PyTorch profiler), not this calculator.
6.6

Triton: Python-speed kernel writing

Day 41

Covers: block-level GPU programming in Triton that compiles to fast device code without raw CUDA; the official vector-add and fused-softmax tutorials.

Build to prove itWork through both official Triton tutorials by typing them yourself; benchmark your Triton softmax against torch.softmax; then modify vector-add into a fused multiply-add kernel of your own.
6.7

Review + GPU project: the matmul ladder

Day 42

No new material — consolidate the week into a single comparison.

Build to prove itOne benchmark notebook: naive CUDA → coalesced → tiled → Triton → cuBLAS, all on the same matrix sizes, with a GFLOP/s bar chart and your own explanation of what each optimization step actually bought and why.
6.8

Tensor cores + mixed precision

Day 43

Covers: fp32 vs. fp16/bf16, tensor cores, loss scaling, mixed-precision training via autocast — why half precision roughly doubles effective bandwidth and unlocks tensor-core throughput at the same time.

Build to prove itTrain the Track 3 CIFAR CNN with mixed precision: compare epoch time, peak memory, and final accuracy against fp32. Check whether your Day 42 matmul ladder's cuBLAS entry also speeds up in fp16.
6.9

torch.compile + kernel fusion

Day 44

Covers: graph capture and codegen under the hood (why fusion eliminates memory-bound overhead), and the one-time compile cold-start cost.

Build to prove itApply torch.compile to both the CIFAR model and the Track 4 GPT. Measure steady-state speedup (ignore the first compiled step). Stack it with mixed precision and record the combined gain.
6.10

FlashAttention: a case study in IO-awareness

Day 45

Covers: why naive attention is memory-bound (it materializes the full N×N attention matrix in HBM) and how FlashAttention tiles the computation through fast on-chip memory instead — the Day 39 tiling trick, applied at production scale.

Build to prove itBenchmark your Track 4, Day-22-style hand-rolled attention against PyTorch's built-in scaled-dot-product attention across a range of sequence lengths; plot time and memory against length and find the point where naive attention falls over.
6.11

Custom ops: plugging kernels into PyTorch

Day 46

Covers: custom autograd Functions (you write forward and backward yourself), and integrating CUDA/Triton kernels into PyTorch directly.

Build to prove itWrite a custom autograd Function for the Day 41 fused-multiply-add Triton kernel, with a hand-written backward pass, verified against autograd with a gradient check — the full circle from Track 3's micrograd, now running on a GPU.
6.12

Reading production-grade CUDA (llm.c)

Day 47

Karpathy's llm.c: GPT-2 training in raw C/CUDA. The fp32 CPU reference implementation is the bridge from Python-level understanding to systems code.

Build to prove itMap five core functions in the reference implementation (attention, layernorm, matmul forward/backward, the Adam step) back to the equivalent lines in your own Track 4 GPT, annotating as you go.
6.13

Quantization + inference optimization

Day 48

Covers: int8/4-bit quantization (weight-only vs. dynamic), the KV cache, batching — and why inference is fundamentally a memory-bandwidth game, not a compute one.

Build to prove itLoad a small model in fp16 and in 4-bit; compare VRAM usage and tokens/second. Then compute, by hand, how much memory a KV cache needs for a 7B-parameter model at 8K context — use the calculator below to check your arithmetic.
6.14

Review + GPU project: make nanoGPT fast

Day 49

No new material — apply the entire toolkit at once.

Build to prove itTake your Track 4 GPT and apply mixed precision + torch.compile + fused attention together. Measure tokens/second before and after each individual addition, then profile once more to find the new bottleneck. Milestone: you can make a real model measurably faster and explain every gain.

KV-cache memory calculator

2 × layers × heads × head_dim × context_length × batch × bytes_per_param — the arithmetic behind Day 48

The KV cache stores one key and one value vector per attention head, per layer, per token generated so far — that's where the leading factor of 2 comes from. This is exactly the by-hand calculation the source curriculum assigns for Day 48; use it to check your own arithmetic, not to skip doing it once yourself.

What this doesn't model: grouped-query or multi-query attention (GQA/MQA) reduce the effective KV-head count below the full attention-head count — enter the actual number of KV heads your model uses, not the query-head count, if they differ. This also ignores model weights and activation memory entirely; it is the KV cache's contribution only.
Track 7 · Days 50–56 in the source plan

Distributed Training & HPC

Goal: speak the language of clusters (Slurm, MPI, NCCL); scale training from one GPU to many (DDP, ZeRO/FSDP, tensor and pipeline parallelism); finish with a distributed, optimized, crash-tested capstone. Proof you're done: you can design and defend a parallelism plan for a 70B-parameter model, and you've run real multi-process training yourself.

7.1

Anatomy of a cluster + Slurm

Day 50

Covers: nodes, GPUs per node, NVLink within a node, InfiniBand/Ethernet between nodes; the job scheduler as the front door to every HPC system.

Build to prove itWrite a complete sbatch script for a hypothetical 2-node, 8-GPU training job: resource requests, module loads, the launch line, output handling.
7.2

MPI: the language of HPC

Day 51

Why it matters one day later: the same all-reduce collective operation you implement by hand here is exactly what synchronizes gradients in DDP.

Covers: ranks, communicators, point-to-point send/receive, collectives (broadcast, reduce, all-reduce).

Build to prove itWith mpi4py: a hello-world with ranks, a ring-pass token exercise, then a manual all-reduce implemented with send/receive and verified against the library's built-in all-reduce.
7.3

Data parallelism: DDP

Day 52

Covers: one process per GPU, a replicated model, gradient all-reduce overlapped with the backward pass.

Build to prove itConvert the Track 3 CIFAR trainer to DDP: process-group init, a distributed sampler, model wrapping, rank-0-only logging. Run 2-process CPU-backend first to prove the mechanics, then real multi-GPU if available, measuring scaling efficiency.
7.4

Memory math + ZeRO/FSDP

Day 53

Covers: why DDP hits a memory wall (weights + gradients + optimizer states add up to roughly 16 bytes per parameter under mixed precision); ZeRO's idea — shard those states across GPUs instead of replicating them; FSDP as PyTorch's native implementation of that idea.

Build to prove itDo the memory arithmetic by hand for a 7B-parameter model: per-GPU memory under plain DDP vs. ZeRO stages 1/2/3 — the calculation behind most "why is CUDA out of memory" incidents at scale. The calculator below runs the same arithmetic.
7.5

Tensor + pipeline parallelism, 5D parallelism

Day 54

Covers: when a model doesn't fit even after sharding — splitting individual layers across GPUs (tensor parallel) or splitting the layer stack into stages with micro-batching (pipeline parallel); how production systems compose data, tensor, and pipeline parallelism together.

Build to prove itPaper exercise, no multi-GPU hardware required: for a 70B-parameter model on 64 GPUs, propose a data/tensor/pipeline parallelism layout, justify where each communication step happens, and estimate per-GPU memory using the Day 53 arithmetic.
7.6

Running at scale: checkpointing, failures, MFU

Day 55

Covers: checkpoint/resume (a large enough job will lose nodes eventually), straggler effects, NCCL as the GPU collective-communication layer, and model FLOPs utilization (MFU) — described in the source material as the one honest metric of training efficiency.

Build to prove itAdd checkpoint/resume to your DDP trainer and kill it mid-epoch to prove resume actually works. Compute the MFU of your fastest Track 6 GPT run: achieved FLOP/s over the hardware's peak.

Training cost/time estimator

total FLOPs ≈ 6 × parameters × training tokens — the standard forward+backward compute approximation, then divided by your cluster's achievable throughput (peak × MFU)

The 6× multiplier comes from splitting a transformer's compute into a forward pass (≈2 FLOPs per parameter per token) plus a backward pass (≈2× the forward cost) — 2 + 4 = 6. This is the same approximation used to estimate published models' training compute, and it plugs directly into the MFU concept this section just covered: real wall-clock time depends on what fraction of peak FLOPs you actually sustain, not the spec-sheet number.

This is a first-order estimate, not a quote: real runs lose additional time to data loading, checkpointing, communication overhead between GPUs (worse at higher node counts — see Track 7.5), and restarts after failures (Track 7.6). MFU itself typically drops as you scale to more GPUs due to communication overhead, so don't assume your single-node MFU holds at 10x the GPU count. Treat this as a planning floor, not a promise.
7.7

Final capstone: distributed, optimized, documented

Day 56

Ship the program. Two strong options from the source curriculum: train your GPT with DDP on multiple GPUs with mixed precision + compile + fused attention + checkpointing, reporting tokens/sec, scaling efficiency, and MFU — or an end-to-end ML service on real infrastructure data with GPU-accelerated training, an eval suite, and a deployment script.

Build to prove itA repo with README, a metrics table, a written postmortem, and a "what I'd do with 64 GPUs" section.

ZeRO / FSDP per-GPU memory calculator

~16 bytes/parameter (mixed precision: fp16 weights + fp16 grads + fp32 optimizer states) sharded across N GPUs — the Day 53 arithmetic

Under mixed-precision training with Adam, each parameter typically costs roughly: 2 bytes (fp16 weight) + 2 bytes (fp16 gradient) + 4 bytes (fp32 master weight copy) + 4 bytes (fp32 Adam momentum) + 4 bytes (fp32 Adam variance) ≈ 16 bytes total, before ZeRO sharding. ZeRO-1 shards optimizer states only; ZeRO-2 adds gradient sharding; ZeRO-3 (and FSDP) also shards the parameters themselves. The exact byte-per-parameter breakdown varies by optimizer and mixed-precision recipe — 16 bytes/param is the commonly-cited planning figure for fp16-mixed-precision Adam specifically, not a universal constant; confirm against your actual optimizer and precision settings before sizing real hardware.

What this ignores, on purpose: activation memory (which scales with batch size and sequence length, not just parameter count) and communication buffers are not modeled here — this is states-only memory, exactly matching the Day 53 exercise's scope. Real GPU sizing needs both numbers together.
Milestones

The two capstones — what "done" looks like

The source curriculum treats these as the two real checkpoints in the whole path — not review days, shipped artifacts.

CapstoneScopeWhat it proves
1 · AI PractitionerEnd of Track 5 (source Day 35)A shipped AI application — RAG assistant, hardened agent, or extended GPT — with real evals and a written postmortem. You can build the applied layer, not just describe it.
2 · FinalEnd of Track 7 (source Day 56)A distributed, performance-optimized, crash-tested training run (or an end-to-end ML service) with a metrics table and a scaling narrative you can defend. You're a practitioner on both sides: models and the metal they run on.
What comes after: the source curriculum's own closing advice is to keep exactly one live project running at all times — a competition, an open-source contribution, or a real service — because skills decay without a forcing function. It also recommends the PMPP book (Programming Massively Parallel Processors) as the natural next deep read once you have hands-on GPU context to absorb it, and finishing the Ultra-Scale Playbook end to end including the parts a first pass skims.
Resource Stack

The core resources, all free

Every link below appears in the topic sections above at least once; this is the same set gathered in one place. All are free tiers or fully free resources — nothing in this path requires paid hardware or a paid course.

freeCodeCamp — Python for Data Science NumPy from 9:36:10, pandas from 11:04:12 — skip the first 9 hours of plain Python basics if you already know the language. youtube.com 3Blue1Brown — Essence of Linear Algebra Vectors, matrices, transformations — the math backbone of everything after it. youtube.com 3Blue1Brown — Essence of Calculus Derivatives and the chain rule — required for understanding backprop, not optional. youtube.com StatQuest — Machine Learning video index Every classic ML algorithm, clearly explained. Use as a lookup index, not a marathon. statquest.org 3Blue1Brown — Neural Networks series Neural nets, gradient descent, backprop, attention and transformers, all visually. 3blue1brown.com Karpathy — Neural Networks: Zero to Hero Backprop, MLPs, BatchNorm, the tokenizer, and GPT — all built from scratch in code. The spine of Tracks 3–4. karpathy.ai Karpathy — Intro to Large Language Models (1hr) Pretraining, SFT, RLHF, scaling, tool use, security — the whole pipeline in one talk. youtube.com Hugging Face — LLM Course Transformers library, tokenizers, fine-tuning, NLP tasks. huggingface.co Hugging Face — PEFT docs LoRA and the other parameter-efficient fine-tuning methods. huggingface.co Claude API docs Used throughout the AI Engineering track for prompting, evals, and agent-loop examples. docs.claude.com Kaggle Learn Hands-on micro-courses: pandas, intro to ML, feature engineering — plus the leaderboard for your first submission. kaggle.com PyTorch — Learn the Basics Official tensors-to-training-loop tutorial path. pytorch.org freeCodeCamp — CUDA Programming Course GPU architecture, first kernels, the CUDA API, faster matmul, Triton, PyTorch extensions. The spine of Track 6. youtube.com GPU MODE lectures The premier GPU-programming reading group: profiling, memory, tiling, quantization, FlashAttention, NCCL. Uses the PMPP book as its text. github.com/gpu-mode NVIDIA CUDA Programming Guide The official reference — current as of CUDA 13; the old "CUDA C++ Programming Guide" URL is now legacy. docs.nvidia.com Triton — official tutorials Vector add, fused softmax, matmul — kernel writing from Python. triton-lang.org Horace He — Making Deep Learning Go Brrrr Compute-bound vs. memory-bound vs. overhead-bound — the mental model for all GPU performance work. horace.io karpathy/llm.c GPT-2 training in raw C/CUDA — the bridge from Python understanding to systems code. github.com/karpathy FlashAttention — paper Dao et al., 2022. The IO-awareness argument behind Track 6's Day 45. arxiv.org PyTorch — Getting Started with DDP The official recipe used to build Track 7's data-parallel trainer. docs.pytorch.org PyTorch — FSDP tutorial (FSDP2) Fully-sharded data parallel training — PyTorch's native ZeRO-style implementation. docs.pytorch.org The Ultra-Scale Playbook (Hugging Face) The free book on training LLMs on GPU clusters: ZeRO, tensor/pipeline parallelism, 5D parallelism, kernels, MFU. The Track 7 text. huggingface.co Slurm Quick Start The scheduler's own quick-start guide: sbatch, srun, squeue, sinfo, salloc. slurm.schedmd.com LLNL MPI tutorial Ranks, communicators, collectives — the canonical free MPI curriculum. hpc-tutorials.llnl.gov m-khalifa.com The curator's portfolio, and the companion storage-engineering reference site. m-khalifa.com
About & Method

Where this content comes from, and what it doesn't claim

This page is a reorganization, not an original curriculum. The seven-track structure, the day-by-day "learn / build / deliverable" content, the dependency-chain framing, and every resource link above are drawn directly from a 56-day study plan whose own appendix states every link was verified live (fetched directly, or confirmed via current search results) on June 10–11, 2026. This page reorganizes that material by topic instead of by day, because a topic map is the more useful reference once you're past the first read-through — but the underlying claims, sequencing, and resources are the source document's, not independently re-derived here.

The one exception: the AI Engineering & Evaluation track (Track 5) is explicitly deepened with citations to Chip Huyen's AI Engineering: Building Applications with Foundation Models (O'Reilly, 2025) — by chapter, and clearly marked inline. Those citations point you to the fuller argument in the book; they are not this page's own claims restated as fact.

What this page does not do: invent benchmark numbers, model comparisons, or "best practices" beyond what the source material states. Where a figure is a commonly-cited planning heuristic rather than a fixed constant — the ≈16-bytes-per-parameter mixed-precision memory estimate, for instance — it's labeled as a heuristic you should confirm against your actual setup, the same discipline the companion storage-engineering site applies to its own calculators.

Corrections and link rot: the source material itself flags that link landscapes shift — PyTorch's docs moved to a new domain since some of these libraries' early days, Kaggle's free multi-GPU accelerator offerings change over time, NVIDIA has rewritten its CUDA guide once already. If a link on this page breaks, the source document's own recovery method holds: search the exact resource title given next to it. Corrections are welcome via LinkedIn.