Zero to AI engineer, organized by topic.
Most "learn AI" material is either a marketing-flavored overview or a single deep paper with no path to it. This is neither: it's a topic-based map from linear algebra to distributed multi-GPU training, built from a verified 56-day engineering curriculum and (for the AI-engineering practice layer specifically) Chip Huyen's AI Engineering. Every topic states what it covers, why the order matters, what to build to prove you understand it, and exactly where the primary source material lives — no invented benchmarks, no numbers we can't point to a source for.
How this reference is organized
Each of the seven tracks below groups related days from the source curriculum into one topic, states the concrete thing you should be able to build by the end, and lists the primary learning resources — verbatim from the source material, not paraphrased. This is deliberately not a day-by-day schedule: the source curriculum's own daily loop (watch → build in Colab → write a 5-line summary → self-check, with three recall questions on yesterday's material before starting anything new) is the right way to pace yourself once you pick a track, but the pacing itself isn't the point of this page — the map of what to learn, in what order, and why, is.
.shape constantly once you reach deep learning; shape bugs are the majority of real debugging time there. And once you reach the GPU track: measure before you optimize, and measure again after — a performance claim without a number is a feeling, not an engineering result.Why the order matters — it's a dependency chain, not a menu
Every track below is load-bearing for the one after it, all the way through GPU systems: the matrix multiply you hand-code in Track 1 is the CUDA kernel you tile in Track 6 and the tensor-core workload you feed in the same track. This is the argument for not skimming the math, however tempting that is for an experienced developer — broadcasting and the chain rule resurface as silent bugs inside neural networks if you don't have them cold.
This sequence and its framing come directly from the source curriculum's own "why the order matters" section — it is not a claim this site invented independently.
Math & Python Foundations
Goal: fluency with NumPy/pandas, geometric intuition for vectors and gradients, and one working linear regression trained entirely by hand. Proof you're done: you can explain gradient descent on a whiteboard and have the code to prove it.
Vectorization, NumPy & pandas
Day 1Why first: every later track assumes array-native thinking. Loop-based code is the single most common beginner habit that has to be unlearned before anything else sticks.
Covers: why array operations beat loops, array shapes, broadcasting rules, DataFrame slicing with .loc, boolean masks, groupby.
Vectors & matrices
Day 2Why it matters later: the matrix multiply you hand-code here is the same operation you'll tile into a CUDA kernel in the GPU track and feed to tensor cores after that — it is genuinely the same 20 lines, at increasing levels of "how fast can this go."
Covers: vectors as arrows and as data, span, linear transformations, matrix multiplication as composed transformations.
np.matmul on random matrices, and visualize a 2×2 transformation acting on a grid of points.Derivatives, gradients & partials
Day 3Why it matters later: the chain rule is backpropagation — not an analogy for it, the literal mechanism. Skimming this is the single most common reason "my net trains but I don't know why" happens later.
Covers: derivative as sensitivity to a nudge; the chain rule; partial derivatives — the gradient is a vector of partials, one per input (the source curriculum flags this explicitly as the jump from single-variable to multivariable calculus that neural nets require).
Probability essentials
Day 4Covers: distributions, mean and variance, sampling, the normal distribution, and why ML treats data as draws from a distribution.
What machine learning actually is
Day 5Covers: training vs. testing, the bias/variance tradeoff, overfitting.
Linear regression from scratch
Days 6–7Gradient descent end to end: loss surface, learning rate, convergence — no new material, this is purely a build-and-consolidate stretch.
LinearRegression.Classical Machine Learning
Goal: train, evaluate, and improve real models on tabular data; ship a first Kaggle submission. Proof you're done: a leaderboard entry you can defend, preprocessing choice by preprocessing choice.
Classification: softmax & logistic regression
Day 8Why it matters later: softmax reappears inside attention, inside LLM sampling, and inside every classifier after this — the source curriculum is explicit that this is the one place to learn it properly, once.
Covers: decision boundaries, the sigmoid, log loss, odds and log-odds, softmax as sigmoid's multi-class generalization, and why cross-entropy is softmax's natural loss.
Trees, forests, boosting
Day 9Covers: decision trees (splits, impurity), why forests reduce variance, the boosting idea.
Model evaluation
Day 10Covers: cross-validation, the confusion-matrix family, ROC/AUC, precision vs. recall, and data leakage — described in the source material as "the silent killer" of otherwise-plausible model results.
Feature engineering
Day 11Covers: encoding categoricals, scaling, imputation, leakage-safe pipelines.
Unsupervised learning: clustering & PCA
Day 12Covers: k-means step by step, hierarchical clustering, principal component analysis.
End-to-end project & first submission
Days 13–14No new material — full-cycle consolidation: EDA, cleaning, feature engineering, model comparison, cross-validated scoring, then a real leaderboard submission.
Deep Learning From Scratch
Goal: understand backpropagation well enough to build it from nothing, then use PyTorch the way it was designed to be used — while quietly building GPU instincts (parameter counts, memory, mixed precision) for later. Proof you're done: a CNN trained on real images, and the ability to trace a gradient through your network by hand.
Neural net intuition
Day 15Covers: what a network is, layers as learned transformations, gradient descent on a loss landscape.
Backprop from scratch (micrograd)
Days 16–17Why it's the anchor of this whole track: the source curriculum calls this "the single most important session of the program" — building the autograd engine yourself is what makes every later framework (PyTorch's autograd, and eventually your own custom CUDA ops) legible rather than magical.
Build a minimal autograd Value class with backward passes for addition, multiplication, and tanh; then a tiny neural-net library (Neuron, Layer, MLP) with a training loop on toy data.
PyTorch fundamentals
Day 18Covers: tensors, autograd, nn.Module, Dataset/DataLoader, the canonical training loop.
zero_grad() — and read what the errors actually say. Recognizing these two failure modes on sight saves hours later.Training an MLP properly + regularization
Day 19Covers: loss functions, SGD vs. Adam, learning-rate effects, recognizing overfitting in training curves, and the standard anti-overfitting toolkit — dropout, weight decay (L2), early stopping — and what each actually does to the optimization, not just that it "helps."
Convolutional neural networks
Day 20Covers: convolutions as learned filters, pooling, parameter sharing, why CNNs fit image data specifically.
Review + image project (CIFAR-10)
Day 21No new material — consolidation and a real image-classification project.
Transformers & Large Language Models
Goal: build a GPT from the inside out — attention, healthy activations, the tokenizer, the transformer block — then understand how a raw next-token predictor becomes a helpful assistant. Proof you're done: you can explain how a GPT works end to end, because you built each piece of it.
Sequences and attention
Day 22Why it's not as new as it looks: the source curriculum points out explicitly that the core of attention is matrix multiplication (Track 1) plus softmax (Track 2) — nothing you haven't already built by hand.
Covers: why fixed-window and recurrent approaches struggle at long range; attention as a learned, weighted lookup across a sequence.
Attention visualizer — the exact 3-token exercise above, running live
Edit Q, K, V below (3 tokens × 2 dimensions, so every number stays hand-checkable) and recompute. This is scaled dot-product attention — the same formula you're asked to derive by hand on Day 22 — with every intermediate matrix shown, matching the self-check above.
How language models work (makemore)
Day 23Covers: next-token prediction as the entire training objective; sampling; a bigram and then an MLP-based character-level language model.
Healthy networks: activations, gradients, BatchNorm
Day 24Why it's here: this is the gap between "my net trains" and "I know why my net trains" — and LayerNorm, BatchNorm's cousin, sits inside every transformer block you build next.
Covers: activation statistics, dead neurons, vanishing/exploding gradients, why weight initialization matters, and BatchNorm.
Tokenization: build a GPT tokenizer
Day 25Covers: Byte Pair Encoding from scratch, encode/decode, and why several of an LLM's famous weaknesses — spelling, arithmetic, non-English text — are really tokenizer artifacts rather than reasoning failures. Tokenizers are a separately-trained component with their own training set, a point most introductory material glosses over.
tiktoken on the same text.Build a tiny GPT
Day 26The transformer block, multi-head self-attention, residual connections, LayerNorm, and the full training loop — assembled from every primitive built in Tracks 1–4 so far.
From base model to assistant: SFT + RLHF
Days 27–28Covers: pretraining vs. finetuning, how a raw next-token predictor becomes a helpful assistant (supervised fine-tuning, RLHF), scaling laws, tool use, and LLM security — the pipeline stages most curricula mention but skip explaining.
AI Engineering & Evaluation
Goal: climb the applied stack on top of the model internals you already understand — APIs, evaluation, RAG, fine-tuning, and hardened agents. Proof you're done: a shipped AI capstone with measurable evals you can defend.
LLM APIs, prompting & structured outputs
Day 29Covers: the hosted-model ecosystem and Transformers-library basics; system prompts, few-shot examples, JSON-constrained/structured output, retries and timeouts.
Huyen deepening: Chapter 5, "Prompt Engineering" — covers prompt structure, in-context learning mechanics, and failure modes in more depth than an API quickstart.
Evals: measuring LLM systems
Day 30Why the source material calls this out specifically: "you cannot improve what you don't measure" — the source curriculum describes evals as the single most under-taught practitioner skill, on the grounds that a prompt change without an eval is just vibes.
Covers: golden sets, exact-match vs. LLM-as-judge scoring, regression-testing prompts.
Huyen deepening: Chapter 3 ("Evaluation Methodology") and Chapter 4 ("Evaluate AI Systems") — the fuller treatment of exact-match vs. AI-judge tradeoffs, golden-set construction, and safety/toxicity evaluation that a single practice day can only introduce.
Retrieval-augmented generation
Day 31Covers: embeddings, cosine similarity, chunking strategy, retrieve-then-generate, and why RAG beats simply stuffing more text into the context window.
Huyen deepening: Chapter 6, "RAG and Agents" — the fuller treatment of retrieval strategy design and where RAG's context-construction choices actually fail in production.
Choosing a vector store — by architecture, not by name
The Day 31 exercise doesn't require a specific product — any of these will index 20 documents fine. The choice starts to matter once you're past the toy exercise. Categorize by what you actually need, and confirm current benchmarks/pricing yourself before committing, since this market moves fast and any specific numbers here would be stale within a year.
| If your priority is… | Reach for | Why |
|---|---|---|
| You already run Postgres | A Postgres vector extension (e.g. pgvector) | No new infrastructure to operate; good enough recall for most RAG workloads at moderate scale; keeps vectors next to your relational data for joined queries. |
| Purpose-built, self-hosted, need metadata filtering | A dedicated open-source vector database | Built specifically for approximate nearest-neighbor search at scale, with first-class support for filtering results by metadata alongside similarity — matters once retrieval needs "similar AND from this date range." |
| Fully managed, don't want to operate infrastructure | A managed vector-database service | Trades operational control for a hosted, scaled service — the right call when your team's bottleneck is engineering time, not infrastructure cost. |
| Billion-scale vectors, need horizontal scale-out | A distributed vector database designed for scale-out | Most single-node solutions degrade well before a billion vectors; scale-out-native systems trade setup complexity for headroom you don't need until you actually hit it. |
| Multimodal (text + image) or embedded/edge deployment | A multimodal-native or embeddable vector library | Not every vector store handles mixed embedding types or runs inside a mobile/edge process — check this explicitly if your RAG system isn't text-only or needs to run offline. |
Fine-tuning with LoRA / PEFT
Day 32Covers: when to fine-tune vs. prompt vs. RAG; full fine-tuning vs. parameter-efficient methods; how LoRA's low-rank update matrices work, and why the resulting checkpoint is megabytes rather than gigabytes.
Huyen deepening: Chapter 7, "Finetuning" — the fuller argument on the model-memory arithmetic behind the fine-tune/prompt/RAG decision, plus Chapter 8, "Dataset Engineering," on why a small, carefully-curated instruction set can outperform a much larger noisy one for fine-tuning specifically.
Decision tool: prompt, RAG, or fine-tune?
A structured version of the framework Tracks 5.1–5.4 walk through in sequence. This encodes standard practitioner guidance (try cheapest first; add retrieval for knowledge gaps; fine-tune only for behavior prompting can't fix) — it's a decision aid, not a substitute for building the golden-set eval from Track 5.2 to actually measure which option wins for your case.
Agents + tool use
Day 33Covers: function calling, the agent loop (reason, act, observe, repeat), stopping conditions — deliberately without an agent framework, so the raw loop is visible.
Huyen deepening: Chapter 6, "RAG and Agents" (the agentic half) — covers evaluating agentic systems specifically, beyond single-turn tool selection.
Past a single agent: multi-agent orchestration
The Day 33 exercise builds one agent with two tools — the right place to start, because you can see the whole reason/act/observe loop directly. Real systems often split responsibility across several agents once a single agent's context and tool set get too broad to reason about reliably. This is a genuine extension of what you just built, not a different technology:
| Pattern | Shape | When it's worth the added complexity |
|---|---|---|
| Single agent, many tools | What you built on Day 33, scaled up — one reasoning loop choosing among a larger tool set. | Default choice. Stay here until you can point at a concrete failure the single-agent version causes — added agents mean added coordination bugs and cost, not automatic quality. |
| Specialized agents with a router | A dispatcher agent classifies the request and hands off to a narrower specialist agent (e.g. one for lookups, one for calculations, one for writing). | When one broad system prompt starts producing worse tool selection than several narrow ones would — narrower context per agent is often more reliable than one agent doing everything. |
| Sequential pipeline | Agent A's output becomes Agent B's input becomes Agent C's input — a fixed chain, not a dynamic loop. | When the task has genuinely sequential stages (e.g. research → draft → fact-check) and you want each stage's output inspectable and independently evaluable, rather than buried inside one agent's internal reasoning. |
| Verifier/critic pattern | A second agent checks the first agent's output before it's used, and can reject or request revision. | When correctness matters more than latency/cost — the same adversarial-verification idea used elsewhere in AI engineering (and in this very site's own review process), applied at inference time instead of at build time. |
Hardening agents: guardrails & failure modes
Day 34The framing worth keeping: an agent that works on the happy path is a demo; one that fails safely is a product.
Covers: prompt injection, tool-permission boundaries, max-iteration guards, structured tool schemas, trace logging, graceful degradation.
Review + Capstone 1: AI Practitioner
Day 35No new material — ship. Strong project options in the source curriculum: a RAG assistant over technical documentation (with evals), an agent that triages a real class of failures, or the tiny GPT from Track 4 extended with one experiment of your own design.
Huyen deepening: Chapter 10, "AI Engineering Architecture and User Feedback" — how the individual techniques from this whole track (prompting, evals, RAG, fine-tuning, agents) assemble into one production-grade system, and how user feedback closes the loop after ship.
GPU & CUDA Performance Engineering
Goal: start from "what even is a streaming multiprocessor" and end writing, profiling, and integrating real kernels — CUDA and Triton — then apply the full performance toolkit to your own models. Proof you're done: your Track 4 GPT trains measurably faster, and you can explain every gained token/second from first principles.
Why GPUs: throughput vs. latency
Day 36Covers: CPU vs. GPU design philosophy — few clever cores vs. thousands of simple ones; streaming multiprocessors, warps, latency hiding; why memory bandwidth, not FLOPs, bounds most real workloads.
nvidia-smi and decode every field. Benchmark a large matmul on CPU vs. GPU in PyTorch. Compute achieved GB/s for a simple elementwise op and compare it to the card's spec sheet — your first real bandwidth calculation.The CUDA programming model: first kernels
Day 37Covers: kernels, threads, blocks, grids; threadIdx/blockIdx index math; host vs. device memory; cudaMalloc/cudaMemcpy.
GPU memory hierarchy & coalescing
Day 38Covers: registers, shared memory, L2 cache, HBM — their relative sizes, latencies, and bandwidths; memory coalescing — why adjacent threads should read adjacent addresses.
Tiling + shared-memory matmul
Day 39Why it echoes later: the same idea — keep hot data in fast memory — is what FlashAttention does at production scale further into this track.
Covers: loading tiles of two matrices into shared memory, synchronizing, and accumulating; why this reduces HBM traffic substantially.
torch.matmul (cuBLAS). Expect cuBLAS to win — the gap between your tiled version and cuBLAS is itself the lesson in how much engineering lives inside production kernels.Profiling + the roofline model
Day 40Covers: compute-bound vs. memory-bound vs. overhead-bound workloads; the roofline model; PyTorch profiler basics.
Roofline calculator — the Day 40 exercise, computed live
Arithmetic intensity = total FLOPs ÷ total bytes moved. Compare it against your GPU's ridge point (peak compute ÷ peak memory bandwidth) to classify an operation as compute-bound or memory-bound — exactly the classification the self-check above asks you to do by hand. Defaults below are matmul-shaped; edit for your own op.
Triton: Python-speed kernel writing
Day 41Covers: block-level GPU programming in Triton that compiles to fast device code without raw CUDA; the official vector-add and fused-softmax tutorials.
torch.softmax; then modify vector-add into a fused multiply-add kernel of your own.Review + GPU project: the matmul ladder
Day 42No new material — consolidate the week into a single comparison.
Tensor cores + mixed precision
Day 43Covers: fp32 vs. fp16/bf16, tensor cores, loss scaling, mixed-precision training via autocast — why half precision roughly doubles effective bandwidth and unlocks tensor-core throughput at the same time.
torch.compile + kernel fusion
Day 44Covers: graph capture and codegen under the hood (why fusion eliminates memory-bound overhead), and the one-time compile cold-start cost.
torch.compile to both the CIFAR model and the Track 4 GPT. Measure steady-state speedup (ignore the first compiled step). Stack it with mixed precision and record the combined gain.FlashAttention: a case study in IO-awareness
Day 45Covers: why naive attention is memory-bound (it materializes the full N×N attention matrix in HBM) and how FlashAttention tiles the computation through fast on-chip memory instead — the Day 39 tiling trick, applied at production scale.
Custom ops: plugging kernels into PyTorch
Day 46Covers: custom autograd Functions (you write forward and backward yourself), and integrating CUDA/Triton kernels into PyTorch directly.
Function for the Day 41 fused-multiply-add Triton kernel, with a hand-written backward pass, verified against autograd with a gradient check — the full circle from Track 3's micrograd, now running on a GPU.Reading production-grade CUDA (llm.c)
Day 47Karpathy's llm.c: GPT-2 training in raw C/CUDA. The fp32 CPU reference implementation is the bridge from Python-level understanding to systems code.
Quantization + inference optimization
Day 48Covers: int8/4-bit quantization (weight-only vs. dynamic), the KV cache, batching — and why inference is fundamentally a memory-bandwidth game, not a compute one.
Review + GPU project: make nanoGPT fast
Day 49No new material — apply the entire toolkit at once.
torch.compile + fused attention together. Measure tokens/second before and after each individual addition, then profile once more to find the new bottleneck. Milestone: you can make a real model measurably faster and explain every gain.KV-cache memory calculator
The KV cache stores one key and one value vector per attention head, per layer, per token generated so far — that's where the leading factor of 2 comes from. This is exactly the by-hand calculation the source curriculum assigns for Day 48; use it to check your own arithmetic, not to skip doing it once yourself.
Distributed Training & HPC
Goal: speak the language of clusters (Slurm, MPI, NCCL); scale training from one GPU to many (DDP, ZeRO/FSDP, tensor and pipeline parallelism); finish with a distributed, optimized, crash-tested capstone. Proof you're done: you can design and defend a parallelism plan for a 70B-parameter model, and you've run real multi-process training yourself.
Anatomy of a cluster + Slurm
Day 50Covers: nodes, GPUs per node, NVLink within a node, InfiniBand/Ethernet between nodes; the job scheduler as the front door to every HPC system.
sbatch script for a hypothetical 2-node, 8-GPU training job: resource requests, module loads, the launch line, output handling.MPI: the language of HPC
Day 51Why it matters one day later: the same all-reduce collective operation you implement by hand here is exactly what synchronizes gradients in DDP.
Covers: ranks, communicators, point-to-point send/receive, collectives (broadcast, reduce, all-reduce).
mpi4py: a hello-world with ranks, a ring-pass token exercise, then a manual all-reduce implemented with send/receive and verified against the library's built-in all-reduce.Data parallelism: DDP
Day 52Covers: one process per GPU, a replicated model, gradient all-reduce overlapped with the backward pass.
Memory math + ZeRO/FSDP
Day 53Covers: why DDP hits a memory wall (weights + gradients + optimizer states add up to roughly 16 bytes per parameter under mixed precision); ZeRO's idea — shard those states across GPUs instead of replicating them; FSDP as PyTorch's native implementation of that idea.
Tensor + pipeline parallelism, 5D parallelism
Day 54Covers: when a model doesn't fit even after sharding — splitting individual layers across GPUs (tensor parallel) or splitting the layer stack into stages with micro-batching (pipeline parallel); how production systems compose data, tensor, and pipeline parallelism together.
Running at scale: checkpointing, failures, MFU
Day 55Covers: checkpoint/resume (a large enough job will lose nodes eventually), straggler effects, NCCL as the GPU collective-communication layer, and model FLOPs utilization (MFU) — described in the source material as the one honest metric of training efficiency.
Training cost/time estimator
The 6× multiplier comes from splitting a transformer's compute into a forward pass (≈2 FLOPs per parameter per token) plus a backward pass (≈2× the forward cost) — 2 + 4 = 6. This is the same approximation used to estimate published models' training compute, and it plugs directly into the MFU concept this section just covered: real wall-clock time depends on what fraction of peak FLOPs you actually sustain, not the spec-sheet number.
Final capstone: distributed, optimized, documented
Day 56Ship the program. Two strong options from the source curriculum: train your GPT with DDP on multiple GPUs with mixed precision + compile + fused attention + checkpointing, reporting tokens/sec, scaling efficiency, and MFU — or an end-to-end ML service on real infrastructure data with GPU-accelerated training, an eval suite, and a deployment script.
ZeRO / FSDP per-GPU memory calculator
Under mixed-precision training with Adam, each parameter typically costs roughly: 2 bytes (fp16 weight) + 2 bytes (fp16 gradient) + 4 bytes (fp32 master weight copy) + 4 bytes (fp32 Adam momentum) + 4 bytes (fp32 Adam variance) ≈ 16 bytes total, before ZeRO sharding. ZeRO-1 shards optimizer states only; ZeRO-2 adds gradient sharding; ZeRO-3 (and FSDP) also shards the parameters themselves. The exact byte-per-parameter breakdown varies by optimizer and mixed-precision recipe — 16 bytes/param is the commonly-cited planning figure for fp16-mixed-precision Adam specifically, not a universal constant; confirm against your actual optimizer and precision settings before sizing real hardware.
The two capstones — what "done" looks like
The source curriculum treats these as the two real checkpoints in the whole path — not review days, shipped artifacts.
| Capstone | Scope | What it proves |
|---|---|---|
| 1 · AI Practitioner | End of Track 5 (source Day 35) | A shipped AI application — RAG assistant, hardened agent, or extended GPT — with real evals and a written postmortem. You can build the applied layer, not just describe it. |
| 2 · Final | End of Track 7 (source Day 56) | A distributed, performance-optimized, crash-tested training run (or an end-to-end ML service) with a metrics table and a scaling narrative you can defend. You're a practitioner on both sides: models and the metal they run on. |
The core resources, all free
Every link below appears in the topic sections above at least once; this is the same set gathered in one place. All are free tiers or fully free resources — nothing in this path requires paid hardware or a paid course.
Where this content comes from, and what it doesn't claim
This page is a reorganization, not an original curriculum. The seven-track structure, the day-by-day "learn / build / deliverable" content, the dependency-chain framing, and every resource link above are drawn directly from a 56-day study plan whose own appendix states every link was verified live (fetched directly, or confirmed via current search results) on June 10–11, 2026. This page reorganizes that material by topic instead of by day, because a topic map is the more useful reference once you're past the first read-through — but the underlying claims, sequencing, and resources are the source document's, not independently re-derived here.
The one exception: the AI Engineering & Evaluation track (Track 5) is explicitly deepened with citations to Chip Huyen's AI Engineering: Building Applications with Foundation Models (O'Reilly, 2025) — by chapter, and clearly marked inline. Those citations point you to the fuller argument in the book; they are not this page's own claims restated as fact.
What this page does not do: invent benchmark numbers, model comparisons, or "best practices" beyond what the source material states. Where a figure is a commonly-cited planning heuristic rather than a fixed constant — the ≈16-bytes-per-parameter mixed-precision memory estimate, for instance — it's labeled as a heuristic you should confirm against your actual setup, the same discipline the companion storage-engineering site applies to its own calculators.
Corrections and link rot: the source material itself flags that link landscapes shift — PyTorch's docs moved to a new domain since some of these libraries' early days, Kaggle's free multi-GPU accelerator offerings change over time, NVIDIA has rewritten its CUDA guide once already. If a link on this page breaks, the source document's own recovery method holds: search the exact resource title given next to it. Corrections are welcome via LinkedIn.