Anatomy of LLMs – From Math to Production
Large Language Models are no longer research curiosities; they are software systems with hard mathematical cores and very real engineering constraints. This course is designed for medium and senior-level developers who want a principled, end-to-end understanding of LLMs — from …
Overview
Large Language Models are no longer research curiosities; they are software systems with hard mathematical cores and very real engineering constraints. This course is designed for medium and senior-level developers who want a principled, end-to-end understanding of LLMs — from the algebra that drives attention, to mixed-precision numerics on modern accelerators, to distributed training, inference serving, retrieval-augmented generation, interpretability, and production reliability. We assume you are comfortable reading technical papers, reasoning about asymptotics and hardware, and writing nontrivial code. We do not assume prior ML specialization: we build the stack from first principles, but we speak to you as an engineer and mathematician, not as a beginner.
By the end, you will be able to: reason about representation learning and optimization in concrete linear-algebraic terms; design and benchmark GPU/TPU kernels and identify true bottlenecks (often bandwidth, not FLOPs); implement and profile a minimal transformer; train and fine-tune with robust schedules and regularization; debug instabilities (NaNs, exploding gradients, loss spikes) with the right tools; scale training with tensor/pipeline parallelism, ZeRO/FSDP, and memory-efficient attention; build retrieval pipelines with vector databases and hybrid search; probe and edit models safely; and deliver production-grade inference with observability, A/B testing, and cost control.
Block I equips you with the quantitative substrate: matrix/tensor operations as the lingua franca of neural computation, probability and cross-entropy as coding-length objectives, floating-point formats (FP32/FP16/BF16) and their failure modes, and the physics of memory bandwidth versus compute throughput. You will learn why elementwise ops are often bandwidth-bound, how to profile and fuse kernels, and how to read GPU specs (FLOPs, VRAM, interconnects) in a way that predicts wall-clock performance. We cover PCA/SVD for compression, loss landscapes and saddle points, and implement optimizers (SGD/Adam) to demystify the training loop.
Block II builds the architectural intuition: from RNN/LSTM to attention and the transformer blueprint (multi-head self-attention, positional encodings, residuals, LayerNorm). We go deep on inference mechanics: KV-cache, speculative decoding, Mixture-of-Experts, and emerging state space models (S4, Mamba). You will understand scaling laws, tokenization trade-offs, and the embedding geometry that underlies few-shot behavior. You will also implement a compact transformer and experiment with KV-cache and speculative paths.
Block III focuses on optimization at scale: loss functions and label smoothing, batching and shuffling strategies, learning-rate warmup with cosine decay, gradient clipping, and the subtle effects of data ordering. We compare full fine-tuning versus parameter-efficient methods like LoRA/adapters, cover self-supervised pretraining, RLHF with PPO and reward models, and discuss reward hacking and catastrophic forgetting with concrete mitigation strategies.
Block IV is your debugging playbook: diagnosing loss spikes, profiling GPU memory, analyzing gradient norms and dead heads, and recovering from NaNs with AMP scaling, safe casts, and numerical hygiene. You will practice building visualizations that reveal optimization pathologies and profile training with Nsight/torch.profiler.
Block V addresses distributed systems and systems optimization: model parallelism (tensor/pipeline), communication primitives (AllReduce/AllGather) and their costs, FlashAttention and memory-efficient variants, activation/gradient checkpointing, gradient accumulation, and mixed precision specifics. We cover FSDP/ZeRO/DeepSpeed, dataset streaming, and inference servers (vLLM, TGI). You will run LLaMA on vLLM and explore quantization (int8/int4) trade-offs with calibrated evaluation.
Block VI delivers RAG fundamentals: vector databases (FAISS, Milvus), embedding models, robust chunking, hybrid search (dense + sparse), and citation/source tracking. You will build a simple RAG and a version with source attribution, paying attention to latency and recall-quality trade-offs.
Block VII explores internals and interpretability: activation patching, probing, Grad-CAM-like techniques for transformers, circuit analysis, knowledge editing (ROME/MEMIT-style), jailbreaks and alignment attacks, and contamination detection. You will learn how to ask causal questions about model behavior and to guard your evaluations.
Block VIII lands the plane: profiling real-world bottlenecks, A/B testing and guardrails, GPU-hour economics and capacity planning, and production monitoring for degradation (latency, calibration, safety, drift). The course emphasizes measurement, reproducibility, and engineering rigor. Every practice block culminates in concrete artifacts: profilers, minimal pipelines, tuned schedulers, and service blueprints you can adapt to your stack.
Curriculum
- 9 Sections
- 76 Lessons
- Lifetime
- 1. Block I. Math, Physics and Hardware for LLMs16
- 1.1HW1F 1.1 Introduction: What is an LLM, history, myths vs. reality
- 1.2HW1F 1.2 Linear algebra: matrices, tensors, neural network operations
- 1.3HW1F 1.3 Probability, entropy, cross-entropy
- 1.4HW1F 1.4 Floating point arithmetic: IEEE 754, FP16, BF16, numerical errors
- 1.5HW1F 1.5 Memory bandwidth bottlenecks: why bandwidth matters more than FLOPS
- 1.6HW1F 1.6 Compiler optimizations: CUDA kernels, XLA, TorchInductor
- 1.7HW1F 1.7 GPU/TPU performance: FLOPS, VRAM, interconnects
- 1.8HW1F 1.8 PCA, SVD: compression and representations
- 1.9HW1F 1.9 Numerical instabilities: exploding/vanishing gradients
- 1.10HW1F 1.10 Loss landscapes and saddle points
- 1.11HW1F 1.11 Practice: implement your own SGD/Adam optimizer
- 1.12HW1F 1.12 Practice: profiling matrix operations
- 1.13HW1F 1.13 Practice: FP16 vs BF16 training simulation
- 1.14HW1F 1.14 Practice: measure GPU memory bandwidth
- 1.15HW1F 1.15 Practice: minimal PyTorch pipeline and bottleneck profiling
- 1.101HW1F 1. Quiz3 Questions
- 2. Block II. Architectures and Mechanics of LLMs15
- 2.1HW1F 2.1 RNN, LSTM, GRU: pre-transformer era
- 2.2HW1F 2.2 Attention and seq2seq
- 2.3HW1F 2.3 Transformer architecture: multi-head self-attention
- 2.4HW1F 2.4 Positional encoding
- 2.5HW1F 2.5 KV-cache: mechanics and optimization
- 2.6HW1F 2.6 Speculative decoding: speeding up inference with smaller models
- 2.7HW1F 2.7 Mixture of Experts (MoE): scaling strategy
- 2.8HW1F 2.8 State Space Models (S4, Mamba) as alternatives to transformers
- 2.9HW1F 2.9 Scaling laws: size vs. quality
- 2.10HW1F 2.10 Dropout, LayerNorm, residuals
- 2.11HW1F 2.11 Tokenization: BPE, SentencePiece, byte-level
- 2.12HW1F 2.12 Embeddings and latent representations
- 2.13HW1F 2.13 Practice: build a mini-transformer in PyTorch
- 2.14HW1F 2.14 Practice: inference with KV-cache and speculative decoding
- 2.101HW1F 2. Quiz3 Questions
- 3. Block III. Training and Optimization13
- 3.1HW1F 3.1 Loss functions and label smoothing
- 3.2HW1F 3.2 Batch, epochs, data shuffling
- 3.3HW1F 3.3 Learning rate warmup and cosine schedules
- 3.4HW1F 3.4 Gradient clipping: strategies and effects
- 3.5HW1F 3.5 Data ordering effects: how batch order impacts convergence
- 3.6HW1F 3.6 Fine-tuning: full, LoRA, adapters
- 3.7HW1F 3.7 Self-supervised learning
- 3.8HW1F 3.8 RLHF: reward models and PPO
- 3.9HW1F 3.9 Reward hacking: how models trick reward functions
- 3.10HW1F 3.10 Catastrophic forgetting
- 3.11HW1F 3.11 Practice: fine-tune LLaMA/Mistral
- 3.12HW1F 3.12 Practice: LoRA vs full fine-tune comparison
- 3.101HW1F 3. Quiz3 Questions
- 4. Block IV. Debugging & Profiling9
- 4.1HW1F 4.1 Diagnosing loss spikes
- 4.2HW1F 4.2 Memory profiling: tools and GPU memory leaks
- 4.3HW1F 4.3 Gradient analysis: norms, dead neurons
- 4.4HW1F 4.4 Training instabilities: NaN detection and recovery
- 4.5HW1F 4.5 Practice: gradient visualization
- 4.6HW1F 4.6 Practice: GPU profiling during training
- 4.7HW1F 4.7 Practice: NaN debugging in a training run
- 4.8HW1F 4.8 Practice: compare optimizers for stability (Adam vs LAMB vs Lion)
- 4.101HW1F 4. Quiz3 Questions
- 5. Block V. Infrastructure and Distributed Systems11
- 5.1HW1F 5.1 Model parallelism: tensor parallelism, pipeline parallelism
- 5.2HW1F 5.2 Communication overhead: AllReduce, AllGather
- 5.3HW1F 5.3 Flash Attention and memory-efficient variants
- 5.4HW1F 5.4 Checkpointing and gradient accumulation
- 5.5HW1F 5.5 Mixed precision training: FP16/BF16 specifics and pitfalls
- 5.6HW1F 5.6 DeepSpeed, FSDP, ZeRO
- 5.7HW1F 5.7 Dataset streaming pipelines
- 5.8HW1F 5.8 Inference servers: vLLM, TGI
- 5.9HW1F 5.9 Practice: run LLaMA on vLLM
- 5.10HW1F 5.10 Practice: quantization (int8, int4) and quality comparison
- 5.101HW1F 5. Quiz3 Questions
- 6. Block VI. RAG and Retrieval8
- 6.1HW1F 6.1 Vector databases: FAISS, Milvus
- 6.2HW1F 6.2 Embedding models
- 6.3HW1F 6.3 Chunking strategies for documents
- 6.4HW1F 6.4 Hybrid search: combining dense and sparse
- 6.5HW1F 6.5 Citation and source tracking
- 6.6HW1F 6.6 Practice: simple RAG with docs
- 6.7HW1F 6.7 Practice: RAG with source attribution
- 6.101HW1F 6. Quiz3 Questions
- 7. Block VII. Model Internals & Interpretability7
- 7.1HW1F 7.1 Activation patching: editing hidden states
- 7.2HW1F 7.2 Interpretability techniques: probing, Grad-CAM for transformers
- 7.3HW1F 7.3 Circuit analysis: algorithms inside the model
- 7.4HW1F 7.4 Knowledge editing: locally changing facts
- 7.5HW1F 7.5 Jailbreaking techniques: alignment attacks
- 7.6HW1F 7.6 Data contamination detection: leaks in evaluation datasets
- 7.101HW1F 7. Quiz3 Questions
- 8. Block VIII. Final Projects & Production5
- HW1F FinalQuiz1