- 9 Sections
- 76 Lessons
- Lifetime
Expand all sectionsCollapse all sections
- 1. Block I. Math, Physics and Hardware for LLMs16
- 1.1HW1F 1.1 Introduction: What is an LLM, history, myths vs. reality
- 1.2HW1F 1.2 Linear algebra: matrices, tensors, neural network operations
- 1.3HW1F 1.3 Probability, entropy, cross-entropy
- 1.4HW1F 1.4 Floating point arithmetic: IEEE 754, FP16, BF16, numerical errors
- 1.5HW1F 1.5 Memory bandwidth bottlenecks: why bandwidth matters more than FLOPS
- 1.6HW1F 1.6 Compiler optimizations: CUDA kernels, XLA, TorchInductor
- 1.7HW1F 1.7 GPU/TPU performance: FLOPS, VRAM, interconnects
- 1.8HW1F 1.8 PCA, SVD: compression and representations
- 1.9HW1F 1.9 Numerical instabilities: exploding/vanishing gradients
- 1.10HW1F 1.10 Loss landscapes and saddle points
- 1.11HW1F 1.11 Practice: implement your own SGD/Adam optimizer
- 1.12HW1F 1.12 Practice: profiling matrix operations
- 1.13HW1F 1.13 Practice: FP16 vs BF16 training simulation
- 1.14HW1F 1.14 Practice: measure GPU memory bandwidth
- 1.15HW1F 1.15 Practice: minimal PyTorch pipeline and bottleneck profiling
- 1.101HW1F 1. Quiz3 Questions
- 2. Block II. Architectures and Mechanics of LLMs15
- 2.1HW1F 2.1 RNN, LSTM, GRU: pre-transformer era
- 2.2HW1F 2.2 Attention and seq2seq
- 2.3HW1F 2.3 Transformer architecture: multi-head self-attention
- 2.4HW1F 2.4 Positional encoding
- 2.5HW1F 2.5 KV-cache: mechanics and optimization
- 2.6HW1F 2.6 Speculative decoding: speeding up inference with smaller models
- 2.7HW1F 2.7 Mixture of Experts (MoE): scaling strategy
- 2.8HW1F 2.8 State Space Models (S4, Mamba) as alternatives to transformers
- 2.9HW1F 2.9 Scaling laws: size vs. quality
- 2.10HW1F 2.10 Dropout, LayerNorm, residuals
- 2.11HW1F 2.11 Tokenization: BPE, SentencePiece, byte-level
- 2.12HW1F 2.12 Embeddings and latent representations
- 2.13HW1F 2.13 Practice: build a mini-transformer in PyTorch
- 2.14HW1F 2.14 Practice: inference with KV-cache and speculative decoding
- 2.101HW1F 2. Quiz3 Questions
- 3. Block III. Training and Optimization13
- 3.1HW1F 3.1 Loss functions and label smoothing
- 3.2HW1F 3.2 Batch, epochs, data shuffling
- 3.3HW1F 3.3 Learning rate warmup and cosine schedules
- 3.4HW1F 3.4 Gradient clipping: strategies and effects
- 3.5HW1F 3.5 Data ordering effects: how batch order impacts convergence
- 3.6HW1F 3.6 Fine-tuning: full, LoRA, adapters
- 3.7HW1F 3.7 Self-supervised learning
- 3.8HW1F 3.8 RLHF: reward models and PPO
- 3.9HW1F 3.9 Reward hacking: how models trick reward functions
- 3.10HW1F 3.10 Catastrophic forgetting
- 3.11HW1F 3.11 Practice: fine-tune LLaMA/Mistral
- 3.12HW1F 3.12 Practice: LoRA vs full fine-tune comparison
- 3.101HW1F 3. Quiz3 Questions
- 4. Block IV. Debugging & Profiling9
- 4.1HW1F 4.1 Diagnosing loss spikes
- 4.2HW1F 4.2 Memory profiling: tools and GPU memory leaks
- 4.3HW1F 4.3 Gradient analysis: norms, dead neurons
- 4.4HW1F 4.4 Training instabilities: NaN detection and recovery
- 4.5HW1F 4.5 Practice: gradient visualization
- 4.6HW1F 4.6 Practice: GPU profiling during training
- 4.7HW1F 4.7 Practice: NaN debugging in a training run
- 4.8HW1F 4.8 Practice: compare optimizers for stability (Adam vs LAMB vs Lion)
- 4.101HW1F 4. Quiz3 Questions
- 5. Block V. Infrastructure and Distributed Systems11
- 5.1HW1F 5.1 Model parallelism: tensor parallelism, pipeline parallelism
- 5.2HW1F 5.2 Communication overhead: AllReduce, AllGather
- 5.3HW1F 5.3 Flash Attention and memory-efficient variants
- 5.4HW1F 5.4 Checkpointing and gradient accumulation
- 5.5HW1F 5.5 Mixed precision training: FP16/BF16 specifics and pitfalls
- 5.6HW1F 5.6 DeepSpeed, FSDP, ZeRO
- 5.7HW1F 5.7 Dataset streaming pipelines
- 5.8HW1F 5.8 Inference servers: vLLM, TGI
- 5.9HW1F 5.9 Practice: run LLaMA on vLLM
- 5.10HW1F 5.10 Practice: quantization (int8, int4) and quality comparison
- 5.101HW1F 5. Quiz3 Questions
- 6. Block VI. RAG and Retrieval8
- 6.1HW1F 6.1 Vector databases: FAISS, Milvus
- 6.2HW1F 6.2 Embedding models
- 6.3HW1F 6.3 Chunking strategies for documents
- 6.4HW1F 6.4 Hybrid search: combining dense and sparse
- 6.5HW1F 6.5 Citation and source tracking
- 6.6HW1F 6.6 Practice: simple RAG with docs
- 6.7HW1F 6.7 Practice: RAG with source attribution
- 6.101HW1F 6. Quiz3 Questions
- 7. Block VII. Model Internals & Interpretability7
- 7.1HW1F 7.1 Activation patching: editing hidden states
- 7.2HW1F 7.2 Interpretability techniques: probing, Grad-CAM for transformers
- 7.3HW1F 7.3 Circuit analysis: algorithms inside the model
- 7.4HW1F 7.4 Knowledge editing: locally changing facts
- 7.5HW1F 7.5 Jailbreaking techniques: alignment attacks
- 7.6HW1F 7.6 Data contamination detection: leaks in evaluation datasets
- 7.101HW1F 7. Quiz3 Questions
- 8. Block VIII. Final Projects & Production5
- HW1F FinalQuiz1
HW1F 8. Quiz
Prev