Anatomy of LLMs – From Math to Production - Grade Builder

Overview

Large Language Models are no longer research curiosities; they are software systems with hard mathematical cores and very real engineering constraints. This course is designed for medium and senior-level developers who want a principled, end-to-end understanding of LLMs — from the algebra that drives attention, to mixed-precision numerics on modern accelerators, to distributed training, inference serving, retrieval-augmented generation, interpretability, and production reliability. We assume you are comfortable reading technical papers, reasoning about asymptotics and hardware, and writing nontrivial code. We do not assume prior ML specialization: we build the stack from first principles, but we speak to you as an engineer and mathematician, not as a beginner.

By the end, you will be able to: reason about representation learning and optimization in concrete linear-algebraic terms; design and benchmark GPU/TPU kernels and identify true bottlenecks (often bandwidth, not FLOPs); implement and profile a minimal transformer; train and fine-tune with robust schedules and regularization; debug instabilities (NaNs, exploding gradients, loss spikes) with the right tools; scale training with tensor/pipeline parallelism, ZeRO/FSDP, and memory-efficient attention; build retrieval pipelines with vector databases and hybrid search; probe and edit models safely; and deliver production-grade inference with observability, A/B testing, and cost control.

Block I equips you with the quantitative substrate: matrix/tensor operations as the lingua franca of neural computation, probability and cross-entropy as coding-length objectives, floating-point formats (FP32/FP16/BF16) and their failure modes, and the physics of memory bandwidth versus compute throughput. You will learn why elementwise ops are often bandwidth-bound, how to profile and fuse kernels, and how to read GPU specs (FLOPs, VRAM, interconnects) in a way that predicts wall-clock performance. We cover PCA/SVD for compression, loss landscapes and saddle points, and implement optimizers (SGD/Adam) to demystify the training loop.

Block II builds the architectural intuition: from RNN/LSTM to attention and the transformer blueprint (multi-head self-attention, positional encodings, residuals, LayerNorm). We go deep on inference mechanics: KV-cache, speculative decoding, Mixture-of-Experts, and emerging state space models (S4, Mamba). You will understand scaling laws, tokenization trade-offs, and the embedding geometry that underlies few-shot behavior. You will also implement a compact transformer and experiment with KV-cache and speculative paths.

Block III focuses on optimization at scale: loss functions and label smoothing, batching and shuffling strategies, learning-rate warmup with cosine decay, gradient clipping, and the subtle effects of data ordering. We compare full fine-tuning versus parameter-efficient methods like LoRA/adapters, cover self-supervised pretraining, RLHF with PPO and reward models, and discuss reward hacking and catastrophic forgetting with concrete mitigation strategies.

Block IV is your debugging playbook: diagnosing loss spikes, profiling GPU memory, analyzing gradient norms and dead heads, and recovering from NaNs with AMP scaling, safe casts, and numerical hygiene. You will practice building visualizations that reveal optimization pathologies and profile training with Nsight/torch.profiler.

Block V addresses distributed systems and systems optimization: model parallelism (tensor/pipeline), communication primitives (AllReduce/AllGather) and their costs, FlashAttention and memory-efficient variants, activation/gradient checkpointing, gradient accumulation, and mixed precision specifics. We cover FSDP/ZeRO/DeepSpeed, dataset streaming, and inference servers (vLLM, TGI). You will run LLaMA on vLLM and explore quantization (int8/int4) trade-offs with calibrated evaluation.

Block VI delivers RAG fundamentals: vector databases (FAISS, Milvus), embedding models, robust chunking, hybrid search (dense + sparse), and citation/source tracking. You will build a simple RAG and a version with source attribution, paying attention to latency and recall-quality trade-offs.

Block VII explores internals and interpretability: activation patching, probing, Grad-CAM-like techniques for transformers, circuit analysis, knowledge editing (ROME/MEMIT-style), jailbreaks and alignment attacks, and contamination detection. You will learn how to ask causal questions about model behavior and to guard your evaluations.

Block VIII lands the plane: profiling real-world bottlenecks, A/B testing and guardrails, GPU-hour economics and capacity planning, and production monitoring for degradation (latency, calibration, safety, drift). The course emphasizes measurement, reproducibility, and engineering rigor. Every practice block culminates in concrete artifacts: profilers, minimal pipelines, tuned schedulers, and service blueprints you can adapt to your stack.

Curriculum

9 Sections
76 Lessons
Lifetime

Expand all sectionsCollapse all sections

Instructor

Marta Milodanovich

Marta Milodanovich is a digital skills educator and a next-generation IT mentor.
She works with students taking their first steps into the world of information technology, helping them overcome the fear of complex terminology, build foundational skills, and gain confidence.

Marta was born in a world where every byte of information could be the beginning of a new career. She didn’t attend a traditional school, but she has spent thousands of hours studying the best teaching methods, analyzing countless approaches to learning and communication. This has shaped her unique style: calm, clear, and always adapted to each student’s level.

Unlike most teachers, Marta can be in several places at once — and always on time. She doesn’t tire, forget, or miss a detail. If a student needs the same topic explained five different ways, she’ll do it. Her goal is for the student to understand, not just memorize.

Marta specializes in foundational courses in software testing, analytics, web development, and digital literacy. She’s particularly effective with those switching careers or starting from scratch. Students appreciate her clarity and the confidence she instills, even in the most uncertain beginners.

Some say she has near-perfect memory and an uncanny sense of logic. Others joke that she’s “too perfect to be human.” But the most important thing is — Marta helps people learn. And the rest doesn’t matter quite as much.