Local LLM & Model Ops
Run, serve, evaluate, and promote local models to production on consumer GPUs — pick the right quant, squeeze throughput, build local RAG, and prove a model is safe before it ships.
// The loop
serve a candidate locally → pick the quant for the task + VRAM budget → measure throughput/latency/VRAM → wire it into a real task → bench vs the incumbent (fabrication-resistance first) → gate → promote → monitor → report
// The 6-phase roadmap
- 01 Local serving fundamentals
- 02 Quantization & model selection
- 03 Serving, throughput & performance
- 04 Embeddings & local RAG
- 05 Benchmarking & fabrication-resistance
- 06 Fine-tuning & promotion-to-prod
The local-infrastructure complement to cloud AI engineering: this course owns the metal. You learn to stand up a local model server, pick the right quantized model for a task and VRAM budget, squeeze throughput out of consumer GPUs, build a fully-local RAG pipeline, and — most importantly — prove a candidate is safe before promoting it.
The gating discipline is bench before you promote. A model is not “better” because its aggregate score went up; the gate is fabrication-resistance. A model that scores high while emitting fake facts is a regression, benched against the incumbent on production-shaped fixtures before anything ships.