// The loop

serve a candidate locally → pick the quant for the task + VRAM budget → measure throughput/latency/VRAM → wire it into a real task → bench vs the incumbent (fabrication-resistance first) → gate → promote → monitor → report

// The 6-phase roadmap

01 Local serving fundamentals
02 Quantization & model selection
03 Serving, throughput & performance
04 Embeddings & local RAG
05 Benchmarking & fabrication-resistance
06 Fine-tuning & promotion-to-prod

The local-infrastructure complement to cloud AI engineering: this course owns the metal. You learn to stand up a local model server, pick the right quantized model for a task and VRAM budget, squeeze throughput out of consumer GPUs, build a fully-local RAG pipeline, and — most importantly — prove a candidate is safe before promoting it.

The gating discipline is bench before you promote. A model is not “better” because its aggregate score went up; the gate is fabrication-resistance. A model that scores high while emitting fake facts is a regression, benched against the incumbent on production-shaped fixtures before anything ships.

Local LLM & Model Ops

// The loop

// The 6-phase roadmap

More in AI