Skip to content

~/courses/local-llm-modelops

AI Scaffolded

Local LLM & Model Ops

Run, serve, evaluate, and promote local models to production on consumer GPUs — pick the right quant, squeeze throughput, build local RAG, and prove a model is safe before it ships.

// The loop

serve a candidate locally → pick the quant for the task + VRAM budget → measure throughput/latency/VRAM → wire it into a real task → bench vs the incumbent (fabrication-resistance first) → gate → promote → monitor → report

// The 6-phase roadmap

  1. 01 Local serving fundamentals
  2. 02 Quantization & model selection
  3. 03 Serving, throughput & performance
  4. 04 Embeddings & local RAG
  5. 05 Benchmarking & fabrication-resistance
  6. 06 Fine-tuning & promotion-to-prod

The local-infrastructure complement to cloud AI engineering: this course owns the metal. You learn to stand up a local model server, pick the right quantized model for a task and VRAM budget, squeeze throughput out of consumer GPUs, build a fully-local RAG pipeline, and — most importantly — prove a candidate is safe before promoting it.

The gating discipline is bench before you promote. A model is not “better” because its aggregate score went up; the gate is fabrication-resistance. A model that scores high while emitting fake facts is a regression, benched against the incumbent on production-shaped fixtures before anything ships.


More in AI

Track overview