The Failures That Don’t Crash: MLOps for AI Agents
Session Abstract
This talk takes four reliability patterns from distributed systems and shows what they look like inside an agent architecture. How to shadow-test an agent. Why your circuit breakers need confidence thresholds. What an eval harness looks like when your system is non-deterministic. And why human oversight degrades faster than anyone admits.
Session Description
AI agents are shipping to production without the reliability patterns we spent decades building for distributed systems. Only 37% of teams run online evaluations on their agents (LangChain State of Agent Engineering 2026). The rest have no systematic way to detect when an agent produces a confident, plausible, wrong answer.
This talk bridges that gap. Drawing from 15 years of building systems at scale (50 billion requests/month at Start.io, shadow deployment pipelines at Riskmethods, and the core MLOps platform at Qwak) I’ll present four reliability patterns adapted for agent architectures:
1. Shadow testing agents against a baseline before promoting them to production
2. Circuit breakers with confidence thresholds instead of simple error rates
3. Evaluation harnesses designed for non-deterministic outputs
4. Structured human oversight that accounts for automation bias decay
Each pattern comes with implementation details: what to measure, where to hook into the agent lifecycle, and what failure modes to watch for. The examples are framework-agnostic and based on real production systems, not toy demos.
The audience will walk away with concrete patterns they can apply to their own agent deployments whether they’re building with LangChain, LlamaIndex, custom frameworks, or bare API calls.