The ironies of AI automation, Part 2: more ops, more glue, more humans-in-the-loop
Under the hood, today’s “automated” AI stacks are anything but hands-off. Agentic tool use, structured output modes, and function calling have matured into production primitives-but they arrive with state machines, retrieval layers, prompt registries, and evaluation harnesses that need constant care. What’s notable here isn’t just capability growth; it’s the operational surface area. Teams now ship canaries for prompts, deterministic fallbacks for non-deterministic models, and observability that looks suspiciously like SRE for language models.
The bigger picture: automation is moving costs from manual labor to GPUs, data curation, and verification. Inference is cheaper thanks to quantization, speculative decoding, and distillation; verification is not. That gap explains the boom in LLM eval tooling, safety guardrails, and red-teaming-as-a-service. Worth noting, open-weight models have closed enough of the quality gap to be viable in many pipelines-but they still lean on the same scaffolding: RAG, tool orchestration, and strict schema enforcement to keep outputs contract-safe. The net effect isn’t fewer humans; it’s new roles at different choke points-prompt QA, incident triage, and policy tuning. Automation is real, but the irony stands: getting reliable autonomy requires more glue code, more monitoring, and a clearer definition of “done” than ever.