The "Self-Improving" AI Myth (And What 60 Production Deployments Actually Do)
Sixty production deployments converged on a three-layer architecture where the eval surface, not the base model, is the moat.
The AI Runtime library is our own writing: weekly issues on Vertical AI Agents, Model Reliability Engineering, and the lessons from the trenches of shipping AI. Free, every week.
Sixty production deployments converged on a three-layer architecture where the eval surface, not the base model, is the moat.
His new essay asks regulators to build an FAA for AI models. If you ship AI inside a bank, you already work in that regime, and there is a five-minute test that tells you whether your agent is as far
The harness engineering discourse names what to build. The Model Reliability Engineering arc names how long the build lasts, what kills it, and what to do at week six.
Most AI projects die in the gap between "works in the demo" and "works in production."
Quantization changes the numbers. Lossless compression removes the wasted bits and keeps every output identical, for about 30% less memory.
Quantization changes the numbers. Lossless compression removes the wasted bits and keeps every output identical, for about 30% less memory.
The leading AI legal research tools still hallucinate on up to a third of queries, so the production answer in law is not a better model but a harness built to assume the model is wrong.
Sixteen published deep-dives, four modules, one operating thesis. The harness around the model is the product. Free.
Six hundred lines of code, no abstractions, and the argument that every wrapper around the LLM is on borrowed time.
Context Engineering for code agents is the discipline of deciding what the model knows about a codebase, its conventions, and the organization at inference time.
Thirty-plus harnesses, four topologies, two billion-dollar valuations, one collapsing abstraction layer. The canonical landscape of how autonomous agents drive the web - and the trade-offs that decide
Learnings from the first hundred days of MPP and the year-plus of x402: how Parallel, Browserbase, fal.ai, and AWS are actually running it, where the production failure modes are, and the archite
Seven layers wrap every LLM that has shipped in healthcare, banking, and insurance. The model itself is the smallest of them — here’s what the other six are doing.
Two talks, one diagnosis — the infrastructure layer between AI capability and enterprise production is the bottleneck, and it isn’t being built by the model labs.
Tool descriptions are now executable instructions, the dependency graph for agents runs through hundreds of unvetted servers, and the registry your enterprise needs to govern them does not yet exist.
Per-token pricing was the right unit when API calls were single-shot. Is it when your agent runs adaptive thinking, fans out tool calls, spawns sub-agents, and retries on partial failure?
HockeyStack just raised $50M to scale a vertical agent platform whose reasoning engine is a custom ML pipeline — not a frontier model. Why that matters for anyone building agents.
MIT’s open-source agent swarm replaces the orchestrator with an artifact reactor. The architecture is worth studying even if you’ll never build a science swarm.
OpenClaw isn’t a product to adopt. It’s a reference architecture to decompose. Five primitives, three production-grade use cases that earn real revenue, and a harness audit checklist for anyone build
Enterprise software only creates value when it’s actually deployed, and deployment is overwhelmingly a labor problem, not a software problem.
Working on something specific? Ask our team directly. We read every question.
One free issue a week on Vertical AI Agents, Model Reliability Engineering, and lessons from the trenches of shipping AI.
Production-grade writing, in your inbox.
Read by AI practitioners at IBM, Amazon, Meta, Google, Nvidia, OpenAI, MIT, Harvard, and more for production learnings.
Free. No spam. Unsubscribe in one click. See past issues →