Right-Size Every Call: An AI Cost and Latency Router
Build middleware that routes each request between small and large models by predicted complexity, measuring the tradeoff across cost, latency, and quality.
Why this matters
Sending every request to the largest model is the most common source of avoidable AI spend, but routing naively to a smaller model drops quality on the requests that needed the large one. The value is in a router that protects a quality floor while cutting cost and latency, and proves the tradeoff with numbers.
Persona
Platform engineer running an LLM feature at scale
Current manual workflow
Every request goes to one large model regardless of difficulty, and cost and latency are reviewed monthly after the bill arrives, with no per-request complexity signal.
The AI workflow to build
The middleware scores each request for predicted complexity, routes simple requests to a small model and hard ones to a large model, and falls back to the large model when a small-model output fails a quality check. It logs cost, latency, and a quality score per request and reports the aggregate tradeoff against an all-large baseline.
Inputs
- a request stream with mixed difficulty
- two or more candidate models
- a quality check or rubric
- a quality floor configuration
Outputs
- a per-request routing decision
- cost, latency, and quality logs
- an aggregate tradeoff report against the all-large baseline
Definition of done
On a synthetic request stream with labeled easy and hard items, the router cuts cost and latency against the all-large baseline while holding aggregate quality at or above the configured floor, and escalates failed small-model outputs to the large model rather than shipping them.
A batch of 1000 requests, 70 percent simple lookups and 30 percent multi-step reasoning, with an all-large baseline cost and quality recorded.
Report: cost down 58 percent, p95 latency down 40 percent, aggregate quality within 1 point of baseline, 4 percent of small-model outputs escalated to large on a failed quality check.
Data plan
synthetic data
Boundaries and non-goals
- training a router model from scratch
- real production traffic
- provider billing integration
Evaluation ideas
- cost and latency reduction against baseline
- quality floor adherence
- escalation precision
- routing accuracy on labeled difficulty
Run Level target
R3 Reliable Plain translation: handles real cases.
Scope envelope
Buildable by one solo builder in 20 to 30 focused hours, on public, synthetic, or sanitized data, with a demo path that requires no production access.
Suggested tools
Suggested options, never requirements; briefs are tool-agnostic.