Most teams start with one model call inside one agent loop. Then the loop grows. The Planner calls a reasoning model. The Critic calls another. The evaluator suite calls a cheap classifier. A background job uses a long-context model. A fallback path calls a different provider because the first one timed out.
Six months later, nobody can answer which model made a risky recommendation, why that model was eligible, whether the fallback crossed a residency boundary, or which route change caused the cost spike.
That is the problem the AI Gateway and LLM Router solve. The AI Gateway is the model-side boundary. The Tool Gateway controls external effects; the AI Gateway controls model calls.
The mistake: treating model calls as library calls
A model SDK is a useful transport. It is not a runtime boundary.
If every subsystem imports a provider SDK directly, model behavior becomes distributed configuration. The Planner has one timeout policy. The Critic has another. Evaluators log different attributes. The refund agent silently upgrades to a larger model while the onboarding agent quietly downgrades to meet latency. Each choice may be reasonable in isolation; together they make the system hard to operate.
ContextOS treats model invocation like every other governed boundary:
| Question | Gateway answer |
|---|---|
| What operation is being performed? | judge.plan, judge.verify, judge.score, eval.safety, summarize.trace. |
| What constraints apply? | risk class, data residency, data class, max tokens, latency SLO, max cost. |
| Which models are eligible? | model profiles filtered by capability, policy, health, quota, and context size. |
| Why this route? | a RoutingDecision with rule ids, profile id, score, fallback index, and explanation. |
| Can we replay it? | route policy, model profile, compiled context ref, request envelope, usage, and trace id are pinned. |
The model call becomes an auditable operation, not a line of application code.
Router input should be explicit
The router should not infer risk from prompt text. The caller should send a provider-neutral envelope with enough structure to make selection deterministic:
{
"operation": "judge.plan",
"intent_id": "support.refund",
"risk_class": "destructive",
"compiled_context_ref": "compiled_ctx_41",
"requirements": {
"structured_output": true,
"tool_calling": false,
"max_input_tokens": 24000,
"max_output_tokens": 2000,
"latency_slo_ms": 2500,
"max_cost_usd": 0.08,
"data_residency": "us"
},
"routing_hints": {
"quality_tier": "standard",
"fallback_allowed": true,
"canary_allowed": false
}
}This shape matters because it separates policy from preference. quality_tier is a hint. data_residency is a constraint. structured_output is a requirement. max_cost_usd is a budget. A router that cannot tell those apart will eventually “optimize” away a control.
The five-stage routing algorithm
The router is not magic. It is a governed selection function.
The filter stage removes candidates that cannot legally or technically satisfy the call: wrong region, wrong data class, missing structured output, context window too small, unhealthy provider, exhausted quota.
The score stage ranks only eligible candidates. A high-risk structured-output operation might weight quality and reliability. A read-only summary might weight latency and cost. The important part is that the scoring policy is versioned, signed, and pinned per environment.
The select stage uses deterministic tie-breaks: policy priority, quality band, lower latency, lower estimated cost, stable provider priority. Do not let timestamp ordering or random selection decide high-risk routes unless you are explicitly running a canary stage.
The invoke stage applies timeout, retry, streaming, and repair rules from the gateway policy, not from random call sites.
The record stage emits the evidence the operator will need later: selected profile, rejected candidates, fallback index, token usage, provider response model, applied rule ids, and trace span.
Model profiles, not hard-coded model names
Hard-coded model names make migrations expensive. A profile is the stable ContextOS artifact:
{
"model_profile_id": "profile_reasoning_standard_v7",
"provider_adapter": "provider_a",
"deployment_ref": "secret://model-deployments/provider-a/reasoning-standard",
"capabilities": {
"structured_output": true,
"tool_calling": true,
"long_context": true,
"streaming": true
},
"policy": {
"regions": ["us", "eu"],
"data_classes_allowed": ["PUBLIC", "INTERNAL"],
"eligible_risk_classes": ["read_only", "local_write", "network", "delegated", "destructive"]
},
"score_hints": {
"quality": 0.86,
"latency_p95_ms": 1900,
"cost_per_1k_input_usd": 0.002
},
"status": "healthy"
}The profile lets product teams ask for a capability class instead of a vendor SKU. Platform teams can change the underlying deployment, run canaries, retire unsafe profiles, and preserve a stable contract for the runtime.
Observability is part of the contract
The gateway should emit ordinary distributed traces plus GenAI-specific attributes. W3C Trace Context gives the portable traceparent / tracestate propagation layer. OpenTelemetry GenAI semantic conventions define useful model-call attributes such as operation name, provider, request model, response model, token usage, streaming state, and errors.
ContextOS adds the domain attributes those specs do not own:
| Attribute | Why it matters |
|---|---|
contextos.intent_id | Aggregates quality and cost by business job. |
contextos.risk_class | Shows whether higher-risk calls used approved profiles. |
contextos.compiled_context_ref | Connects model output to the exact context the model saw. |
contextos.routing_policy_id | Explains the active route policy. |
contextos.model_profile_id | Separates stable profile from provider deployment. |
contextos.fallback_index | Makes fallback behavior visible. |
Avoid logging full prompts and outputs by default. OpenTelemetry explicitly treats full content as sensitive and often expensive. Store content through the same evidence and replay controls you use for other run artifacts.
Failure modes the gateway should block
| Failure | Correct response |
|---|---|
| Provider timeout | retry only if policy allows, then use the next eligible fallback. |
| Rate limit | move within the same policy band and record fallback. |
| Budget breach | downgrade only if the quality floor still holds; otherwise return BUDGET_EXCEEDED. |
| Structured output invalid | repair only when allowed; otherwise return SCHEMA_INVALID. |
| Residency mismatch | refuse; never fallback across residency. |
| Profile not eligible for risk class | refuse; do not let the caller override with a model name. |
The gateway should fail closed. A model call that cannot satisfy the route contract is a runtime error, not a creative opportunity.
What changes operationally
First, cost becomes attributable. Instead of “the agent got expensive,” the team can see support.refund using profile_reasoning_premium_v3 after a policy rollout.
Second, model migration becomes safer. You can run profile candidates in shadow, compare scorecards, and roll forward by route policy rather than editing Planner code.
Third, incident response gets shorter. When a run fails, the DecisionRecord can point to the exact model profile, provider adapter, route rule, token budget, and fallback decision involved.
Fourth, platform governance moves out of prompt text. The model can be brilliant or mediocre; either way, model selection remains a deterministic runtime function.
Research base
- ContextOS spec: AI Gateway and LLM Router, Reference Architecture, and Evaluation and Observability.
- OpenTelemetry GenAI semantic conventions for model-call spans and token usage attributes.
- W3C Trace Context for portable distributed trace propagation.
- OWASP Top 10 for LLM Applications for the security framing around prompt injection, excessive agency, data exposure, and unbounded consumption.