Skip to content
Back to Blog
Building the runtime
May 14, 2026
·by ·6 min read

AI Gateway and LLM Router: Model Choice Is a Runtime Decision

Share:XBSMRedditHNEmail

Most teams start with one model call inside one agent loop. Then the loop grows. The Planner calls a reasoning model. The Critic calls another. The evaluator suite calls a cheap classifier. A background job uses a long-context model. A fallback path calls a different provider because the first one timed out.

Six months later, nobody can answer which model made a risky recommendation, why that model was eligible, whether the fallback crossed a residency boundary, or which route change caused the cost spike.

That is the problem the AI Gateway and LLM Router solve. The AI Gateway is the model-side boundary. The Tool Gateway controls external effects; the AI Gateway controls model calls.

The mistake: treating model calls as library calls

A model SDK is a useful transport. It is not a runtime boundary.

If every subsystem imports a provider SDK directly, model behavior becomes distributed configuration. The Planner has one timeout policy. The Critic has another. Evaluators log different attributes. The refund agent silently upgrades to a larger model while the onboarding agent quietly downgrades to meet latency. Each choice may be reasonable in isolation; together they make the system hard to operate.

ContextOS treats model invocation like every other governed boundary:

QuestionGateway answer
What operation is being performed?judge.plan, judge.verify, judge.score, eval.safety, summarize.trace.
What constraints apply?risk class, data residency, data class, max tokens, latency SLO, max cost.
Which models are eligible?model profiles filtered by capability, policy, health, quota, and context size.
Why this route?a RoutingDecision with rule ids, profile id, score, fallback index, and explanation.
Can we replay it?route policy, model profile, compiled context ref, request envelope, usage, and trace id are pinned.

The model call becomes an auditable operation, not a line of application code.

Router input should be explicit

The router should not infer risk from prompt text. The caller should send a provider-neutral envelope with enough structure to make selection deterministic:

{
  "operation": "judge.plan",
  "intent_id": "support.refund",
  "risk_class": "destructive",
  "compiled_context_ref": "compiled_ctx_41",
  "requirements": {
    "structured_output": true,
    "tool_calling": false,
    "max_input_tokens": 24000,
    "max_output_tokens": 2000,
    "latency_slo_ms": 2500,
    "max_cost_usd": 0.08,
    "data_residency": "us"
  },
  "routing_hints": {
    "quality_tier": "standard",
    "fallback_allowed": true,
    "canary_allowed": false
  }
}

This shape matters because it separates policy from preference. quality_tier is a hint. data_residency is a constraint. structured_output is a requirement. max_cost_usd is a budget. A router that cannot tell those apart will eventually “optimize” away a control.

The five-stage routing algorithm

The router is not magic. It is a governed selection function.

The filter stage removes candidates that cannot legally or technically satisfy the call: wrong region, wrong data class, missing structured output, context window too small, unhealthy provider, exhausted quota.

The score stage ranks only eligible candidates. A high-risk structured-output operation might weight quality and reliability. A read-only summary might weight latency and cost. The important part is that the scoring policy is versioned, signed, and pinned per environment.

The select stage uses deterministic tie-breaks: policy priority, quality band, lower latency, lower estimated cost, stable provider priority. Do not let timestamp ordering or random selection decide high-risk routes unless you are explicitly running a canary stage.

The invoke stage applies timeout, retry, streaming, and repair rules from the gateway policy, not from random call sites.

The record stage emits the evidence the operator will need later: selected profile, rejected candidates, fallback index, token usage, provider response model, applied rule ids, and trace span.

Model profiles, not hard-coded model names

Hard-coded model names make migrations expensive. A profile is the stable ContextOS artifact:

{
  "model_profile_id": "profile_reasoning_standard_v7",
  "provider_adapter": "provider_a",
  "deployment_ref": "secret://model-deployments/provider-a/reasoning-standard",
  "capabilities": {
    "structured_output": true,
    "tool_calling": true,
    "long_context": true,
    "streaming": true
  },
  "policy": {
    "regions": ["us", "eu"],
    "data_classes_allowed": ["PUBLIC", "INTERNAL"],
    "eligible_risk_classes": ["read_only", "local_write", "network", "delegated", "destructive"]
  },
  "score_hints": {
    "quality": 0.86,
    "latency_p95_ms": 1900,
    "cost_per_1k_input_usd": 0.002
  },
  "status": "healthy"
}

The profile lets product teams ask for a capability class instead of a vendor SKU. Platform teams can change the underlying deployment, run canaries, retire unsafe profiles, and preserve a stable contract for the runtime.

Observability is part of the contract

The gateway should emit ordinary distributed traces plus GenAI-specific attributes. W3C Trace Context gives the portable traceparent / tracestate propagation layer. OpenTelemetry GenAI semantic conventions define useful model-call attributes such as operation name, provider, request model, response model, token usage, streaming state, and errors.

ContextOS adds the domain attributes those specs do not own:

AttributeWhy it matters
contextos.intent_idAggregates quality and cost by business job.
contextos.risk_classShows whether higher-risk calls used approved profiles.
contextos.compiled_context_refConnects model output to the exact context the model saw.
contextos.routing_policy_idExplains the active route policy.
contextos.model_profile_idSeparates stable profile from provider deployment.
contextos.fallback_indexMakes fallback behavior visible.

Avoid logging full prompts and outputs by default. OpenTelemetry explicitly treats full content as sensitive and often expensive. Store content through the same evidence and replay controls you use for other run artifacts.

Failure modes the gateway should block

FailureCorrect response
Provider timeoutretry only if policy allows, then use the next eligible fallback.
Rate limitmove within the same policy band and record fallback.
Budget breachdowngrade only if the quality floor still holds; otherwise return BUDGET_EXCEEDED.
Structured output invalidrepair only when allowed; otherwise return SCHEMA_INVALID.
Residency mismatchrefuse; never fallback across residency.
Profile not eligible for risk classrefuse; do not let the caller override with a model name.

The gateway should fail closed. A model call that cannot satisfy the route contract is a runtime error, not a creative opportunity.

What changes operationally

First, cost becomes attributable. Instead of “the agent got expensive,” the team can see support.refund using profile_reasoning_premium_v3 after a policy rollout.

Second, model migration becomes safer. You can run profile candidates in shadow, compare scorecards, and roll forward by route policy rather than editing Planner code.

Third, incident response gets shorter. When a run fails, the DecisionRecord can point to the exact model profile, provider adapter, route rule, token budget, and fallback decision involved.

Fourth, platform governance moves out of prompt text. The model can be brilliant or mediocre; either way, model selection remains a deterministic runtime function.

Research base

Found this useful? Share it.

Share:XBSMRedditHNEmail