AI Gateway and LLM Router
Reference design for the model-side gateway that normalizes provider APIs, enforces model-call policy, and routes requests under ContextOS budgets.
The AI Gateway is the model-side boundary for ContextOS. It sits between the Cognitive Core and model providers. It normalizes model invocation, applies policy, enforces budget, records telemetry, and delegates model selection to the LLM Router.
It is not the Tool Gateway. The Tool Gateway controls side effects. The AI Gateway controls model calls.
Design stance
The gateway exists because production systems should not scatter provider-specific model calls throughout planners, evaluators, adapters, and batch jobs. Provider APIs evolve. Model catalogs change. Pricing, context windows, tool support, data residency, and safety features vary. The rest of the runtime should depend on one ContextOS contract, not on every provider contract directly.
The router is not magic. It is a governed selection function over explicit inputs: task shape, risk, budget, latency, region, required capabilities, provider health, and evaluator feedback.
Responsibilities
| Layer | Owns | Does not own |
|---|---|---|
| AI Gateway | auth, request validation, redaction, budget checks, provider adapters, streaming normalization, telemetry, audit | business planning, tool execution, memory promotion |
| LLM Router | candidate filtering, route scoring, fallback order, route explanation, canary selection | policy authoring, provider billing terms, final business decision |
| Provider adapter | provider-specific request and response mapping | ContextOS policy, decision semantics, tool authorization |
Placement
The important boundary is authority: a model provider can produce text, structured output, or a model-native tool-call proposal. It cannot execute ContextOS tools directly. Any external effect still returns to the Decision plane and goes through the Tool Gateway.
Northbound contract
The Cognitive Core calls the gateway with a provider-neutral envelope.
{
"request_id": "req_01j9",
"run_id": "run_88",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"tenant_id": "tenant_acme_prod",
"intent_id": "support.refund",
"risk_class": "destructive",
"operation": "judge.plan",
"compiled_context_ref": "compiled_ctx_41",
"input": {
"instructions": "Produce a plan that can be verified by the Critic.",
"messages": [
{
"role": "user",
"content": "Refund order ord_881"
}
]
},
"requirements": {
"structured_output": true,
"tool_calling": false,
"vision": false,
"max_input_tokens": 24000,
"max_output_tokens": 2000,
"latency_slo_ms": 2500,
"max_cost_usd": 0.08,
"data_residency": "us"
},
"routing_hints": {
"quality_tier": "standard",
"fallback_allowed": true,
"canary_allowed": false
}
}The response carries both model output and the route decision that produced it.
{
"request_id": "req_01j9",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"status": "ok",
"output": {
"type": "json",
"value": {
"plan_id": "plan_refund_01",
"steps": []
}
},
"usage": {
"input_tokens": 18340,
"output_tokens": 612,
"estimated_cost_usd": 0.041
},
"route": {
"routing_decision_id": "route_01j9",
"model_profile_id": "profile_reasoning_standard_v7",
"provider_adapter": "provider_a",
"routing_rule_ids": ["ROUTE_STANDARD_REASONING_US"],
"fallback_index": 0,
"explanation": "Selected profile supports structured output within residency, latency, and budget constraints."
}
}Model profile registry
The router selects from model profiles, not hard-coded model names. A profile can map to one or more provider deployments behind the adapter layer.
{
"model_profile_id": "profile_reasoning_standard_v7",
"provider_adapter": "provider_a",
"deployment_ref": "secret://model-deployments/provider-a/reasoning-standard",
"capabilities": {
"structured_output": true,
"tool_calling": true,
"vision": false,
"long_context": true,
"streaming": true
},
"limits": {
"max_input_tokens": 128000,
"max_output_tokens": 8192,
"rpm": 300,
"tpm": 300000
},
"policy": {
"regions": ["us", "eu"],
"data_classes_allowed": ["PUBLIC", "INTERNAL"],
"stores_provider_state": false,
"eligible_risk_classes": ["read_only", "local_write", "network", "delegated", "destructive"]
},
"score_hints": {
"quality": 0.86,
"latency_p95_ms": 1900,
"cost_per_1k_input_usd": 0.002,
"cost_per_1k_output_usd": 0.008
},
"status": "healthy"
}Routing algorithm
The router runs in five deterministic stages.
- Filter candidates by data residency, data classification, required capabilities, active policy, health, quota, and context size.
- Score remaining candidates against quality, latency, cost, reliability, and evaluator feedback for the active intent.
- Select the highest-ranked profile with deterministic tie-breaks: policy priority, quality band, lower p95 latency, lower estimated cost, stable provider priority.
- Invoke through the selected provider adapter with timeout, retry, and streaming settings from the Run Context.
- Record route decision, usage, provider response model, fallback index, applied rules, and trace span.
This makes routing explainable. If the router chooses a cheaper model, the decision record must show that the cheaper profile remained inside the quality band for the current task.
Routing policy
Routing policy is a Trust-plane config artifact. It should be versioned, signed, evaluated in CI, and pinned per environment.
policy_id: route.support.standard.v4
owner_role: platform_runtime
default_profile: profile_general_standard_v5
rules:
- rule_id: ROUTE_HIGH_RISK_STRUCTURED
priority: 100
when:
risk_class: destructive
structured_output: true
require:
data_residency: request
capabilities:
structured_output: true
max_fallbacks: 1
score:
quality: 0.55
reliability: 0.20
latency: 0.15
cost: 0.10
candidates:
- profile_reasoning_standard_v7
- profile_reasoning_premium_v3
- rule_id: ROUTE_LOW_RISK_FAST
priority: 40
when:
risk_class: read_only
score:
latency: 0.45
cost: 0.35
quality: 0.20
candidates:
- profile_general_fast_v9
- profile_general_standard_v5Fallback rules
Fallback is allowed only when it preserves the contract requested by the caller.
| Failure | Automatic response | Must record |
|---|---|---|
| Provider timeout | Retry once if idempotent, then use next eligible fallback | timeout, retry count, fallback index |
| Provider rate limit | Move to next profile in same policy band | provider error class, selected fallback |
| Budget would exceed cap | Downgrade only if quality floor still holds; otherwise return BUDGET_EXCEEDED | estimated cost and rejected candidates |
| Structured output validation fails | Retry with repair prompt if policy allows; otherwise return SCHEMA_INVALID | schema ref, validation error |
| Residency mismatch | Do not fallback across residency boundary | denied region and required region |
Fallback must never broaden data exposure, remove required structured output, ignore risk class, or bypass policy.
Streaming contract
The gateway may expose streaming, but the stream is still a typed envelope.
| Event | Meaning |
|---|---|
route.selected | Router selected a profile and adapter. |
output.delta | Provider emitted a text or structured-output chunk. |
usage.delta | Token or cost estimate changed. |
provider.warning | Adapter observed a recoverable provider warning. |
output.final | Final normalized output is ready. |
error.final | Terminal error with ContextOS error code. |
Streaming consumers should not treat raw provider chunks as final business state. The Decision plane consumes the normalized final output.
Observability
Every gateway invocation emits OpenTelemetry spans and metrics tied to the parent trace_id.
Recommended span structure:
contextos.ai_gateway.invoke
contextos.llm_router.select
contextos.provider_adapter.invokeRequired attributes:
| Attribute | Why it matters |
|---|---|
contextos.run_id | Correlates route decision to the agent run. |
contextos.intent_id | Enables per-intent routing evaluation. |
contextos.routing_decision_id | Links telemetry to audit record. |
contextos.model_profile_id | Stable route target independent of provider model names. |
contextos.provider_adapter | Provider boundary used for the call. |
contextos.fallback_index | Shows whether the primary route held. |
contextos.estimated_cost_usd | Makes budget enforcement auditable. |
Use OpenTelemetry GenAI semantic conventions where they fit, but pin the emitted convention version because those conventions are still evolving.
Audit record
The gateway persists a RoutingDecision record. It is not a substitute for the final DecisionRecord, but the final decision should reference it in lineage.
{
"routing_decision_id": "route_01j9",
"request_id": "req_01j9",
"run_id": "run_88",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"policy_version": "route.support.standard.v4",
"selected_profile": "profile_reasoning_standard_v7",
"candidate_profiles": [
"profile_reasoning_standard_v7",
"profile_reasoning_premium_v3"
],
"rejected_profiles": [
{
"model_profile_id": "profile_general_fast_v9",
"reason": "missing_structured_output"
}
],
"usage": {
"input_tokens": 18340,
"output_tokens": 612,
"estimated_cost_usd": 0.041
},
"fallback_index": 0,
"created_at": "2026-05-09T10:30:00Z"
}Security controls
The AI Gateway should enforce these controls before any provider call:
- Redact secrets and disallowed data classes from provider-bound payloads.
- Deny requests whose data residency cannot be satisfied by an eligible profile.
- Keep provider credentials inside the gateway boundary.
- Disable provider-side state unless the active profile and tenant policy explicitly permit it.
- Store prompts, completions, and chunks according to classification and retention policy.
- Refuse prompt-response caching for sensitive, tenant-private, or user-private payloads unless policy explicitly permits it.
- Treat provider-native tool calls as proposals, not executed actions.
Deployment model
Evaluation loop
Routing quality should be evaluated like any other ContextOS runtime behavior.
| Signal | Used for |
|---|---|
| Golden-set score by intent | Validate that routing changes do not reduce decision quality. |
| Fallback rate | Detect provider instability or bad primary routes. |
| Cost per accepted decision | Catch cost regressions that do not improve quality. |
| Structured-output validation rate | Detect profile or prompt incompatibility. |
| Latency by route | Keep quality improvements inside user-facing SLOs. |
| Human override rate | Identify routes that look cheap but create review burden. |
Routing policy changes should pass replay before promotion. Canary routing is acceptable only when the Trust plane can compare outcomes against the pinned baseline.
Acceptance checklist
Before shipping an AI Gateway or router into production:
- A single provider-neutral invocation envelope exists.
- Model profiles are versioned and separate from provider names in business logic.
- Routing policy is signed, pinned, and environment-scoped.
- Route decisions are recorded and joinable to Decision Records.
- Provider-side state is explicit and policy-controlled.
- Structured output validation failures are typed.
- Fallback cannot cross residency, data classification, or capability boundaries.
- OTEL spans and token/cost metrics are emitted for every call.
- Golden replay covers route changes before promotion.
Research basis
This design follows patterns from primary docs and keeps provider-specific details behind adapters:
- OpenAI Responses API migration guide: typed response items, function calling, structured outputs, state, and tool-capable model interactions.
- Amazon Bedrock intelligent prompt routing: routing for quality and cost, fallback models, route criteria, and traceable responses.
- Microsoft Foundry model router: deployable model-router pattern, routing modes, model subsets, data-zone boundaries, and versioned underlying model sets.
- OpenTelemetry GenAI semantic conventions: span, metric, and event naming guidance for generative AI systems.