Skip to content
Press / to search

AI Gateway and LLM Router

Reference design for the model-side gateway that normalizes provider APIs, enforces model-call policy, and routes requests under ContextOS budgets.

Reference DesignLast reviewed: Edit on GitHub
At a glance

The AI Gateway is the model-side boundary for ContextOS. It sits between the Cognitive Core and model providers. It normalizes model invocation, applies policy, enforces budget, records telemetry, and delegates model selection to the LLM Router.

It is not the Tool Gateway. The Tool Gateway controls side effects. The AI Gateway controls model calls.

Design stance

The gateway exists because production systems should not scatter provider-specific model calls throughout planners, evaluators, adapters, and batch jobs. Provider APIs evolve. Model catalogs change. Pricing, context windows, tool support, data residency, and safety features vary. The rest of the runtime should depend on one ContextOS contract, not on every provider contract directly.

The router is not magic. It is a governed selection function over explicit inputs: task shape, risk, budget, latency, region, required capabilities, provider health, and evaluator feedback.

Responsibilities

LayerOwnsDoes not own
AI Gatewayauth, request validation, redaction, budget checks, provider adapters, streaming normalization, telemetry, auditbusiness planning, tool execution, memory promotion
LLM Routercandidate filtering, route scoring, fallback order, route explanation, canary selectionpolicy authoring, provider billing terms, final business decision
Provider adapterprovider-specific request and response mappingContextOS policy, decision semantics, tool authorization

Placement

invokeAgent requestConversation ManagerContext Pack CompilerPlanner / Critic /ExecutorJudgment layerAI GatewayLLM RouterProvider adaptersModel providersTool GatewayExternal systemsOTEL spans + metricsRoutingDecision record
The AI Gateway is model-side. Tool calls still route through the Tool Gateway.

The important boundary is authority: a model provider can produce text, structured output, or a model-native tool-call proposal. It cannot execute ContextOS tools directly. Any external effect still returns to the Decision plane and goes through the Tool Gateway.

Northbound contract

The Cognitive Core calls the gateway with a provider-neutral envelope.

{
  "request_id": "req_01j9",
  "run_id": "run_88",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "tenant_id": "tenant_acme_prod",
  "intent_id": "support.refund",
  "risk_class": "destructive",
  "operation": "judge.plan",
  "compiled_context_ref": "compiled_ctx_41",
  "input": {
    "instructions": "Produce a plan that can be verified by the Critic.",
    "messages": [
      {
        "role": "user",
        "content": "Refund order ord_881"
      }
    ]
  },
  "requirements": {
    "structured_output": true,
    "tool_calling": false,
    "vision": false,
    "max_input_tokens": 24000,
    "max_output_tokens": 2000,
    "latency_slo_ms": 2500,
    "max_cost_usd": 0.08,
    "data_residency": "us"
  },
  "routing_hints": {
    "quality_tier": "standard",
    "fallback_allowed": true,
    "canary_allowed": false
  }
}

The response carries both model output and the route decision that produced it.

{
  "request_id": "req_01j9",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "status": "ok",
  "output": {
    "type": "json",
    "value": {
      "plan_id": "plan_refund_01",
      "steps": []
    }
  },
  "usage": {
    "input_tokens": 18340,
    "output_tokens": 612,
    "estimated_cost_usd": 0.041
  },
  "route": {
    "routing_decision_id": "route_01j9",
    "model_profile_id": "profile_reasoning_standard_v7",
    "provider_adapter": "provider_a",
    "routing_rule_ids": ["ROUTE_STANDARD_REASONING_US"],
    "fallback_index": 0,
    "explanation": "Selected profile supports structured output within residency, latency, and budget constraints."
  }
}

Model profile registry

The router selects from model profiles, not hard-coded model names. A profile can map to one or more provider deployments behind the adapter layer.

{
  "model_profile_id": "profile_reasoning_standard_v7",
  "provider_adapter": "provider_a",
  "deployment_ref": "secret://model-deployments/provider-a/reasoning-standard",
  "capabilities": {
    "structured_output": true,
    "tool_calling": true,
    "vision": false,
    "long_context": true,
    "streaming": true
  },
  "limits": {
    "max_input_tokens": 128000,
    "max_output_tokens": 8192,
    "rpm": 300,
    "tpm": 300000
  },
  "policy": {
    "regions": ["us", "eu"],
    "data_classes_allowed": ["PUBLIC", "INTERNAL"],
    "stores_provider_state": false,
    "eligible_risk_classes": ["read_only", "local_write", "network", "delegated", "destructive"]
  },
  "score_hints": {
    "quality": 0.86,
    "latency_p95_ms": 1900,
    "cost_per_1k_input_usd": 0.002,
    "cost_per_1k_output_usd": 0.008
  },
  "status": "healthy"
}

Routing algorithm

The router runs in five deterministic stages.

  1. Filter candidates by data residency, data classification, required capabilities, active policy, health, quota, and context size.
  2. Score remaining candidates against quality, latency, cost, reliability, and evaluator feedback for the active intent.
  3. Select the highest-ranked profile with deterministic tie-breaks: policy priority, quality band, lower p95 latency, lower estimated cost, stable provider priority.
  4. Invoke through the selected provider adapter with timeout, retry, and streaming settings from the Run Context.
  5. Record route decision, usage, provider response model, fallback index, applied rules, and trace span.

This makes routing explainable. If the router chooses a cheaper model, the decision record must show that the cheaper profile remained inside the quality band for the current task.

Routing policy

Routing policy is a Trust-plane config artifact. It should be versioned, signed, evaluated in CI, and pinned per environment.

policy_id: route.support.standard.v4
owner_role: platform_runtime
default_profile: profile_general_standard_v5
rules:
  - rule_id: ROUTE_HIGH_RISK_STRUCTURED
    priority: 100
    when:
      risk_class: destructive
      structured_output: true
    require:
      data_residency: request
      capabilities:
        structured_output: true
      max_fallbacks: 1
    score:
      quality: 0.55
      reliability: 0.20
      latency: 0.15
      cost: 0.10
    candidates:
      - profile_reasoning_standard_v7
      - profile_reasoning_premium_v3
  - rule_id: ROUTE_LOW_RISK_FAST
    priority: 40
    when:
      risk_class: read_only
    score:
      latency: 0.45
      cost: 0.35
      quality: 0.20
    candidates:
      - profile_general_fast_v9
      - profile_general_standard_v5

Fallback rules

Fallback is allowed only when it preserves the contract requested by the caller.

FailureAutomatic responseMust record
Provider timeoutRetry once if idempotent, then use next eligible fallbacktimeout, retry count, fallback index
Provider rate limitMove to next profile in same policy bandprovider error class, selected fallback
Budget would exceed capDowngrade only if quality floor still holds; otherwise return BUDGET_EXCEEDEDestimated cost and rejected candidates
Structured output validation failsRetry with repair prompt if policy allows; otherwise return SCHEMA_INVALIDschema ref, validation error
Residency mismatchDo not fallback across residency boundarydenied region and required region

Fallback must never broaden data exposure, remove required structured output, ignore risk class, or bypass policy.

Streaming contract

The gateway may expose streaming, but the stream is still a typed envelope.

EventMeaning
route.selectedRouter selected a profile and adapter.
output.deltaProvider emitted a text or structured-output chunk.
usage.deltaToken or cost estimate changed.
provider.warningAdapter observed a recoverable provider warning.
output.finalFinal normalized output is ready.
error.finalTerminal error with ContextOS error code.

Streaming consumers should not treat raw provider chunks as final business state. The Decision plane consumes the normalized final output.

Observability

Every gateway invocation emits OpenTelemetry spans and metrics tied to the parent trace_id.

Recommended span structure:

contextos.ai_gateway.invoke
  contextos.llm_router.select
  contextos.provider_adapter.invoke

Required attributes:

AttributeWhy it matters
contextos.run_idCorrelates route decision to the agent run.
contextos.intent_idEnables per-intent routing evaluation.
contextos.routing_decision_idLinks telemetry to audit record.
contextos.model_profile_idStable route target independent of provider model names.
contextos.provider_adapterProvider boundary used for the call.
contextos.fallback_indexShows whether the primary route held.
contextos.estimated_cost_usdMakes budget enforcement auditable.

Use OpenTelemetry GenAI semantic conventions where they fit, but pin the emitted convention version because those conventions are still evolving.

Audit record

The gateway persists a RoutingDecision record. It is not a substitute for the final DecisionRecord, but the final decision should reference it in lineage.

{
  "routing_decision_id": "route_01j9",
  "request_id": "req_01j9",
  "run_id": "run_88",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "policy_version": "route.support.standard.v4",
  "selected_profile": "profile_reasoning_standard_v7",
  "candidate_profiles": [
    "profile_reasoning_standard_v7",
    "profile_reasoning_premium_v3"
  ],
  "rejected_profiles": [
    {
      "model_profile_id": "profile_general_fast_v9",
      "reason": "missing_structured_output"
    }
  ],
  "usage": {
    "input_tokens": 18340,
    "output_tokens": 612,
    "estimated_cost_usd": 0.041
  },
  "fallback_index": 0,
  "created_at": "2026-05-09T10:30:00Z"
}

Security controls

The AI Gateway should enforce these controls before any provider call:

  • Redact secrets and disallowed data classes from provider-bound payloads.
  • Deny requests whose data residency cannot be satisfied by an eligible profile.
  • Keep provider credentials inside the gateway boundary.
  • Disable provider-side state unless the active profile and tenant policy explicitly permit it.
  • Store prompts, completions, and chunks according to classification and retention policy.
  • Refuse prompt-response caching for sensitive, tenant-private, or user-private payloads unless policy explicitly permits it.
  • Treat provider-native tool calls as proposals, not executed actions.

Deployment model

Gateway API podsValidation + redactionLLM RouterRouting policyregistryModel profile registryProvider health +quotaProvider adapter AProvider adapter BProvider adapter CProvider deploymentProvider deploymentProvider deploymentTelemetry pipelineRoutingDecision store
Gateway deployment: policy and model-profile registries feed the router; provider adapters keep provider-specific behavior out of the Cognitive Core.

Evaluation loop

Routing quality should be evaluated like any other ContextOS runtime behavior.

SignalUsed for
Golden-set score by intentValidate that routing changes do not reduce decision quality.
Fallback rateDetect provider instability or bad primary routes.
Cost per accepted decisionCatch cost regressions that do not improve quality.
Structured-output validation rateDetect profile or prompt incompatibility.
Latency by routeKeep quality improvements inside user-facing SLOs.
Human override rateIdentify routes that look cheap but create review burden.

Routing policy changes should pass replay before promotion. Canary routing is acceptable only when the Trust plane can compare outcomes against the pinned baseline.

Acceptance checklist

Before shipping an AI Gateway or router into production:

  • A single provider-neutral invocation envelope exists.
  • Model profiles are versioned and separate from provider names in business logic.
  • Routing policy is signed, pinned, and environment-scoped.
  • Route decisions are recorded and joinable to Decision Records.
  • Provider-side state is explicit and policy-controlled.
  • Structured output validation failures are typed.
  • Fallback cannot cross residency, data classification, or capability boundaries.
  • OTEL spans and token/cost metrics are emitted for every call.
  • Golden replay covers route changes before promotion.

Research basis

This design follows patterns from primary docs and keeps provider-specific details behind adapters: