AI Gateway and LLM Router

Reference design for the model-side gateway that normalizes provider APIs, enforces model-call policy, and routes requests under ContextOS budgets.

Reference DesignLast reviewed: 2026-05-09 Edit on GitHub

At a glance

The AI Gateway is the model-side boundary for ContextOS. It sits between the Cognitive Core and model providers. It normalizes model invocation, applies policy, enforces budget, records telemetry, and delegates model selection to the LLM Router.

It is not the Tool Gateway. The Tool Gateway controls side effects. The AI Gateway controls model calls.

Design stance

The gateway exists because production systems should not scatter provider-specific model calls throughout planners, evaluators, adapters, and batch jobs. Provider APIs evolve. Model catalogs change. Pricing, context windows, tool support, data residency, and safety features vary. The rest of the runtime should depend on one ContextOS contract, not on every provider contract directly.

The router is not magic. It is a governed selection function over explicit inputs: task shape, risk, budget, latency, region, required capabilities, provider health, and evaluator feedback.

Responsibilities

Layer	Owns	Does not own
AI Gateway	auth, request validation, redaction, budget checks, provider adapters, streaming normalization, telemetry, audit	business planning, tool execution, memory promotion
LLM Router	candidate filtering, route scoring, fallback order, route explanation, canary selection	policy authoring, provider billing terms, final business decision
Provider adapter	provider-specific request and response mapping	ContextOS policy, decision semantics, tool authorization

Placement

The AI Gateway is model-side. Tool calls still route through the Tool Gateway.

The important boundary is authority: a model provider can produce text, structured output, or a model-native tool-call proposal. It cannot execute ContextOS tools directly. Any external effect still returns to the Decision plane and goes through the Tool Gateway.

Northbound contract

The Cognitive Core calls the gateway with a provider-neutral envelope.

{
  "request_id": "req_01j9",
  "run_id": "run_88",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "tenant_id": "tenant_acme_prod",
  "intent_id": "support.refund",
  "risk_class": "destructive",
  "operation": "judge.plan",
  "compiled_context_ref": "compiled_ctx_41",
  "input": {
    "instructions": "Produce a plan that can be verified by the Critic.",
    "messages": [
      {
        "role": "user",
        "content": "Refund order ord_881"
      }
    ]
  },
  "requirements": {
    "structured_output": true,
    "tool_calling": false,
    "vision": false,
    "max_input_tokens": 24000,
    "max_output_tokens": 2000,
    "latency_slo_ms": 2500,
    "max_cost_usd": 0.08,
    "data_residency": "us"
  },
  "routing_hints": {
    "quality_tier": "standard",
    "fallback_allowed": true,
    "canary_allowed": false
  }
}

The response carries both model output and the route decision that produced it.

{
  "request_id": "req_01j9",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "status": "ok",
  "output": {
    "type": "json",
    "value": {
      "plan_id": "plan_refund_01",
      "steps": []
    }
  },
  "usage": {
    "input_tokens": 18340,
    "output_tokens": 612,
    "estimated_cost_usd": 0.041
  },
  "route": {
    "routing_decision_id": "route_01j9",
    "model_profile_id": "profile_reasoning_standard_v7",
    "provider_adapter": "provider_a",
    "routing_rule_ids": ["ROUTE_STANDARD_REASONING_US"],
    "fallback_index": 0,
    "explanation": "Selected profile supports structured output within residency, latency, and budget constraints."
  }
}

Model profile registry

The router selects from model profiles, not hard-coded model names. A profile can map to one or more provider deployments behind the adapter layer.

{
  "model_profile_id": "profile_reasoning_standard_v7",
  "provider_adapter": "provider_a",
  "deployment_ref": "secret://model-deployments/provider-a/reasoning-standard",
  "capabilities": {
    "structured_output": true,
    "tool_calling": true,
    "vision": false,
    "long_context": true,
    "streaming": true
  },
  "limits": {
    "max_input_tokens": 128000,
    "max_output_tokens": 8192,
    "rpm": 300,
    "tpm": 300000
  },
  "policy": {
    "regions": ["us", "eu"],
    "data_classes_allowed": ["PUBLIC", "INTERNAL"],
    "stores_provider_state": false,
    "eligible_risk_classes": ["read_only", "local_write", "network", "delegated", "destructive"]
  },
  "score_hints": {
    "quality": 0.86,
    "latency_p95_ms": 1900,
    "cost_per_1k_input_usd": 0.002,
    "cost_per_1k_output_usd": 0.008
  },
  "status": "healthy"
}

Routing algorithm

The router runs in five deterministic stages.

Filter candidates by data residency, data classification, required capabilities, active policy, health, quota, and context size.
Score remaining candidates against quality, latency, cost, reliability, and evaluator feedback for the active intent.
Select the highest-ranked profile with deterministic tie-breaks: policy priority, quality band, lower p95 latency, lower estimated cost, stable provider priority.
Invoke through the selected provider adapter with timeout, retry, and streaming settings from the Run Context.
Record route decision, usage, provider response model, fallback index, applied rules, and trace span.

This makes routing explainable. If the router chooses a cheaper model, the decision record must show that the cheaper profile remained inside the quality band for the current task.

Routing policy

Routing policy is a Trust-plane config artifact. It should be versioned, signed, evaluated in CI, and pinned per environment.

policy_id: route.support.standard.v4
owner_role: platform_runtime
default_profile: profile_general_standard_v5
rules:
  - rule_id: ROUTE_HIGH_RISK_STRUCTURED
    priority: 100
    when:
      risk_class: destructive
      structured_output: true
    require:
      data_residency: request
      capabilities:
        structured_output: true
      max_fallbacks: 1
    score:
      quality: 0.55
      reliability: 0.20
      latency: 0.15
      cost: 0.10
    candidates:
      - profile_reasoning_standard_v7
      - profile_reasoning_premium_v3
  - rule_id: ROUTE_LOW_RISK_FAST
    priority: 40
    when:
      risk_class: read_only
    score:
      latency: 0.45
      cost: 0.35
      quality: 0.20
    candidates:
      - profile_general_fast_v9
      - profile_general_standard_v5

Fallback rules

Fallback is allowed only when it preserves the contract requested by the caller.

Failure	Automatic response	Must record
Provider timeout	Retry once if idempotent, then use next eligible fallback	timeout, retry count, fallback index
Provider rate limit	Move to next profile in same policy band	provider error class, selected fallback
Budget would exceed cap	Downgrade only if quality floor still holds; otherwise return `BUDGET_EXCEEDED`	estimated cost and rejected candidates
Structured output validation fails	Retry with repair prompt if policy allows; otherwise return `SCHEMA_INVALID`	schema ref, validation error
Residency mismatch	Do not fallback across residency boundary	denied region and required region

Fallback must never broaden data exposure, remove required structured output, ignore risk class, or bypass policy.

Streaming contract

The gateway may expose streaming, but the stream is still a typed envelope.

Event	Meaning
`route.selected`	Router selected a profile and adapter.
`output.delta`	Provider emitted a text or structured-output chunk.
`usage.delta`	Token or cost estimate changed.
`provider.warning`	Adapter observed a recoverable provider warning.
`output.final`	Final normalized output is ready.
`error.final`	Terminal error with ContextOS error code.

Streaming consumers should not treat raw provider chunks as final business state. The Decision plane consumes the normalized final output.

Observability

Every gateway invocation emits OpenTelemetry spans and metrics tied to the parent trace_id.

Recommended span structure:

contextos.ai_gateway.invoke
  contextos.llm_router.select
  contextos.provider_adapter.invoke

Required attributes:

Attribute	Why it matters
`contextos.run_id`	Correlates route decision to the agent run.
`contextos.intent_id`	Enables per-intent routing evaluation.
`contextos.routing_decision_id`	Links telemetry to audit record.
`contextos.model_profile_id`	Stable route target independent of provider model names.
`contextos.provider_adapter`	Provider boundary used for the call.
`contextos.fallback_index`	Shows whether the primary route held.
`contextos.estimated_cost_usd`	Makes budget enforcement auditable.

Use OpenTelemetry GenAI semantic conventions where they fit, but pin the emitted convention version because those conventions are still evolving.

Audit record

The gateway persists a RoutingDecision record. It is not a substitute for the final DecisionRecord, but the final decision should reference it in lineage.

{
  "routing_decision_id": "route_01j9",
  "request_id": "req_01j9",
  "run_id": "run_88",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "policy_version": "route.support.standard.v4",
  "selected_profile": "profile_reasoning_standard_v7",
  "candidate_profiles": [
    "profile_reasoning_standard_v7",
    "profile_reasoning_premium_v3"
  ],
  "rejected_profiles": [
    {
      "model_profile_id": "profile_general_fast_v9",
      "reason": "missing_structured_output"
    }
  ],
  "usage": {
    "input_tokens": 18340,
    "output_tokens": 612,
    "estimated_cost_usd": 0.041
  },
  "fallback_index": 0,
  "created_at": "2026-05-09T10:30:00Z"
}

Security controls

The AI Gateway should enforce these controls before any provider call:

Redact secrets and disallowed data classes from provider-bound payloads.
Deny requests whose data residency cannot be satisfied by an eligible profile.
Keep provider credentials inside the gateway boundary.
Disable provider-side state unless the active profile and tenant policy explicitly permit it.
Store prompts, completions, and chunks according to classification and retention policy.
Refuse prompt-response caching for sensitive, tenant-private, or user-private payloads unless policy explicitly permits it.
Treat provider-native tool calls as proposals, not executed actions.

Deployment model

Gateway deployment: policy and model-profile registries feed the router; provider adapters keep provider-specific behavior out of the Cognitive Core.

Evaluation loop

Routing quality should be evaluated like any other ContextOS runtime behavior.

Signal	Used for
Golden-set score by intent	Validate that routing changes do not reduce decision quality.
Fallback rate	Detect provider instability or bad primary routes.
Cost per accepted decision	Catch cost regressions that do not improve quality.
Structured-output validation rate	Detect profile or prompt incompatibility.
Latency by route	Keep quality improvements inside user-facing SLOs.
Human override rate	Identify routes that look cheap but create review burden.

Routing policy changes should pass replay before promotion. Canary routing is acceptable only when the Trust plane can compare outcomes against the pinned baseline.

Acceptance checklist

Before shipping an AI Gateway or router into production:

A single provider-neutral invocation envelope exists.
Model profiles are versioned and separate from provider names in business logic.
Routing policy is signed, pinned, and environment-scoped.
Route decisions are recorded and joinable to Decision Records.
Provider-side state is explicit and policy-controlled.
Structured output validation failures are typed.
Fallback cannot cross residency, data classification, or capability boundaries.
OTEL spans and token/cost metrics are emitted for every call.
Golden replay covers route changes before promotion.

Research basis

This design follows patterns from primary docs and keeps provider-specific details behind adapters:

OpenAI Responses API migration guide: typed response items, function calling, structured outputs, state, and tool-capable model interactions.
Amazon Bedrock intelligent prompt routing: routing for quality and cost, fallback models, route criteria, and traceable responses.
Microsoft Foundry model router: deployable model-router pattern, routing modes, model subsets, data-zone boundaries, and versioned underlying model sets.
OpenTelemetry GenAI semantic conventions: span, metric, and event naming guidance for generative AI systems.