Skip to content
Press / to search

Deployment Blueprint

Multi-region deployment of the five planes — service placement, tenancy partitions, replication, replay, SLOs.

Reference DesignLast reviewed: Edit on GitHub
At a glance

Reference Architecture

This blueprint maps the five planes to a deployable topology with tenant isolation, data residency, replay determinism, and SLO-driven reliability.

Executive summary

The runtime decomposes into stateless Decision-plane services, regional AI Gateway / LLM Router pods, regional stateful Intelligence-plane stores, a single Action-plane Tool Gateway per region, and a globally-replicated Trust-plane control plane. Replay is preserved by content-addressed pack, model profile, route decision, and snapshot pinning; audit is preserved by append-only WORM-capable storage.

Deployment topology (reference)

Cross-region replication: pack registry, model profiles, routing policy, audit, KG snapshot

Two active regions run identical plane stacks; pack registry, routing policy, audit, and KG snapshots replicate across them.

Service placement by plane

PlaneClassPlacement
Intelligencestateful (KG, vectors, memory tier storage)regional primary + cross-region replica; tenant-partitioned indices
Contextstateless (Compiler)regional, autoscaled; cold-start dominated by pack registry hydration
Decisionstateless (Planner / Executor / Critic, subagent lanes)regional; long-running session checkpoints in regional durable store
Model boundarystateless (AI Gateway, LLM Router, provider adapters)regional; routing decisions written to audit; model profiles signed by Trust control plane
Actionstateless (Tool Gateway) + stateful (idempotency cache)regional; egress credentials brokered per call
Trustglobal (policy bundle registry, audit store, pack registry)active-active with strict write controls; signed promotion

Multi-region strategy

  • Active/active for serving APIs (invokeAgent, retrieval, recall).
  • Regional AI Gateway / LLM Router for model calls; fallback cannot cross residency, classification, or capability boundaries.
  • Active/passive for batch processing and embedding pipelines.
  • Cross-region replication for KG snapshots, audit, pack registry, Decision Record store.
  • Region pinning for regulated tenants and data-residency compliance.

Tenancy partitions

  • Per-tenant namespaces for KG and memory indices.
  • Row / column security for source-of-truth stores.
  • Tenant-scoped model profile allowlists, routing budgets, and provider data-retention policy at the AI Gateway.
  • Scoped credentials and per-tenant rate budgets at the Tool Gateway.
  • Hard multi-tenancy (isolated clusters) for regulated tenants.

Caching

  • L1 in-process cache for compiled CompiledContext (short TTL keyed by pack_version + snapshot_version + request_hash).
  • L2 regional cache with tenant-keyed namespaces.
  • Cache invalidation tied to: pack promotion, snapshot promotion, policy bundle promotion, model-profile promotion, and routing-policy promotion.
  • Embedding cache keyed by (model_version, text_hash, tenant_id) when the pack permits.
  • Prompt-response caching is disabled by default at the AI Gateway and enabled only when tenant policy, classification, and profile allow it.

SLO wiring

  • Per-tenant Run Budget enforcement at the Conversation Manager boundary.
  • Model-call latency, token, and cost budgets enforced at the AI Gateway.
  • Step timeouts enforced by the Tool Gateway, not by the adapter.
  • SLO dashboards: p95 compile, p95 router decision, p95 gateway overhead, p95 retrieval, p95 recall, p99 end-to-end Run Context wall-clock.
  • Error budgets banded by risk_class.

Deterministic replay

  • Pack registry is content-addressed and signed.
  • Model profiles, routing rules, and RoutingDecision records are pinned.
  • KG snapshots are pinned per environment; promotions deliberate.
  • Audit store retains tool transcripts, policy decisions, approval records.
  • Replay re-derives the Critic verdict and Decision Record without re-executing tools or silently changing the model route.

Stateful stores

  • KG primary: regional, sharded by tenant; cross-region replicas for read-only failover.
  • Vector / BM25 indices: rebuilt from KG snapshot; do not lead the snapshot.
  • Memory tier storage: regional primary with WAL replication; per-tenant TTL.
  • RoutingDecision store: append-only; joins model profile, routing rule, provider adapter, and trace ID.
  • Decision Record store: append-only; cross-region replicated; WORM-capable for audit class.
  • Audit store: append-only, WORM-capable, replicated.

Scaling model

  • Stateless services: HPA by p95 latency or in-flight count.
  • Stateful services: vertical + sharded by tenant or entity type.
  • Queues: backpressure with dead-letter queues for tool execution and batch jobs.

Data residency and isolation

  • Data classified at ingestion (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED).
  • Residency constraints enforced at the Conversation Manager and storage layers.
  • Cross-region replication excluded for RESTRICTED classes.

Runbooks (high level)

  • Provider outage: AI Gateway routes to an eligible alternate profile; degrade quality tier only within residency, capability, and budget constraints; log scorecard delta.
  • Latency spike: reduce context and graph hop budgets within policy, lower model route tier if allowed, and enable cache; do not silently drop required evidence.
  • Cost spike: enforce Run Budget caps, select lower-cost model profiles when policy allows, and use cached read-only aliases where valid; do not change approval-mode semantics for side-effecting tools.
  • KG index corruption: promote replica; replay from audit logs.
  • Audit store unavailability: refuse destructive actions; serve read_only and local_write only.

Reference SLOs (example)

Servicep95 latencyAvailabilityNotes
Conversation Manager100 ms99.95%Envelope acceptance
Context Pack Compiler250 ms99.9%Cached on hot paths
AI Gateway300 ms99.95%Gateway overhead, excludes provider generation
LLM Router50 ms99.95%Candidate filter + route decision
Knowledge Substrate (GraphRAG)500 ms99.9%Snapshot-pinned
Memory Service (recall)200 ms99.9%Tenant-scoped
Tool Gateway1500 ms99.9%Includes adapter call
Policy Engine100 ms99.95%Pre/mid/post
Audit store write200 ms99.99%WORM-capable

Sizing assumptions (example)

Workload tiers

  • Tier 1 (Dev/PoC): 50 RPS, p95 2s, single region, soft multi-tenancy.
  • Tier 2 (Prod mid): 300 RPS, p95 1.5s, active/active regions, partial isolation.
  • Tier 3 (Regulated): 1000 RPS, p95 1.2s, hard tenant isolation, residency pinned.

Service baseline (per region)

  • Conversation Manager / Compiler / Orchestrator: 6–12 pods, 2 vCPU / 4 GB; autoscale by p95 latency.
  • AI Gateway / LLM Router: 4–8 pods, 1–2 vCPU / 2–4 GB; autoscale by in-flight model calls and router latency.
  • Policy Engine: 4–8 pods, 1 vCPU / 2 GB; CPU-bound.
  • Knowledge Substrate: 6–10 pods, 4 vCPU / 8 GB; memory-heavy.
  • Memory Service: 4–8 pods, 2 vCPU / 4 GB; latency-sensitive.
  • Vector index: 3–6 shards, 8 vCPU / 32 GB; shard by tenant.
  • Graph store: 3 nodes, 8 vCPU / 32 GB; replicated.
  • Audit store: append-only; WORM-capable storage with daily compaction.