Deployment Blueprint

Multi-region deployment of the five planes — service placement, tenancy partitions, replication, replay, SLOs.

Reference DesignLast reviewed: 2026-05-09 Edit on GitHub

At a glance

This blueprint maps the five planes to a deployable topology with tenant isolation, data residency, replay determinism, and SLO-driven reliability.

Executive summary

The runtime decomposes into stateless Decision-plane services, regional AI Gateway / LLM Router pods, regional stateful Intelligence-plane stores, a single Action-plane Tool Gateway per region, and a globally-replicated Trust-plane control plane. Replay is preserved by content-addressed pack, model profile, route decision, and snapshot pinning; audit is preserved by append-only WORM-capable storage.

Deployment topology (reference)

Cross-region replication: pack registry, model profiles, routing policy, audit, KG snapshot

Two active regions run identical plane stacks; pack registry, routing policy, audit, and KG snapshots replicate across them.

Service placement by plane

Plane	Class	Placement
Intelligence	stateful (KG, vectors, memory tier storage)	regional primary + cross-region replica; tenant-partitioned indices
Context	stateless (Compiler)	regional, autoscaled; cold-start dominated by pack registry hydration
Decision	stateless (Planner / Executor / Critic, subagent lanes)	regional; long-running session checkpoints in regional durable store
Model boundary	stateless (AI Gateway, LLM Router, provider adapters)	regional; routing decisions written to audit; model profiles signed by Trust control plane
Action	stateless (Tool Gateway) + stateful (idempotency cache)	regional; egress credentials brokered per call
Trust	global (policy bundle registry, audit store, pack registry)	active-active with strict write controls; signed promotion

Multi-region strategy

Active/active for serving APIs (invokeAgent, retrieval, recall).
Regional AI Gateway / LLM Router for model calls; fallback cannot cross residency, classification, or capability boundaries.
Active/passive for batch processing and embedding pipelines.
Cross-region replication for KG snapshots, audit, pack registry, Decision Record store.
Region pinning for regulated tenants and data-residency compliance.

Tenancy partitions

Per-tenant namespaces for KG and memory indices.
Row / column security for source-of-truth stores.
Tenant-scoped model profile allowlists, routing budgets, and provider data-retention policy at the AI Gateway.
Scoped credentials and per-tenant rate budgets at the Tool Gateway.
Hard multi-tenancy (isolated clusters) for regulated tenants.

Caching

L1 in-process cache for compiled CompiledContext (short TTL keyed by pack_version + snapshot_version + request_hash).
L2 regional cache with tenant-keyed namespaces.
Cache invalidation tied to: pack promotion, snapshot promotion, policy bundle promotion, model-profile promotion, and routing-policy promotion.
Embedding cache keyed by (model_version, text_hash, tenant_id) when the pack permits.
Prompt-response caching is disabled by default at the AI Gateway and enabled only when tenant policy, classification, and profile allow it.

SLO wiring

Per-tenant Run Budget enforcement at the Conversation Manager boundary.
Model-call latency, token, and cost budgets enforced at the AI Gateway.
Step timeouts enforced by the Tool Gateway, not by the adapter.
SLO dashboards: p95 compile, p95 router decision, p95 gateway overhead, p95 retrieval, p95 recall, p99 end-to-end Run Context wall-clock.
Error budgets banded by risk_class.

Deterministic replay

Pack registry is content-addressed and signed.
Model profiles, routing rules, and RoutingDecision records are pinned.
KG snapshots are pinned per environment; promotions deliberate.
Audit store retains tool transcripts, policy decisions, approval records.
Replay re-derives the Critic verdict and Decision Record without re-executing tools or silently changing the model route.

Stateful stores

KG primary: regional, sharded by tenant; cross-region replicas for read-only failover.
Vector / BM25 indices: rebuilt from KG snapshot; do not lead the snapshot.
Memory tier storage: regional primary with WAL replication; per-tenant TTL.
RoutingDecision store: append-only; joins model profile, routing rule, provider adapter, and trace ID.
Decision Record store: append-only; cross-region replicated; WORM-capable for audit class.
Audit store: append-only, WORM-capable, replicated.

Scaling model

Stateless services: HPA by p95 latency or in-flight count.
Stateful services: vertical + sharded by tenant or entity type.
Queues: backpressure with dead-letter queues for tool execution and batch jobs.

Data residency and isolation

Data classified at ingestion (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED).
Residency constraints enforced at the Conversation Manager and storage layers.
Cross-region replication excluded for RESTRICTED classes.

Runbooks (high level)

Provider outage: AI Gateway routes to an eligible alternate profile; degrade quality tier only within residency, capability, and budget constraints; log scorecard delta.
Latency spike: reduce context and graph hop budgets within policy, lower model route tier if allowed, and enable cache; do not silently drop required evidence.
Cost spike: enforce Run Budget caps, select lower-cost model profiles when policy allows, and use cached read-only aliases where valid; do not change approval-mode semantics for side-effecting tools.
KG index corruption: promote replica; replay from audit logs.
Audit store unavailability: refuse destructive actions; serve read_only and local_write only.

Reference SLOs (example)

Service	p95 latency	Availability	Notes
Conversation Manager	100 ms	99.95%	Envelope acceptance
Context Pack Compiler	250 ms	99.9%	Cached on hot paths
AI Gateway	300 ms	99.95%	Gateway overhead, excludes provider generation
LLM Router	50 ms	99.95%	Candidate filter + route decision
Knowledge Substrate (GraphRAG)	500 ms	99.9%	Snapshot-pinned
Memory Service (recall)	200 ms	99.9%	Tenant-scoped
Tool Gateway	1500 ms	99.9%	Includes adapter call
Policy Engine	100 ms	99.95%	Pre/mid/post
Audit store write	200 ms	99.99%	WORM-capable

Sizing assumptions (example)

Workload tiers

Tier 1 (Dev/PoC): 50 RPS, p95 2s, single region, soft multi-tenancy.
Tier 2 (Prod mid): 300 RPS, p95 1.5s, active/active regions, partial isolation.
Tier 3 (Regulated): 1000 RPS, p95 1.2s, hard tenant isolation, residency pinned.

Service baseline (per region)

Conversation Manager / Compiler / Orchestrator: 6–12 pods, 2 vCPU / 4 GB; autoscale by p95 latency.
AI Gateway / LLM Router: 4–8 pods, 1–2 vCPU / 2–4 GB; autoscale by in-flight model calls and router latency.
Policy Engine: 4–8 pods, 1 vCPU / 2 GB; CPU-bound.
Knowledge Substrate: 6–10 pods, 4 vCPU / 8 GB; memory-heavy.
Memory Service: 4–8 pods, 2 vCPU / 4 GB; latency-sensitive.
Vector index: 3–6 shards, 8 vCPU / 32 GB; shard by tenant.
Graph store: 3 nodes, 8 vCPU / 32 GB; replicated.
Audit store: append-only; WORM-capable storage with daily compaction.