Deployment Blueprint
Multi-region deployment of the five planes — service placement, tenancy partitions, replication, replay, SLOs.
This blueprint maps the five planes to a deployable topology with tenant isolation, data residency, replay determinism, and SLO-driven reliability.
Executive summary
The runtime decomposes into stateless Decision-plane services, regional AI Gateway / LLM Router pods, regional stateful Intelligence-plane stores, a single Action-plane Tool Gateway per region, and a globally-replicated Trust-plane control plane. Replay is preserved by content-addressed pack, model profile, route decision, and snapshot pinning; audit is preserved by append-only WORM-capable storage.
Deployment topology (reference)
Cross-region replication: pack registry, model profiles, routing policy, audit, KG snapshot
Service placement by plane
| Plane | Class | Placement |
|---|---|---|
| Intelligence | stateful (KG, vectors, memory tier storage) | regional primary + cross-region replica; tenant-partitioned indices |
| Context | stateless (Compiler) | regional, autoscaled; cold-start dominated by pack registry hydration |
| Decision | stateless (Planner / Executor / Critic, subagent lanes) | regional; long-running session checkpoints in regional durable store |
| Model boundary | stateless (AI Gateway, LLM Router, provider adapters) | regional; routing decisions written to audit; model profiles signed by Trust control plane |
| Action | stateless (Tool Gateway) + stateful (idempotency cache) | regional; egress credentials brokered per call |
| Trust | global (policy bundle registry, audit store, pack registry) | active-active with strict write controls; signed promotion |
Multi-region strategy
- Active/active for serving APIs (
invokeAgent, retrieval, recall). - Regional AI Gateway / LLM Router for model calls; fallback cannot cross residency, classification, or capability boundaries.
- Active/passive for batch processing and embedding pipelines.
- Cross-region replication for KG snapshots, audit, pack registry, Decision Record store.
- Region pinning for regulated tenants and data-residency compliance.
Tenancy partitions
- Per-tenant namespaces for KG and memory indices.
- Row / column security for source-of-truth stores.
- Tenant-scoped model profile allowlists, routing budgets, and provider data-retention policy at the AI Gateway.
- Scoped credentials and per-tenant rate budgets at the Tool Gateway.
- Hard multi-tenancy (isolated clusters) for regulated tenants.
Caching
- L1 in-process cache for compiled
CompiledContext(short TTL keyed bypack_version + snapshot_version + request_hash). - L2 regional cache with tenant-keyed namespaces.
- Cache invalidation tied to: pack promotion, snapshot promotion, policy bundle promotion, model-profile promotion, and routing-policy promotion.
- Embedding cache keyed by
(model_version, text_hash, tenant_id)when the pack permits. - Prompt-response caching is disabled by default at the AI Gateway and enabled only when tenant policy, classification, and profile allow it.
SLO wiring
- Per-tenant Run Budget enforcement at the Conversation Manager boundary.
- Model-call latency, token, and cost budgets enforced at the AI Gateway.
- Step timeouts enforced by the Tool Gateway, not by the adapter.
- SLO dashboards: p95 compile, p95 router decision, p95 gateway overhead, p95 retrieval, p95 recall, p99 end-to-end Run Context wall-clock.
- Error budgets banded by
risk_class.
Deterministic replay
- Pack registry is content-addressed and signed.
- Model profiles, routing rules, and
RoutingDecisionrecords are pinned. - KG snapshots are pinned per environment; promotions deliberate.
- Audit store retains tool transcripts, policy decisions, approval records.
- Replay re-derives the Critic verdict and Decision Record without re-executing tools or silently changing the model route.
Stateful stores
- KG primary: regional, sharded by tenant; cross-region replicas for read-only failover.
- Vector / BM25 indices: rebuilt from KG snapshot; do not lead the snapshot.
- Memory tier storage: regional primary with WAL replication; per-tenant TTL.
- RoutingDecision store: append-only; joins model profile, routing rule, provider adapter, and trace ID.
- Decision Record store: append-only; cross-region replicated; WORM-capable for audit class.
- Audit store: append-only, WORM-capable, replicated.
Scaling model
- Stateless services: HPA by p95 latency or in-flight count.
- Stateful services: vertical + sharded by tenant or entity type.
- Queues: backpressure with dead-letter queues for tool execution and batch jobs.
Data residency and isolation
- Data classified at ingestion (
PUBLIC/INTERNAL/CONFIDENTIAL/RESTRICTED). - Residency constraints enforced at the Conversation Manager and storage layers.
- Cross-region replication excluded for
RESTRICTEDclasses.
Runbooks (high level)
- Provider outage: AI Gateway routes to an eligible alternate profile; degrade quality tier only within residency, capability, and budget constraints; log scorecard delta.
- Latency spike: reduce context and graph hop budgets within policy, lower model route tier if allowed, and enable cache; do not silently drop required evidence.
- Cost spike: enforce Run Budget caps, select lower-cost model profiles when policy allows, and use cached read-only aliases where valid; do not change approval-mode semantics for side-effecting tools.
- KG index corruption: promote replica; replay from audit logs.
- Audit store unavailability: refuse
destructiveactions; serveread_onlyandlocal_writeonly.
Reference SLOs (example)
| Service | p95 latency | Availability | Notes |
|---|---|---|---|
| Conversation Manager | 100 ms | 99.95% | Envelope acceptance |
| Context Pack Compiler | 250 ms | 99.9% | Cached on hot paths |
| AI Gateway | 300 ms | 99.95% | Gateway overhead, excludes provider generation |
| LLM Router | 50 ms | 99.95% | Candidate filter + route decision |
| Knowledge Substrate (GraphRAG) | 500 ms | 99.9% | Snapshot-pinned |
| Memory Service (recall) | 200 ms | 99.9% | Tenant-scoped |
| Tool Gateway | 1500 ms | 99.9% | Includes adapter call |
| Policy Engine | 100 ms | 99.95% | Pre/mid/post |
| Audit store write | 200 ms | 99.99% | WORM-capable |
Sizing assumptions (example)
Workload tiers
- Tier 1 (Dev/PoC): 50 RPS, p95 2s, single region, soft multi-tenancy.
- Tier 2 (Prod mid): 300 RPS, p95 1.5s, active/active regions, partial isolation.
- Tier 3 (Regulated): 1000 RPS, p95 1.2s, hard tenant isolation, residency pinned.
Service baseline (per region)
- Conversation Manager / Compiler / Orchestrator: 6–12 pods, 2 vCPU / 4 GB; autoscale by p95 latency.
- AI Gateway / LLM Router: 4–8 pods, 1–2 vCPU / 2–4 GB; autoscale by in-flight model calls and router latency.
- Policy Engine: 4–8 pods, 1 vCPU / 2 GB; CPU-bound.
- Knowledge Substrate: 6–10 pods, 4 vCPU / 8 GB; memory-heavy.
- Memory Service: 4–8 pods, 2 vCPU / 4 GB; latency-sensitive.
- Vector index: 3–6 shards, 8 vCPU / 32 GB; shard by tenant.
- Graph store: 3 nodes, 8 vCPU / 32 GB; replicated.
- Audit store: append-only; WORM-capable storage with daily compaction.