Skip to content
Press / to search

Security and Compliance

Trust-plane controls — policy outside agent code, identity propagation, approval-mode tiers, sandboxing, attestation, and OTEL-bound audit.

Living DocumentLast reviewed: Edit on GitHub
At a glance

Security in ContextOS is a runtime primitive, not a perimeter add-on. It is implemented by the Trust plane, which sits over every other plane and decides what is allowed before any side effect can occur.

Definition

A coordinated set of: deterministic policy enforcement at every plane boundary; user-delegated and agent-workload identity propagation through every Run Context; the approval-mode tier taxonomy bound to every adapter capability; cryptographic provenance (signed artifacts, hash-chained audit, optional attestation); OTEL-tied audit; and replayable evidence. No control depends on model self-policing.

Why it exists

Agents act on production systems. Prompts leak, tools misbehave, models drift, third parties go rogue. Security has to be enforced where actions actually happen — at the Tool Gateway, at the policy boundary, at the memory promotion gate, at the sandbox kernel — and recorded in a way that survives audit and incident response.

How it works

  1. Compile-time controls — the Context Pack Compiler emits runtime_controls (must_refuse, must_escalate, approval_gates_active, redaction_rules_active) baked into the prompt manifests.
  2. Plan-time controls — the Critic verifies plans against tool allow-lists, evidence requirements, and approval-mode declarations.
  3. Execute-time controls — the Tool Gateway brokers identity, validates schemas, enforces approval modes, and emits trace + audit envelopes.
  4. Memory-write controls — the memory promotion pipeline gates writes by class, contradiction check, and consent.
  5. Audit — every applied policy decision, gate verdict, tool transcript, and memory promotion is recorded against the trace_id of the Run Context and is hash-chained for tamper-evidence.

Threat model

ContextOS treats the agent system as multi-principal under partial trust, following the framing in Trusted AI Agents in the Cloud:

  • The agent provider ships agent code and policies.
  • The model provider ships base and tuned models.
  • The tool ecosystem (MCP servers, OpenAPI services, A2A peers) defines capabilities.
  • The cloud / runtime hosts execution.
  • The enterprise / user delegates authority and demands proof.

No single principal is fully trusted. Compromise of the model, of any single tool, or of any single tenant must not compromise the whole system. The architecture’s job is to make every action mediated, evidenced, and attributable.

Attack classes the architecture defends against

ClassConcrete examplesPrimary defense
Prompt / context injectionindirect injection from KG documents, tool outputs, memory recallredaction at compile; deterministic policy at execute; tool surface narrowed to capability allow-list
Excessive agencymodel invokes tools outside intentcapability_class + intent shape verified by the Critic; act denied for read_only intents
Credential exfiltrationtool argument carries a secret to a third-party domainegress allow-list; STS-style per-call credential exchange; redaction of secrets in compiled context
Tool argument tamperingmodel emits an out-of-bounds amount, recipient, or scopearg_constraints on the capability registration; Critic re-verify; param-level policy rules
Cross-tenant traversalKG read or memory recall reaches another tenant’s datastorage-level tenant_id scoping; cross-tenant denials emitted as security events
Repeated / amplified callsmodel loops a destructive actionper-(tenant, capability) rate budgets; idempotency keys; numCalls policy constraint
Memory poisoningadversarial input promoted to long-term memorypromotion-aware memory pipeline with consent + classification + contradiction checks
Supply chainunsigned tool, mutable image, unpinned modelContext Pack and adapter signing; content-hash image pinning; pinned model + pack version per environment
Sandbox escapecode execution capability breaks isolationdefault-deny sandbox profile; signed images; per-call attestation for e2b / firecracker
Audit tamperingpost-hoc edit of decision historyappend-only, hash-chained Decision Records bound to W3C trace_id

OWASP LLM Top 10 mapping

IDRiskContextOS control
LLM01Prompt injectionredaction-at-compile + tool allow-list + Critic re-verify (see Prompt injection defense)
LLM02Insecure output handlingtyped tool envelopes; outputs flow through promotion pipeline before reaching memory or context
LLM03Training data poisoningspec scope: not in runtime threat model; managed by upstream model provider controls
LLM04Model DoSRunBudget caps (total_tokens, wall_clock_ms, max_tool_calls); rate budgets per (tenant, capability)
LLM05Supply chainsigned Context Packs and adapters; pinned model + pack versions; image content hashes for sandbox kernels
LLM06Sensitive info disclosureredaction rules in runtime_controls; data classification on every artifact
LLM07Insecure plugin / tool designtyped capability schemas; arg_constraints; capability_class enforcement
LLM08Excessive agencyapproval-mode tiers; intent-shape verification; default-deny on actions
LLM09OverrelianceDecision Record exposes evidence_refs and confidence; evaluators score safety per run
LLM10Model theftspec scope: out of scope at runtime; managed at the model-serving boundary

The architecture also references MITRE ATLAS tactics for AI-system threat modeling; per-intent adversarial suites are tracked via the Attack Success Rate metric described in the trusted-cloud reference.

Defense in depth

Every plane has its own deterministic check. Compromise of one layer must not compromise the whole.

LayerControl pointWhat is checked
CompileContext Pack Compilerresolved tools = Registry ∩ Permissions − Prohibitions; runtime_controls baked into manifests
PlanCritictools used ⊆ surfaced; intent shape (read_only excludes act); evidence requirements
ExecuteTool Gatewayidentity, schema, approval mode, rate budget, idempotency, egress allow-list
SandboxSandbox kernelprofile invariants enforced by the kernel, not by the adapter
Memory writePromotion gateclassification, consent, contradiction
AuditOTEL + Decision Recordhash-chained envelope linked by trace_id
ReplayReplay harnessDecisionRecord must be reproducible from pinned snapshot

Boundary enforcement model

Every plane boundary is a deterministic check, not a model check:

  • Context → DecisionCompiledContext only contains capabilities the Compiler resolved through Registry ∩ Permissions − Prohibitions with approval-mode filters applied.
  • Decision -> Action — the Critic re-verifies the Plan; the Tool Gateway re-validates each ToolCallEnvelope at execution time.
  • Action → External — egress credentials are exchanged per call (STS-style); ingress credentials are verified for both the delegated user and the agent workload identity.
  • Memory write — capture is immutable; promotion checks consent, classification, and contradiction.
  • External → Audit — every accepted or denied effect emits a signed envelope; the chain breaks if any past entry is altered.

Identity propagation

Two identities flow through every Run Context, and the agent side is split into registry identity, workload identity, and a short-lived signed claim:

  • Delegated user identityrun_context.user.delegation carries an OAuth-style token reference, scopes, and subject claims. Used by delegated-mode tools.
  • Agent registry identityrun_context.agent.agent_urn is a versioned subject such as agent:contextos/support-refund@1.2.0. Used for ownership, lifecycle, manifest-scoped permissions, and audit.
  • Agent workload identityrun_context.agent.workload_identity is a SPIFFE-style URI (spiffe://contextos/agents/<role>). Used for service-to-service auth and for non-delegated calls.
  • Agent identity claimrun_context.agent.identity_claim binds the registry subject, workload, tenant, run ID, principal chain, scopes, expiry, and key ID for this run or A2A hop.

Both identities are signed metadata on every ToolCallEnvelope and on every audit record.

Token chaining and minimum privilege

  • Delegation tokens are exchanged at the Gateway for per-call scoped credentials (STS-style); long-lived bearer tokens never reach an adapter.
  • Scopes on the exchanged credential are the intersection of delegation.scopes, agent.identity_claim.scopes, permission.scopes, and the capability’s declared scope set — never the union.
  • A2A hops do not forward the inbound delegation token or blindly reuse the parent claim. The receiving agent re-attests its workload identity, receives a fresh narrower child claim, and re-evaluates its own policy bundle. See A2A trust boundary.

Workload identity rotation and revocation

  • Agent workload identities rotate on a fixed cadence; rotation is a signed change recorded against the agent registration.
  • A workload identity that fails attestation is denied at the Gateway before any policy is evaluated; this is logged as security_event:workload_identity_attestation_failed.
  • Agent signing keys rotate under change control. Historical keys may remain trusted only for replay until a declared deprecation window expires.
  • A revoked agent registration is removed from active discovery and denied at claim minting, A2A dispatch, and Tool Gateway execution.
  • Claim hashes, key IDs, agent URNs, and principal-chain summaries are logged; raw bearer tokens and raw signing secrets are never logged.

Approval-mode tiers (the policy contract)

The five canonical modes (read_only / local_write / network / delegated / destructive) are defined in Governance. Security implications:

  • read_only — no credential exchange beyond ingress; full audit; no approval gate.
  • local_write — idempotency required; tenant-scoped credentials; full audit.
  • network — egress allow-list; per-(tenant, capability) rate budget.
  • delegated — valid user delegation token required; scopes minimum-privilege; per-call evidence captured.
  • destructive — named approver; frozen evidence snapshot at gate; post-execution audit + reconciliation.

Policy may select a lower effective approval mode for a specific bounded request when the capability’s declared maximum allows it, but it cannot exceed that maximum.

Prompt injection defense

Prompts in ContextOS are treated as untrusted by default. The defense is structural, not heuristic:

  • Compile-time isolation. The Context Pack Compiler emits typed buckets (policy / tool / evidence / memory / business / session). Untrusted strings (KG documents, tool outputs, memory recall) flow through evidence and memory buckets that the Critic understands as advisory, not authoritative.
  • Tool surface narrowing. The model only sees the schemas of tools the Compiler resolved through Registry ∩ Permissions − Prohibitions. An injection that names a tool not in the surfaced set has no effect — the tool simply does not exist in the model’s view.
  • Deterministic policy outside the model. Every ToolCallEnvelope re-evaluates policy at the Gateway. Even if the model is fully manipulated by an injected instruction, the call is denied unless deterministic policy permits it.
  • Argument constraints, not just tool gates. arg_constraints (regex, enum, min/max, idempotency key required) bound the values the model can submit. An injection that says “send to attacker@…” against a capability with to_in: ["@…"] returns denied.
  • Redaction rules. runtime_controls.redaction_rules_active strip declared sensitive substrings before any prompt is sent to the model and on every ToolResultEnvelope ingestion.
  • Critic re-verify before execute — the plan’s tools, arg shapes, and intent class are re-checked deterministically.
  • Default deny on every plane boundary. An unrecognized capability, scope, or destination denies; it does not fall through.

Indirect injection from third-party content is the dominant LLM-era attack class. The defense is not “smarter prompts” — it is moving authority out of the prompt entirely.

Tenant isolation

  • Every read in the Knowledge Graph and every recall in Memory is scoped by tenant_id at the storage layer, not by application code.
  • The Tool Gateway rejects any ToolCallEnvelope whose argument-derived tenant scope does not match run_context.tenant_id.
  • Cross-tenant traversal denials emit a security_event:cross_tenant_denied with the offending capability, evidence_refs, and the offending value; on-call is paged on a non-zero rate.
  • Adapter credentials are scoped per tenant; the Tool Gateway exchanges them per call rather than holding long-lived bearer tokens.
  • Caches (including the cached read-only aliases) key on tenant_id first; a cache miss across tenants is structurally impossible.

Cryptographic primitives

The Trust plane treats provenance as a typed contract, not a deployment detail.

ArtifactSigned byVerified at
Context Packpack owner roleCompiler at load; runtime refuses unsigned or unpinned refs
Adapter / capability registrationadapter ownerGateway at registration; periodic re-verification
Policy bundlegovernance rolePolicy Engine at load; bundle.signed_by recorded with every policy_decision_id
Sandbox profilesecurity_leadSandbox kernel before container start
Decision Recordruntime signing keyreplay harness; auditor on export
Tool transcriptruntime signing keyincluded in audit.tool_transcript_id

Hash-chained audit

Decision Records, tool transcripts, and policy_decisions[] form an append-only, hash-chained log keyed by trace_id. Each entry’s hash includes the previous entry’s hash; tampering is detectable on replay. Hash-chain construction follows the pattern in the trusted-cloud reference.

Key management

  • Signing keys are stored in a KMS; the runtime never holds a raw private key.
  • Rotation is a signed change recorded against the registry; verifiers accept any non-revoked key within the rotation window.
  • Revoked keys remain queryable for replay of historical runs against the keys valid at that time.

Secrets and credential exchange

The Tool Gateway is the only component that holds long-lived credentials, and only as references.

  • Per-call exchange. Every ToolCallEnvelope triggers an STS-style exchange producing a credential scoped to (tenant, capability, run_context.user.subject, scopes intersect permission.scopes, max-lifetime <= wall-clock budget).
  • Vault references, not values. auth.delegated_user_token_ref and auth.agent_token_ref are opaque vault handles. Adapters never see raw tokens.
  • No secret transit through the model. Secrets that must reach a tool (e.g., per-tenant API keys) bypass the Compiler — they are resolved at the Gateway from the (tenant, capability) mapping.
  • Permitted env vars (sandbox) are an explicit allow-list. The closed list is defined per profile and signed.
  • Rotation. Rotation cadence per credential class is declared at the adapter; the registry refuses adapters without a declared rotation contract.

Egress controls

network, delegated, and destructive capabilities can produce outbound traffic. The Gateway constrains every dimension:

  • Destination allow-list. Each capability declares endpoint_in[]. Calls to any other host are denied at the Gateway, not at the network.
  • DNS pinning. Hostnames in endpoint_in resolve through the runtime’s DNS layer, which refuses CNAMEs to non-allow-listed origins.
  • TLS pinning (optional) per capability via pinned_sha256[] of acceptable certificate fingerprints; rotation is a signed registry change.
  • Rate budget. Every (tenant, capability) declares a per-window rate; the Gateway returns denied with error_code: rate_exceeded rather than queueing.
  • Egress audit. Every outbound request emits a span carrying trace_id, policy_decision_id, capability id, and destination — sufficient for traffic-side reconciliation against the Decision Record.

Data classification and redaction

Every artifact crossing a plane boundary carries a classification:

ClassExamplesDefault handling
PUBLICnon-tenant docs, marketing copyretained per default policy
INTERNALaggregate metrics, generic intentsretained per tenant policy
CONFIDENTIALcustomer PII, financial detailredaction-on-emit; promotion-gated for memory
RESTRICTEDsecrets, IDV documents, regulated payloadsnever enters the prompt; never promoted to memory

Redaction rules live in runtime_controls.redaction_rules_active and are evaluated at the Compiler (before prompt construction) and at the Gateway (on ToolResultEnvelope ingestion). A redaction failure is a release-blocking Safety regression.

Data residency and retention bands are declared per classification and tenant; the audit trail records the band against every persisted artifact.

Memory-write controls

Memory is a primary security boundary because it is the channel by which untrusted content can persist into future runs.

  • Capture is immutable. Raw evidence is stored append-only with classification, source, and trace_id.
  • Promotion is explicit. A candidate is promoted only when it passes the consent-record check, the contradiction check against existing high-confidence facts, and the classification check (no RESTRICTED content is promotable).
  • Consent ledger. Every promotion records the consent basis: explicit user opt-in, contractual basis, or operator-authored policy. A promotion without a consent reference is rejected.
  • Tenant scoping. Memory recall scopes by tenant_id and subject claims; a cross-subject recall is a security_event:memory_cross_subject_denied.
  • Decay and erasure. Subject-level erasure cascades to embeddings, evidence references, and any derived strategy rules; a deletion request is itself recorded against trace_id for replay-aware audit.

Sandbox controls

For tool capabilities that execute untrusted code (e.g., user-provided scripts in a research workflow), the runtime exposes a Sandbox layer as a first-class component, not as a tool feature flag.

Sandbox profiles

A sandbox profile is a typed, signed contract declaring what the sandbox can and cannot do.

{
  "sandbox_profile_id": "sbx_research_default",
  "version": "1.0.0",
  "kernel": "docker",
  "image_pin": "sha256:...",
  "filesystem": { "host_mount": [], "tmpfs_mb": 256 },
  "network": { "ingress": "deny", "egress": "deny", "allowlist": [] },
  "compute": { "cpu_cores": 1.0, "memory_mb": 512, "wall_clock_ms": 30000 },
  "stdio": { "stdin_max_bytes": 65536, "stdout_max_bytes": 1048576, "stderr_max_bytes": 1048576 },
  "secrets": { "permitted_env_vars": [] },
  "result_classification": "INTERNAL"
}

Sandbox kernels

KernelWhen
dockerlocal-first deployments; image is pinned by content hash
e2bhosted ephemeral sandbox; per-call attestation
firecrackerregulated workloads; microvm isolation
wasmpure-compute thinking helpers; no syscalls

Invariants

  • Default deny everything. Capabilities that need code execution must reference a registered sandbox_profile_id; profiles cannot be inlined.
  • No host filesystem by default. Any host_mount requires a separate operator approval recorded as a signed change to the profile.
  • No inbound network. Egress only via explicit allowlist.
  • Hard wall-clock and memory caps enforced by the kernel, not the adapter.
  • Output is typed. stdout/stderr are captured as artifacts with classification; nothing flows into memory or the model context without going through the promotion pipeline.
  • Image pinning is mandatory. Tag-based image references (latest, stable) are refused.
  • Per-call attestation for e2b / firecracker kernels; attestation hash recorded against the Decision Record.
  • No secrets transit. permitted_env_vars is the closed allow-list; the sandbox never sees the broader credential store.

Lifecycle

  • Profile authored under change control with required reviewer (security_lead).
  • Promoted into the pack registry alongside Context Packs.
  • Bound to capabilities via sandbox_profile_id on the adapter declaration.
  • Retired profiles remain queryable for replay; cannot be re-promoted without a new version.

A2A trust boundary

A2A (agent-to-agent) hops are the highest-leverage place for trust to leak. The contract:

  • No delegation forwarding. The receiving agent does not inherit the caller’s delegation token. It re-attests its own workload identity and re-evaluates its own policy bundle against the original run_context.user.
  • Narrow child claims. A parent A2A call may authorize a child only through a fresh signed claim whose scopes are a subset of the parent scopes and the receiver manifest’s identity-scope ceiling.
  • Typed message envelopes with explicit correlation_id and parent_decision_id so the receiving Decision Record links to the parent for replay.
  • Approval-mode propagation. A delegated or destructive parent call that triggers an A2A hop must be re-acknowledged by the receiving agent; the receiver may select a lower effective mode only when its policy and the capability’s declared maximum allow it.
  • Per-hop audit. Every A2A message emits a span with the agent.agent_urn, agent.workload_identity, claim hash, and principal chain of both sender and receiver; the chain is verifiable on replay.
  • Loop guard. A2A graph depth and cycle detection are enforced by the Orchestrator; cycles are rejected with error_code: a2a_cycle_detected.

Supply chain

The runtime treats every artifact as supply-chain-relevant:

  • Context Packs. Pack registry holds pack_id@semver with content hash, signing key id, and an SBOM-style dependency tree (referenced packs, decision catalog versions, prompt fragment refs).
  • Adapters. Each adapter declares image hash (for hosted), source repo + commit (for in-repo), and the auth contract. Tag-based references are refused.
  • Models. Model id + content hash recorded against every Decision Record’s lineage; mismatch on replay surfaces as a non-determinism event.
  • Sandbox images pinned by content hash; tag references like latest / stable are rejected at profile validation.
  • Build provenance. Packs and adapters are produced through a CI pipeline that emits provenance attestations (Sigstore-compatible); the registry refuses artifacts whose attestation does not verify against the declared signer.

Attestation

For deployments that require cryptographic proof of “what executed what against which inputs,” ContextOS aligns with the differential-attestation pattern in the Trusted AI Agents in the Cloud reference. The Decision Record can carry an attestation_ref that binds the platform measurement, the artifact manifest (pack, model, policies, tools), the input envelope, the policy decisions, and the result. The replay harness consumes the manifest to reproduce the verdict; an external auditor can verify the record without re-executing tools.

This is optional infrastructure, not a baseline requirement: deployments that do not need attested execution still get hash-chained audit and signed Decision Records.

Boundary controls (this repo)

This repo is the public spec surface. Even there, controls are explicit:

  • Centralized security headers in next.config.ts: Content-Security-Policy (default-src 'self', script-src includes 'wasm-unsafe-eval' for Pagefind WebAssembly search without enabling production 'unsafe-eval', frame-ancestors 'none', object-src 'none', upgrade-insecure-requests, block-all-mixed-content), Strict-Transport-Security (max-age=63072000; includeSubDomains; preload), X-Frame-Options: DENY, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy: camera=(), microphone=(), geolocation=(), X-Content-Type-Options: nosniff, Cross-Origin-Opener-Policy: same-origin, Cross-Origin-Resource-Policy: same-origin for pages with cross-origin on static and generated social preview image routes, X-Permitted-Cross-Domain-Policies: none.
  • Optional Basic Auth at the edge via src/proxy.ts for private review deployments (gated by BASIC_AUTH_ENABLED, bypassed for local development, loopback hosts, and canonical public domains).
  • CI enforces npm audit --audit-level=high, gitleaks for credential scanning (with # gitleaks:allow for documented examples), and Semgrep p/ci for static analysis.
  • Dependabot updates skip Vercel deployments via vercel.json’s ignoreCommand to keep the deploy queue clean.

Compliance mapping

ContextOS aligns with widely-adopted control frameworks. The mapping is intentionally control-to-primitive, not control-to-document.

FrameworkContextOS primitive
NIST AI RMF Govern / Map / Measure / ManageTrust-plane bundles / Decision Catalog / evaluators / audit + replay
ISO/IEC 42001 AI management systempolicy lifecycle + evaluator runs + improvement-proposal change control
ISO/IEC 27001 / SOC 2tenant isolation, identity propagation, hash-chained audit, KMS key rotation, vendor (adapter) management
EU AI Actsee Governance regulatory timeline; transparency obligations encoded as runtime-enforced rules
GDPR / India DPDPAconsent ledger on memory promotion; subject erasure cascades; data classification + residency on every artifact
OWASP LLM Top 10see the mapping table
MITRE ATLASadversarial suites tracked per intent; Attack Success Rate metric

Vulnerability disclosure

Security issues in this repository should be reported per SECURITY.md: email piyush@piyush.me with a description, impact, reproduction steps, and any PoC. Public disclosure is requested only after coordinated investigation. Only the latest release is supported for security updates.

Auditability

Every Decision Record carries enough to reconstruct the run:

  • trace_id, run_id, session_id, tenant_id, pinned pack version, snapshot version.
  • agent_identity.subject, agent_identity.claim_hash, principal_chain, and identity key ID.
  • policy_decisions[] with policy_decision_id and matched rule_ids[].
  • tool_transcripts[] with policy_decision_id, evidence_refs, and audit metadata.
  • approvals[] with approver identity, frozen evidence snapshot hash, effective approval mode.
  • controls_active[] (must_refuse, must_escalate, approval_gates_active, redaction_rules_active).
  • lineage — pack version, model id + hash, snapshot id, policy bundle versions.

Audit envelopes are append-only, hash-chained per trace_id, and signed by the runtime signing key. Tampering is detectable on replay because the chain breaks; replay against a tampered chain returns replay_status: tamper_detected rather than a recomputed verdict.

This is the substrate for replay.

Incident response and replay

Replay is the primary IR primitive. Given a trace_id:

  1. Resolve the pinned pack_version, snapshot, and policy bundle versions from lineage.
  2. Recover the recorded invokeAgent envelope and tool transcripts.
  3. Re-run the canonical loop offline against the recorded transcripts; verify the produced DecisionRecord byte-matches the persisted one.
  4. Re-score against the current evaluators; the delta isolates whether behavior has drifted in the runtime, the data, or the evaluator itself.

Operational commitments:

  • Quarterly drill end-to-end against a chosen production trace_id.
  • Tail-based sampling retains every run that crossed destructive, that hit a loop guard, or that failed scorecard thresholds — these are the IR-relevant runs by construction.
  • Time-to-replay is an operational metric; an IR contract that cannot replay within hours is treated as a regression.
  • Rotation-aware verification. Replays against historical signing keys are accepted; rotation does not invalidate prior audit.

Interfaces

Inputs

  • Policy bundles (versioned, signed)
  • Identity assertions (delegated user, agent workload, optional attestation report)
  • Adapter capability declarations (with approval_mode, capability_class, arg_constraints, endpoint_in, rotation contract)
  • Sandbox profiles (signed)
  • Run Context (tenant_id, role, claims, budgets)

Outputs

  • Allow / deny verdicts with reasons and remediation hints
  • Approval-gate prompts with frozen evidence
  • Audit records bound to trace_id and signed
  • Security events (cross-tenant denials, credential rotations, sandbox violations, attestation failures, A2A cycle detections)
  • Replay datasets

Failure modes

  • Policy drift across environments.
  • Stale credentials at the Tool Gateway after a rotation.
  • Revoked agent registration still discoverable by a stale A2A card or adapter cache.
  • Child agent claim broader than parent claim or manifest ceiling.
  • Approval gates bypassed by a Planner that skips a checkpoint (mitigated by Critic re-verify).
  • Memory write that violates classification (mitigated by consent + classification check at candidate stage).
  • Audit gap when a custom adapter forgets to propagate W3C trace headers (caught by trace-coverage assertion).
  • Cached read-only alias hiding a permission change after a policy update.
  • Indirect prompt injection from a recently-promoted memory entry that bypassed candidate-stage classification.
  • Egress allow-list expressed as a tag/CNAME that resolves outside the intended origin (mitigated by DNS pinning).
  • A2A receiver that accepts inbound delegation without re-attesting workload identity.
  • Hash-chain truncation from a partial outage masquerading as a legitimate gap.

Operational concerns

  • Policy version pinning per environment; promotion is a deliberate step.
  • Secrets rotation cadence at the Tool Gateway, signed and registry-recorded.
  • Workload identity rotation cadence; revoked-keys window for replay.
  • Agent manifest lifecycle review and revoked-agent discovery sweeps.
  • Trusted-previous-key deprecation schedule after signing-key rotation.
  • Sampling stratified by risk tier; tail-based sampling forced for destructive runs.
  • Trace retention bands by data classification.
  • Quarterly IR drill that exercises replay end-to-end against a real trace_id.
  • DNS / TLS pinning maintenance for endpoint_in[] capabilities.
  • SBOM and adapter-attestation review on every adapter version bump.
  • Sandbox profile re-verification on kernel upgrade.

Evaluation metrics

  • Policy compliance rate (target: 100% on guardrails).
  • Approval-gate honored rate (target: 100%).
  • Cross-tenant denial rate (target: zero in steady state; non-zero pages on-call).
  • Audit coverage (fraction of runs with full trace + manifests + decision record).
  • Replay determinism rate (DecisionRecord byte-match against pinned snapshot).
  • Time to replay for a given trace_id.
  • Mean time to detect / respond on Trust-plane events.
  • Attack Success Rate on the per-intent adversarial suite (target: zero successful attacks on gated release suites; see the trusted-cloud reference).
  • Redaction failure rate (target: 0% on CONFIDENTIAL / RESTRICTED classes).

Example

A condensed Trust-plane receipt on a ToolResultEnvelope:

{
  "tool_call_id": "tc_118",
  "run_id": "req_9f3a12",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "capability_id": "adp_payments.issue_refund",
  "tenant_id": "tenant_acme_prod",
  "status": "completed",
  "citations": ["policy:POLICY_RETURNS_V1#R_HIGH_VALUE_REQUIRES_APPROVAL"],
  "mutations": [{ "mutation_ref": "tool:adp_payments.issue_refund:tc_118" }],
  "policy_decision_id": "pol_9901",
  "metadata": {
    "approval_mode_effective": "destructive",
    "approver": "user_finance_lead_77",
    "approval_evidence_snapshot_hash": "sha256:b2a1...",
    "tool_transcript_id": "tool_tx_118",
    "redaction_applied": false,
    "chain_prev_hash": "sha256:7c4a...",
    "signed_by": "kid_runtime_2026Q2"
  },
  "latency_ms": 242
}

Common misconceptions

  • Security is not a single perimeter. It is enforced at every plane boundary.
  • The model is not the security boundary. Policy outside agent code is.
  • Audit is not logging. Audit is structured, signed, hash-chained, replayable, and tied to a DecisionRecord.
  • Sandbox is not a backup plan. It is the default for any capability that runs untrusted code.
  • Prompt injection is not solved by better prompts. The defense is moving authority out of the prompt — tool surfacing, deterministic policy, arg constraints.
  • A2A is not a way to launder authority. Each hop re-attests, re-evaluates, and re-records.
  • Compliance is a byproduct. The runtime produces audit and replay material as part of normal operation; compliance reports are derived, not authored.