Most agentic AI systems start the same way: a strong model, a prompt, a tool list, a vector database, and a demo path where the agent appears to reason its way through work.
That stack is enough to impress a room. It is not enough to run a business workflow where money, customer state, regulated data, or production infrastructure can change.
The difference is not model intelligence. The difference is whether the system around the model can prove what context was used, what authority was granted, which policy ran, which tool executed, what evidence supported the decision, whether the result can be replayed, and how the next release will get safer.
That system is the harness.
ContextOS is the harness architecture for agentic AI. It turns an agent from “LLM plus tools” into a governed decision runtime.
Research readout
The current agent literature converges on a practical point: autonomy increases capability, but it also increases the need for runtime control.
| Source | Practical signal | What it means for production agents |
|---|---|---|
| Anthropic: Building effective agents | Start simple; add agentic complexity only when it demonstrably improves outcomes. Agentic systems trade latency and cost for task performance. Autonomous loops need guardrails, sandboxing, clear tools, and stopping conditions. | A production harness should make the simplest safe path easy and make extra autonomy pay rent through measured outcomes. |
| OpenAI Agents SDK docs | Agent apps need orchestration, tools, handoffs, guardrails, tracing, sandbox execution, human review, and evaluation loops as first-class concerns. | The platform shape is no longer “call model.” It is “operate runs.” ContextOS adds the cross-vendor contract around those runs. |
| OpenAI guardrails and human review | Guardrails decide whether a run should continue, pause, or stop; tool-level checks matter around side effects; approvals need resumable state. | Safety belongs at input, output, tool, and approval boundaries, not only in a system prompt. |
| LangSmith observability docs | Agent traces should capture model calls, tool calls, decision points, and production behavior. | Observability is necessary but not sufficient. ContextOS turns traces into replayable decision evidence. |
| NIST AI RMF Core | AI risk management needs context mapping, measurement, monitoring, human oversight, documented controls, and continuous updates as risks evolve. | Harness engineering is the runtime form of AI risk management: map, measure, manage, and govern every run. |
| OWASP Top 10 for LLM Applications 2025 and OWASP Agentic guidance | Agentic systems expand the security problem from bad output to tool misuse, excessive agency, memory poisoning, identity abuse, cascading failures, and rogue autonomous behavior. | A serious agent platform needs least privilege, memory governance, replay, and emergency rollback as normal runtime behavior. |
Before and after matrix
Read this like a design review checklist. If the “before” cell describes your current system, the row names the production gap.
| Area | Before ContextOS: common agent stack | What breaks | After ContextOS / harness engineering | Proof a geek can inspect |
|---|---|---|---|---|
| Unit of work | A user message enters a prompt and the agent decides what to do. | There is no stable runtime object to authorize, measure, replay, or compare. | Every request enters as a RunContext with tenant, user, delegation, intent, safety mode, trace, and budget. | run_id, trace_id, tenant_id, intent, safety_mode, run_budget in the invocation envelope. |
| Task boundary | ”You are a support agent” plus natural-language instructions. | Scope expands by accident. The same prompt handles refunds, fraud, shipping, retention, and escalations. | The Intent-Task Catalog maps raw requests to canonical intents and approved task templates. | Intent classification record, intent_ref, task_template_id, allowed task list. |
| Context | RAG pulls whatever looks semantically close at run time. | Stale docs, policy conflicts, tenant leakage, and silent context truncation become invisible causes of bad actions. | A Context Pack declares source priorities, buckets, budgets, policies, tools, memory, evals, and decision specs. | pack_id@version, snapshot_version, context_ledger, budget_report, evidence_manifest. |
| Planning | The model loops until it thinks it is done. | Cost spikes, repeated tool calls, hidden no-progress loops, and non-deterministic branching. | The Decision plane runs Planner -> Critic.verify -> Executor -> Critic.score -> Consolidate under budget and loop guards. | Plan transcript, critic verdicts, max_tool_calls, max_replan_attempts, loop-guard events. |
| Tool surface | Function calling exposes a broad registry because “the model can choose.” | Excessive agency: the model can discover and combine capabilities the workflow never needed. | The Tool Gateway receives only the compiled tool manifest allowed by pack, policy, tenant, and safety mode. | CompiledContext.tool_manifest, adapter ids, capability ids, approval modes, arg constraints. |
| Authority | One backend service token executes most actions. | The agent inherits more privilege than the user, the workflow, or the risk class requires. | RunContext separates user delegation from agent workload identity; tool calls exchange scoped credentials per capability. | Delegation scopes, workload identity, policy_decision_id, credential exchange audit metadata. |
| Approval | A UI asks a human to click approve before high-risk actions. | The approver sees mutable data, the pause is not replayable, and denial handling differs per workflow. | Approval gates freeze evidence, bind approver identity, persist resumable state, and write the result to the DecisionRecord. | approval_event, gate_id, approver id, frozen evidence hash, status=APPROVED / REJECTED / DEFERRED. |
| Policy | Prompt says “do not refund above 5000” or a post-hoc checker reviews output. | The model can ignore policy, misunderstand policy, or comply in prose while tools still execute. | Policy bundles run outside the model at compile, plan, and execute boundaries. They produce typed decisions. | Policy bundle version, matched rule_ids[], policy_decision_id, input facts, verdict. |
| Guardrails | A classifier checks the first user message or the final answer. | Side effects happen between those two points. Tool arguments and tool results are under-checked. | Guardrails exist at input, output, tool, approval, memory, and release boundaries. | Tool guardrail results, redaction reports, must_refuse, must_escalate, evaluator verdicts. |
| Memory | The agent writes “important” facts into a vector store. | Prompt injection and stale assumptions persist across sessions; memory becomes a second ungoverned prompt. | Promotion-aware memory separates capture, candidate, review, promotion, decay, and erasure. | Memory write class, consent basis, contradiction check, classification, promotion status, trace link. |
| Observability | Logs contain prompt, response, tool call, maybe latency. | Logs show activity but not authority, policy, evidence, or replayability. | Observability emits W3C-style trace correlation plus typed runtime artifacts. | OTEL spans, trace_id, tool transcripts, scorecards, replay packet, DecisionRecord link. |
| Audit | After an incident, engineers reconstruct what happened from logs and Slack. | Audit depends on human interpretation and live systems that may have changed. | Every governed run emits a DecisionRecord as the durable receipt. | decision_key, evidence_refs, policy_decisions, approvals, controls, lineage, record_hash. |
| Replay | Re-run the prompt and hope the same model, docs, and tools behave similarly. | Live retrieval, model drift, policy edits, and tool state make reproduction impossible. | Replay uses pinned pack, snapshot, routing decision, transcripts, and policy bundle to re-derive the record offline. | replay_id, pinned versions, recorded tool transcripts, byte-match or named diff. |
| Evaluation | A few golden prompts and manual review before launch. | Regressions ship through prompt edits, model upgrades, policy changes, and tool changes. | Evaluation and Observability scores Policy, Utility, Latency, Safety, and Economics on live samples and golden replays. | RunScore, golden set id, evaluator suite version, release-gate verdict. |
| Cost control | Token and tool spend are inspected after the bill arrives. | Agent loops hide cost inside “reasoning.” Expensive paths become normal. | RunBudget caps tokens, wall-clock, tool calls, model routing, and per-intent economics. | budget_report, cost per decision, router decision, model profile, p95/p99 by intent. |
| Rollback | Revert a prompt or disable a feature flag. | The rollback may not restore the old behavior because context, tools, policy, and model route changed independently. | Rollback re-pins the prior release tuple: pack, policy, tool manifest, evaluator suite, model profile, memory snapshot rule. | Release tuple, rollout stage, prior pin, replay against pre-release traces. |
| Improvement | Operators send feedback in Slack; someone updates the prompt. | Learning is anecdotal, unversioned, and hard to verify. | Corrections become typed StrategyRules or pack changes, then pass replay and release gates before promotion. | FeedbackRecord, StrategyRule proposal, diff, golden replay, approver, promotion record. |
| Multi-agent work | Subagents share whatever context and tools the orchestrator hands them. | Delegation leaks authority. Failures cascade across agents with weak provenance. | Subagent lanes inherit bounded context, scoped authority, parent decision refs, and explicit handoff contracts. | parent_decision_id, lane id, handoff envelope, per-lane budget, tool surface per lane. |
| Security posture | Security reviews the prompt, model provider, and API permissions. | Prompt injection, tool misuse, memory poisoning, and over-privileged credentials cross boundaries. | ContextOS treats every boundary as enforceable: compile, plan, execute, memory write, approval, replay, release. | Threat model mapped to controls, least-privilege manifests, redaction tests, memory review, emergency stop. |
The important shift
The before-state agent is optimized for “can it finish the task?”
The after-state ContextOS agent is optimized for five harder questions:
| Question | Why it matters | ContextOS artifact |
|---|---|---|
| Was this the right task? | Autonomy is unsafe if the runtime cannot name the intent and allowed task. | Intent-Task Catalog, intent_ref, task template |
| Did it see the right context? | A correct model with wrong context still acts incorrectly. | Context Pack, CompiledContext, evidence manifest |
| Was it allowed to act? | Tools change real systems; authorization cannot live in prose. | Tool Gateway, approval-mode tiers, policy decisions |
| Can we prove what happened? | Trust collapses when incident review depends on interpretation. | DecisionRecord, trace, tool transcript, scorecard |
| Can we improve without drift? | Agent quality must improve through release engineering, not folklore. | Feedback Store, StrategyRule, replay, release gate |
Why “just use an agent framework” is not enough
Frameworks help you build. They do not automatically decide your enterprise contract.
| Framework primitive | What it gives you | What ContextOS still has to define |
|---|---|---|
| Agent loop | Calls models and tools until completion. | Which intents deserve an autonomous loop, what budget applies, and when the Critic must stop. |
| Tool calling | Lets the model invoke functions or hosted tools. | Which capabilities are surfaced for this tenant, user, intent, risk class, and pack version. |
| Handoffs | Moves work between specialist agents. | Who owns the final decision, what authority transfers, and how parent and child records link. |
| Guardrails | Blocks or validates selected inputs, outputs, or tool calls. | Which policy bundle is authoritative, where checks run, and how denials become replayable evidence. |
| Tracing | Shows model calls, tool calls, handoffs, and spans. | Which trace fields are mandatory, how trace joins to DecisionRecord, and how replay verifies it. |
| Human review | Pauses sensitive tool calls for approval. | Which evidence is frozen, who may approve, what happens on timeout, and how approval affects audit. |
| Sandbox | Runs code or tools in a constrained environment. | Which sandbox profile is allowed by pack, how outputs are classified, and how attestations are recorded. |
The harness is the layer that turns those primitives into a production contract.
Risk translation table
Agentic AI risks are not abstract. They map directly to missing runtime controls.
| Risk pattern | Before ContextOS symptom | Harness control that closes the gap |
|---|---|---|
| Prompt injection | Retrieved text or user input tells the model to ignore policy. | Context is treated as evidence, not authority; policy runs outside the model; Tool Gateway re-validates every call. |
| Excessive agency | The agent has access to tools that are unrelated to the request. | Compiled tool manifest exposes only capabilities allowed by pack, policy, tenant, safety mode, and delegation. |
| Tool misuse | The model calls the right tool with unsafe arguments. | Arg constraints, policy decisions, idempotency keys, approval-mode tiers, and Critic verification before execute. |
| Sensitive data leakage | Tool output or memory recall enters a response without classification. | Data classification, redaction rules, promotion-aware memory, output guardrails, and trace-linked evidence refs. |
| Memory poisoning | A malicious conversation creates durable future behavior. | Capture-only raw memory, promotion review, contradiction checks, consent basis, decay, and rollbackable memory refs. |
| Cascading failure | One bad intermediate step causes many downstream actions. | Bounded Planner / Executor / Critic loop, max replan attempts, tool-call caps, failure playbooks, and rollback. |
| Invisible drift | A model or policy update changes behavior with no obvious code diff. | Release tuple pinning, golden replay, evaluator scorecards, and decision-record comparison. |
| Audit tampering | Logs can be edited, incomplete, or impossible to join. | Append-only DecisionRecords, trace ids, record hashes, prior hashes, and replay packets. |
What changes in one run
Here is the same refund agent before and after the harness.
| Stage | Before: agent demo | After: ContextOS-governed run |
|---|---|---|
| Request | ”Refund order ord_881 for INR 4200” enters the support prompt. | Request enters invokeAgent with tenant_id, user delegation, intent=support.refund, safety_mode=destructive, and trace_id. |
| Context | Agent searches docs and customer history. | Compiler builds CompiledContext from ctxpack.support@version, pinned KG snapshot, policy bundle, evidence manifest, and budget report. |
| Plan | Model says it will look up the order and issue refund. | Planner proposes steps; Critic checks tool allow-list, required evidence, approval mode, and argument bounds. |
| Read tool | Order lookup runs. | Tool Gateway executes adp_orders.lookup as read_only, emits toolResult with evidence ref and trace span. |
| Policy | Prompt says high-value refunds need approval. | Policy Engine emits policy_decision_id, matched rule ids, and requires_approval_gate=GATE_FINANCE_APPROVAL. |
| Approval | Agent asks a human in UI. | Approval gate freezes evidence snapshot, pauses the run, stores resumable state, and records approver verdict. |
| Write tool | Payment API is called. | Tool Gateway validates identity, idempotency key, amount cap, egress, approval event, and reversal token before execute. |
| Final answer | Agent writes “refund issued.” | Runtime emits DecisionRecord with outcome, evidence refs, approvals, controls, scorecard, lineage, trace, and record hash. |
| Later audit | Engineer searches logs. | Auditor starts with trace_id, fetches DecisionRecord, replays from pinned inputs, and verifies byte-match. |
Sample prompts vs governed prompts
This is the piece most teams miss: ContextOS does not delete the prompt. It demotes the prompt from “the system boundary” to “one instruction field inside a governed runtime envelope.”
| Use case | Bare LLM prompt | Prompt plus tools | Prompt wrapped with ContextOS constructs | Why the third version is different |
|---|---|---|---|---|
| Customer refund | ”You are a support agent. Decide whether to refund this order and respond politely." | "You can call lookup_order and issue_refund. Follow refund policy. Ask for approval when needed.” | RunContext(intent=support.refund, safety_mode=destructive) + ContextPack(ctxpack.support@5.2.0) + DecisionSpec(support.refund.execute) + Tool Gateway exposing lookup_order and gated issue_refund. | The model can propose, but policy, approval, amount bounds, evidence, idempotency, and audit are enforced outside the prompt. |
| Regulated back office | ”Review this exception and decide whether it can be approved." | "Use the policy docs and case tools. Escalate risky cases.” | RunContext(risk_class=regulated) + policy bundle with rule ids + required evidence list + approval gate + DecisionRecord status values APPROVED, REJECTED, ESCALATED, DEFERRED. | The decision becomes comparable across reviewers because every outcome cites the same evidence contract and policy bundle. |
| Incident command | ”Triage this incident and recommend next steps." | "Use logs, metrics, and runbook tools. Page owners if needed.” | ContextPack(incident.command) pins runbooks, service graph, SLOs, escalation matrix, and allowed tools; Tool Gateway gates paging, rollback, and traffic-shift actions. | The agent can help coordinate, but high-blast-radius actions still require explicit controls, owner identity, and replayable evidence. |
| Software delivery | ”Review this PR and merge it if tests look good." | "Use GitHub, CI, and code search tools. Comment on problems.” | DecisionSpec(delivery.pr.open) declares required evidence: diff summary, tests, static checks, owner approval, risk classification, rollback note. | Merge authority is not a vibe. It is a typed decision with test evidence, approval provenance, and release-gate status. |
| Data stewardship | ”Fix this data quality issue." | "Query the warehouse and update records if confidence is high.” | ContextOS binds ontology version, CEID namespaces, data classification, lineage graph, write capability, and remediation DecisionRecord. | The system knows which entity is being changed, which source is authoritative, and whether the write is allowed for that data class. |
The prompt should get smaller as the harness gets stronger.
| Layer | Belongs in prompt | Belongs in ContextOS contract |
|---|---|---|
| Tone | ”Be concise, neutral, and explain next steps.” | Customer communication templates and redaction rules. |
| Task shape | ”Evaluate refund eligibility and explain the result.” | Intent, task template, DecisionSpec, required evidence, allowed outcomes. |
| Tool choice | ”Look up the order before deciding.” | Compiled tool manifest, capability constraints, approval-mode tier, idempotency policy. |
| Policy | ”Do not violate refund policy.” | Versioned policy bundle, JsonLogic rules, priority, policy_decision_id, release gate. |
| Safety | ”Escalate risky cases.” | must_escalate, approval gates, risk class, evaluator thresholds, failure playbooks. |
| Audit | ”Mention why you decided.” | DecisionRecord with evidence refs, approvals, controls, lineage, scorecard, trace, hash. |
Here is the shape in one compact envelope:
{
"model_prompt": "Evaluate whether this customer refund can proceed. Explain the customer-facing result without exposing internal rule ids.",
"run_context": {
"intent": "support.refund",
"tenant_id": "tenant_acme_prod",
"user_role": "support_agent",
"safety_mode": "destructive",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"run_budget": { "max_tool_calls": 8, "max_replan_attempts": 2 }
},
"context_pack_ref": "ctxpack.support@5.2.0",
"decision_spec_ref": "support.refund.execute@1.0.0",
"compiled_tool_manifest": [
{ "capability": "orders.lookup", "approval_mode": "read_only" },
{ "capability": "payments.issue_refund", "approval_mode": "destructive", "requires_gate": "GATE_FINANCE_APPROVAL" }
],
"runtime_controls": {
"policy_bundle": "POLICY_RETURNS_V4",
"required_evidence": ["identity_verified", "order_lookup", "policy_eval"],
"redaction_rules": ["card_number", "pan", "cvv"],
"emit": "DecisionRecord"
}
}That wrapper is the product point. The prompt still asks for judgment. ContextOS decides the safe context, allowed authority, enforcement path, evidence contract, and audit record around that judgment.
Adoption path
Do not try to build the whole control plane in one sprint. Ship the smallest load-bearing slice.
| Step | Build this first | Done when |
|---|---|---|
| 1. Name the run | RunContext plus canonical trace_id on every invocation. | Every agent call has tenant, user, intent, safety mode, budget, and trace. |
| 2. Compile context | One Context Pack for one workflow. | The model sees a bounded context envelope, not an ad-hoc prompt construction. |
| 3. Gate tools | Tool Gateway for one read tool and one side-effecting tool. | No tool call bypasses schema validation, policy, identity, idempotency, and trace. |
| 4. Emit records | DecisionRecord for the primary decision. | Every successful, rejected, escalated, or deferred run has a durable receipt. |
| 5. Replay one case | Offline replay for one normal run and one boundary run. | Replay proves the record from pinned inputs without executing live tools. |
| 6. Score releases | Policy / Utility / Latency / Safety / Economics scorecard. | A prompt, pack, policy, model, or tool change cannot promote without a verdict. |
| 7. Close improvement | Operator correction -> StrategyRule proposal -> replay -> approval. | Learning becomes a governed release path, not a prompt edit. |
Design review questions
Use this table before letting an agent act on anything important.
| Ask this | Good answer | Bad answer |
|---|---|---|
| What is the maximum approval mode this run can reach? | Declared in RunContext.safety_mode, tool manifests, policy bundle, and DecisionSpec. | ”The prompt tells it to ask first.” |
| Which tools were visible to the model? | A compiled tool manifest for the intent and tenant. | ”It had access to our internal tool server.” |
| What evidence supports the final action? | evidence_refs[] on the DecisionRecord, with source, hash, snapshot, and timestamp. | ”It saw the customer record.” |
| How do we know policy ran? | policy_decisions[] with bundle version, rule ids, inputs, and verdict. | ”The model was instructed with policy.” |
| Can we reproduce this tomorrow? | Replay packet pins pack, snapshot, transcripts, routing, policy, evaluator suite, and record hash. | ”We can rerun the request.” |
| What happens if the agent loops? | RunBudget, max tool calls, max replans, no-progress detection, and failure playbook. | ”The model usually stops.” |
| How does a correction become durable? | FeedbackRecord -> StrategyRule or pack diff -> golden replay -> release gate. | ”We will update the prompt.” |
What ContextOS is not
| Misread | Correct interpretation |
|---|---|
| ”ContextOS replaces agent frameworks.” | No. It defines the contract around any framework: context, authority, records, replay, evaluation, and release. |
| ”Harness engineering slows teams down.” | It removes repeated governance work. Teams move faster because context, tools, approvals, audit, and rollback are reusable primitives. |
| ”This is only for regulated industries.” | Any workflow with customer state, money, production data, code changes, or durable memory needs these controls. Regulation only makes the need obvious. |
| ”Observability is enough.” | Traces explain execution. DecisionRecords prove governed outcomes. Replay verifies them. You need all three. |
| ”Better models will make this unnecessary.” | Better models increase the surface area worth automating. More useful autonomy raises the value of boundaries, not lowers it. |
The one-line version
Before ContextOS, an agent is a probabilistic worker surrounded by ad-hoc trust.
After ContextOS, an agent is a governed runtime participant: bounded by compiled context, constrained by policy and tool authority, inspected by a Critic, recorded as a DecisionRecord, replayed from pinned inputs, scored by evaluators, and improved through release gates.
That is the difference between a demo you can admire and a system an enterprise can adopt.
Read next: Harness Engineering, Context Pack, Decision Record, Tool Gateway, and Evaluation and Observability.