Agentic AI Systems Before and After ContextOS

Most agentic AI systems start the same way: a strong model, a prompt, a tool list, a vector database, and a demo path where the agent appears to reason its way through work.

That stack is enough to impress a room. It is not enough to run a business workflow where money, customer state, regulated data, or production infrastructure can change.

The difference is not model intelligence. The difference is whether the system around the model can prove what context was used, what authority was granted, which policy ran, which tool executed, what evidence supported the decision, whether the result can be replayed, and how the next release will get safer.

That system is the harness.

ContextOS is the harness architecture for agentic AI. It turns an agent from “LLM plus tools” into a governed decision runtime.

Core thesis

Agents do not fail only because they are wrong. They fail because the runtime cannot prove, bound, replay, or repair them.

The production upgrade is not “better prompts.” It is compiled context, least-privilege tools, policy outside the model, typed decision records, replay, scorecards, and release-gated improvement.

Before

Agent as clever worker

Prompt, tools, memory, logs, and human trust glued together per workflow.

After

Agent as governed run

Every request compiles, executes, records, scores, replays, and improves through one contract.

Proof

Artifacts beat opinions

ContextPack, ToolEnvelope, DecisionRecord, scorecard, replay packet, rollout tuple.

Outcome

Adoptable autonomy

The organization can let agents act because the blast radius is explicit and controllable.

Research readout

The current agent literature converges on a practical point: autonomy increases capability, but it also increases the need for runtime control.

Source	Practical signal	What it means for production agents
Anthropic: Building effective agents	Start simple; add agentic complexity only when it demonstrably improves outcomes. Agentic systems trade latency and cost for task performance. Autonomous loops need guardrails, sandboxing, clear tools, and stopping conditions.	A production harness should make the simplest safe path easy and make extra autonomy pay rent through measured outcomes.
OpenAI Agents SDK docs	Agent apps need orchestration, tools, handoffs, guardrails, tracing, sandbox execution, human review, and evaluation loops as first-class concerns.	The platform shape is no longer “call model.” It is “operate runs.” ContextOS adds the cross-vendor contract around those runs.
OpenAI guardrails and human review	Guardrails decide whether a run should continue, pause, or stop; tool-level checks matter around side effects; approvals need resumable state.	Safety belongs at input, output, tool, and approval boundaries, not only in a system prompt.
LangSmith observability docs	Agent traces should capture model calls, tool calls, decision points, and production behavior.	Observability is necessary but not sufficient. ContextOS turns traces into replayable decision evidence.
NIST AI RMF Core	AI risk management needs context mapping, measurement, monitoring, human oversight, documented controls, and continuous updates as risks evolve.	Harness engineering is the runtime form of AI risk management: map, measure, manage, and govern every run.
OWASP Top 10 for LLM Applications 2025 and OWASP Agentic guidance	Agentic systems expand the security problem from bad output to tool misuse, excessive agency, memory poisoning, identity abuse, cascading failures, and rogue autonomous behavior.	A serious agent platform needs least privilege, memory governance, replay, and emergency rollback as normal runtime behavior.

Before and after matrix

Read this like a design review checklist. If the “before” cell describes your current system, the row names the production gap.

Area	Before ContextOS: common agent stack	What breaks	After ContextOS / harness engineering	Proof a geek can inspect
Unit of work	A user message enters a prompt and the agent decides what to do.	There is no stable runtime object to authorize, measure, replay, or compare.	Every request enters as a `RunContext` with tenant, user, delegation, intent, safety mode, trace, and budget.	`run_id`, `trace_id`, `tenant_id`, `intent`, `safety_mode`, `run_budget` in the invocation envelope.
Task boundary	”You are a support agent” plus natural-language instructions.	Scope expands by accident. The same prompt handles refunds, fraud, shipping, retention, and escalations.	The Intent-Task Catalog maps raw requests to canonical intents and approved task templates.	Intent classification record, `intent_ref`, `task_template_id`, allowed task list.
Context	RAG pulls whatever looks semantically close at run time.	Stale docs, policy conflicts, tenant leakage, and silent context truncation become invisible causes of bad actions.	A Context Pack declares source priorities, buckets, budgets, policies, tools, memory, evals, and decision specs.	`pack_id@version`, `snapshot_version`, `context_ledger`, `budget_report`, `evidence_manifest`.
Planning	The model loops until it thinks it is done.	Cost spikes, repeated tool calls, hidden no-progress loops, and non-deterministic branching.	The Decision plane runs Planner -> Critic.verify -> Executor -> Critic.score -> Consolidate under budget and loop guards.	Plan transcript, critic verdicts, `max_tool_calls`, `max_replan_attempts`, loop-guard events.
Tool surface	Function calling exposes a broad registry because “the model can choose.”	Excessive agency: the model can discover and combine capabilities the workflow never needed.	The Tool Gateway receives only the compiled tool manifest allowed by pack, policy, tenant, and safety mode.	`CompiledContext.tool_manifest`, adapter ids, capability ids, approval modes, arg constraints.
Authority	One backend service token executes most actions.	The agent inherits more privilege than the user, the workflow, or the risk class requires.	`RunContext` separates user delegation from agent workload identity; tool calls exchange scoped credentials per capability.	Delegation scopes, workload identity, `policy_decision_id`, credential exchange audit metadata.
Approval	A UI asks a human to click approve before high-risk actions.	The approver sees mutable data, the pause is not replayable, and denial handling differs per workflow.	Approval gates freeze evidence, bind approver identity, persist resumable state, and write the result to the DecisionRecord.	`approval_event`, `gate_id`, approver id, frozen evidence hash, `status=APPROVED / REJECTED / DEFERRED`.
Policy	Prompt says “do not refund above 5000” or a post-hoc checker reviews output.	The model can ignore policy, misunderstand policy, or comply in prose while tools still execute.	Policy bundles run outside the model at compile, plan, and execute boundaries. They produce typed decisions.	Policy bundle version, matched `rule_ids[]`, `policy_decision_id`, input facts, verdict.
Guardrails	A classifier checks the first user message or the final answer.	Side effects happen between those two points. Tool arguments and tool results are under-checked.	Guardrails exist at input, output, tool, approval, memory, and release boundaries.	Tool guardrail results, redaction reports, `must_refuse`, `must_escalate`, evaluator verdicts.
Memory	The agent writes “important” facts into a vector store.	Prompt injection and stale assumptions persist across sessions; memory becomes a second ungoverned prompt.	Promotion-aware memory separates capture, candidate, review, promotion, decay, and erasure.	Memory write class, consent basis, contradiction check, classification, promotion status, trace link.
Observability	Logs contain prompt, response, tool call, maybe latency.	Logs show activity but not authority, policy, evidence, or replayability.	Observability emits W3C-style trace correlation plus typed runtime artifacts.	OTEL spans, `trace_id`, tool transcripts, scorecards, replay packet, DecisionRecord link.
Audit	After an incident, engineers reconstruct what happened from logs and Slack.	Audit depends on human interpretation and live systems that may have changed.	Every governed run emits a DecisionRecord as the durable receipt.	`decision_key`, `evidence_refs`, `policy_decisions`, approvals, controls, lineage, `record_hash`.
Replay	Re-run the prompt and hope the same model, docs, and tools behave similarly.	Live retrieval, model drift, policy edits, and tool state make reproduction impossible.	Replay uses pinned pack, snapshot, routing decision, transcripts, and policy bundle to re-derive the record offline.	`replay_id`, pinned versions, recorded tool transcripts, byte-match or named diff.
Evaluation	A few golden prompts and manual review before launch.	Regressions ship through prompt edits, model upgrades, policy changes, and tool changes.	Evaluation and Observability scores Policy, Utility, Latency, Safety, and Economics on live samples and golden replays.	`RunScore`, golden set id, evaluator suite version, release-gate verdict.
Cost control	Token and tool spend are inspected after the bill arrives.	Agent loops hide cost inside “reasoning.” Expensive paths become normal.	RunBudget caps tokens, wall-clock, tool calls, model routing, and per-intent economics.	`budget_report`, cost per decision, router decision, model profile, p95/p99 by intent.
Rollback	Revert a prompt or disable a feature flag.	The rollback may not restore the old behavior because context, tools, policy, and model route changed independently.	Rollback re-pins the prior release tuple: pack, policy, tool manifest, evaluator suite, model profile, memory snapshot rule.	Release tuple, rollout stage, prior pin, replay against pre-release traces.
Improvement	Operators send feedback in Slack; someone updates the prompt.	Learning is anecdotal, unversioned, and hard to verify.	Corrections become typed StrategyRules or pack changes, then pass replay and release gates before promotion.	FeedbackRecord, StrategyRule proposal, diff, golden replay, approver, promotion record.
Multi-agent work	Subagents share whatever context and tools the orchestrator hands them.	Delegation leaks authority. Failures cascade across agents with weak provenance.	Subagent lanes inherit bounded context, scoped authority, parent decision refs, and explicit handoff contracts.	`parent_decision_id`, lane id, handoff envelope, per-lane budget, tool surface per lane.
Security posture	Security reviews the prompt, model provider, and API permissions.	Prompt injection, tool misuse, memory poisoning, and over-privileged credentials cross boundaries.	ContextOS treats every boundary as enforceable: compile, plan, execute, memory write, approval, replay, release.	Threat model mapped to controls, least-privilege manifests, redaction tests, memory review, emergency stop.

The important shift

The before-state agent is optimized for “can it finish the task?”

The after-state ContextOS agent is optimized for five harder questions:

Question	Why it matters	ContextOS artifact
Was this the right task?	Autonomy is unsafe if the runtime cannot name the intent and allowed task.	Intent-Task Catalog, `intent_ref`, task template
Did it see the right context?	A correct model with wrong context still acts incorrectly.	Context Pack, `CompiledContext`, evidence manifest
Was it allowed to act?	Tools change real systems; authorization cannot live in prose.	Tool Gateway, approval-mode tiers, policy decisions
Can we prove what happened?	Trust collapses when incident review depends on interpretation.	DecisionRecord, trace, tool transcript, scorecard
Can we improve without drift?	Agent quality must improve through release engineering, not folklore.	Feedback Store, StrategyRule, replay, release gate

Why “just use an agent framework” is not enough

Frameworks help you build. They do not automatically decide your enterprise contract.

Framework primitive	What it gives you	What ContextOS still has to define
Agent loop	Calls models and tools until completion.	Which intents deserve an autonomous loop, what budget applies, and when the Critic must stop.
Tool calling	Lets the model invoke functions or hosted tools.	Which capabilities are surfaced for this tenant, user, intent, risk class, and pack version.
Handoffs	Moves work between specialist agents.	Who owns the final decision, what authority transfers, and how parent and child records link.
Guardrails	Blocks or validates selected inputs, outputs, or tool calls.	Which policy bundle is authoritative, where checks run, and how denials become replayable evidence.
Tracing	Shows model calls, tool calls, handoffs, and spans.	Which trace fields are mandatory, how trace joins to DecisionRecord, and how replay verifies it.
Human review	Pauses sensitive tool calls for approval.	Which evidence is frozen, who may approve, what happens on timeout, and how approval affects audit.
Sandbox	Runs code or tools in a constrained environment.	Which sandbox profile is allowed by pack, how outputs are classified, and how attestations are recorded.

The harness is the layer that turns those primitives into a production contract.

Risk translation table

Agentic AI risks are not abstract. They map directly to missing runtime controls.

Risk pattern	Before ContextOS symptom	Harness control that closes the gap
Prompt injection	Retrieved text or user input tells the model to ignore policy.	Context is treated as evidence, not authority; policy runs outside the model; Tool Gateway re-validates every call.
Excessive agency	The agent has access to tools that are unrelated to the request.	Compiled tool manifest exposes only capabilities allowed by pack, policy, tenant, safety mode, and delegation.
Tool misuse	The model calls the right tool with unsafe arguments.	Arg constraints, policy decisions, idempotency keys, approval-mode tiers, and Critic verification before execute.
Sensitive data leakage	Tool output or memory recall enters a response without classification.	Data classification, redaction rules, promotion-aware memory, output guardrails, and trace-linked evidence refs.
Memory poisoning	A malicious conversation creates durable future behavior.	Capture-only raw memory, promotion review, contradiction checks, consent basis, decay, and rollbackable memory refs.
Cascading failure	One bad intermediate step causes many downstream actions.	Bounded Planner / Executor / Critic loop, max replan attempts, tool-call caps, failure playbooks, and rollback.
Invisible drift	A model or policy update changes behavior with no obvious code diff.	Release tuple pinning, golden replay, evaluator scorecards, and decision-record comparison.
Audit tampering	Logs can be edited, incomplete, or impossible to join.	Append-only DecisionRecords, trace ids, record hashes, prior hashes, and replay packets.

What changes in one run

Here is the same refund agent before and after the harness.

Stage	Before: agent demo	After: ContextOS-governed run
Request	”Refund order ord_881 for INR 4200” enters the support prompt.	Request enters `invokeAgent` with `tenant_id`, user delegation, `intent=support.refund`, `safety_mode=destructive`, and `trace_id`.
Context	Agent searches docs and customer history.	Compiler builds `CompiledContext` from `ctxpack.support@version`, pinned KG snapshot, policy bundle, evidence manifest, and budget report.
Plan	Model says it will look up the order and issue refund.	Planner proposes steps; Critic checks tool allow-list, required evidence, approval mode, and argument bounds.
Read tool	Order lookup runs.	Tool Gateway executes `adp_orders.lookup` as `read_only`, emits `toolResult` with evidence ref and trace span.
Policy	Prompt says high-value refunds need approval.	Policy Engine emits `policy_decision_id`, matched rule ids, and `requires_approval_gate=GATE_FINANCE_APPROVAL`.
Approval	Agent asks a human in UI.	Approval gate freezes evidence snapshot, pauses the run, stores resumable state, and records approver verdict.
Write tool	Payment API is called.	Tool Gateway validates identity, idempotency key, amount cap, egress, approval event, and reversal token before execute.
Final answer	Agent writes “refund issued.”	Runtime emits DecisionRecord with outcome, evidence refs, approvals, controls, scorecard, lineage, trace, and record hash.
Later audit	Engineer searches logs.	Auditor starts with `trace_id`, fetches DecisionRecord, replays from pinned inputs, and verifies byte-match.

Sample prompts vs governed prompts

This is the piece most teams miss: ContextOS does not delete the prompt. It demotes the prompt from “the system boundary” to “one instruction field inside a governed runtime envelope.”

Use case	Bare LLM prompt	Prompt plus tools	Prompt wrapped with ContextOS constructs	Why the third version is different
Customer refund	”You are a support agent. Decide whether to refund this order and respond politely."	"You can call `lookup_order` and `issue_refund`. Follow refund policy. Ask for approval when needed.”	`RunContext(intent=support.refund, safety_mode=destructive)` + `ContextPack(ctxpack.support@5.2.0)` + `DecisionSpec(support.refund.execute)` + Tool Gateway exposing `lookup_order` and gated `issue_refund`.	The model can propose, but policy, approval, amount bounds, evidence, idempotency, and audit are enforced outside the prompt.
Regulated back office	”Review this exception and decide whether it can be approved."	"Use the policy docs and case tools. Escalate risky cases.”	`RunContext(risk_class=regulated)` + policy bundle with rule ids + required evidence list + approval gate + DecisionRecord status values `APPROVED`, `REJECTED`, `ESCALATED`, `DEFERRED`.	The decision becomes comparable across reviewers because every outcome cites the same evidence contract and policy bundle.
Incident command	”Triage this incident and recommend next steps."	"Use logs, metrics, and runbook tools. Page owners if needed.”	`ContextPack(incident.command)` pins runbooks, service graph, SLOs, escalation matrix, and allowed tools; Tool Gateway gates paging, rollback, and traffic-shift actions.	The agent can help coordinate, but high-blast-radius actions still require explicit controls, owner identity, and replayable evidence.
Software delivery	”Review this PR and merge it if tests look good."	"Use GitHub, CI, and code search tools. Comment on problems.”	`DecisionSpec(delivery.pr.open)` declares required evidence: diff summary, tests, static checks, owner approval, risk classification, rollback note.	Merge authority is not a vibe. It is a typed decision with test evidence, approval provenance, and release-gate status.
Data stewardship	”Fix this data quality issue."	"Query the warehouse and update records if confidence is high.”	ContextOS binds ontology version, CEID namespaces, data classification, lineage graph, write capability, and remediation DecisionRecord.	The system knows which entity is being changed, which source is authoritative, and whether the write is allowed for that data class.

The prompt should get smaller as the harness gets stronger.

Layer	Belongs in prompt	Belongs in ContextOS contract
Tone	”Be concise, neutral, and explain next steps.”	Customer communication templates and redaction rules.
Task shape	”Evaluate refund eligibility and explain the result.”	Intent, task template, DecisionSpec, required evidence, allowed outcomes.
Tool choice	”Look up the order before deciding.”	Compiled tool manifest, capability constraints, approval-mode tier, idempotency policy.
Policy	”Do not violate refund policy.”	Versioned policy bundle, JsonLogic rules, priority, `policy_decision_id`, release gate.
Safety	”Escalate risky cases.”	`must_escalate`, approval gates, risk class, evaluator thresholds, failure playbooks.
Audit	”Mention why you decided.”	DecisionRecord with evidence refs, approvals, controls, lineage, scorecard, trace, hash.

Here is the shape in one compact envelope:

{
  "model_prompt": "Evaluate whether this customer refund can proceed. Explain the customer-facing result without exposing internal rule ids.",
  "run_context": {
    "intent": "support.refund",
    "tenant_id": "tenant_acme_prod",
    "user_role": "support_agent",
    "safety_mode": "destructive",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "run_budget": { "max_tool_calls": 8, "max_replan_attempts": 2 }
  },
  "context_pack_ref": "ctxpack.support@5.2.0",
  "decision_spec_ref": "support.refund.execute@1.0.0",
  "compiled_tool_manifest": [
    { "capability": "orders.lookup", "approval_mode": "read_only" },
    { "capability": "payments.issue_refund", "approval_mode": "destructive", "requires_gate": "GATE_FINANCE_APPROVAL" }
  ],
  "runtime_controls": {
    "policy_bundle": "POLICY_RETURNS_V4",
    "required_evidence": ["identity_verified", "order_lookup", "policy_eval"],
    "redaction_rules": ["card_number", "pan", "cvv"],
    "emit": "DecisionRecord"
  }
}

That wrapper is the product point. The prompt still asks for judgment. ContextOS decides the safe context, allowed authority, enforcement path, evidence contract, and audit record around that judgment.

Adoption path

Do not try to build the whole control plane in one sprint. Ship the smallest load-bearing slice.

Step	Build this first	Done when
1. Name the run	`RunContext` plus canonical `trace_id` on every invocation.	Every agent call has tenant, user, intent, safety mode, budget, and trace.
2. Compile context	One Context Pack for one workflow.	The model sees a bounded context envelope, not an ad-hoc prompt construction.
3. Gate tools	Tool Gateway for one read tool and one side-effecting tool.	No tool call bypasses schema validation, policy, identity, idempotency, and trace.
4. Emit records	DecisionRecord for the primary decision.	Every successful, rejected, escalated, or deferred run has a durable receipt.
5. Replay one case	Offline replay for one normal run and one boundary run.	Replay proves the record from pinned inputs without executing live tools.
6. Score releases	Policy / Utility / Latency / Safety / Economics scorecard.	A prompt, pack, policy, model, or tool change cannot promote without a verdict.
7. Close improvement	Operator correction -> StrategyRule proposal -> replay -> approval.	Learning becomes a governed release path, not a prompt edit.

Design review questions

Use this table before letting an agent act on anything important.

Ask this	Good answer	Bad answer
What is the maximum approval mode this run can reach?	Declared in `RunContext.safety_mode`, tool manifests, policy bundle, and DecisionSpec.	”The prompt tells it to ask first.”
Which tools were visible to the model?	A compiled tool manifest for the intent and tenant.	”It had access to our internal tool server.”
What evidence supports the final action?	`evidence_refs[]` on the DecisionRecord, with source, hash, snapshot, and timestamp.	”It saw the customer record.”
How do we know policy ran?	`policy_decisions[]` with bundle version, rule ids, inputs, and verdict.	”The model was instructed with policy.”
Can we reproduce this tomorrow?	Replay packet pins pack, snapshot, transcripts, routing, policy, evaluator suite, and record hash.	”We can rerun the request.”
What happens if the agent loops?	RunBudget, max tool calls, max replans, no-progress detection, and failure playbook.	”The model usually stops.”
How does a correction become durable?	FeedbackRecord -> StrategyRule or pack diff -> golden replay -> release gate.	”We will update the prompt.”

What ContextOS is not

Misread	Correct interpretation
”ContextOS replaces agent frameworks.”	No. It defines the contract around any framework: context, authority, records, replay, evaluation, and release.
”Harness engineering slows teams down.”	It removes repeated governance work. Teams move faster because context, tools, approvals, audit, and rollback are reusable primitives.
”This is only for regulated industries.”	Any workflow with customer state, money, production data, code changes, or durable memory needs these controls. Regulation only makes the need obvious.
”Observability is enough.”	Traces explain execution. DecisionRecords prove governed outcomes. Replay verifies them. You need all three.
”Better models will make this unnecessary.”	Better models increase the surface area worth automating. More useful autonomy raises the value of boundaries, not lowers it.

The one-line version

Before ContextOS, an agent is a probabilistic worker surrounded by ad-hoc trust.

After ContextOS, an agent is a governed runtime participant: bounded by compiled context, constrained by policy and tool authority, inspected by a Critic, recorded as a DecisionRecord, replayed from pinned inputs, scored by evaluators, and improved through release gates.

That is the difference between a demo you can admire and a system an enterprise can adopt.