Skip to content
Back to Blog
Building the runtime
May 9, 2026
·by Piyush·18 min read

Agentic AI Systems Before and After ContextOS

ContextOS
Harness Engineering
Agentic AI
Production Readiness
Before and After
Share:XHN

Most agentic AI systems start the same way: a strong model, a prompt, a tool list, a vector database, and a demo path where the agent appears to reason its way through work.

That stack is enough to impress a room. It is not enough to run a business workflow where money, customer state, regulated data, or production infrastructure can change.

The difference is not model intelligence. The difference is whether the system around the model can prove what context was used, what authority was granted, which policy ran, which tool executed, what evidence supported the decision, whether the result can be replayed, and how the next release will get safer.

That system is the harness.

ContextOS is the harness architecture for agentic AI. It turns an agent from “LLM plus tools” into a governed decision runtime.

Core thesis
Agents do not fail only because they are wrong. They fail because the runtime cannot prove, bound, replay, or repair them.
The production upgrade is not “better prompts.” It is compiled context, least-privilege tools, policy outside the model, typed decision records, replay, scorecards, and release-gated improvement.
Before
Agent as clever worker
Prompt, tools, memory, logs, and human trust glued together per workflow.
After
Agent as governed run
Every request compiles, executes, records, scores, replays, and improves through one contract.
Proof
Artifacts beat opinions
ContextPack, ToolEnvelope, DecisionRecord, scorecard, replay packet, rollout tuple.
Outcome
Adoptable autonomy
The organization can let agents act because the blast radius is explicit and controllable.

Research readout

The current agent literature converges on a practical point: autonomy increases capability, but it also increases the need for runtime control.

SourcePractical signalWhat it means for production agents
Anthropic: Building effective agentsStart simple; add agentic complexity only when it demonstrably improves outcomes. Agentic systems trade latency and cost for task performance. Autonomous loops need guardrails, sandboxing, clear tools, and stopping conditions.A production harness should make the simplest safe path easy and make extra autonomy pay rent through measured outcomes.
OpenAI Agents SDK docsAgent apps need orchestration, tools, handoffs, guardrails, tracing, sandbox execution, human review, and evaluation loops as first-class concerns.The platform shape is no longer “call model.” It is “operate runs.” ContextOS adds the cross-vendor contract around those runs.
OpenAI guardrails and human reviewGuardrails decide whether a run should continue, pause, or stop; tool-level checks matter around side effects; approvals need resumable state.Safety belongs at input, output, tool, and approval boundaries, not only in a system prompt.
LangSmith observability docsAgent traces should capture model calls, tool calls, decision points, and production behavior.Observability is necessary but not sufficient. ContextOS turns traces into replayable decision evidence.
NIST AI RMF CoreAI risk management needs context mapping, measurement, monitoring, human oversight, documented controls, and continuous updates as risks evolve.Harness engineering is the runtime form of AI risk management: map, measure, manage, and govern every run.
OWASP Top 10 for LLM Applications 2025 and OWASP Agentic guidanceAgentic systems expand the security problem from bad output to tool misuse, excessive agency, memory poisoning, identity abuse, cascading failures, and rogue autonomous behavior.A serious agent platform needs least privilege, memory governance, replay, and emergency rollback as normal runtime behavior.

Before and after matrix

Read this like a design review checklist. If the “before” cell describes your current system, the row names the production gap.

AreaBefore ContextOS: common agent stackWhat breaksAfter ContextOS / harness engineeringProof a geek can inspect
Unit of workA user message enters a prompt and the agent decides what to do.There is no stable runtime object to authorize, measure, replay, or compare.Every request enters as a RunContext with tenant, user, delegation, intent, safety mode, trace, and budget.run_id, trace_id, tenant_id, intent, safety_mode, run_budget in the invocation envelope.
Task boundary”You are a support agent” plus natural-language instructions.Scope expands by accident. The same prompt handles refunds, fraud, shipping, retention, and escalations.The Intent-Task Catalog maps raw requests to canonical intents and approved task templates.Intent classification record, intent_ref, task_template_id, allowed task list.
ContextRAG pulls whatever looks semantically close at run time.Stale docs, policy conflicts, tenant leakage, and silent context truncation become invisible causes of bad actions.A Context Pack declares source priorities, buckets, budgets, policies, tools, memory, evals, and decision specs.pack_id@version, snapshot_version, context_ledger, budget_report, evidence_manifest.
PlanningThe model loops until it thinks it is done.Cost spikes, repeated tool calls, hidden no-progress loops, and non-deterministic branching.The Decision plane runs Planner -> Critic.verify -> Executor -> Critic.score -> Consolidate under budget and loop guards.Plan transcript, critic verdicts, max_tool_calls, max_replan_attempts, loop-guard events.
Tool surfaceFunction calling exposes a broad registry because “the model can choose.”Excessive agency: the model can discover and combine capabilities the workflow never needed.The Tool Gateway receives only the compiled tool manifest allowed by pack, policy, tenant, and safety mode.CompiledContext.tool_manifest, adapter ids, capability ids, approval modes, arg constraints.
AuthorityOne backend service token executes most actions.The agent inherits more privilege than the user, the workflow, or the risk class requires.RunContext separates user delegation from agent workload identity; tool calls exchange scoped credentials per capability.Delegation scopes, workload identity, policy_decision_id, credential exchange audit metadata.
ApprovalA UI asks a human to click approve before high-risk actions.The approver sees mutable data, the pause is not replayable, and denial handling differs per workflow.Approval gates freeze evidence, bind approver identity, persist resumable state, and write the result to the DecisionRecord.approval_event, gate_id, approver id, frozen evidence hash, status=APPROVED / REJECTED / DEFERRED.
PolicyPrompt says “do not refund above 5000” or a post-hoc checker reviews output.The model can ignore policy, misunderstand policy, or comply in prose while tools still execute.Policy bundles run outside the model at compile, plan, and execute boundaries. They produce typed decisions.Policy bundle version, matched rule_ids[], policy_decision_id, input facts, verdict.
GuardrailsA classifier checks the first user message or the final answer.Side effects happen between those two points. Tool arguments and tool results are under-checked.Guardrails exist at input, output, tool, approval, memory, and release boundaries.Tool guardrail results, redaction reports, must_refuse, must_escalate, evaluator verdicts.
MemoryThe agent writes “important” facts into a vector store.Prompt injection and stale assumptions persist across sessions; memory becomes a second ungoverned prompt.Promotion-aware memory separates capture, candidate, review, promotion, decay, and erasure.Memory write class, consent basis, contradiction check, classification, promotion status, trace link.
ObservabilityLogs contain prompt, response, tool call, maybe latency.Logs show activity but not authority, policy, evidence, or replayability.Observability emits W3C-style trace correlation plus typed runtime artifacts.OTEL spans, trace_id, tool transcripts, scorecards, replay packet, DecisionRecord link.
AuditAfter an incident, engineers reconstruct what happened from logs and Slack.Audit depends on human interpretation and live systems that may have changed.Every governed run emits a DecisionRecord as the durable receipt.decision_key, evidence_refs, policy_decisions, approvals, controls, lineage, record_hash.
ReplayRe-run the prompt and hope the same model, docs, and tools behave similarly.Live retrieval, model drift, policy edits, and tool state make reproduction impossible.Replay uses pinned pack, snapshot, routing decision, transcripts, and policy bundle to re-derive the record offline.replay_id, pinned versions, recorded tool transcripts, byte-match or named diff.
EvaluationA few golden prompts and manual review before launch.Regressions ship through prompt edits, model upgrades, policy changes, and tool changes.Evaluation and Observability scores Policy, Utility, Latency, Safety, and Economics on live samples and golden replays.RunScore, golden set id, evaluator suite version, release-gate verdict.
Cost controlToken and tool spend are inspected after the bill arrives.Agent loops hide cost inside “reasoning.” Expensive paths become normal.RunBudget caps tokens, wall-clock, tool calls, model routing, and per-intent economics.budget_report, cost per decision, router decision, model profile, p95/p99 by intent.
RollbackRevert a prompt or disable a feature flag.The rollback may not restore the old behavior because context, tools, policy, and model route changed independently.Rollback re-pins the prior release tuple: pack, policy, tool manifest, evaluator suite, model profile, memory snapshot rule.Release tuple, rollout stage, prior pin, replay against pre-release traces.
ImprovementOperators send feedback in Slack; someone updates the prompt.Learning is anecdotal, unversioned, and hard to verify.Corrections become typed StrategyRules or pack changes, then pass replay and release gates before promotion.FeedbackRecord, StrategyRule proposal, diff, golden replay, approver, promotion record.
Multi-agent workSubagents share whatever context and tools the orchestrator hands them.Delegation leaks authority. Failures cascade across agents with weak provenance.Subagent lanes inherit bounded context, scoped authority, parent decision refs, and explicit handoff contracts.parent_decision_id, lane id, handoff envelope, per-lane budget, tool surface per lane.
Security postureSecurity reviews the prompt, model provider, and API permissions.Prompt injection, tool misuse, memory poisoning, and over-privileged credentials cross boundaries.ContextOS treats every boundary as enforceable: compile, plan, execute, memory write, approval, replay, release.Threat model mapped to controls, least-privilege manifests, redaction tests, memory review, emergency stop.

The important shift

The before-state agent is optimized for “can it finish the task?”

The after-state ContextOS agent is optimized for five harder questions:

QuestionWhy it mattersContextOS artifact
Was this the right task?Autonomy is unsafe if the runtime cannot name the intent and allowed task.Intent-Task Catalog, intent_ref, task template
Did it see the right context?A correct model with wrong context still acts incorrectly.Context Pack, CompiledContext, evidence manifest
Was it allowed to act?Tools change real systems; authorization cannot live in prose.Tool Gateway, approval-mode tiers, policy decisions
Can we prove what happened?Trust collapses when incident review depends on interpretation.DecisionRecord, trace, tool transcript, scorecard
Can we improve without drift?Agent quality must improve through release engineering, not folklore.Feedback Store, StrategyRule, replay, release gate

Why “just use an agent framework” is not enough

Frameworks help you build. They do not automatically decide your enterprise contract.

Framework primitiveWhat it gives youWhat ContextOS still has to define
Agent loopCalls models and tools until completion.Which intents deserve an autonomous loop, what budget applies, and when the Critic must stop.
Tool callingLets the model invoke functions or hosted tools.Which capabilities are surfaced for this tenant, user, intent, risk class, and pack version.
HandoffsMoves work between specialist agents.Who owns the final decision, what authority transfers, and how parent and child records link.
GuardrailsBlocks or validates selected inputs, outputs, or tool calls.Which policy bundle is authoritative, where checks run, and how denials become replayable evidence.
TracingShows model calls, tool calls, handoffs, and spans.Which trace fields are mandatory, how trace joins to DecisionRecord, and how replay verifies it.
Human reviewPauses sensitive tool calls for approval.Which evidence is frozen, who may approve, what happens on timeout, and how approval affects audit.
SandboxRuns code or tools in a constrained environment.Which sandbox profile is allowed by pack, how outputs are classified, and how attestations are recorded.

The harness is the layer that turns those primitives into a production contract.

Risk translation table

Agentic AI risks are not abstract. They map directly to missing runtime controls.

Risk patternBefore ContextOS symptomHarness control that closes the gap
Prompt injectionRetrieved text or user input tells the model to ignore policy.Context is treated as evidence, not authority; policy runs outside the model; Tool Gateway re-validates every call.
Excessive agencyThe agent has access to tools that are unrelated to the request.Compiled tool manifest exposes only capabilities allowed by pack, policy, tenant, safety mode, and delegation.
Tool misuseThe model calls the right tool with unsafe arguments.Arg constraints, policy decisions, idempotency keys, approval-mode tiers, and Critic verification before execute.
Sensitive data leakageTool output or memory recall enters a response without classification.Data classification, redaction rules, promotion-aware memory, output guardrails, and trace-linked evidence refs.
Memory poisoningA malicious conversation creates durable future behavior.Capture-only raw memory, promotion review, contradiction checks, consent basis, decay, and rollbackable memory refs.
Cascading failureOne bad intermediate step causes many downstream actions.Bounded Planner / Executor / Critic loop, max replan attempts, tool-call caps, failure playbooks, and rollback.
Invisible driftA model or policy update changes behavior with no obvious code diff.Release tuple pinning, golden replay, evaluator scorecards, and decision-record comparison.
Audit tamperingLogs can be edited, incomplete, or impossible to join.Append-only DecisionRecords, trace ids, record hashes, prior hashes, and replay packets.

What changes in one run

Here is the same refund agent before and after the harness.

StageBefore: agent demoAfter: ContextOS-governed run
Request”Refund order ord_881 for INR 4200” enters the support prompt.Request enters invokeAgent with tenant_id, user delegation, intent=support.refund, safety_mode=destructive, and trace_id.
ContextAgent searches docs and customer history.Compiler builds CompiledContext from ctxpack.support@version, pinned KG snapshot, policy bundle, evidence manifest, and budget report.
PlanModel says it will look up the order and issue refund.Planner proposes steps; Critic checks tool allow-list, required evidence, approval mode, and argument bounds.
Read toolOrder lookup runs.Tool Gateway executes adp_orders.lookup as read_only, emits toolResult with evidence ref and trace span.
PolicyPrompt says high-value refunds need approval.Policy Engine emits policy_decision_id, matched rule ids, and requires_approval_gate=GATE_FINANCE_APPROVAL.
ApprovalAgent asks a human in UI.Approval gate freezes evidence snapshot, pauses the run, stores resumable state, and records approver verdict.
Write toolPayment API is called.Tool Gateway validates identity, idempotency key, amount cap, egress, approval event, and reversal token before execute.
Final answerAgent writes “refund issued.”Runtime emits DecisionRecord with outcome, evidence refs, approvals, controls, scorecard, lineage, trace, and record hash.
Later auditEngineer searches logs.Auditor starts with trace_id, fetches DecisionRecord, replays from pinned inputs, and verifies byte-match.

Sample prompts vs governed prompts

This is the piece most teams miss: ContextOS does not delete the prompt. It demotes the prompt from “the system boundary” to “one instruction field inside a governed runtime envelope.”

Use caseBare LLM promptPrompt plus toolsPrompt wrapped with ContextOS constructsWhy the third version is different
Customer refund”You are a support agent. Decide whether to refund this order and respond politely.""You can call lookup_order and issue_refund. Follow refund policy. Ask for approval when needed.”RunContext(intent=support.refund, safety_mode=destructive) + ContextPack(ctxpack.support@5.2.0) + DecisionSpec(support.refund.execute) + Tool Gateway exposing lookup_order and gated issue_refund.The model can propose, but policy, approval, amount bounds, evidence, idempotency, and audit are enforced outside the prompt.
Regulated back office”Review this exception and decide whether it can be approved.""Use the policy docs and case tools. Escalate risky cases.”RunContext(risk_class=regulated) + policy bundle with rule ids + required evidence list + approval gate + DecisionRecord status values APPROVED, REJECTED, ESCALATED, DEFERRED.The decision becomes comparable across reviewers because every outcome cites the same evidence contract and policy bundle.
Incident command”Triage this incident and recommend next steps.""Use logs, metrics, and runbook tools. Page owners if needed.”ContextPack(incident.command) pins runbooks, service graph, SLOs, escalation matrix, and allowed tools; Tool Gateway gates paging, rollback, and traffic-shift actions.The agent can help coordinate, but high-blast-radius actions still require explicit controls, owner identity, and replayable evidence.
Software delivery”Review this PR and merge it if tests look good.""Use GitHub, CI, and code search tools. Comment on problems.”DecisionSpec(delivery.pr.open) declares required evidence: diff summary, tests, static checks, owner approval, risk classification, rollback note.Merge authority is not a vibe. It is a typed decision with test evidence, approval provenance, and release-gate status.
Data stewardship”Fix this data quality issue.""Query the warehouse and update records if confidence is high.”ContextOS binds ontology version, CEID namespaces, data classification, lineage graph, write capability, and remediation DecisionRecord.The system knows which entity is being changed, which source is authoritative, and whether the write is allowed for that data class.

The prompt should get smaller as the harness gets stronger.

LayerBelongs in promptBelongs in ContextOS contract
Tone”Be concise, neutral, and explain next steps.”Customer communication templates and redaction rules.
Task shape”Evaluate refund eligibility and explain the result.”Intent, task template, DecisionSpec, required evidence, allowed outcomes.
Tool choice”Look up the order before deciding.”Compiled tool manifest, capability constraints, approval-mode tier, idempotency policy.
Policy”Do not violate refund policy.”Versioned policy bundle, JsonLogic rules, priority, policy_decision_id, release gate.
Safety”Escalate risky cases.”must_escalate, approval gates, risk class, evaluator thresholds, failure playbooks.
Audit”Mention why you decided.”DecisionRecord with evidence refs, approvals, controls, lineage, scorecard, trace, hash.

Here is the shape in one compact envelope:

{
  "model_prompt": "Evaluate whether this customer refund can proceed. Explain the customer-facing result without exposing internal rule ids.",
  "run_context": {
    "intent": "support.refund",
    "tenant_id": "tenant_acme_prod",
    "user_role": "support_agent",
    "safety_mode": "destructive",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "run_budget": { "max_tool_calls": 8, "max_replan_attempts": 2 }
  },
  "context_pack_ref": "ctxpack.support@5.2.0",
  "decision_spec_ref": "support.refund.execute@1.0.0",
  "compiled_tool_manifest": [
    { "capability": "orders.lookup", "approval_mode": "read_only" },
    { "capability": "payments.issue_refund", "approval_mode": "destructive", "requires_gate": "GATE_FINANCE_APPROVAL" }
  ],
  "runtime_controls": {
    "policy_bundle": "POLICY_RETURNS_V4",
    "required_evidence": ["identity_verified", "order_lookup", "policy_eval"],
    "redaction_rules": ["card_number", "pan", "cvv"],
    "emit": "DecisionRecord"
  }
}

That wrapper is the product point. The prompt still asks for judgment. ContextOS decides the safe context, allowed authority, enforcement path, evidence contract, and audit record around that judgment.

Adoption path

Do not try to build the whole control plane in one sprint. Ship the smallest load-bearing slice.

StepBuild this firstDone when
1. Name the runRunContext plus canonical trace_id on every invocation.Every agent call has tenant, user, intent, safety mode, budget, and trace.
2. Compile contextOne Context Pack for one workflow.The model sees a bounded context envelope, not an ad-hoc prompt construction.
3. Gate toolsTool Gateway for one read tool and one side-effecting tool.No tool call bypasses schema validation, policy, identity, idempotency, and trace.
4. Emit recordsDecisionRecord for the primary decision.Every successful, rejected, escalated, or deferred run has a durable receipt.
5. Replay one caseOffline replay for one normal run and one boundary run.Replay proves the record from pinned inputs without executing live tools.
6. Score releasesPolicy / Utility / Latency / Safety / Economics scorecard.A prompt, pack, policy, model, or tool change cannot promote without a verdict.
7. Close improvementOperator correction -> StrategyRule proposal -> replay -> approval.Learning becomes a governed release path, not a prompt edit.

Design review questions

Use this table before letting an agent act on anything important.

Ask thisGood answerBad answer
What is the maximum approval mode this run can reach?Declared in RunContext.safety_mode, tool manifests, policy bundle, and DecisionSpec.”The prompt tells it to ask first.”
Which tools were visible to the model?A compiled tool manifest for the intent and tenant.”It had access to our internal tool server.”
What evidence supports the final action?evidence_refs[] on the DecisionRecord, with source, hash, snapshot, and timestamp.”It saw the customer record.”
How do we know policy ran?policy_decisions[] with bundle version, rule ids, inputs, and verdict.”The model was instructed with policy.”
Can we reproduce this tomorrow?Replay packet pins pack, snapshot, transcripts, routing, policy, evaluator suite, and record hash.”We can rerun the request.”
What happens if the agent loops?RunBudget, max tool calls, max replans, no-progress detection, and failure playbook.”The model usually stops.”
How does a correction become durable?FeedbackRecord -> StrategyRule or pack diff -> golden replay -> release gate.”We will update the prompt.”

What ContextOS is not

MisreadCorrect interpretation
”ContextOS replaces agent frameworks.”No. It defines the contract around any framework: context, authority, records, replay, evaluation, and release.
”Harness engineering slows teams down.”It removes repeated governance work. Teams move faster because context, tools, approvals, audit, and rollback are reusable primitives.
”This is only for regulated industries.”Any workflow with customer state, money, production data, code changes, or durable memory needs these controls. Regulation only makes the need obvious.
”Observability is enough.”Traces explain execution. DecisionRecords prove governed outcomes. Replay verifies them. You need all three.
”Better models will make this unnecessary.”Better models increase the surface area worth automating. More useful autonomy raises the value of boundaries, not lowers it.

The one-line version

Before ContextOS, an agent is a probabilistic worker surrounded by ad-hoc trust.

After ContextOS, an agent is a governed runtime participant: bounded by compiled context, constrained by policy and tool authority, inspected by a Critic, recorded as a DecisionRecord, replayed from pinned inputs, scored by evaluators, and improved through release gates.

That is the difference between a demo you can admire and a system an enterprise can adopt.

Read next: Harness Engineering, Context Pack, Decision Record, Tool Gateway, and Evaluation and Observability.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.