Most “agent projects” fail for the same reason most microservice migrations failed a decade ago: the team shipped a shape (a chat box, a chain, a tool list) without shipping the system around it.
In ContextOS terms, the durable system is the harness—the controlled execution environment that decides what the model sees, what it may do, what counts as “done,” and how failures become versioned improvements. The model reasons; the harness governs execution.
This post is an end-to-end engineering path: contracts first, boundaries explicit, artifacts versioned, behavior replayable. It is the map; for a single run with code at every transition, pair it with End-to-end refund: twelve primitives in one production run.
If you want the conceptual spine first, read Harness Engineering. If you want the wire format, read API Contracts. This article connects those into a sequence a team can execute without losing the thread, and it belongs in the broader Agent Engineering series.
What the best current work agrees on
The field has converged on a useful discipline:
- Anthropic’s effective-agent guidance argues for the simplest working architecture first: fixed workflows when the path is known, autonomous agents only when the task genuinely needs flexible tool use and multi-step recovery.
- Anthropic’s context-engineering work treats context as a finite resource: use small, high-signal inputs, just-in-time retrieval, compaction, and focused subagents instead of dumping everything into the model.
- OpenAI’s agent-eval guidance puts traces before repeatable datasets when you are still debugging behavior, then moves to eval runs when you need regression discipline.
- OpenTelemetry’s GenAI semantic conventions now name inference, retrieval, and tool execution as first-class spans; that matters because agent behavior cannot be debugged from final strings alone.
- Meta-Harness makes the strategic point explicit: the harness itself—code, context selection, tools, and traces—is an optimization target, not just glue around a model.
ContextOS turns that research signal into a product rule: do not ship an agent as a prompt. Ship a harness that can be measured, replayed, rolled back, and improved.
TL;DR: the build order
Do these in order. Skipping a step is how you get “it worked in the demo” and “we cannot explain Tuesday.”
| Step | You ship | Before you touch |
|---|---|---|
| 0 | Intent + task contract + blast radius | Retrieval, model choice, “creative” copy |
| 1 | RunContext + trace propagation | Tool schemas |
| 2 | Pinned Context Pack + compile → CompiledContext | The Planner’s personality |
| 3 | Planner + critic.verify + critic.score + consolidate | Raw tool calling from the model |
| 4 | Tool Gateway + adapter manifests | Direct HTTP from agent code |
| 5 | DecisionRecord contract + storage | Chat logs as audit trail |
| 6 | OTEL traces + trace grading + five-evaluator scorecard | Dashboards nobody owns |
| 7 | Feature-flagged rollout stages + kill switch | Big-bang prompt deploys |
| 8 | Improvement loop + reviewer agents | Slack as your change log |
The mental model: agent versus harness
An agent, in the sense product and security care about, is not a prompt. It is a request-scoped execution unit with:
- a declared intent (what class of work this is),
- a RunContext (who, where, budget, safety posture),
- pinned harness artifacts (Context Pack, policies, tools, evaluators),
- a decision loop that produces a typed outcome,
- a Tool Gateway as the only path to external effect,
- a DecisionRecord that is audit-grade and replayable.
Everything else—model choice, retrieval heuristics, planner wording—is inside that frame. When those change, they change as versioned harness candidates, not as silent production mutation. See Harness candidates are model checkpoints and How great AI engineers build agents.
Workflow first, agent second
“Build an agent” is often the wrong first instruction. Choose the smallest runtime shape that can satisfy the scorecard:
| Runtime shape | Use it when | Harness implication |
|---|---|---|
| Single call + retrieval | The task is short, low-risk, and answer-shaped | Still emit RunContext, evaluator result, and trace id |
| Fixed workflow | The steps are known and decomposition improves accuracy | Add deterministic gates between steps |
| Planner / Executor / Critic loop | The system must choose tools or recover from changing evidence | Add loop guards, tool budgets, and critic verdicts |
| Orchestrator + workers | Subtasks are separable and parallel exploration pays off | Give each worker a bounded context and typed return |
| Long-running agent | Work crosses sessions, repos, or days | Persist progress notes, feature queues, tests, and replay handles |
The harness decision is not philosophical. It is empirical: ship the simpler shape until evals and traces prove it is exhausted.
The five planes in one glance
ContextOS decomposes the harness so ownership and failure modes stay legible:
| Plane | What you implement | Spec |
|---|---|---|
| Intelligence | Evidence, identity, memory recall, ontology-bound refs | Memory, Identity |
| Context | Pack → CompiledContext (compiler stages, budgets, manifests) | Cognitive Core, Context Pack |
| Decision | Planner → verify → execute → score → consolidate | Orchestration |
| Action | Typed tools, gateway, approval binding | Adapter Mesh |
| Trust | Policy, evaluators, replay, improvement | Governance, Evaluation |
Cross-cutting types (RunContext, RunBudget, ApprovalMode, CompiledContext, DecisionRecord) are the wire format between planes. Treat them like protobuf between services: stable, versioned, boring.
The canonical pipeline (what you are implementing)
The runtime contract is intentionally boring: one entry point, one loop shape, one outcome envelope.
invokeAgent(request_envelope, run_context)
→ compile(packs, request, run_context) → CompiledContext
→ loop {
planner(CompiledContext) → Plan
critic.verify(Plan) → ok | replan | reject
executor(Plan, ToolGateway) → step_results, evidence
critic.score(step_results) → accept | retry | replan | escalate
consolidate(effects, evidence) → memory_proposals
}
→ DecisionRecord(evidence_refs, approvals, controls_active, trace_id)That is the canonical execution contract in API Contracts. Your implementation should be traceable to each step.
Sanity check: if critic.verify does not run before side effects, you have a script with extra LLM calls, not a harness. The mechanical story lives in The Critic: verify, score, consolidate.
Weak team habits versus strong team habits
| Weak habit | Strong habit |
|---|---|
| “We will add evals after launch” | Goldens and shadow scorecards before any customer traffic |
| Tools as ad-hoc fetch wrappers | Manifests, schemas, approval modes, idempotency keys |
| Chat transcript as audit log | DecisionRecord + evidence refs + replay |
| One giant prompt file in the repo | Versioned pack + policy + tool tuples |
| Model upgrade as first lever | Harness tuple change with replay diff |
| “The model should know the policy” | Policy engine at deterministic boundaries |
| Final-answer evals only | Trace grading + trajectory checks + held-out release set |
| Tool docs as afterthought | Agent-computer interface designed like an API |
| Rollout = deploy Friday | Staged harness rollout + kill switch + pinned rollback |
Phase 0 — Name the work: intent, task contract, and blast radius
Before you touch retrieval or tools, pin what this agent is allowed to be responsible for.
- Intent — Register a canonical intent string (for example
support.refund) in your Intent–Task Catalog. Intents are the join key for policy, metrics, evaluators, and rollout gates. - Task contract — Inputs, outputs, evidence requirements, escalation paths, and “must never” clauses. Vague contracts force the model to improvise; improvisation is where policy violations hide.
- Blast radius — Classify side effects. Annoying wrong answers still need traces; money-moving wrong answers need approval modes, gates, and replay from day one.
Common failure: ten intents share one prompt and one tool list. You cannot roll back support.refund without accidentally changing support.address_change.
Done when: a reviewer can read the intent page and predict which approval tier applies without reading your code.
Phase 1 — RunContext: carry identity, budget, and safety once
Every envelope carries (or references) the same RunContext family of facts: tenant, user delegation, agent workload identity, intent, locale, safety mode, and RunBudget (tokens per bucket, max tool calls, wall clock, cost caps).
This sounds bureaucratic until you debug your first production incident without it. The RunContext is how you answer: who authorized this, what budget was in force, why did the gateway deny the call, and can we reproduce the decision exactly.
Common failure: tenant_id exists only in the HTTP handler while tools use a different implicit scope. The gateway cannot enforce what the runtime never sees.
Done when: every span, tool call, and final DecisionRecord joins on trace_id and run_id without ad-hoc logging fields.
Phase 2 — Context plane: ship a Context Pack, not a mega-prompt
The Context Pack is the compiler input that turns “everything we know about this request” into a CompiledContext the planner can consume. The compiler pipeline is multi-stage by design: intent materialization, policy surfacing, tool surfacing, evidence binding, memory recall, token budgeting, bucket assembly, manifests.
You do not need every layer perfect on day one. You do need version pins and a compile step that fails closed when compatibility breaks.
Practical path:
- Follow Tutorial: your first Context Pack.
- Run the reference compiler harness locally:
npx tsx src/lib/contextos/test-harness.ts(see the repo’sCLAUDE.md). - Treat pack, policy, tool registry, and eval bundles as independent version lines—a release is a tuple of pins, not “whatever is on main.”
Common failure: “dynamic context” assembled at runtime with no snapshot id. Replay becomes non-deterministic; compliance cannot freeze what the model saw.
Done when: two engineers can diff pack 5.1.0 vs 5.2.0 and explain behavioral deltas in terms of manifests and policy rule IDs, not vibes.
Memory is part of the context contract, not a sidebar
Promotion-aware memory belongs in the same discipline as packs: what can be recalled, what requires promotion, how contradictions surface. See Memory and the in-depth walkthrough Promotion-aware memory in code. If consolidate emits memory proposals but nothing reviews them, you have recreated “the model remembered something” without an audit trail.
Context budget is a product decision
Do not ask “how much can we fit?” Ask “what is the smallest evidence set that makes the next decision safer?” A production compile should record:
| Context bucket | Keep | Drop or defer |
|---|---|---|
| Policy | Active rules, approval obligations, refusal criteria | Inactive policy prose |
| Evidence | Minimal source refs required for the decision | Full documents unless needed by a tool |
| Memory | Promoted, scoped facts with lineage | Raw conversations, unreviewed notes |
| Tools | Names, schemas, examples, side-effect class | Tools irrelevant to the intent |
| History | Last useful state transition and open obligations | Full chat scrollback |
Long-running agents need an additional rule: state must survive outside the context window. Use progress files, task queues, replay handles, and compact summaries as harness artifacts rather than hoping the next model turn remembers enough.
Phase 3 — Decision plane: the loop is the product
The decision plane is where “LLM behavior” becomes governed execution:
- Planner proposes steps against the CompiledContext.
- Critic.verify enforces plan-level invariants before expensive or risky work—evidence sufficiency, policy obligations, approval-mode consistency.
- Executor issues tool calls only through the gateway.
- Critic.score decides whether results satisfy the rubric or require retry, replan, or escalation.
- Consolidate emits structured memory proposals instead of silent recall pollution.
Orchestration details and subagent patterns live in Orchestration; the cognitive loop framing lives in Cognitive Core.
Common failure: a single “agent” node that both plans and executes tools in one forward pass. You lose the interception point where verify can refuse a bad plan for free.
Done when: you can disable the planner in a test harness and still prove that critic.verify rejects illegal plans deterministically.
Phase 4 — Action plane: Tool Gateway or bust
If tools are “just HTTP calls the model chooses,” you have recreated RPC with extra steps. The Tool Gateway exists to enforce:
- schema correctness,
- capability and side-effect classification,
- tenant isolation and credential handling,
- approval-mode binding (
read_onlythroughdestructiveper Governance), - structured outcomes (
success,denied,gate_pending,failed,timeoutper API Contracts).
Author tools the way you author APIs: explicit errors, idempotency keys for mutating calls, stable result shapes that include evidence refs the evaluators can grade. Implementation narrative: Build the Tool Gateway.
Treat tool definitions as an agent-computer interface:
| Tool surface | What good looks like |
|---|---|
| Name | Verb + noun + domain: refund.lookup_order, not getData |
| Args | Strongly typed, explicit units, no overloaded fields |
| Examples | At least one success, one denial, one malformed-input case |
| Result | Machine-readable status, evidence refs, retryability, side-effect id |
| Boundaries | Clear “do not use this tool for…” language |
| Tests | Golden prompts that verify selection, arguments, denial, and retry behavior |
Poor tool design makes the model look bad. Good tool design makes the safe path easier than the unsafe path.
Security note: the boundary between untrusted user content and privileged tools is a harness problem, not a prompt problem. Read Prompt injection is a boundary problem.
Done when: there is no alternate code path that reaches a payment adapter without passing through the gateway and leaving a toolResult audit trail.
Phase 5 — Trust plane: DecisionRecord as the receipt
The DecisionRecord is the typed outcome of the loop: what happened, what evidence supported it, which controls were active, which approvals were obtained, and which trace to open when something looks wrong.
If your “agent logs” are chat transcripts, you will not pass a serious security review. If your receipts are DecisionRecords, you can replay. Deep dive: Replay is the real audit log and Replay harness in code.
Done when: compliance can answer “show me the rule and evidence bundle for this action” without a developer grepping unstructured logs.
Phase 6 — Observability and evaluation: scalars and diagnosis
Observable is not “we have Grafana.” Observable means: every decision is traceable end-to-end, and failures emit diagnostic signal for humans and for the improvement loop.
ContextOS standardizes five evaluators (policy, utility, latency, safety, economics) as scorecard dimensions—see Evaluation and Observability. The point is release gating and rollback discipline, not vanity charts. Wiring narrative: Wiring the five evaluators. Dataset discipline: Dataset-first agent engineering and Scorecards over vibes.
Use three layers of evidence:
| Layer | Question | Example gate |
|---|---|---|
| Final output | Did the user-facing answer satisfy the task? | Utility rubric ≥ threshold |
| Trajectory | Did the agent choose the right steps and tools? | Expected tool sequence or LLM trajectory judge passes |
| Trace | Why did the harness accept, retry, escalate, or deny? | Every policy, tool, and critic span is present |
This mirrors the direction of modern agent eval tooling: trace grading while behavior is still being understood, repeatable datasets once the failure modes are known, and trajectory checks when tool use is part of correctness.
Common failure: the same golden set is used for tuning and for ship gates. Scorecards stop being honest; see Meta-Harness discussion in Harness Engineering.
Done when: a pack change cannot merge unless golden scenarios move in the intended direction without violating safety and policy guardrails.
Phase 7 — Rollout: staged, kill-switched, replay-verified
Never ship harness changes as binary cutovers. The canonical stages (0%_shadow → 100%) exist so you detect regressions where they are cheap. Every stage needs a kill switch that pins the prior artifact tuple and restores traffic—validated by replay that reproduces prior DecisionRecord semantics.
See the rollout table in Harness Engineering. Code-oriented staging: Pack rollout in five stages.
Done when: on-call can revert a bad pack without redeploying application code.
Phase 8 — Improvement: reviewers, proposals, promotion
The harness is a search target: prompts, retrievers, planner skills, policies, and evaluators are all movable pieces under change control. The Improvement Loop turns operator corrections and failed runs into typed proposals that pass the same gates as code. Operator flywheel: From correction to StrategyRule and Autotune the harness.
Use Reviewer Agents to scale pre-human review—architecture, security, reliability, product, data, cost, compliance—without pretending LLMs replace sign-offs.
Bake the improvement loop into the harness interface:
candidate:
id: ctxpack.support@5.3.0-candidate.7
changed:
- context_pack.rules.refund_evidence_order
- tool_manifest.refund.lookup_order.examples
generated_from:
- failed_run: run_2026_05_12_0187
- operator_correction: corr_241
must_improve:
utility.operator_corrected_rate: -0.03
economics.tool_calls_per_decision: -0.4
must_not_regress:
policy.rule_violation_rate: 0
safety.unsupported_claim_rate: 0
latency.p95_ms: "+10%"
evidence:
search_set: support_refund_search_v12
release_set: support_refund_release_v8
replay_bundle: replay_support_refund_2026w19Autotune is allowed to propose candidates. It is not allowed to silently ship them. Promotion still requires scorecard gates, reviewer coverage, and staged rollout.
Done when: your team can show a month of proposals with acceptance or rejection reasons, not a month of Slack threads.
A four-week sequencing sketch (one squad)
This is not the only schedule; it is a dependency-respecting default when the team is small and the intent is real.
| Week | Focus | Exit criteria |
|---|---|---|
| 1 | Intent catalog + RunContext + trace plumbing | Single run visible end-to-end in your APM with stable join keys |
| 2 | First Context Pack pin + compile + empty Planner loop | CompiledContext snapshots stored; compile fails on incompatible pins |
| 3 | Tool Gateway for read-only tools + verify/score on canned plans | No direct adapter calls from model code paths |
| 4 | DecisionRecord persistence + golden evals + shadow rollout | Kill switch tested; replay reproduces at least one golden receipt |
Parallel track (never “later”): failure semantics and typed verdicts—Failure playbooks.
Quality gates by change type
Do not run the same checklist for every change. Gate the artifact that changed:
| Change | Required checks |
|---|---|
| Context Pack | Compile snapshot diff, retrieval coverage, evidence sufficiency, replay |
| Tool manifest | Schema validation, tool-selection evals, denial cases, idempotency tests |
| Policy bundle | Must-allow / must-deny goldens, approval-mode matrix, adversarial examples |
| Planner skill | Trajectory evals, loop-budget checks, stuck-run detection |
| Critic rubric | Judge calibration, disagreement review, held-out release set |
| Model upgrade | Replay diff, cost/latency scorecard, safety and policy hard floors |
| Memory rule | Promotion lineage, contradiction tests, recall precision |
This is how harness engineering stays practical. You keep the gate narrow enough to run often and strict enough to stop the dangerous class of regression.
The eight-property acceptance test (pre-launch audit)
Treat this table as the minimum bar for “production harness.” Each row maps to concrete mechanisms in the spec. Narrative version: Eight property harness audit.
| Property | What “good” looks like |
|---|---|
| Context-aware | CompiledContext reflects intent; irrelevant context is budget-trimmed, not hidden in prose. |
| Policy-governed | Deterministic policy at compile, plan, and execute boundaries. |
| Tool-controlled | Only declared tools; schemas enforced; approval modes bound. |
| Validated | Evaluators gate completion; rubrics versioned like code. |
| Observable | OTEL-first tracing; join keys consistent across planes. |
| Reversible | Idempotency, reversal tokens where applicable, replay and rollback documented. |
| Measurable | Scorecards and business metrics per intent and per artifact tuple. |
| Continuously improving | Proposals, reviewer coverage, disjoint train vs. test for tuning. |
Repo layout: legible to humans and coding agents
Put harness artifacts where CI can validate them and where agents can navigate without secret knowledge:
repo/
AGENTS.md
ARCHITECTURE.md
harness/
packs/
policies/
tools/
evals/
fixtures/
validators/
observability/
reviewers/
skills/
feedback/AGENTS.md should be a navigation file, not a prompt dump. See AGENTS.md done right.
Trace template: refund-shaped (where to go for code)
Use this as a narrative checklist for any high-risk intent:
- Inbound
invokeAgentwith pinned pack refs and a populated RunContext. - Context Pack compiles
CompiledContextwith manifests and active controls. - Planner proposes a small plan: policy checks, reads, then maybe a destructive step.
critic.verifyblocks until evidence obligations are satisfied (see the missing-evidence replan in the refund post).- Policy requires an approval gate for high-value destructive actions.
- Tool Gateway returns
gate_pendinguntil approval resolves; then executes with idempotency and frozen evidence hash. DecisionRecordrecords evidence refs, approvals, controls, trace id.- Scorecard runs; replay reproduces the receipt against pinned versions.
Step-by-step code for that path lives in End-to-end refund walkthrough. This article is the checklist and sequencing; that post is the annotated execution.
What to read next
Series
Specs
Runnable reference
Blog series (build order)
External references
- Building effective agents
- Effective context engineering for AI agents
- OpenAI agent evals
- OpenTelemetry GenAI span conventions
If you build in the order contract → context compile → loop → gateway → receipt → observability → rollout → improvement, you spend more time up front and far less time explaining incidents you cannot replay. That trade is the difference between a demo and a product.