How to Develop an Agent with an Agent Harness, End to End

Most “agent projects” fail for the same reason most microservice migrations failed a decade ago: the team shipped a shape (a chat box, a chain, a tool list) without shipping the system around it.

In ContextOS terms, the durable system is the harness—the controlled execution environment that decides what the model sees, what it may do, what counts as “done,” and how failures become versioned improvements. The model reasons; the harness governs execution.

This post is an end-to-end engineering path: contracts first, boundaries explicit, artifacts versioned, behavior replayable. It is the map; for a single run with code at every transition, pair it with End-to-end refund: twelve primitives in one production run.

If you want the conceptual spine first, read Harness Engineering. If you want the wire format, read API Contracts. This article connects those into a sequence a team can execute without losing the thread, and it belongs in the broader Agent Engineering series.

What the best current work agrees on

The field has converged on a useful discipline:

Anthropic’s effective-agent guidance argues for the simplest working architecture first: fixed workflows when the path is known, autonomous agents only when the task genuinely needs flexible tool use and multi-step recovery.
Anthropic’s context-engineering work treats context as a finite resource: use small, high-signal inputs, just-in-time retrieval, compaction, and focused subagents instead of dumping everything into the model.
OpenAI’s agent-eval guidance puts traces before repeatable datasets when you are still debugging behavior, then moves to eval runs when you need regression discipline.
OpenTelemetry’s GenAI semantic conventions now name inference, retrieval, and tool execution as first-class spans; that matters because agent behavior cannot be debugged from final strings alone.
Meta-Harness makes the strategic point explicit: the harness itself—code, context selection, tools, and traces—is an optimization target, not just glue around a model.

ContextOS turns that research signal into a product rule: do not ship an agent as a prompt. Ship a harness that can be measured, replayed, rolled back, and improved.

TL;DR: the build order

Do these in order. Skipping a step is how you get “it worked in the demo” and “we cannot explain Tuesday.”

Step	You ship	Before you touch
0	Intent + task contract + blast radius	Retrieval, model choice, “creative” copy
1	`RunContext` + trace propagation	Tool schemas
2	Pinned Context Pack + compile → `CompiledContext`	The Planner’s personality
3	Planner + `critic.verify` + `critic.score` + consolidate	Raw tool calling from the model
4	Tool Gateway + adapter manifests	Direct HTTP from agent code
5	`DecisionRecord` contract + storage	Chat logs as audit trail
6	OTEL traces + trace grading + five-evaluator scorecard	Dashboards nobody owns
7	Feature-flagged rollout stages + kill switch	Big-bang prompt deploys
8	Improvement loop + reviewer agents	Slack as your change log

Field rule

If you cannot point to the code, score, and trace for each box, you do not have a harness yet.

Compile, verify, gateway dispatch, receipt, replay, rollback, and improvement are not documentation metaphors. They are deployable modules with tests and release gates.

The mental model: agent versus harness

An agent, in the sense product and security care about, is not a prompt. It is a request-scoped execution unit with:

a declared intent (what class of work this is),
a RunContext (who, where, budget, safety posture),
pinned harness artifacts (Context Pack, policies, tools, evaluators),
a decision loop that produces a typed outcome,
a Tool Gateway as the only path to external effect,
a DecisionRecord that is audit-grade and replayable.

Everything else—model choice, retrieval heuristics, planner wording—is inside that frame. When those change, they change as versioned harness candidates, not as silent production mutation. See Harness candidates are model checkpoints and How great AI engineers build agents.

Workflow first, agent second

“Build an agent” is often the wrong first instruction. Choose the smallest runtime shape that can satisfy the scorecard:

Runtime shape	Use it when	Harness implication
Single call + retrieval	The task is short, low-risk, and answer-shaped	Still emit `RunContext`, evaluator result, and trace id
Fixed workflow	The steps are known and decomposition improves accuracy	Add deterministic gates between steps
Planner / Executor / Critic loop	The system must choose tools or recover from changing evidence	Add loop guards, tool budgets, and critic verdicts
Orchestrator + workers	Subtasks are separable and parallel exploration pays off	Give each worker a bounded context and typed return
Long-running agent	Work crosses sessions, repos, or days	Persist progress notes, feature queues, tests, and replay handles

The harness decision is not philosophical. It is empirical: ship the simpler shape until evals and traces prove it is exhausted.

The five planes in one glance

ContextOS decomposes the harness so ownership and failure modes stay legible:

Plane	What you implement	Spec
Intelligence	Evidence, identity, memory recall, ontology-bound refs	Memory, Identity
Context	Pack → `CompiledContext` (compiler stages, budgets, manifests)	Cognitive Core, Context Pack
Decision	Planner → verify → execute → score → consolidate	Orchestration
Action	Typed tools, gateway, approval binding	Adapter Mesh
Trust	Policy, evaluators, replay, improvement	Governance, Evaluation

Cross-cutting types (RunContext, RunBudget, ApprovalMode, CompiledContext, DecisionRecord) are the wire format between planes. Treat them like protobuf between services: stable, versioned, boring.

The canonical pipeline (what you are implementing)

The runtime contract is intentionally boring: one entry point, one loop shape, one outcome envelope.

invokeAgent(request_envelope, run_context)
  → compile(packs, request, run_context) → CompiledContext
  → loop {
       planner(CompiledContext)         → Plan
       critic.verify(Plan)              → ok | replan | reject
       executor(Plan, ToolGateway)      → step_results, evidence
       critic.score(step_results)       → accept | retry | replan | escalate
       consolidate(effects, evidence)   → memory_proposals
     }
  → DecisionRecord(evidence_refs, approvals, controls_active, trace_id)

That is the canonical execution contract in API Contracts. Your implementation should be traceable to each step.

Sanity check: if critic.verify does not run before side effects, you have a script with extra LLM calls, not a harness. The mechanical story lives in The Critic: verify, score, consolidate.

Weak team habits versus strong team habits

Weak habit	Strong habit
“We will add evals after launch”	Goldens and shadow scorecards before any customer traffic
Tools as ad-hoc fetch wrappers	Manifests, schemas, approval modes, idempotency keys
Chat transcript as audit log	`DecisionRecord` + evidence refs + replay
One giant prompt file in the repo	Versioned pack + policy + tool tuples
Model upgrade as first lever	Harness tuple change with replay diff
“The model should know the policy”	Policy engine at deterministic boundaries
Final-answer evals only	Trace grading + trajectory checks + held-out release set
Tool docs as afterthought	Agent-computer interface designed like an API
Rollout = deploy Friday	Staged harness rollout + kill switch + pinned rollback

Phase 0 — Name the work: intent, task contract, and blast radius

Before you touch retrieval or tools, pin what this agent is allowed to be responsible for.

Intent — Register a canonical intent string (for example support.refund) in your Intent–Task Catalog. Intents are the join key for policy, metrics, evaluators, and rollout gates.
Task contract — Inputs, outputs, evidence requirements, escalation paths, and “must never” clauses. Vague contracts force the model to improvise; improvisation is where policy violations hide.
Blast radius — Classify side effects. Annoying wrong answers still need traces; money-moving wrong answers need approval modes, gates, and replay from day one.

Common failure: ten intents share one prompt and one tool list. You cannot roll back support.refund without accidentally changing support.address_change.

Done when: a reviewer can read the intent page and predict which approval tier applies without reading your code.

Phase 1 — RunContext: carry identity, budget, and safety once

Every envelope carries (or references) the same RunContext family of facts: tenant, user delegation, agent workload identity, intent, locale, safety mode, and RunBudget (tokens per bucket, max tool calls, wall clock, cost caps).

This sounds bureaucratic until you debug your first production incident without it. The RunContext is how you answer: who authorized this, what budget was in force, why did the gateway deny the call, and can we reproduce the decision exactly.

Common failure: tenant_id exists only in the HTTP handler while tools use a different implicit scope. The gateway cannot enforce what the runtime never sees.

Done when: every span, tool call, and final DecisionRecord joins on trace_id and run_id without ad-hoc logging fields.

Phase 2 — Context plane: ship a Context Pack, not a mega-prompt

The Context Pack is the compiler input that turns “everything we know about this request” into a CompiledContext the planner can consume. The compiler pipeline is multi-stage by design: intent materialization, policy surfacing, tool surfacing, evidence binding, memory recall, token budgeting, bucket assembly, manifests.

You do not need every layer perfect on day one. You do need version pins and a compile step that fails closed when compatibility breaks.

Practical path:

Follow Tutorial: your first Context Pack.
Run the reference compiler harness locally: npx tsx src/lib/contextos/test-harness.ts (see the repo’s CLAUDE.md).
Treat pack, policy, tool registry, and eval bundles as independent version lines—a release is a tuple of pins, not “whatever is on main.”

Common failure: “dynamic context” assembled at runtime with no snapshot id. Replay becomes non-deterministic; compliance cannot freeze what the model saw.

Done when: two engineers can diff pack 5.1.0 vs 5.2.0 and explain behavioral deltas in terms of manifests and policy rule IDs, not vibes.

Memory is part of the context contract, not a sidebar

Promotion-aware memory belongs in the same discipline as packs: what can be recalled, what requires promotion, how contradictions surface. See Memory and the in-depth walkthrough Promotion-aware memory in code. If consolidate emits memory proposals but nothing reviews them, you have recreated “the model remembered something” without an audit trail.

Context budget is a product decision

Do not ask “how much can we fit?” Ask “what is the smallest evidence set that makes the next decision safer?” A production compile should record:

Context bucket	Keep	Drop or defer
Policy	Active rules, approval obligations, refusal criteria	Inactive policy prose
Evidence	Minimal source refs required for the decision	Full documents unless needed by a tool
Memory	Promoted, scoped facts with lineage	Raw conversations, unreviewed notes
Tools	Names, schemas, examples, side-effect class	Tools irrelevant to the intent
History	Last useful state transition and open obligations	Full chat scrollback

Long-running agents need an additional rule: state must survive outside the context window. Use progress files, task queues, replay handles, and compact summaries as harness artifacts rather than hoping the next model turn remembers enough.

Phase 3 — Decision plane: the loop is the product

The decision plane is where “LLM behavior” becomes governed execution:

Planner proposes steps against the CompiledContext.
Critic.verify enforces plan-level invariants before expensive or risky work—evidence sufficiency, policy obligations, approval-mode consistency.
Executor issues tool calls only through the gateway.
Critic.score decides whether results satisfy the rubric or require retry, replan, or escalation.
Consolidate emits structured memory proposals instead of silent recall pollution.

Orchestration details and subagent patterns live in Orchestration; the cognitive loop framing lives in Cognitive Core.

Common failure: a single “agent” node that both plans and executes tools in one forward pass. You lose the interception point where verify can refuse a bad plan for free.

Done when: you can disable the planner in a test harness and still prove that critic.verify rejects illegal plans deterministically.

Phase 4 — Action plane: Tool Gateway or bust

If tools are “just HTTP calls the model chooses,” you have recreated RPC with extra steps. The Tool Gateway exists to enforce:

schema correctness,
capability and side-effect classification,
tenant isolation and credential handling,
approval-mode binding (read_only through destructive per Governance),
structured outcomes (success, denied, gate_pending, failed, timeout per API Contracts).

Author tools the way you author APIs: explicit errors, idempotency keys for mutating calls, stable result shapes that include evidence refs the evaluators can grade. Implementation narrative: Build the Tool Gateway.

Treat tool definitions as an agent-computer interface:

Tool surface	What good looks like
Name	Verb + noun + domain: `refund.lookup_order`, not `getData`
Args	Strongly typed, explicit units, no overloaded fields
Examples	At least one success, one denial, one malformed-input case
Result	Machine-readable status, evidence refs, retryability, side-effect id
Boundaries	Clear “do not use this tool for…” language
Tests	Golden prompts that verify selection, arguments, denial, and retry behavior

Poor tool design makes the model look bad. Good tool design makes the safe path easier than the unsafe path.

Security note: the boundary between untrusted user content and privileged tools is a harness problem, not a prompt problem. Read Prompt injection is a boundary problem.

Done when: there is no alternate code path that reaches a payment adapter without passing through the gateway and leaving a toolResult audit trail.

Phase 5 — Trust plane: DecisionRecord as the receipt

The DecisionRecord is the typed outcome of the loop: what happened, what evidence supported it, which controls were active, which approvals were obtained, and which trace to open when something looks wrong.

If your “agent logs” are chat transcripts, you will not pass a serious security review. If your receipts are DecisionRecords, you can replay. Deep dive: Replay is the real audit log and Replay harness in code.

Done when: compliance can answer “show me the rule and evidence bundle for this action” without a developer grepping unstructured logs.

Phase 6 — Observability and evaluation: scalars and diagnosis

Observable is not “we have Grafana.” Observable means: every decision is traceable end-to-end, and failures emit diagnostic signal for humans and for the improvement loop.

ContextOS standardizes five evaluators (policy, utility, latency, safety, economics) as scorecard dimensions—see Evaluation and Observability. The point is release gating and rollback discipline, not vanity charts. Wiring narrative: Wiring the five evaluators. Dataset discipline: Dataset-first agent engineering and Scorecards over vibes.

Use three layers of evidence:

Layer	Question	Example gate
Final output	Did the user-facing answer satisfy the task?	Utility rubric ≥ threshold
Trajectory	Did the agent choose the right steps and tools?	Expected tool sequence or LLM trajectory judge passes
Trace	Why did the harness accept, retry, escalate, or deny?	Every policy, tool, and critic span is present

This mirrors the direction of modern agent eval tooling: trace grading while behavior is still being understood, repeatable datasets once the failure modes are known, and trajectory checks when tool use is part of correctness.

Common failure: the same golden set is used for tuning and for ship gates. Scorecards stop being honest; see Meta-Harness discussion in Harness Engineering.

Done when: a pack change cannot merge unless golden scenarios move in the intended direction without violating safety and policy guardrails.

Phase 7 — Rollout: staged, kill-switched, replay-verified

Never ship harness changes as binary cutovers. The canonical stages (0%_shadow → 100%) exist so you detect regressions where they are cheap. Every stage needs a kill switch that pins the prior artifact tuple and restores traffic—validated by replay that reproduces prior DecisionRecord semantics.

See the rollout table in Harness Engineering. Code-oriented staging: Pack rollout in five stages.

Done when: on-call can revert a bad pack without redeploying application code.

Phase 8 — Improvement: reviewers, proposals, promotion

The harness is a search target: prompts, retrievers, planner skills, policies, and evaluators are all movable pieces under change control. The Improvement Loop turns operator corrections and failed runs into typed proposals that pass the same gates as code. Operator flywheel: From correction to StrategyRule and Autotune the harness.

Use Reviewer Agents to scale pre-human review—architecture, security, reliability, product, data, cost, compliance—without pretending LLMs replace sign-offs.

Bake the improvement loop into the harness interface:

candidate:
  id: ctxpack.support@5.3.0-candidate.7
  changed:
    - context_pack.rules.refund_evidence_order
    - tool_manifest.refund.lookup_order.examples
  generated_from:
    - failed_run: run_2026_05_12_0187
    - operator_correction: corr_241
  must_improve:
    utility.operator_corrected_rate: -0.03
    economics.tool_calls_per_decision: -0.4
  must_not_regress:
    policy.rule_violation_rate: 0
    safety.unsupported_claim_rate: 0
    latency.p95_ms: "+10%"
  evidence:
    search_set: support_refund_search_v12
    release_set: support_refund_release_v8
    replay_bundle: replay_support_refund_2026w19

Autotune is allowed to propose candidates. It is not allowed to silently ship them. Promotion still requires scorecard gates, reviewer coverage, and staged rollout.

Done when: your team can show a month of proposals with acceptance or rejection reasons, not a month of Slack threads.

A four-week sequencing sketch (one squad)

This is not the only schedule; it is a dependency-respecting default when the team is small and the intent is real.

Week	Focus	Exit criteria
1	Intent catalog + RunContext + trace plumbing	Single run visible end-to-end in your APM with stable join keys
2	First Context Pack pin + compile + empty Planner loop	`CompiledContext` snapshots stored; compile fails on incompatible pins
3	Tool Gateway for read-only tools + verify/score on canned plans	No direct adapter calls from model code paths
4	DecisionRecord persistence + golden evals + shadow rollout	Kill switch tested; replay reproduces at least one golden receipt

Parallel track (never “later”): failure semantics and typed verdicts—Failure playbooks.

Quality gates by change type

Do not run the same checklist for every change. Gate the artifact that changed:

Change	Required checks
Context Pack	Compile snapshot diff, retrieval coverage, evidence sufficiency, replay
Tool manifest	Schema validation, tool-selection evals, denial cases, idempotency tests
Policy bundle	Must-allow / must-deny goldens, approval-mode matrix, adversarial examples
Planner skill	Trajectory evals, loop-budget checks, stuck-run detection
Critic rubric	Judge calibration, disagreement review, held-out release set
Model upgrade	Replay diff, cost/latency scorecard, safety and policy hard floors
Memory rule	Promotion lineage, contradiction tests, recall precision

This is how harness engineering stays practical. You keep the gate narrow enough to run often and strict enough to stop the dangerous class of regression.

The eight-property acceptance test (pre-launch audit)

Treat this table as the minimum bar for “production harness.” Each row maps to concrete mechanisms in the spec. Narrative version: Eight property harness audit.

Property	What “good” looks like
Context-aware	`CompiledContext` reflects intent; irrelevant context is budget-trimmed, not hidden in prose.
Policy-governed	Deterministic policy at compile, plan, and execute boundaries.
Tool-controlled	Only declared tools; schemas enforced; approval modes bound.
Validated	Evaluators gate completion; rubrics versioned like code.
Observable	OTEL-first tracing; join keys consistent across planes.
Reversible	Idempotency, reversal tokens where applicable, replay and rollback documented.
Measurable	Scorecards and business metrics per intent and per artifact tuple.
Continuously improving	Proposals, reviewer coverage, disjoint train vs. test for tuning.

Repo layout: legible to humans and coding agents

Put harness artifacts where CI can validate them and where agents can navigate without secret knowledge:

repo/
  AGENTS.md
  ARCHITECTURE.md
  harness/
    packs/
    policies/
    tools/
    evals/
    fixtures/
    validators/
    observability/
    reviewers/
    skills/
    feedback/

AGENTS.md should be a navigation file, not a prompt dump. See AGENTS.md done right.

Trace template: refund-shaped (where to go for code)

Use this as a narrative checklist for any high-risk intent:

Inbound invokeAgent with pinned pack refs and a populated RunContext.
Context Pack compiles CompiledContext with manifests and active controls.
Planner proposes a small plan: policy checks, reads, then maybe a destructive step.
critic.verify blocks until evidence obligations are satisfied (see the missing-evidence replan in the refund post).
Policy requires an approval gate for high-value destructive actions.
Tool Gateway returns gate_pending until approval resolves; then executes with idempotency and frozen evidence hash.
DecisionRecord records evidence refs, approvals, controls, trace id.
Scorecard runs; replay reproduces the receipt against pinned versions.

Step-by-step code for that path lives in End-to-end refund walkthrough. This article is the checklist and sequencing; that post is the annotated execution.

What to read next

Series

Agent Engineering series

Specs

Runnable reference

SecondBrain

Blog series (build order)

External references

If you build in the order contract → context compile → loop → gateway → receipt → observability → rollout → improvement, you spend more time up front and far less time explaining incidents you cannot replay. That trade is the difference between a demo and a product.