Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·17 min read

How to Develop an Agent with an Agent Harness, End to End

ContextOS
Harness Engineering
Agents
Context Packs
Production
Share:XHN

Most “agent projects” fail for the same reason most microservice migrations failed a decade ago: the team shipped a shape (a chat box, a chain, a tool list) without shipping the system around it.

In ContextOS terms, the durable system is the harness—the controlled execution environment that decides what the model sees, what it may do, what counts as “done,” and how failures become versioned improvements. The model reasons; the harness governs execution.

This post is an end-to-end engineering path: contracts first, boundaries explicit, artifacts versioned, behavior replayable. It is the map; for a single run with code at every transition, pair it with End-to-end refund: twelve primitives in one production run.

If you want the conceptual spine first, read Harness Engineering. If you want the wire format, read API Contracts. This article connects those into a sequence a team can execute without losing the thread, and it belongs in the broader Agent Engineering series.

What the best current work agrees on

The field has converged on a useful discipline:

  • Anthropic’s effective-agent guidance argues for the simplest working architecture first: fixed workflows when the path is known, autonomous agents only when the task genuinely needs flexible tool use and multi-step recovery.
  • Anthropic’s context-engineering work treats context as a finite resource: use small, high-signal inputs, just-in-time retrieval, compaction, and focused subagents instead of dumping everything into the model.
  • OpenAI’s agent-eval guidance puts traces before repeatable datasets when you are still debugging behavior, then moves to eval runs when you need regression discipline.
  • OpenTelemetry’s GenAI semantic conventions now name inference, retrieval, and tool execution as first-class spans; that matters because agent behavior cannot be debugged from final strings alone.
  • Meta-Harness makes the strategic point explicit: the harness itself—code, context selection, tools, and traces—is an optimization target, not just glue around a model.

ContextOS turns that research signal into a product rule: do not ship an agent as a prompt. Ship a harness that can be measured, replayed, rolled back, and improved.

TL;DR: the build order

Do these in order. Skipping a step is how you get “it worked in the demo” and “we cannot explain Tuesday.”

StepYou shipBefore you touch
0Intent + task contract + blast radiusRetrieval, model choice, “creative” copy
1RunContext + trace propagationTool schemas
2Pinned Context Pack + compile → CompiledContextThe Planner’s personality
3Planner + critic.verify + critic.score + consolidateRaw tool calling from the model
4Tool Gateway + adapter manifestsDirect HTTP from agent code
5DecisionRecord contract + storageChat logs as audit trail
6OTEL traces + trace grading + five-evaluator scorecardDashboards nobody owns
7Feature-flagged rollout stages + kill switchBig-bang prompt deploys
8Improvement loop + reviewer agentsSlack as your change log
Field rule
If you cannot point to the code, score, and trace for each box, you do not have a harness yet.
Compile, verify, gateway dispatch, receipt, replay, rollback, and improvement are not documentation metaphors. They are deployable modules with tests and release gates.

The mental model: agent versus harness

An agent, in the sense product and security care about, is not a prompt. It is a request-scoped execution unit with:

  • a declared intent (what class of work this is),
  • a RunContext (who, where, budget, safety posture),
  • pinned harness artifacts (Context Pack, policies, tools, evaluators),
  • a decision loop that produces a typed outcome,
  • a Tool Gateway as the only path to external effect,
  • a DecisionRecord that is audit-grade and replayable.

Everything else—model choice, retrieval heuristics, planner wording—is inside that frame. When those change, they change as versioned harness candidates, not as silent production mutation. See Harness candidates are model checkpoints and How great AI engineers build agents.

Workflow first, agent second

“Build an agent” is often the wrong first instruction. Choose the smallest runtime shape that can satisfy the scorecard:

Runtime shapeUse it whenHarness implication
Single call + retrievalThe task is short, low-risk, and answer-shapedStill emit RunContext, evaluator result, and trace id
Fixed workflowThe steps are known and decomposition improves accuracyAdd deterministic gates between steps
Planner / Executor / Critic loopThe system must choose tools or recover from changing evidenceAdd loop guards, tool budgets, and critic verdicts
Orchestrator + workersSubtasks are separable and parallel exploration pays offGive each worker a bounded context and typed return
Long-running agentWork crosses sessions, repos, or daysPersist progress notes, feature queues, tests, and replay handles

The harness decision is not philosophical. It is empirical: ship the simpler shape until evals and traces prove it is exhausted.

The five planes in one glance

ContextOS decomposes the harness so ownership and failure modes stay legible:

PlaneWhat you implementSpec
IntelligenceEvidence, identity, memory recall, ontology-bound refsMemory, Identity
ContextPack → CompiledContext (compiler stages, budgets, manifests)Cognitive Core, Context Pack
DecisionPlanner → verify → execute → score → consolidateOrchestration
ActionTyped tools, gateway, approval bindingAdapter Mesh
TrustPolicy, evaluators, replay, improvementGovernance, Evaluation

Cross-cutting types (RunContext, RunBudget, ApprovalMode, CompiledContext, DecisionRecord) are the wire format between planes. Treat them like protobuf between services: stable, versioned, boring.

The canonical pipeline (what you are implementing)

The runtime contract is intentionally boring: one entry point, one loop shape, one outcome envelope.

invokeAgent(request_envelope, run_context)
  → compile(packs, request, run_context) → CompiledContext
  → loop {
       planner(CompiledContext)         → Plan
       critic.verify(Plan)              → ok | replan | reject
       executor(Plan, ToolGateway)      → step_results, evidence
       critic.score(step_results)       → accept | retry | replan | escalate
       consolidate(effects, evidence)   → memory_proposals
     }
  → DecisionRecord(evidence_refs, approvals, controls_active, trace_id)

That is the canonical execution contract in API Contracts. Your implementation should be traceable to each step.

Sanity check: if critic.verify does not run before side effects, you have a script with extra LLM calls, not a harness. The mechanical story lives in The Critic: verify, score, consolidate.

Weak team habits versus strong team habits

Weak habitStrong habit
“We will add evals after launch”Goldens and shadow scorecards before any customer traffic
Tools as ad-hoc fetch wrappersManifests, schemas, approval modes, idempotency keys
Chat transcript as audit logDecisionRecord + evidence refs + replay
One giant prompt file in the repoVersioned pack + policy + tool tuples
Model upgrade as first leverHarness tuple change with replay diff
“The model should know the policy”Policy engine at deterministic boundaries
Final-answer evals onlyTrace grading + trajectory checks + held-out release set
Tool docs as afterthoughtAgent-computer interface designed like an API
Rollout = deploy FridayStaged harness rollout + kill switch + pinned rollback

Phase 0 — Name the work: intent, task contract, and blast radius

Before you touch retrieval or tools, pin what this agent is allowed to be responsible for.

  1. Intent — Register a canonical intent string (for example support.refund) in your Intent–Task Catalog. Intents are the join key for policy, metrics, evaluators, and rollout gates.
  2. Task contract — Inputs, outputs, evidence requirements, escalation paths, and “must never” clauses. Vague contracts force the model to improvise; improvisation is where policy violations hide.
  3. Blast radius — Classify side effects. Annoying wrong answers still need traces; money-moving wrong answers need approval modes, gates, and replay from day one.

Common failure: ten intents share one prompt and one tool list. You cannot roll back support.refund without accidentally changing support.address_change.

Done when: a reviewer can read the intent page and predict which approval tier applies without reading your code.

Phase 1 — RunContext: carry identity, budget, and safety once

Every envelope carries (or references) the same RunContext family of facts: tenant, user delegation, agent workload identity, intent, locale, safety mode, and RunBudget (tokens per bucket, max tool calls, wall clock, cost caps).

This sounds bureaucratic until you debug your first production incident without it. The RunContext is how you answer: who authorized this, what budget was in force, why did the gateway deny the call, and can we reproduce the decision exactly.

Common failure: tenant_id exists only in the HTTP handler while tools use a different implicit scope. The gateway cannot enforce what the runtime never sees.

Done when: every span, tool call, and final DecisionRecord joins on trace_id and run_id without ad-hoc logging fields.

Phase 2 — Context plane: ship a Context Pack, not a mega-prompt

The Context Pack is the compiler input that turns “everything we know about this request” into a CompiledContext the planner can consume. The compiler pipeline is multi-stage by design: intent materialization, policy surfacing, tool surfacing, evidence binding, memory recall, token budgeting, bucket assembly, manifests.

You do not need every layer perfect on day one. You do need version pins and a compile step that fails closed when compatibility breaks.

Practical path:

  1. Follow Tutorial: your first Context Pack.
  2. Run the reference compiler harness locally: npx tsx src/lib/contextos/test-harness.ts (see the repo’s CLAUDE.md).
  3. Treat pack, policy, tool registry, and eval bundles as independent version lines—a release is a tuple of pins, not “whatever is on main.”

Common failure: “dynamic context” assembled at runtime with no snapshot id. Replay becomes non-deterministic; compliance cannot freeze what the model saw.

Done when: two engineers can diff pack 5.1.0 vs 5.2.0 and explain behavioral deltas in terms of manifests and policy rule IDs, not vibes.

Memory is part of the context contract, not a sidebar

Promotion-aware memory belongs in the same discipline as packs: what can be recalled, what requires promotion, how contradictions surface. See Memory and the in-depth walkthrough Promotion-aware memory in code. If consolidate emits memory proposals but nothing reviews them, you have recreated “the model remembered something” without an audit trail.

Context budget is a product decision

Do not ask “how much can we fit?” Ask “what is the smallest evidence set that makes the next decision safer?” A production compile should record:

Context bucketKeepDrop or defer
PolicyActive rules, approval obligations, refusal criteriaInactive policy prose
EvidenceMinimal source refs required for the decisionFull documents unless needed by a tool
MemoryPromoted, scoped facts with lineageRaw conversations, unreviewed notes
ToolsNames, schemas, examples, side-effect classTools irrelevant to the intent
HistoryLast useful state transition and open obligationsFull chat scrollback

Long-running agents need an additional rule: state must survive outside the context window. Use progress files, task queues, replay handles, and compact summaries as harness artifacts rather than hoping the next model turn remembers enough.

Phase 3 — Decision plane: the loop is the product

The decision plane is where “LLM behavior” becomes governed execution:

  • Planner proposes steps against the CompiledContext.
  • Critic.verify enforces plan-level invariants before expensive or risky work—evidence sufficiency, policy obligations, approval-mode consistency.
  • Executor issues tool calls only through the gateway.
  • Critic.score decides whether results satisfy the rubric or require retry, replan, or escalation.
  • Consolidate emits structured memory proposals instead of silent recall pollution.

Orchestration details and subagent patterns live in Orchestration; the cognitive loop framing lives in Cognitive Core.

Common failure: a single “agent” node that both plans and executes tools in one forward pass. You lose the interception point where verify can refuse a bad plan for free.

Done when: you can disable the planner in a test harness and still prove that critic.verify rejects illegal plans deterministically.

Phase 4 — Action plane: Tool Gateway or bust

If tools are “just HTTP calls the model chooses,” you have recreated RPC with extra steps. The Tool Gateway exists to enforce:

  • schema correctness,
  • capability and side-effect classification,
  • tenant isolation and credential handling,
  • approval-mode binding (read_only through destructive per Governance),
  • structured outcomes (success, denied, gate_pending, failed, timeout per API Contracts).

Author tools the way you author APIs: explicit errors, idempotency keys for mutating calls, stable result shapes that include evidence refs the evaluators can grade. Implementation narrative: Build the Tool Gateway.

Treat tool definitions as an agent-computer interface:

Tool surfaceWhat good looks like
NameVerb + noun + domain: refund.lookup_order, not getData
ArgsStrongly typed, explicit units, no overloaded fields
ExamplesAt least one success, one denial, one malformed-input case
ResultMachine-readable status, evidence refs, retryability, side-effect id
BoundariesClear “do not use this tool for…” language
TestsGolden prompts that verify selection, arguments, denial, and retry behavior

Poor tool design makes the model look bad. Good tool design makes the safe path easier than the unsafe path.

Security note: the boundary between untrusted user content and privileged tools is a harness problem, not a prompt problem. Read Prompt injection is a boundary problem.

Done when: there is no alternate code path that reaches a payment adapter without passing through the gateway and leaving a toolResult audit trail.

Phase 5 — Trust plane: DecisionRecord as the receipt

The DecisionRecord is the typed outcome of the loop: what happened, what evidence supported it, which controls were active, which approvals were obtained, and which trace to open when something looks wrong.

If your “agent logs” are chat transcripts, you will not pass a serious security review. If your receipts are DecisionRecords, you can replay. Deep dive: Replay is the real audit log and Replay harness in code.

Done when: compliance can answer “show me the rule and evidence bundle for this action” without a developer grepping unstructured logs.

Phase 6 — Observability and evaluation: scalars and diagnosis

Observable is not “we have Grafana.” Observable means: every decision is traceable end-to-end, and failures emit diagnostic signal for humans and for the improvement loop.

ContextOS standardizes five evaluators (policy, utility, latency, safety, economics) as scorecard dimensions—see Evaluation and Observability. The point is release gating and rollback discipline, not vanity charts. Wiring narrative: Wiring the five evaluators. Dataset discipline: Dataset-first agent engineering and Scorecards over vibes.

Use three layers of evidence:

LayerQuestionExample gate
Final outputDid the user-facing answer satisfy the task?Utility rubric ≥ threshold
TrajectoryDid the agent choose the right steps and tools?Expected tool sequence or LLM trajectory judge passes
TraceWhy did the harness accept, retry, escalate, or deny?Every policy, tool, and critic span is present

This mirrors the direction of modern agent eval tooling: trace grading while behavior is still being understood, repeatable datasets once the failure modes are known, and trajectory checks when tool use is part of correctness.

Common failure: the same golden set is used for tuning and for ship gates. Scorecards stop being honest; see Meta-Harness discussion in Harness Engineering.

Done when: a pack change cannot merge unless golden scenarios move in the intended direction without violating safety and policy guardrails.

Phase 7 — Rollout: staged, kill-switched, replay-verified

Never ship harness changes as binary cutovers. The canonical stages (0%_shadow100%) exist so you detect regressions where they are cheap. Every stage needs a kill switch that pins the prior artifact tuple and restores traffic—validated by replay that reproduces prior DecisionRecord semantics.

See the rollout table in Harness Engineering. Code-oriented staging: Pack rollout in five stages.

Done when: on-call can revert a bad pack without redeploying application code.

Phase 8 — Improvement: reviewers, proposals, promotion

The harness is a search target: prompts, retrievers, planner skills, policies, and evaluators are all movable pieces under change control. The Improvement Loop turns operator corrections and failed runs into typed proposals that pass the same gates as code. Operator flywheel: From correction to StrategyRule and Autotune the harness.

Use Reviewer Agents to scale pre-human review—architecture, security, reliability, product, data, cost, compliance—without pretending LLMs replace sign-offs.

Bake the improvement loop into the harness interface:

candidate:
  id: ctxpack.support@5.3.0-candidate.7
  changed:
    - context_pack.rules.refund_evidence_order
    - tool_manifest.refund.lookup_order.examples
  generated_from:
    - failed_run: run_2026_05_12_0187
    - operator_correction: corr_241
  must_improve:
    utility.operator_corrected_rate: -0.03
    economics.tool_calls_per_decision: -0.4
  must_not_regress:
    policy.rule_violation_rate: 0
    safety.unsupported_claim_rate: 0
    latency.p95_ms: "+10%"
  evidence:
    search_set: support_refund_search_v12
    release_set: support_refund_release_v8
    replay_bundle: replay_support_refund_2026w19

Autotune is allowed to propose candidates. It is not allowed to silently ship them. Promotion still requires scorecard gates, reviewer coverage, and staged rollout.

Done when: your team can show a month of proposals with acceptance or rejection reasons, not a month of Slack threads.

A four-week sequencing sketch (one squad)

This is not the only schedule; it is a dependency-respecting default when the team is small and the intent is real.

WeekFocusExit criteria
1Intent catalog + RunContext + trace plumbingSingle run visible end-to-end in your APM with stable join keys
2First Context Pack pin + compile + empty Planner loopCompiledContext snapshots stored; compile fails on incompatible pins
3Tool Gateway for read-only tools + verify/score on canned plansNo direct adapter calls from model code paths
4DecisionRecord persistence + golden evals + shadow rolloutKill switch tested; replay reproduces at least one golden receipt

Parallel track (never “later”): failure semantics and typed verdicts—Failure playbooks.

Quality gates by change type

Do not run the same checklist for every change. Gate the artifact that changed:

ChangeRequired checks
Context PackCompile snapshot diff, retrieval coverage, evidence sufficiency, replay
Tool manifestSchema validation, tool-selection evals, denial cases, idempotency tests
Policy bundleMust-allow / must-deny goldens, approval-mode matrix, adversarial examples
Planner skillTrajectory evals, loop-budget checks, stuck-run detection
Critic rubricJudge calibration, disagreement review, held-out release set
Model upgradeReplay diff, cost/latency scorecard, safety and policy hard floors
Memory rulePromotion lineage, contradiction tests, recall precision

This is how harness engineering stays practical. You keep the gate narrow enough to run often and strict enough to stop the dangerous class of regression.

The eight-property acceptance test (pre-launch audit)

Treat this table as the minimum bar for “production harness.” Each row maps to concrete mechanisms in the spec. Narrative version: Eight property harness audit.

PropertyWhat “good” looks like
Context-awareCompiledContext reflects intent; irrelevant context is budget-trimmed, not hidden in prose.
Policy-governedDeterministic policy at compile, plan, and execute boundaries.
Tool-controlledOnly declared tools; schemas enforced; approval modes bound.
ValidatedEvaluators gate completion; rubrics versioned like code.
ObservableOTEL-first tracing; join keys consistent across planes.
ReversibleIdempotency, reversal tokens where applicable, replay and rollback documented.
MeasurableScorecards and business metrics per intent and per artifact tuple.
Continuously improvingProposals, reviewer coverage, disjoint train vs. test for tuning.

Repo layout: legible to humans and coding agents

Put harness artifacts where CI can validate them and where agents can navigate without secret knowledge:

repo/
  AGENTS.md
  ARCHITECTURE.md
  harness/
    packs/
    policies/
    tools/
    evals/
    fixtures/
    validators/
    observability/
    reviewers/
    skills/
    feedback/

AGENTS.md should be a navigation file, not a prompt dump. See AGENTS.md done right.

Trace template: refund-shaped (where to go for code)

Use this as a narrative checklist for any high-risk intent:

  1. Inbound invokeAgent with pinned pack refs and a populated RunContext.
  2. Context Pack compiles CompiledContext with manifests and active controls.
  3. Planner proposes a small plan: policy checks, reads, then maybe a destructive step.
  4. critic.verify blocks until evidence obligations are satisfied (see the missing-evidence replan in the refund post).
  5. Policy requires an approval gate for high-value destructive actions.
  6. Tool Gateway returns gate_pending until approval resolves; then executes with idempotency and frozen evidence hash.
  7. DecisionRecord records evidence refs, approvals, controls, trace id.
  8. Scorecard runs; replay reproduces the receipt against pinned versions.

Step-by-step code for that path lives in End-to-end refund walkthrough. This article is the checklist and sequencing; that post is the annotated execution.

Series

Specs

Runnable reference

Blog series (build order)

External references

If you build in the order contract → context compile → loop → gateway → receipt → observability → rollout → improvement, you spend more time up front and far less time explaining incidents you cannot replay. That trade is the difference between a demo and a product.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.