The strongest AI engineers I know do not start with the agent.
They start with the dataset.
Before they argue about the model, framework, memory layer, or planner style, they ask a colder set of questions: what examples define success, what score will decide whether the change helped, what traces will explain failure, what can safely change, what must never regress, and how will the system improve after production corrections?
That is the difference between agent building and agent engineering.
Agent building asks: “Can we get the model to do the task?”
Agent engineering asks: “Can we repeatedly improve the whole system around the model, prove that it improved, and recover when it does not?”
The second question changes the architecture. The harness is no longer a pile of glue code around a model. It becomes the thing being trained by engineering process: context selection, retrieval policy, tool interface, planner loop, evaluator suite, rollout gates, and feedback capture. You still write software. But you operate it like an ML system.
The research signal
Four independent threads point in the same direction.
Anthropic’s agent guidance is intentionally conservative: start with simple, composable patterns; add agentic complexity only when it demonstrably improves outcomes; keep planning visible; design tool interfaces carefully; and test tool usage heavily. The important engineering lesson is not “use fewer frameworks.” It is “complexity has to earn its keep against measurement.”
OpenAI’s Agents SDK tracing makes traces a first-class runtime artifact: model generations, tool calls, guardrails, handoffs, and custom events are captured as spans. Trace grading then treats the trace itself as the object being evaluated, because final-answer grading is too shallow for agent improvement.
OpenAI’s prompt optimizer is explicitly dataset-driven: prepare rows, annotations, critiques, and graders; optimize; test; repeat; manually review before production. The narrow lesson is prompt improvement. The wider lesson is the flywheel: examples -> judgments -> candidate -> evaluation -> review -> release.
DSPy and Meta-Harness push the idea further. DSPy reframes LM applications as programmable pipelines that can be optimized against a metric. Meta-Harness makes the surrounding harness itself the search target: code, scores, and execution traces from prior candidates become the experience store for future candidates.
ContextOS is built around that reading: the model is not the only thing that learns. The harness improves too, but only through typed, measured, replayable, release-gated change.
The working style
Great AI engineers work from evidence, not vibes.
They do not say “the agent feels better after the prompt change.” They say: on support.refund, pack 5.2.0 improved utility by 3.1 points on the search set, held Policy and Safety at 1.0 on the held-out set, reduced p95 latency by 180 ms, changed 41 of 900 replayed runs, and had zero unexpected destructive actions in shadow.
That sentence requires an operating system around the agent.
| Weak practice | Strong practice |
|---|---|
| Prompt tweak after a bad demo | Captured failure becomes a dataset row and replay case |
| One global “quality” score | Scorecard by intent: Policy, Utility, Safety, Latency, Economics |
| Logs after the fact | Trace spans for context compile, plan, tool call, guardrail, verdict |
| Bigger model as first move | Simpler harness first; complexity only when evals justify it |
| Tool descriptions as afterthought | Agent-computer interface designed and tested like an API |
| Harness as fixed glue code | Harness as versioned, measurable, tunable artifact |
| Manual lessons in Slack | FeedbackStore, proposal lifecycle, release gate, rollout |
The rest of this post is the concrete workflow.
1. Start with the task distribution
The first artifact is not a prompt. It is a slice of the work.
intent: support.refund
risk_class: delegated_destructive
traffic_shape:
daily_runs: 18000
long_tail_rate: 0.17
approval_gate_rate: 0.08
business_outcomes:
- correct_refund_decision
- no_policy_violation
- no_unsupported_claim
- time_to_safe_completionThe task distribution tells you what the agent is allowed to be good at. Without it, the team optimizes for whatever examples are easiest to remember.
Build the first dataset from four sources:
| Source | What it contributes |
|---|---|
| Production transcripts | Real language, real ambiguity, real tool failures |
| Operator corrections | Where human judgment disagreed with the system |
| Synthetic boundary cases | Rare but expensive cases that production may not sample enough |
| Policy examples | Must-allow, must-deny, require-approval, require-more-evidence |
Then split it deliberately:
| Split | Who can see it | Use |
|---|---|---|
dev | engineers | fast debugging and local iteration |
search | human or automated proposer | candidate generation and tuning |
release_test | release gate only | final regression check |
shadow_live | production sampler | canary comparison against real traffic |
The release set must not become the search set. Once a proposer can iterate against it, it stops being a release gate and becomes training data.
2. Define a scorecard before tuning
Great teams define the score before the candidate. Otherwise every candidate comes with a new argument for why it should count.
For agents, one scalar score is usually a trap. The agent can improve one dimension by cheating another. Cost falls when evidence is skipped. Latency improves when tools are avoided. Utility rises when approval gates are softened. A production scorecard needs multiple axes.
{
"scorecard": {
"policy": {
"score": 1.0,
"hard_floor": true,
"metrics": ["rule_violation_rate", "approval_gate_honored_rate"]
},
"safety": {
"score": 1.0,
"hard_floor": true,
"metrics": ["redaction_success", "unsupported_claim_rate"]
},
"utility": {
"score": 0.91,
"metrics": ["task_success", "operator_corrected_rate", "answer_completeness"]
},
"latency": {
"p95_ms": 2100,
"metrics": ["compile_ms", "planner_ms", "tool_wait_ms"]
},
"economics": {
"cents_per_decision": 0.74,
"metrics": ["tokens_per_decision", "tool_calls_per_decision"]
}
}
}Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That distinction keeps the optimizer honest.
3. Build the simplest harness that can be measured
The first version should be almost boring.
Do not start with five agents, a memory system, a router, three tool-use planners, and an evaluator cascade. Start with the smallest shape that produces traces and scores.
| Maturity | Harness shape | Ship when |
|---|---|---|
0 | Single model call with retrieval and structured output | task is short, low-risk, and evals pass |
1 | Fixed workflow: classify -> retrieve -> answer -> validate | task decomposes cleanly |
2 | Planner / Executor / Critic loop | task needs adaptive tool use |
3 | Subagents or worker lanes | task has separable parallel work |
4 | Autotuned harness surfaces | enough traces exist to search safely |
This matches the field lesson from effective-agent work: complexity is not a virtue. It is a tax. Pay it only when the scorecard proves the simpler shape is exhausted.
In ContextOS terms, the first production-grade harness still emits the core artifacts defined by the canonical execution contract:
RunContext
-> CompiledContext
-> Plan
-> ToolEnvelope[]
-> Scorecard
-> DecisionRecord
-> ReplayHandleIf the harness cannot emit these, it cannot improve reliably because it cannot explain itself.
4. Treat traces as the real training data
Final outputs are too thin.
An agent can produce a good answer for the wrong reason, a bad answer after the right plan, or a policy-compliant refusal caused by missing evidence. You cannot distinguish those by grading the final string alone. You need the trace.
A useful trace stores:
| Trace layer | Questions it answers |
|---|---|
| Context compile | What evidence, memory, tools, policy, and budget reached the model? |
| Plan | What did the agent intend to do and why? |
| Tool calls | What external truth did it observe or change? |
| Guardrails | Which policy, schema, approval, and safety checks fired? |
| Critic verdict | Why did the harness accept, retry, replan, or escalate? |
| Final record | What decision shipped and what evidence supports it? |
That is why trace grading matters. It lets the team grade the path, not only the destination.
For ContextOS, this means traces are not merely observability. They are the experience store that future humans and proposers use to improve the harness.
harness/experience/runs/run_001/
compiled-context.json
plan.json
tool-transcripts.jsonl
guardrail-events.jsonl
critic-verdicts.jsonl
decision-record.json
scorecard.json
correction.jsonThe next candidate should be able to read the raw run, not just a summary. Summaries are useful for humans. They are not enough for diagnosing confounds.
5. Engineer the agent-computer interface
Great AI engineers spend surprising time on tools.
Tool schemas, parameter names, result shapes, examples, and failure modes are part of the harness. They determine what the model can reliably do. A vague tool is like a bad API with an unreliable caller.
For every tool, review:
| Tool question | Production-grade answer |
|---|---|
| What exactly does it do? | One capability, one side-effect class, one owner |
| What arguments are required? | Strict schema with examples and constraints |
| What does it return? | Typed result with evidence refs and retryability |
| What can go wrong? | Structured error classes, not free-form strings |
| Can it be replayed? | Tool transcript can substitute for live call |
| Can it be retried? | Idempotency key or explicit non-retryable status |
| What approval mode applies? | read_only, local_write, network, delegated, or destructive |
Bad tool design forces the model to compensate with reasoning. Good tool design makes the right action easy and the wrong action hard.
The same rule applies to prompts, context packs, evaluator rubrics, and policies: write them like interfaces, not prose.
6. Make every correction become data
A user correction is not a customer-support note. It is a supervised signal for the Improvement Loop.
The correction needs structure:
{
"feedback_id": "fb_2026_05_12_017",
"trace_id": "trace_refund_77",
"intent": "support.refund",
"observed": {
"decision": "deny",
"reason": "outside refund window"
},
"expected": {
"decision": "approve_with_gate",
"reason": "supplier exception overrides default window"
},
"reason_class": "missing_supplier_exception",
"evidence_refs_missing": ["supplier_policy.exception_window"],
"signed_by": "support_ops_lead",
"status": "captured"
}That single record can feed three loops:
| Loop | Output |
|---|---|
| Dataset loop | Add a replay case to the golden set |
| Insight loop | Cluster repeated corrections into a pattern |
| Strategy loop | Propose a pack, policy, retrieval, or planner change |
The failure is not closed when the user gets the right answer. It is closed when the harness changes or the team records why it should not.
7. Treat the harness like a model
This is the step most software teams miss.
They version code and prompts, but they do not treat the whole harness as the thing being optimized. Great AI engineers do.
The harness has parameters:
| Harness surface | Examples |
|---|---|
| Context | retrieval top_k, source priority, bucket budgets, compression strategy |
| Tools | tool descriptions, schema constraints, retry policy, result shaping |
| Decision | planner templates, tool order, re-plan budget, Critic rubric |
| Trust | evaluator thresholds, sampling, approval gates, rollout stages |
| Memory | promotion threshold, recall class, contradiction handling |
A harness candidate is therefore a release artifact:
{
"candidate_id": "hc_2026_05_12_refund_003",
"baseline": "release.support.refund@2026-05-09",
"changes": [
{
"surface": "context.retrieval",
"field": "supplier_policy.max_hops",
"from": 3,
"to": 2
},
{
"surface": "planner.template",
"field": "verify_supplier_exception_before_window_denial",
"from": false,
"to": true
}
],
"target_metric": "operator_corrected_rate",
"guardrails": ["policy>=1.0", "safety>=1.0", "approval_bypass_rate=0"],
"search_set": "goldens/support.refund/search@2026-05-12",
"heldout_set": "goldens/support.refund/test@2026-05-12"
}That is not “configuration.” It is the agent equivalent of a model checkpoint candidate. It needs evals, review, promotion, shadow traffic, and rollback.
The improvement loop
The loop is simple enough to draw on a whiteboard and strict enough to run in production:
examples + traces + corrections
->
dataset and scorecard
->
candidate harness change
->
replay on search set
->
replay on held-out release set
->
human review
->
staged rollout
->
production traces
->
new examples + correctionsThe important part is not that the loop exists. It is that every handoff has a typed artifact.
| Stage | Artifact |
|---|---|
| Capture | TraceBundle, DecisionRecord, FeedbackRecord |
| Curate | GoldenSet, SearchSet, HeldoutSet |
| Score | Scorecard, ReplayVerdict |
| Propose | TuningProposal, StrategyRule, KnowledgePatch |
| Review | reviewer verdicts, owner approval, rejection reason |
| Release | pack/policy/tool/evaluator tuple, rollout stage, rollback target |
Without artifacts, the improvement loop becomes a meeting. With artifacts, it becomes an engineering system.
What great AI engineers put in the PR
The best agent PRs are not impressive because the diff is clever. They are impressive because the evidence is easy to inspect.
| PR section | What it should contain |
|---|---|
| Problem | The failing traces, dataset slice, or correction cluster |
| Candidate | The exact harness surface changed |
| Scorecard | Baseline vs. candidate on search and held-out sets |
| Trace diff | What changed in context, plan, tool use, and verdict |
| Safety | Policy and Safety floors, approval-gate behavior, redaction |
| Cost and latency | Token, tool-call, and p95 deltas |
| Rollout | shadow/internal/low-risk/monitored/full stage plan |
| Rollback | Prior release tuple and replay confirmation |
If the PR cannot answer these, it is not ready. It may be a useful experiment. It is not a production harness change.
The ContextOS mapping
ContextOS foundations give this working style an ownership model.
| Plane | What great engineers measure | What they improve |
|---|---|---|
| Intelligence | entity-resolution misses, evidence freshness, memory correction rate | ontology additions, graph retrieval constraints, memory promotion rules |
| Context | evidence coverage, omission rate, budget pressure, citation density | Context Pack, retrieval policy, bucket budgets, compression |
| Decision | plan verification pass rate, re-plan rate, Critic verdict quality | planner templates, tool ordering, Critic rubric, loop guards |
| Action | tool success, schema rejection, retry safety, approval-mode mismatch | tool schemas, result shapes, adapter reliability, idempotency |
| Trust | policy pass rate, safety pass rate, replay determinism, proposal adoption | evaluator suites, trace grading, release gates, rollout policy |
This is why ContextOS treats the harness as cross-plane. No single prompt file owns agent quality. Quality emerges from the contracts between planes and improves when each plane exposes measured, bounded improvement surfaces.
The 30-day build plan
If I were starting an agent team from scratch, this is the first month.
| Days | Build | Done when |
|---|---|---|
| 1-3 | Task contract and dataset seed | 100 representative cases, 30 boundary cases, first rubric |
| 4-7 | Baseline harness | single-call or fixed workflow emits traces and scorecards |
| 8-12 | Tool interface pass | schemas, examples, idempotency, error classes, approval modes |
| 13-16 | Replay harness | past runs can be re-scored without live tool calls |
| 17-20 | Scorecard gate | Policy, Safety, Utility, Latency, Economics reported by intent |
| 21-24 | Correction capture | operator corrections become typed FeedbackRecords |
| 25-27 | Candidate loop | one bounded harness surface can be tuned against search set |
| 28-30 | Rollout path | shadow -> internal -> low-risk -> monitored -> full, with rollback |
Notice what is missing: “build autonomous agent” is not the first milestone. A measurable, replayable, improvable harness is.
Anti-patterns
The common mistakes are predictable.
| Anti-pattern | Why it fails | Replacement |
|---|---|---|
| Demo-driven development | Optimizes for memorable examples | dataset slices by intent and risk |
| Prompt-only iteration | Hides retrieval, tools, policy, and scoring failures | harness candidate diffs |
| Final-answer-only evals | Misses bad plans that got lucky | trace grading |
| One global benchmark | Hides per-intent regressions | scorecards by intent and pack version |
| Auto-promotion | Turns optimizer bugs into production incidents | proposal-only improvement loop |
| Framework opacity | Makes prompts, tool calls, and state hard to inspect | simple components with explicit traces |
| Human feedback as notes | Loses supervised signal | typed corrections and golden-set updates |
The bar
An agent team is operating at a high level when these statements are true:
| Statement | Evidence |
|---|---|
| ”We know what success means.” | versioned datasets, rubrics, and scorecards |
| ”We know why failures happen.” | trace bundles with context, plan, tools, guardrails, verdicts |
| ”We know what changed.” | candidate harness diffs and release tuples |
| ”We know the change helped.” | search and held-out replay deltas |
| ”We know it is safe to ship.” | Policy and Safety floors, approval-gate checks, rollout stage |
| ”We can undo it.” | rollback target and replayable prior tuple |
| ”We learn from production.” | corrections feed FeedbackStore, goldens, insights, proposals |
That is the standard. Not a clever prompt. Not a bigger model. Not a beautiful agent diagram.
Great AI engineers build the machine that makes agents better. The harness is that machine. Treat it like a model: give it data, score it, inspect its traces, search its variants, gate its releases, and let production corrections teach the next version.
That is how agents become engineering systems instead of impressive demos.