How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve

The strongest AI engineers I know do not start with the agent.

They start with the dataset.

Before they argue about the model, framework, memory layer, or planner style, they ask a colder set of questions: what examples define success, what score will decide whether the change helped, what traces will explain failure, what can safely change, what must never regress, and how will the system improve after production corrections?

That is the difference between agent building and agent engineering.

Agent building asks: “Can we get the model to do the task?”

Agent engineering asks: “Can we repeatedly improve the whole system around the model, prove that it improved, and recover when it does not?”

The second question changes the architecture. The harness is no longer a pile of glue code around a model. It becomes the thing being trained by engineering process: context selection, retrieval policy, tool interface, planner loop, evaluator suite, rollout gates, and feedback capture. You still write software. But you operate it like an ML system.

The research signal

Four independent threads point in the same direction.

Anthropic’s agent guidance is intentionally conservative: start with simple, composable patterns; add agentic complexity only when it demonstrably improves outcomes; keep planning visible; design tool interfaces carefully; and test tool usage heavily. The important engineering lesson is not “use fewer frameworks.” It is “complexity has to earn its keep against measurement.”

OpenAI’s Agents SDK tracing makes traces a first-class runtime artifact: model generations, tool calls, guardrails, handoffs, and custom events are captured as spans. Trace grading then treats the trace itself as the object being evaluated, because final-answer grading is too shallow for agent improvement.

OpenAI’s prompt optimizer is explicitly dataset-driven: prepare rows, annotations, critiques, and graders; optimize; test; repeat; manually review before production. The narrow lesson is prompt improvement. The wider lesson is the flywheel: examples -> judgments -> candidate -> evaluation -> review -> release.

DSPy and Meta-Harness push the idea further. DSPy reframes LM applications as programmable pipelines that can be optimized against a metric. Meta-Harness makes the surrounding harness itself the search target: code, scores, and execution traces from prior candidates become the experience store for future candidates.

ContextOS is built around that reading: the model is not the only thing that learns. The harness improves too, but only through typed, measured, replayable, release-gated change.

AI engineering rule

Do not ship an agent. Ship an improvement system.

The agent is one runtime shape. The durable product is the loop that turns examples, traces, failures, corrections, and proposals into safer harness versions.

Dataset

What success means

Goldens, edge cases, corrected runs, held-out releases.

Score

What improves

Policy, utility, safety, latency, economics.

Trace

Why it failed

Context, plan, tools, guardrails, verdicts, corrections.

Harness

What changes

Packs, retrieval, tools, planner rules, evaluators, rollout gates.

The working style

Great AI engineers work from evidence, not vibes.

They do not say “the agent feels better after the prompt change.” They say: on support.refund, pack 5.2.0 improved utility by 3.1 points on the search set, held Policy and Safety at 1.0 on the held-out set, reduced p95 latency by 180 ms, changed 41 of 900 replayed runs, and had zero unexpected destructive actions in shadow.

That sentence requires an operating system around the agent.

Weak practice	Strong practice
Prompt tweak after a bad demo	Captured failure becomes a dataset row and replay case
One global “quality” score	Scorecard by intent: Policy, Utility, Safety, Latency, Economics
Logs after the fact	Trace spans for context compile, plan, tool call, guardrail, verdict
Bigger model as first move	Simpler harness first; complexity only when evals justify it
Tool descriptions as afterthought	Agent-computer interface designed and tested like an API
Harness as fixed glue code	Harness as versioned, measurable, tunable artifact
Manual lessons in Slack	FeedbackStore, proposal lifecycle, release gate, rollout

The rest of this post is the concrete workflow.

1. Start with the task distribution

The first artifact is not a prompt. It is a slice of the work.

intent: support.refund
risk_class: delegated_destructive
traffic_shape:
  daily_runs: 18000
  long_tail_rate: 0.17
  approval_gate_rate: 0.08
business_outcomes:
  - correct_refund_decision
  - no_policy_violation
  - no_unsupported_claim
  - time_to_safe_completion

The task distribution tells you what the agent is allowed to be good at. Without it, the team optimizes for whatever examples are easiest to remember.

Build the first dataset from four sources:

Source	What it contributes
Production transcripts	Real language, real ambiguity, real tool failures
Operator corrections	Where human judgment disagreed with the system
Synthetic boundary cases	Rare but expensive cases that production may not sample enough
Policy examples	Must-allow, must-deny, require-approval, require-more-evidence

Then split it deliberately:

Split	Who can see it	Use
`dev`	engineers	fast debugging and local iteration
`search`	human or automated proposer	candidate generation and tuning
`release_test`	release gate only	final regression check
`shadow_live`	production sampler	canary comparison against real traffic

The release set must not become the search set. Once a proposer can iterate against it, it stops being a release gate and becomes training data.

2. Define a scorecard before tuning

Great teams define the score before the candidate. Otherwise every candidate comes with a new argument for why it should count.

For agents, one scalar score is usually a trap. The agent can improve one dimension by cheating another. Cost falls when evidence is skipped. Latency improves when tools are avoided. Utility rises when approval gates are softened. A production scorecard needs multiple axes.

{
  "scorecard": {
    "policy": {
      "score": 1.0,
      "hard_floor": true,
      "metrics": ["rule_violation_rate", "approval_gate_honored_rate"]
    },
    "safety": {
      "score": 1.0,
      "hard_floor": true,
      "metrics": ["redaction_success", "unsupported_claim_rate"]
    },
    "utility": {
      "score": 0.91,
      "metrics": ["task_success", "operator_corrected_rate", "answer_completeness"]
    },
    "latency": {
      "p95_ms": 2100,
      "metrics": ["compile_ms", "planner_ms", "tool_wait_ms"]
    },
    "economics": {
      "cents_per_decision": 0.74,
      "metrics": ["tokens_per_decision", "tool_calls_per_decision"]
    }
  }
}

Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That distinction keeps the optimizer honest.

3. Build the simplest harness that can be measured

The first version should be almost boring.

Do not start with five agents, a memory system, a router, three tool-use planners, and an evaluator cascade. Start with the smallest shape that produces traces and scores.

Maturity	Harness shape	Ship when
`0`	Single model call with retrieval and structured output	task is short, low-risk, and evals pass
`1`	Fixed workflow: classify -> retrieve -> answer -> validate	task decomposes cleanly
`2`	Planner / Executor / Critic loop	task needs adaptive tool use
`3`	Subagents or worker lanes	task has separable parallel work
`4`	Autotuned harness surfaces	enough traces exist to search safely

This matches the field lesson from effective-agent work: complexity is not a virtue. It is a tax. Pay it only when the scorecard proves the simpler shape is exhausted.

In ContextOS terms, the first production-grade harness still emits the core artifacts defined by the canonical execution contract:

RunContext
  -> CompiledContext
  -> Plan
  -> ToolEnvelope[]
  -> Scorecard
  -> DecisionRecord
  -> ReplayHandle

If the harness cannot emit these, it cannot improve reliably because it cannot explain itself.

4. Treat traces as the real training data

Final outputs are too thin.

An agent can produce a good answer for the wrong reason, a bad answer after the right plan, or a policy-compliant refusal caused by missing evidence. You cannot distinguish those by grading the final string alone. You need the trace.

A useful trace stores:

Trace layer	Questions it answers
Context compile	What evidence, memory, tools, policy, and budget reached the model?
Plan	What did the agent intend to do and why?
Tool calls	What external truth did it observe or change?
Guardrails	Which policy, schema, approval, and safety checks fired?
Critic verdict	Why did the harness accept, retry, replan, or escalate?
Final record	What decision shipped and what evidence supports it?

That is why trace grading matters. It lets the team grade the path, not only the destination.

For ContextOS, this means traces are not merely observability. They are the experience store that future humans and proposers use to improve the harness.

harness/experience/runs/run_001/
  compiled-context.json
  plan.json
  tool-transcripts.jsonl
  guardrail-events.jsonl
  critic-verdicts.jsonl
  decision-record.json
  scorecard.json
  correction.json

The next candidate should be able to read the raw run, not just a summary. Summaries are useful for humans. They are not enough for diagnosing confounds.

5. Engineer the agent-computer interface

Great AI engineers spend surprising time on tools.

Tool schemas, parameter names, result shapes, examples, and failure modes are part of the harness. They determine what the model can reliably do. A vague tool is like a bad API with an unreliable caller.

For every tool, review:

Tool question	Production-grade answer
What exactly does it do?	One capability, one side-effect class, one owner
What arguments are required?	Strict schema with examples and constraints
What does it return?	Typed result with evidence refs and retryability
What can go wrong?	Structured error classes, not free-form strings
Can it be replayed?	Tool transcript can substitute for live call
Can it be retried?	Idempotency key or explicit non-retryable status
What approval mode applies?	`read_only`, `local_write`, `network`, `delegated`, or `destructive`

Bad tool design forces the model to compensate with reasoning. Good tool design makes the right action easy and the wrong action hard.

The same rule applies to prompts, context packs, evaluator rubrics, and policies: write them like interfaces, not prose.

6. Make every correction become data

A user correction is not a customer-support note. It is a supervised signal for the Improvement Loop.

The correction needs structure:

{
  "feedback_id": "fb_2026_05_12_017",
  "trace_id": "trace_refund_77",
  "intent": "support.refund",
  "observed": {
    "decision": "deny",
    "reason": "outside refund window"
  },
  "expected": {
    "decision": "approve_with_gate",
    "reason": "supplier exception overrides default window"
  },
  "reason_class": "missing_supplier_exception",
  "evidence_refs_missing": ["supplier_policy.exception_window"],
  "signed_by": "support_ops_lead",
  "status": "captured"
}

That single record can feed three loops:

Loop	Output
Dataset loop	Add a replay case to the golden set
Insight loop	Cluster repeated corrections into a pattern
Strategy loop	Propose a pack, policy, retrieval, or planner change

The failure is not closed when the user gets the right answer. It is closed when the harness changes or the team records why it should not.

7. Treat the harness like a model

This is the step most software teams miss.

They version code and prompts, but they do not treat the whole harness as the thing being optimized. Great AI engineers do.

The harness has parameters:

Harness surface	Examples
Context	retrieval `top_k`, source priority, bucket budgets, compression strategy
Tools	tool descriptions, schema constraints, retry policy, result shaping
Decision	planner templates, tool order, re-plan budget, Critic rubric
Trust	evaluator thresholds, sampling, approval gates, rollout stages
Memory	promotion threshold, recall class, contradiction handling

A harness candidate is therefore a release artifact:

{
  "candidate_id": "hc_2026_05_12_refund_003",
  "baseline": "release.support.refund@2026-05-09",
  "changes": [
    {
      "surface": "context.retrieval",
      "field": "supplier_policy.max_hops",
      "from": 3,
      "to": 2
    },
    {
      "surface": "planner.template",
      "field": "verify_supplier_exception_before_window_denial",
      "from": false,
      "to": true
    }
  ],
  "target_metric": "operator_corrected_rate",
  "guardrails": ["policy>=1.0", "safety>=1.0", "approval_bypass_rate=0"],
  "search_set": "goldens/support.refund/search@2026-05-12",
  "heldout_set": "goldens/support.refund/test@2026-05-12"
}

That is not “configuration.” It is the agent equivalent of a model checkpoint candidate. It needs evals, review, promotion, shadow traffic, and rollback.

The improvement loop

The loop is simple enough to draw on a whiteboard and strict enough to run in production:

examples + traces + corrections
        ->
dataset and scorecard
        ->
candidate harness change
        ->
replay on search set
        ->
replay on held-out release set
        ->
human review
        ->
staged rollout
        ->
production traces
        ->
new examples + corrections

The important part is not that the loop exists. It is that every handoff has a typed artifact.

Stage	Artifact
Capture	`TraceBundle`, `DecisionRecord`, `FeedbackRecord`
Curate	`GoldenSet`, `SearchSet`, `HeldoutSet`
Score	`Scorecard`, `ReplayVerdict`
Propose	`TuningProposal`, `StrategyRule`, `KnowledgePatch`
Review	reviewer verdicts, owner approval, rejection reason
Release	pack/policy/tool/evaluator tuple, rollout stage, rollback target

Without artifacts, the improvement loop becomes a meeting. With artifacts, it becomes an engineering system.

What great AI engineers put in the PR

The best agent PRs are not impressive because the diff is clever. They are impressive because the evidence is easy to inspect.

PR section	What it should contain
Problem	The failing traces, dataset slice, or correction cluster
Candidate	The exact harness surface changed
Scorecard	Baseline vs. candidate on search and held-out sets
Trace diff	What changed in context, plan, tool use, and verdict
Safety	Policy and Safety floors, approval-gate behavior, redaction
Cost and latency	Token, tool-call, and p95 deltas
Rollout	shadow/internal/low-risk/monitored/full stage plan
Rollback	Prior release tuple and replay confirmation

If the PR cannot answer these, it is not ready. It may be a useful experiment. It is not a production harness change.

The ContextOS mapping

ContextOS foundations give this working style an ownership model.

Plane	What great engineers measure	What they improve
Intelligence	entity-resolution misses, evidence freshness, memory correction rate	ontology additions, graph retrieval constraints, memory promotion rules
Context	evidence coverage, omission rate, budget pressure, citation density	Context Pack, retrieval policy, bucket budgets, compression
Decision	plan verification pass rate, re-plan rate, Critic verdict quality	planner templates, tool ordering, Critic rubric, loop guards
Action	tool success, schema rejection, retry safety, approval-mode mismatch	tool schemas, result shapes, adapter reliability, idempotency
Trust	policy pass rate, safety pass rate, replay determinism, proposal adoption	evaluator suites, trace grading, release gates, rollout policy

This is why ContextOS treats the harness as cross-plane. No single prompt file owns agent quality. Quality emerges from the contracts between planes and improves when each plane exposes measured, bounded improvement surfaces.

The 30-day build plan

If I were starting an agent team from scratch, this is the first month.

Days	Build	Done when
1-3	Task contract and dataset seed	100 representative cases, 30 boundary cases, first rubric
4-7	Baseline harness	single-call or fixed workflow emits traces and scorecards
8-12	Tool interface pass	schemas, examples, idempotency, error classes, approval modes
13-16	Replay harness	past runs can be re-scored without live tool calls
17-20	Scorecard gate	Policy, Safety, Utility, Latency, Economics reported by intent
21-24	Correction capture	operator corrections become typed FeedbackRecords
25-27	Candidate loop	one bounded harness surface can be tuned against search set
28-30	Rollout path	shadow -> internal -> low-risk -> monitored -> full, with rollback

Notice what is missing: “build autonomous agent” is not the first milestone. A measurable, replayable, improvable harness is.

Anti-patterns

The common mistakes are predictable.

Anti-pattern	Why it fails	Replacement
Demo-driven development	Optimizes for memorable examples	dataset slices by intent and risk
Prompt-only iteration	Hides retrieval, tools, policy, and scoring failures	harness candidate diffs
Final-answer-only evals	Misses bad plans that got lucky	trace grading
One global benchmark	Hides per-intent regressions	scorecards by intent and pack version
Auto-promotion	Turns optimizer bugs into production incidents	proposal-only improvement loop
Framework opacity	Makes prompts, tool calls, and state hard to inspect	simple components with explicit traces
Human feedback as notes	Loses supervised signal	typed corrections and golden-set updates

The bar

An agent team is operating at a high level when these statements are true:

Statement	Evidence
”We know what success means.”	versioned datasets, rubrics, and scorecards
”We know why failures happen.”	trace bundles with context, plan, tools, guardrails, verdicts
”We know what changed.”	candidate harness diffs and release tuples
”We know the change helped.”	search and held-out replay deltas
”We know it is safe to ship.”	Policy and Safety floors, approval-gate checks, rollout stage
”We can undo it.”	rollback target and replayable prior tuple
”We learn from production.”	corrections feed FeedbackStore, goldens, insights, proposals

That is the standard. Not a clever prompt. Not a bigger model. Not a beautiful agent diagram.

Great AI engineers build the machine that makes agents better. The harness is that machine. Treat it like a model: give it data, score it, inspect its traces, search its variants, gate its releases, and let production corrections teach the next version.

That is how agents become engineering systems instead of impressive demos.