Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·13 min read

How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve

ContextOS
Harness Engineering
AI Engineering
Evaluation
Agents
Share:XHN

The strongest AI engineers I know do not start with the agent.

They start with the dataset.

Before they argue about the model, framework, memory layer, or planner style, they ask a colder set of questions: what examples define success, what score will decide whether the change helped, what traces will explain failure, what can safely change, what must never regress, and how will the system improve after production corrections?

That is the difference between agent building and agent engineering.

Agent building asks: “Can we get the model to do the task?”

Agent engineering asks: “Can we repeatedly improve the whole system around the model, prove that it improved, and recover when it does not?”

The second question changes the architecture. The harness is no longer a pile of glue code around a model. It becomes the thing being trained by engineering process: context selection, retrieval policy, tool interface, planner loop, evaluator suite, rollout gates, and feedback capture. You still write software. But you operate it like an ML system.

The research signal

Four independent threads point in the same direction.

Anthropic’s agent guidance is intentionally conservative: start with simple, composable patterns; add agentic complexity only when it demonstrably improves outcomes; keep planning visible; design tool interfaces carefully; and test tool usage heavily. The important engineering lesson is not “use fewer frameworks.” It is “complexity has to earn its keep against measurement.”

OpenAI’s Agents SDK tracing makes traces a first-class runtime artifact: model generations, tool calls, guardrails, handoffs, and custom events are captured as spans. Trace grading then treats the trace itself as the object being evaluated, because final-answer grading is too shallow for agent improvement.

OpenAI’s prompt optimizer is explicitly dataset-driven: prepare rows, annotations, critiques, and graders; optimize; test; repeat; manually review before production. The narrow lesson is prompt improvement. The wider lesson is the flywheel: examples -> judgments -> candidate -> evaluation -> review -> release.

DSPy and Meta-Harness push the idea further. DSPy reframes LM applications as programmable pipelines that can be optimized against a metric. Meta-Harness makes the surrounding harness itself the search target: code, scores, and execution traces from prior candidates become the experience store for future candidates.

ContextOS is built around that reading: the model is not the only thing that learns. The harness improves too, but only through typed, measured, replayable, release-gated change.

AI engineering rule
Do not ship an agent. Ship an improvement system.
The agent is one runtime shape. The durable product is the loop that turns examples, traces, failures, corrections, and proposals into safer harness versions.
Dataset
What success means
Goldens, edge cases, corrected runs, held-out releases.
Score
What improves
Policy, utility, safety, latency, economics.
Trace
Why it failed
Context, plan, tools, guardrails, verdicts, corrections.
Harness
What changes
Packs, retrieval, tools, planner rules, evaluators, rollout gates.

The working style

Great AI engineers work from evidence, not vibes.

They do not say “the agent feels better after the prompt change.” They say: on support.refund, pack 5.2.0 improved utility by 3.1 points on the search set, held Policy and Safety at 1.0 on the held-out set, reduced p95 latency by 180 ms, changed 41 of 900 replayed runs, and had zero unexpected destructive actions in shadow.

That sentence requires an operating system around the agent.

Weak practiceStrong practice
Prompt tweak after a bad demoCaptured failure becomes a dataset row and replay case
One global “quality” scoreScorecard by intent: Policy, Utility, Safety, Latency, Economics
Logs after the factTrace spans for context compile, plan, tool call, guardrail, verdict
Bigger model as first moveSimpler harness first; complexity only when evals justify it
Tool descriptions as afterthoughtAgent-computer interface designed and tested like an API
Harness as fixed glue codeHarness as versioned, measurable, tunable artifact
Manual lessons in SlackFeedbackStore, proposal lifecycle, release gate, rollout

The rest of this post is the concrete workflow.

1. Start with the task distribution

The first artifact is not a prompt. It is a slice of the work.

intent: support.refund
risk_class: delegated_destructive
traffic_shape:
  daily_runs: 18000
  long_tail_rate: 0.17
  approval_gate_rate: 0.08
business_outcomes:
  - correct_refund_decision
  - no_policy_violation
  - no_unsupported_claim
  - time_to_safe_completion

The task distribution tells you what the agent is allowed to be good at. Without it, the team optimizes for whatever examples are easiest to remember.

Build the first dataset from four sources:

SourceWhat it contributes
Production transcriptsReal language, real ambiguity, real tool failures
Operator correctionsWhere human judgment disagreed with the system
Synthetic boundary casesRare but expensive cases that production may not sample enough
Policy examplesMust-allow, must-deny, require-approval, require-more-evidence

Then split it deliberately:

SplitWho can see itUse
devengineersfast debugging and local iteration
searchhuman or automated proposercandidate generation and tuning
release_testrelease gate onlyfinal regression check
shadow_liveproduction samplercanary comparison against real traffic

The release set must not become the search set. Once a proposer can iterate against it, it stops being a release gate and becomes training data.

2. Define a scorecard before tuning

Great teams define the score before the candidate. Otherwise every candidate comes with a new argument for why it should count.

For agents, one scalar score is usually a trap. The agent can improve one dimension by cheating another. Cost falls when evidence is skipped. Latency improves when tools are avoided. Utility rises when approval gates are softened. A production scorecard needs multiple axes.

{
  "scorecard": {
    "policy": {
      "score": 1.0,
      "hard_floor": true,
      "metrics": ["rule_violation_rate", "approval_gate_honored_rate"]
    },
    "safety": {
      "score": 1.0,
      "hard_floor": true,
      "metrics": ["redaction_success", "unsupported_claim_rate"]
    },
    "utility": {
      "score": 0.91,
      "metrics": ["task_success", "operator_corrected_rate", "answer_completeness"]
    },
    "latency": {
      "p95_ms": 2100,
      "metrics": ["compile_ms", "planner_ms", "tool_wait_ms"]
    },
    "economics": {
      "cents_per_decision": 0.74,
      "metrics": ["tokens_per_decision", "tool_calls_per_decision"]
    }
  }
}

Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That distinction keeps the optimizer honest.

3. Build the simplest harness that can be measured

The first version should be almost boring.

Do not start with five agents, a memory system, a router, three tool-use planners, and an evaluator cascade. Start with the smallest shape that produces traces and scores.

MaturityHarness shapeShip when
0Single model call with retrieval and structured outputtask is short, low-risk, and evals pass
1Fixed workflow: classify -> retrieve -> answer -> validatetask decomposes cleanly
2Planner / Executor / Critic looptask needs adaptive tool use
3Subagents or worker lanestask has separable parallel work
4Autotuned harness surfacesenough traces exist to search safely

This matches the field lesson from effective-agent work: complexity is not a virtue. It is a tax. Pay it only when the scorecard proves the simpler shape is exhausted.

In ContextOS terms, the first production-grade harness still emits the core artifacts defined by the canonical execution contract:

RunContext
  -> CompiledContext
  -> Plan
  -> ToolEnvelope[]
  -> Scorecard
  -> DecisionRecord
  -> ReplayHandle

If the harness cannot emit these, it cannot improve reliably because it cannot explain itself.

4. Treat traces as the real training data

Final outputs are too thin.

An agent can produce a good answer for the wrong reason, a bad answer after the right plan, or a policy-compliant refusal caused by missing evidence. You cannot distinguish those by grading the final string alone. You need the trace.

A useful trace stores:

Trace layerQuestions it answers
Context compileWhat evidence, memory, tools, policy, and budget reached the model?
PlanWhat did the agent intend to do and why?
Tool callsWhat external truth did it observe or change?
GuardrailsWhich policy, schema, approval, and safety checks fired?
Critic verdictWhy did the harness accept, retry, replan, or escalate?
Final recordWhat decision shipped and what evidence supports it?

That is why trace grading matters. It lets the team grade the path, not only the destination.

For ContextOS, this means traces are not merely observability. They are the experience store that future humans and proposers use to improve the harness.

harness/experience/runs/run_001/
  compiled-context.json
  plan.json
  tool-transcripts.jsonl
  guardrail-events.jsonl
  critic-verdicts.jsonl
  decision-record.json
  scorecard.json
  correction.json

The next candidate should be able to read the raw run, not just a summary. Summaries are useful for humans. They are not enough for diagnosing confounds.

5. Engineer the agent-computer interface

Great AI engineers spend surprising time on tools.

Tool schemas, parameter names, result shapes, examples, and failure modes are part of the harness. They determine what the model can reliably do. A vague tool is like a bad API with an unreliable caller.

For every tool, review:

Tool questionProduction-grade answer
What exactly does it do?One capability, one side-effect class, one owner
What arguments are required?Strict schema with examples and constraints
What does it return?Typed result with evidence refs and retryability
What can go wrong?Structured error classes, not free-form strings
Can it be replayed?Tool transcript can substitute for live call
Can it be retried?Idempotency key or explicit non-retryable status
What approval mode applies?read_only, local_write, network, delegated, or destructive

Bad tool design forces the model to compensate with reasoning. Good tool design makes the right action easy and the wrong action hard.

The same rule applies to prompts, context packs, evaluator rubrics, and policies: write them like interfaces, not prose.

6. Make every correction become data

A user correction is not a customer-support note. It is a supervised signal for the Improvement Loop.

The correction needs structure:

{
  "feedback_id": "fb_2026_05_12_017",
  "trace_id": "trace_refund_77",
  "intent": "support.refund",
  "observed": {
    "decision": "deny",
    "reason": "outside refund window"
  },
  "expected": {
    "decision": "approve_with_gate",
    "reason": "supplier exception overrides default window"
  },
  "reason_class": "missing_supplier_exception",
  "evidence_refs_missing": ["supplier_policy.exception_window"],
  "signed_by": "support_ops_lead",
  "status": "captured"
}

That single record can feed three loops:

LoopOutput
Dataset loopAdd a replay case to the golden set
Insight loopCluster repeated corrections into a pattern
Strategy loopPropose a pack, policy, retrieval, or planner change

The failure is not closed when the user gets the right answer. It is closed when the harness changes or the team records why it should not.

7. Treat the harness like a model

This is the step most software teams miss.

They version code and prompts, but they do not treat the whole harness as the thing being optimized. Great AI engineers do.

The harness has parameters:

Harness surfaceExamples
Contextretrieval top_k, source priority, bucket budgets, compression strategy
Toolstool descriptions, schema constraints, retry policy, result shaping
Decisionplanner templates, tool order, re-plan budget, Critic rubric
Trustevaluator thresholds, sampling, approval gates, rollout stages
Memorypromotion threshold, recall class, contradiction handling

A harness candidate is therefore a release artifact:

{
  "candidate_id": "hc_2026_05_12_refund_003",
  "baseline": "release.support.refund@2026-05-09",
  "changes": [
    {
      "surface": "context.retrieval",
      "field": "supplier_policy.max_hops",
      "from": 3,
      "to": 2
    },
    {
      "surface": "planner.template",
      "field": "verify_supplier_exception_before_window_denial",
      "from": false,
      "to": true
    }
  ],
  "target_metric": "operator_corrected_rate",
  "guardrails": ["policy>=1.0", "safety>=1.0", "approval_bypass_rate=0"],
  "search_set": "goldens/support.refund/search@2026-05-12",
  "heldout_set": "goldens/support.refund/test@2026-05-12"
}

That is not “configuration.” It is the agent equivalent of a model checkpoint candidate. It needs evals, review, promotion, shadow traffic, and rollback.

The improvement loop

The loop is simple enough to draw on a whiteboard and strict enough to run in production:

examples + traces + corrections
        ->
dataset and scorecard
        ->
candidate harness change
        ->
replay on search set
        ->
replay on held-out release set
        ->
human review
        ->
staged rollout
        ->
production traces
        ->
new examples + corrections

The important part is not that the loop exists. It is that every handoff has a typed artifact.

StageArtifact
CaptureTraceBundle, DecisionRecord, FeedbackRecord
CurateGoldenSet, SearchSet, HeldoutSet
ScoreScorecard, ReplayVerdict
ProposeTuningProposal, StrategyRule, KnowledgePatch
Reviewreviewer verdicts, owner approval, rejection reason
Releasepack/policy/tool/evaluator tuple, rollout stage, rollback target

Without artifacts, the improvement loop becomes a meeting. With artifacts, it becomes an engineering system.

What great AI engineers put in the PR

The best agent PRs are not impressive because the diff is clever. They are impressive because the evidence is easy to inspect.

PR sectionWhat it should contain
ProblemThe failing traces, dataset slice, or correction cluster
CandidateThe exact harness surface changed
ScorecardBaseline vs. candidate on search and held-out sets
Trace diffWhat changed in context, plan, tool use, and verdict
SafetyPolicy and Safety floors, approval-gate behavior, redaction
Cost and latencyToken, tool-call, and p95 deltas
Rolloutshadow/internal/low-risk/monitored/full stage plan
RollbackPrior release tuple and replay confirmation

If the PR cannot answer these, it is not ready. It may be a useful experiment. It is not a production harness change.

The ContextOS mapping

ContextOS foundations give this working style an ownership model.

PlaneWhat great engineers measureWhat they improve
Intelligenceentity-resolution misses, evidence freshness, memory correction rateontology additions, graph retrieval constraints, memory promotion rules
Contextevidence coverage, omission rate, budget pressure, citation densityContext Pack, retrieval policy, bucket budgets, compression
Decisionplan verification pass rate, re-plan rate, Critic verdict qualityplanner templates, tool ordering, Critic rubric, loop guards
Actiontool success, schema rejection, retry safety, approval-mode mismatchtool schemas, result shapes, adapter reliability, idempotency
Trustpolicy pass rate, safety pass rate, replay determinism, proposal adoptionevaluator suites, trace grading, release gates, rollout policy

This is why ContextOS treats the harness as cross-plane. No single prompt file owns agent quality. Quality emerges from the contracts between planes and improves when each plane exposes measured, bounded improvement surfaces.

The 30-day build plan

If I were starting an agent team from scratch, this is the first month.

DaysBuildDone when
1-3Task contract and dataset seed100 representative cases, 30 boundary cases, first rubric
4-7Baseline harnesssingle-call or fixed workflow emits traces and scorecards
8-12Tool interface passschemas, examples, idempotency, error classes, approval modes
13-16Replay harnesspast runs can be re-scored without live tool calls
17-20Scorecard gatePolicy, Safety, Utility, Latency, Economics reported by intent
21-24Correction captureoperator corrections become typed FeedbackRecords
25-27Candidate loopone bounded harness surface can be tuned against search set
28-30Rollout pathshadow -> internal -> low-risk -> monitored -> full, with rollback

Notice what is missing: “build autonomous agent” is not the first milestone. A measurable, replayable, improvable harness is.

Anti-patterns

The common mistakes are predictable.

Anti-patternWhy it failsReplacement
Demo-driven developmentOptimizes for memorable examplesdataset slices by intent and risk
Prompt-only iterationHides retrieval, tools, policy, and scoring failuresharness candidate diffs
Final-answer-only evalsMisses bad plans that got luckytrace grading
One global benchmarkHides per-intent regressionsscorecards by intent and pack version
Auto-promotionTurns optimizer bugs into production incidentsproposal-only improvement loop
Framework opacityMakes prompts, tool calls, and state hard to inspectsimple components with explicit traces
Human feedback as notesLoses supervised signaltyped corrections and golden-set updates

The bar

An agent team is operating at a high level when these statements are true:

StatementEvidence
”We know what success means.”versioned datasets, rubrics, and scorecards
”We know why failures happen.”trace bundles with context, plan, tools, guardrails, verdicts
”We know what changed.”candidate harness diffs and release tuples
”We know the change helped.”search and held-out replay deltas
”We know it is safe to ship.”Policy and Safety floors, approval-gate checks, rollout stage
”We can undo it.”rollback target and replayable prior tuple
”We learn from production.”corrections feed FeedbackStore, goldens, insights, proposals

That is the standard. Not a clever prompt. Not a bigger model. Not a beautiful agent diagram.

Great AI engineers build the machine that makes agents better. The harness is that machine. Treat it like a model: give it data, score it, inspect its traces, search its variants, gate its releases, and let production corrections teach the next version.

That is how agents become engineering systems instead of impressive demos.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.