Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·7 min read

Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents

ContextOS
AI Engineering
Evaluation
Datasets
Agents
Share:XHN

Most weak agent projects start with a prompt.

Most strong agent projects start with a spreadsheet.

Not because spreadsheets are elegant, but because they force the first hard question: what examples define the work? Until a team can answer that, every improvement is anecdotal. The agent looks better on the demo, worse on the long tail, safer on refusals, more expensive on easy cases, and nobody knows which tradeoff actually happened.

Dataset-first agent engineering means the agent is developed against a living task distribution. The dataset is not a benchmark someone downloads once. It is the operating surface that tells the team what success means, where the system is failing, which candidates deserve release, and which production corrections should become future tests.

This is the second piece in the Agent Engineering series. The first, How Great AI Engineers Build Agents, argued that strong teams treat the harness like a model. This post is about the first input to that model: the dataset.

The dataset is the product specification

For normal software, a test suite says “the implementation still satisfies the contract.”

For agents, the dataset says something larger: “this is the work distribution the harness is being optimized for.”

That distribution has to include ordinary requests, ambiguous requests, tool failures, policy boundaries, missing evidence, stale memory, adversarial text, and operator corrections. Otherwise the harness optimizes for the polished center and fails at the edges where production agents are expensive.

If the dataset lacksThe harness learns
Policy denialsto answer when it should refuse
Approval gatesto treat risky actions like ordinary tool calls
Missing evidenceto improvise unsupported claims
Tool failuresto loop, retry, or hallucinate tool results
Operator correctionsto repeat mistakes humans already fixed
Easy casesto overfit to exotic failures and hurt normal traffic

The goal is not a huge dataset. The goal is a representative one that can explain what changed.

Start with a task distribution

Do not collect random examples. Define the task distribution first.

intent: support.refund
decision_key: support.refund.execute
risk_class: delegated_destructive
traffic:
  daily_runs: 18000
  approval_gate_rate: 0.08
  missing_evidence_rate: 0.12
  policy_denial_rate: 0.06
outcomes:
  primary: correct_refund_decision
  guardrails:
    - no_policy_violation
    - no_unsupported_claim
    - no_approval_bypass
    - no_duplicate_refund

This file is already an engineering asset. It tells the team what to sample, what to score, and what not to regress.

The task distribution should be owned by the same team that owns the business workflow. An AI team can help structure it, but a support refund dataset without support-ops ownership will drift into artificial examples.

Four dataset splits

A production agent needs more than one test set.

SplitWho can use itPurpose
devengineerslocal debugging, quick experiments, obvious failure reproduction
searchhuman and automated proposerscandidate generation, prompt tuning, retrieval tuning, planner tuning
release_testrelease gate onlyfinal held-out regression check
shadow_liveruntime samplercanary comparison against current production traffic

The most common mistake is using the same examples for all four. That gives fast improvement and false confidence. The team iterates against the same examples until the harness memorizes the shape of the test.

Keep the release set held out. Rotate it on a schedule. Do not let automated proposers inspect it during search. If the release set becomes part of candidate generation, it is no longer a release set.

The row schema

A dataset row should be replayable, not just readable.

{
  "case_id": "case_refund_2026_05_12_017",
  "intent": "support.refund",
  "risk_class": "delegated_destructive",
  "input": {
    "user_message": "Can you refund my order? The supplier approved an exception.",
    "tenant_id": "tenant_acme",
    "actor_claims": ["support.agent"]
  },
  "fixtures": {
    "kg_snapshot": "kg_support_2026_05_12",
    "tool_transcripts": "fixtures/refund_exception_017/tool-transcripts.jsonl",
    "policy_bundle": "policy.returns@4.1.0",
    "tool_manifest": "tools.support@3.7.0"
  },
  "expected": {
    "decision": "approve_with_gate",
    "required_evidence_refs": ["order:ord_881", "supplier_policy:exception_22"],
    "required_gate": "GATE_FINANCE_APPROVAL",
    "forbidden": ["issue_refund_without_gate", "claim_policy_without_evidence"]
  },
  "labels": {
    "source": "operator_correction",
    "reason_class": "missing_supplier_exception",
    "difficulty": "boundary",
    "owner": "support_ops"
  }
}

The important fields are not only input and expected. The fixtures make replay possible. The labels make slicing possible. The owner makes the row governable.

If a row cannot be replayed, it is still useful for human inspection, but it is weaker as an engineering artifact.

What belongs in a golden set

A golden set should cover the workflow, not only the happy path.

SliceMinimum content
Happy pathordinary successful requests with complete evidence
Boundary paththreshold cases, policy exceptions, approval-required actions
Missing evidenceno source, stale source, conflicting source, inaccessible source
Tool behaviortimeout, retryable error, non-retryable error, malformed result
Safety and policymust-refuse, must-escalate, redact, require approval
Memorycorrect recall, stale recall, contradiction, no-consent recall
User behaviorambiguous request, correction, hostile text, prompt-injection attempt
Regression casesevery production incident that should never recur

The last row is where many teams fail. Incidents produce retros. They rarely produce durable test cases. A ContextOS-style Improvement Loop treats the incident replay as the durable artifact.

How examples enter the dataset

Do not let every production run become a golden case. That creates volume without judgment.

Use an intake queue:

production trace
  -> sampled candidate row
  -> reviewer labels source and expected behavior
  -> replay fixtures pinned
  -> row accepted into dev/search/release split

The reviewer is deciding whether this case teaches the harness something. A row can be rejected with a reason: duplicate, unclear expected outcome, not representative, policy pending, fixture incomplete.

Operator corrections should have a faster path:

operator correction
  -> FeedbackRecord
  -> replay fixture
  -> dev row immediately
  -> search row after reviewer confirms expected behavior
  -> release row only after dedupe and owner approval

Corrections are high-signal but not automatically correct. Operators can be wrong. The dataset needs provenance so a later correction can supersede an earlier one.

Dataset metrics

Dataset quality needs its own scorecard.

MetricWhy it matters
Coverage by intentprevents one high-volume workflow from hiding missing coverage elsewhere
Coverage by risk classensures destructive and delegated actions are tested
Boundary-case ratiokeeps the set from becoming mostly easy traffic
Correction incorporation lagmeasures how fast production learning becomes test coverage
Replayability ratepercent of rows with pinned fixtures and tool transcripts
Label disagreement ratetells whether the expected outcome is ambiguous
Staleness ageflags rows whose policies, tools, or source schemas are obsolete
Duplicate rateprevents a large dataset from becoming a repeated dataset

The best agent teams prune. They remove stale, duplicate, and low-signal rows. Dataset maintenance is part of harness maintenance.

The ContextOS version

In ContextOS, the dataset is not floating outside the runtime. It binds to the same artifacts the harness uses in production.

Dataset fieldContextOS artifact
intent and risk classIntent-Task Catalog
expected decisionDecisionSpec and DecisionRecord
evidence requirementsContext Pack and Knowledge Graph snapshot
tool fixturesToolEnvelope transcripts
policy expectationspolicy bundle and approval-mode tiers
score dimensionsevaluator suite
correction sourceFeedbackStore

That binding matters because it stops the dataset from becoming prose. A release gate can replay a row against a candidate pack, policy, tool manifest, and evaluator suite, then produce a typed verdict.

The first week

A small team can build a useful first golden set in a week.

DayWork
1Pick one intent and write the task distribution
2Collect 50 ordinary production-like cases
3Add 25 boundary cases from policy and operator knowledge
4Add 25 failure cases from past incidents or synthetic tool failures
5Define row schema, fixtures, and expected outcomes
6Split dev/search/release and run the baseline harness
7Review the scorecard and fill the biggest missing slice

Do not wait for a perfect dataset. A rough, owned, replayable dataset beats an ambitious benchmark nobody can operate.

The bar

A dataset is production-grade when:

CheckPass line
It has an ownera domain owner signs expected outcomes
It is sliceablerows carry intent, risk, source, difficulty, and reason labels
It is replayablerows pin snapshots, policies, tools, and transcripts
It is held outrelease examples are not used for candidate search
It learnscorrections and incidents become rows
It agesstale rows are reviewed, updated, or retired

Great agent engineering begins here. The dataset is not paperwork around the agent. It is the agent’s operating definition of reality.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.