Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents

Most weak agent projects start with a prompt.

Most strong agent projects start with a spreadsheet.

Not because spreadsheets are elegant, but because they force the first hard question: what examples define the work? Until a team can answer that, every improvement is anecdotal. The agent looks better on the demo, worse on the long tail, safer on refusals, more expensive on easy cases, and nobody knows which tradeoff actually happened.

Dataset-first agent engineering means the agent is developed against a living task distribution. The dataset is not a benchmark someone downloads once. It is the operating surface that tells the team what success means, where the system is failing, which candidates deserve release, and which production corrections should become future tests.

This is the second piece in the Agent Engineering series. The first, How Great AI Engineers Build Agents, argued that strong teams treat the harness like a model. This post is about the first input to that model: the dataset.

The dataset is the product specification

For normal software, a test suite says “the implementation still satisfies the contract.”

For agents, the dataset says something larger: “this is the work distribution the harness is being optimized for.”

That distribution has to include ordinary requests, ambiguous requests, tool failures, policy boundaries, missing evidence, stale memory, adversarial text, and operator corrections. Otherwise the harness optimizes for the polished center and fails at the edges where production agents are expensive.

If the dataset lacks	The harness learns
Policy denials	to answer when it should refuse
Approval gates	to treat risky actions like ordinary tool calls
Missing evidence	to improvise unsupported claims
Tool failures	to loop, retry, or hallucinate tool results
Operator corrections	to repeat mistakes humans already fixed
Easy cases	to overfit to exotic failures and hurt normal traffic

The goal is not a huge dataset. The goal is a representative one that can explain what changed.

Start with a task distribution

Do not collect random examples. Define the task distribution first.

intent: support.refund
decision_key: support.refund.execute
risk_class: delegated_destructive
traffic:
  daily_runs: 18000
  approval_gate_rate: 0.08
  missing_evidence_rate: 0.12
  policy_denial_rate: 0.06
outcomes:
  primary: correct_refund_decision
  guardrails:
    - no_policy_violation
    - no_unsupported_claim
    - no_approval_bypass
    - no_duplicate_refund

This file is already an engineering asset. It tells the team what to sample, what to score, and what not to regress.

The task distribution should be owned by the same team that owns the business workflow. An AI team can help structure it, but a support refund dataset without support-ops ownership will drift into artificial examples.

Four dataset splits

A production agent needs more than one test set.

Split	Who can use it	Purpose
`dev`	engineers	local debugging, quick experiments, obvious failure reproduction
`search`	human and automated proposers	candidate generation, prompt tuning, retrieval tuning, planner tuning
`release_test`	release gate only	final held-out regression check
`shadow_live`	runtime sampler	canary comparison against current production traffic

The most common mistake is using the same examples for all four. That gives fast improvement and false confidence. The team iterates against the same examples until the harness memorizes the shape of the test.

Keep the release set held out. Rotate it on a schedule. Do not let automated proposers inspect it during search. If the release set becomes part of candidate generation, it is no longer a release set.

The row schema

A dataset row should be replayable, not just readable.

{
  "case_id": "case_refund_2026_05_12_017",
  "intent": "support.refund",
  "risk_class": "delegated_destructive",
  "input": {
    "user_message": "Can you refund my order? The supplier approved an exception.",
    "tenant_id": "tenant_acme",
    "actor_claims": ["support.agent"]
  },
  "fixtures": {
    "kg_snapshot": "kg_support_2026_05_12",
    "tool_transcripts": "fixtures/refund_exception_017/tool-transcripts.jsonl",
    "policy_bundle": "policy.returns@4.1.0",
    "tool_manifest": "tools.support@3.7.0"
  },
  "expected": {
    "decision": "approve_with_gate",
    "required_evidence_refs": ["order:ord_881", "supplier_policy:exception_22"],
    "required_gate": "GATE_FINANCE_APPROVAL",
    "forbidden": ["issue_refund_without_gate", "claim_policy_without_evidence"]
  },
  "labels": {
    "source": "operator_correction",
    "reason_class": "missing_supplier_exception",
    "difficulty": "boundary",
    "owner": "support_ops"
  }
}

The important fields are not only input and expected. The fixtures make replay possible. The labels make slicing possible. The owner makes the row governable.

If a row cannot be replayed, it is still useful for human inspection, but it is weaker as an engineering artifact.

What belongs in a golden set

A golden set should cover the workflow, not only the happy path.

Slice	Minimum content
Happy path	ordinary successful requests with complete evidence
Boundary path	threshold cases, policy exceptions, approval-required actions
Missing evidence	no source, stale source, conflicting source, inaccessible source
Tool behavior	timeout, retryable error, non-retryable error, malformed result
Safety and policy	must-refuse, must-escalate, redact, require approval
Memory	correct recall, stale recall, contradiction, no-consent recall
User behavior	ambiguous request, correction, hostile text, prompt-injection attempt
Regression cases	every production incident that should never recur

The last row is where many teams fail. Incidents produce retros. They rarely produce durable test cases. A ContextOS-style Improvement Loop treats the incident replay as the durable artifact.

How examples enter the dataset

Do not let every production run become a golden case. That creates volume without judgment.

Use an intake queue:

production trace
  -> sampled candidate row
  -> reviewer labels source and expected behavior
  -> replay fixtures pinned
  -> row accepted into dev/search/release split

The reviewer is deciding whether this case teaches the harness something. A row can be rejected with a reason: duplicate, unclear expected outcome, not representative, policy pending, fixture incomplete.

Operator corrections should have a faster path:

operator correction
  -> FeedbackRecord
  -> replay fixture
  -> dev row immediately
  -> search row after reviewer confirms expected behavior
  -> release row only after dedupe and owner approval

Corrections are high-signal but not automatically correct. Operators can be wrong. The dataset needs provenance so a later correction can supersede an earlier one.

Dataset metrics

Dataset quality needs its own scorecard.

Metric	Why it matters
Coverage by intent	prevents one high-volume workflow from hiding missing coverage elsewhere
Coverage by risk class	ensures destructive and delegated actions are tested
Boundary-case ratio	keeps the set from becoming mostly easy traffic
Correction incorporation lag	measures how fast production learning becomes test coverage
Replayability rate	percent of rows with pinned fixtures and tool transcripts
Label disagreement rate	tells whether the expected outcome is ambiguous
Staleness age	flags rows whose policies, tools, or source schemas are obsolete
Duplicate rate	prevents a large dataset from becoming a repeated dataset

The best agent teams prune. They remove stale, duplicate, and low-signal rows. Dataset maintenance is part of harness maintenance.

The ContextOS version

In ContextOS, the dataset is not floating outside the runtime. It binds to the same artifacts the harness uses in production.

Dataset field	ContextOS artifact
intent and risk class	Intent-Task Catalog
expected decision	DecisionSpec and DecisionRecord
evidence requirements	Context Pack and Knowledge Graph snapshot
tool fixtures	ToolEnvelope transcripts
policy expectations	policy bundle and approval-mode tiers
score dimensions	evaluator suite
correction source	FeedbackStore

That binding matters because it stops the dataset from becoming prose. A release gate can replay a row against a candidate pack, policy, tool manifest, and evaluator suite, then produce a typed verdict.

The first week

A small team can build a useful first golden set in a week.

Day	Work
1	Pick one intent and write the task distribution
2	Collect 50 ordinary production-like cases
3	Add 25 boundary cases from policy and operator knowledge
4	Add 25 failure cases from past incidents or synthetic tool failures
5	Define row schema, fixtures, and expected outcomes
6	Split dev/search/release and run the baseline harness
7	Review the scorecard and fill the biggest missing slice

Do not wait for a perfect dataset. A rough, owned, replayable dataset beats an ambitious benchmark nobody can operate.

The bar

A dataset is production-grade when:

Check	Pass line
It has an owner	a domain owner signs expected outcomes
It is sliceable	rows carry intent, risk, source, difficulty, and reason labels
It is replayable	rows pin snapshots, policies, tools, and transcripts
It is held out	release examples are not used for candidate search
It learns	corrections and incidents become rows
It ages	stale rows are reviewed, updated, or retired

Great agent engineering begins here. The dataset is not paperwork around the agent. It is the agent’s operating definition of reality.