Most weak agent projects start with a prompt.
Most strong agent projects start with a spreadsheet.
Not because spreadsheets are elegant, but because they force the first hard question: what examples define the work? Until a team can answer that, every improvement is anecdotal. The agent looks better on the demo, worse on the long tail, safer on refusals, more expensive on easy cases, and nobody knows which tradeoff actually happened.
Dataset-first agent engineering means the agent is developed against a living task distribution. The dataset is not a benchmark someone downloads once. It is the operating surface that tells the team what success means, where the system is failing, which candidates deserve release, and which production corrections should become future tests.
This is the second piece in the Agent Engineering series. The first, How Great AI Engineers Build Agents, argued that strong teams treat the harness like a model. This post is about the first input to that model: the dataset.
The dataset is the product specification
For normal software, a test suite says “the implementation still satisfies the contract.”
For agents, the dataset says something larger: “this is the work distribution the harness is being optimized for.”
That distribution has to include ordinary requests, ambiguous requests, tool failures, policy boundaries, missing evidence, stale memory, adversarial text, and operator corrections. Otherwise the harness optimizes for the polished center and fails at the edges where production agents are expensive.
| If the dataset lacks | The harness learns |
|---|---|
| Policy denials | to answer when it should refuse |
| Approval gates | to treat risky actions like ordinary tool calls |
| Missing evidence | to improvise unsupported claims |
| Tool failures | to loop, retry, or hallucinate tool results |
| Operator corrections | to repeat mistakes humans already fixed |
| Easy cases | to overfit to exotic failures and hurt normal traffic |
The goal is not a huge dataset. The goal is a representative one that can explain what changed.
Start with a task distribution
Do not collect random examples. Define the task distribution first.
intent: support.refund
decision_key: support.refund.execute
risk_class: delegated_destructive
traffic:
daily_runs: 18000
approval_gate_rate: 0.08
missing_evidence_rate: 0.12
policy_denial_rate: 0.06
outcomes:
primary: correct_refund_decision
guardrails:
- no_policy_violation
- no_unsupported_claim
- no_approval_bypass
- no_duplicate_refundThis file is already an engineering asset. It tells the team what to sample, what to score, and what not to regress.
The task distribution should be owned by the same team that owns the business workflow. An AI team can help structure it, but a support refund dataset without support-ops ownership will drift into artificial examples.
Four dataset splits
A production agent needs more than one test set.
| Split | Who can use it | Purpose |
|---|---|---|
dev | engineers | local debugging, quick experiments, obvious failure reproduction |
search | human and automated proposers | candidate generation, prompt tuning, retrieval tuning, planner tuning |
release_test | release gate only | final held-out regression check |
shadow_live | runtime sampler | canary comparison against current production traffic |
The most common mistake is using the same examples for all four. That gives fast improvement and false confidence. The team iterates against the same examples until the harness memorizes the shape of the test.
Keep the release set held out. Rotate it on a schedule. Do not let automated proposers inspect it during search. If the release set becomes part of candidate generation, it is no longer a release set.
The row schema
A dataset row should be replayable, not just readable.
{
"case_id": "case_refund_2026_05_12_017",
"intent": "support.refund",
"risk_class": "delegated_destructive",
"input": {
"user_message": "Can you refund my order? The supplier approved an exception.",
"tenant_id": "tenant_acme",
"actor_claims": ["support.agent"]
},
"fixtures": {
"kg_snapshot": "kg_support_2026_05_12",
"tool_transcripts": "fixtures/refund_exception_017/tool-transcripts.jsonl",
"policy_bundle": "policy.returns@4.1.0",
"tool_manifest": "tools.support@3.7.0"
},
"expected": {
"decision": "approve_with_gate",
"required_evidence_refs": ["order:ord_881", "supplier_policy:exception_22"],
"required_gate": "GATE_FINANCE_APPROVAL",
"forbidden": ["issue_refund_without_gate", "claim_policy_without_evidence"]
},
"labels": {
"source": "operator_correction",
"reason_class": "missing_supplier_exception",
"difficulty": "boundary",
"owner": "support_ops"
}
}The important fields are not only input and expected. The fixtures make replay possible. The labels make slicing possible. The owner makes the row governable.
If a row cannot be replayed, it is still useful for human inspection, but it is weaker as an engineering artifact.
What belongs in a golden set
A golden set should cover the workflow, not only the happy path.
| Slice | Minimum content |
|---|---|
| Happy path | ordinary successful requests with complete evidence |
| Boundary path | threshold cases, policy exceptions, approval-required actions |
| Missing evidence | no source, stale source, conflicting source, inaccessible source |
| Tool behavior | timeout, retryable error, non-retryable error, malformed result |
| Safety and policy | must-refuse, must-escalate, redact, require approval |
| Memory | correct recall, stale recall, contradiction, no-consent recall |
| User behavior | ambiguous request, correction, hostile text, prompt-injection attempt |
| Regression cases | every production incident that should never recur |
The last row is where many teams fail. Incidents produce retros. They rarely produce durable test cases. A ContextOS-style Improvement Loop treats the incident replay as the durable artifact.
How examples enter the dataset
Do not let every production run become a golden case. That creates volume without judgment.
Use an intake queue:
production trace
-> sampled candidate row
-> reviewer labels source and expected behavior
-> replay fixtures pinned
-> row accepted into dev/search/release splitThe reviewer is deciding whether this case teaches the harness something. A row can be rejected with a reason: duplicate, unclear expected outcome, not representative, policy pending, fixture incomplete.
Operator corrections should have a faster path:
operator correction
-> FeedbackRecord
-> replay fixture
-> dev row immediately
-> search row after reviewer confirms expected behavior
-> release row only after dedupe and owner approvalCorrections are high-signal but not automatically correct. Operators can be wrong. The dataset needs provenance so a later correction can supersede an earlier one.
Dataset metrics
Dataset quality needs its own scorecard.
| Metric | Why it matters |
|---|---|
| Coverage by intent | prevents one high-volume workflow from hiding missing coverage elsewhere |
| Coverage by risk class | ensures destructive and delegated actions are tested |
| Boundary-case ratio | keeps the set from becoming mostly easy traffic |
| Correction incorporation lag | measures how fast production learning becomes test coverage |
| Replayability rate | percent of rows with pinned fixtures and tool transcripts |
| Label disagreement rate | tells whether the expected outcome is ambiguous |
| Staleness age | flags rows whose policies, tools, or source schemas are obsolete |
| Duplicate rate | prevents a large dataset from becoming a repeated dataset |
The best agent teams prune. They remove stale, duplicate, and low-signal rows. Dataset maintenance is part of harness maintenance.
The ContextOS version
In ContextOS, the dataset is not floating outside the runtime. It binds to the same artifacts the harness uses in production.
| Dataset field | ContextOS artifact |
|---|---|
| intent and risk class | Intent-Task Catalog |
| expected decision | DecisionSpec and DecisionRecord |
| evidence requirements | Context Pack and Knowledge Graph snapshot |
| tool fixtures | ToolEnvelope transcripts |
| policy expectations | policy bundle and approval-mode tiers |
| score dimensions | evaluator suite |
| correction source | FeedbackStore |
That binding matters because it stops the dataset from becoming prose. A release gate can replay a row against a candidate pack, policy, tool manifest, and evaluator suite, then produce a typed verdict.
The first week
A small team can build a useful first golden set in a week.
| Day | Work |
|---|---|
| 1 | Pick one intent and write the task distribution |
| 2 | Collect 50 ordinary production-like cases |
| 3 | Add 25 boundary cases from policy and operator knowledge |
| 4 | Add 25 failure cases from past incidents or synthetic tool failures |
| 5 | Define row schema, fixtures, and expected outcomes |
| 6 | Split dev/search/release and run the baseline harness |
| 7 | Review the scorecard and fill the biggest missing slice |
Do not wait for a perfect dataset. A rough, owned, replayable dataset beats an ambitious benchmark nobody can operate.
The bar
A dataset is production-grade when:
| Check | Pass line |
|---|---|
| It has an owner | a domain owner signs expected outcomes |
| It is sliceable | rows carry intent, risk, source, difficulty, and reason labels |
| It is replayable | rows pin snapshots, policies, tools, and transcripts |
| It is held out | release examples are not used for candidate search |
| It learns | corrections and incidents become rows |
| It ages | stale rows are reviewed, updated, or retired |
Great agent engineering begins here. The dataset is not paperwork around the agent. It is the agent’s operating definition of reality.