A final answer can lie about the run that produced it.
The answer may be correct because the model guessed. It may be wrong even though retrieval found the right document. It may be safe only because the tool failed. It may be concise because the harness silently dropped the evidence. If you only grade the final text, you cannot tell which thing happened.
Trace review is the agent debugger.
The trace is the record of what the harness actually did: what context it compiled, what plan it proposed, what tools it called, what guardrails fired, what the Critic accepted, and what final decision shipped. For production agents, the trace is not an observability nice-to-have. It is the unit of diagnosis and improvement.
OpenAI’s tracing docs describe the same shape at the SDK layer: traces collect model generations, tool calls, handoffs, guardrails, and custom events. Trace grading then turns those traces into structured evaluation objects. ContextOS generalizes that idea across the five planes.
Why final-answer grading is too shallow
Final-answer grading answers one question: did the output look acceptable?
Agent debugging needs more:
| Failure | Final answer says | Trace says |
|---|---|---|
| Wrong source won retrieval | answer is wrong | evidence manifest admitted stale source |
| Tool failed silently | answer is vague | tool result was timeout but planner continued |
| Policy fired late | answer was blocked | action plan should have been rejected earlier |
| Approval was bypassed | answer looks helpful | destructive tool executed without gate |
| Model guessed correctly | answer is correct | evidence refs are missing |
| Over-retrieval inflated cost | answer is correct | context budget used 4x needed tokens |
If the trace does not exist, the team argues from symptoms. If the trace exists, the team can assign the failure to a plane and fix the harness.
The trace anatomy
A useful agent trace has seven layers.
| Layer | What it records |
|---|---|
| Intake | RunContext, actor, tenant, intent, risk class, budget, trace ID |
| Compile | Context Pack version, evidence manifest, memory manifest, policy manifest, tool manifest, omissions |
| Plan | typed plan, step order, dependencies, tool intents, approval requirements |
| Execute | tool calls, arguments, idempotency keys, tool results, retry metadata |
| Guardrail | policy decisions, schema checks, redactions, approvals, loop guards |
| Critic | verify verdicts, step scores, retry/replan/escalate reasons |
| Record | DecisionRecord, scorecard, replay handle, correction links |
The layer names matter less than the completeness. A trace that stops at “model returned” is not an agent trace. It is a model log.
Span attributes that pay rent
Every span should carry enough identity to join later.
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"run_id": "run_refund_017",
"intent": "support.refund",
"risk_class": "delegated_destructive",
"pack_version": "ctxpack.support@5.2.0",
"policy_bundle": "policy.returns@4.1.0",
"tool_manifest": "tools.support@3.7.0",
"decision_record_id": "dr_refund_017"
}Then each plane adds its own fields:
| Plane | Span fields |
|---|---|
| Intelligence | kg_snapshot, entity_resolution_confidence, memory_candidate_count |
| Context | evidence_ref_count, budget_tokens_used, omission_count, redaction_count |
| Decision | plan_id, replan_attempt, critic_verdict, loop_guard_state |
| Action | tool_id, approval_mode_effective, idempotency_key, retry_count |
| Trust | policy_decision_id, approval_gate_id, scorecard_id, replay_id |
These are the fields that let an on-call engineer answer “what happened?” without asking the model to explain itself after the fact.
Trace grading
Trace grading means scoring the path, not just the output.
Examples:
| Grader | Checks |
|---|---|
| Evidence grader | final material claims are supported by evidence refs |
| Tool-use grader | right tool called with valid args and no forbidden tools |
| Policy grader | policy decision matched the expected allow/deny/gate |
| Plan grader | plan included required verification before action |
| Recovery grader | tool failure produced retry, replan, or escalation |
| Cost grader | context and tool usage stayed within target band |
The best graders are mixed. Some are deterministic rules. Some are schema checks. Some are LLM-as-judge with a pinned rubric. Some require human review. Do not force every concern through the same evaluator style.
Trace review workflow
A weekly trace review should be short and mechanical.
- Sample runs by intent, risk class, scorecard failure, correction, and high cost.
- Pick five traces that changed after the last release.
- For each trace, assign the failure to a plane.
- Convert the recurring failures into dataset rows or proposals.
- Close with owners and release gates, not discussion notes.
The working table:
| Trace | Symptom | Plane | Harness fix |
|---|---|---|---|
trace_017 | denied supplier exception | Context | retrieve supplier exception before default refund window |
trace_044 | duplicate retry risk | Action | require idempotency key before retry |
trace_081 | replan loop | Decision | lower re-plan budget and escalate on repeated tool timeout |
trace_103 | unsupported claim | Trust | tighten evidence grader |
The point is not to admire traces. The point is to turn them into bounded changes.
What a good trace reveals
Consider a failed refund run.
The final answer:
I cannot refund this order because it is outside the 90-day refund window.Final-answer grading says: wrong.
Trace review says:
Context compile:
order evidence included
default refund policy included
supplier exception source omitted due to max_hops=2
Plan:
s1 orders.lookup
s2 policy.eval
s3 deny
Tool:
orders.lookup returned order age=104 days
Critic:
accepted denial because supplier exception evidence was absent
Correction:
operator supplied supplier_policy.exception_windowNow the fix is clear. This is not a better-refusal prompt problem. It is a context retrieval and evidence coverage problem. The right candidate is a Context Pack change or retrieval rule, not a model upgrade.
The trace-to-proposal bridge
A trace should be able to generate a proposal skeleton.
{
"proposal_id": "tp_trace_017",
"source_trace": "trace_017",
"failure_plane": "context",
"failure_class": "missing_required_evidence",
"candidate_change": {
"target": "ctxpack.support@5.2.0",
"surface": "retrieval",
"patch": {
"supplier_policy.max_hops": { "from": 2, "to": 3 }
}
},
"replay_required": ["case_refund_supplier_exception_017"],
"guardrails": ["policy==1.0", "safety==1.0"]
}Humans can edit this. Autotune can generate variants. The release gate can replay it. The important point is that the trace produced an engineering artifact, not just a note.
Sampling strategy
Do not review only failures. Review:
| Sample | Why |
|---|---|
| scorecard failures | obvious improvement candidates |
| operator corrections | high-signal supervised examples |
| expensive successes | cost and latency optimization |
| approval-gate runs | high-risk path verification |
| changed-by-candidate runs | release review |
| random successful runs | drift and blind-spot detection |
If the team only reviews failures, it misses quiet cost regressions and lucky successes. If it only reviews random runs, it misses rare high-risk paths. Sampling should be stratified.
Privacy and retention
Trace review can expose sensitive data. Treat retention as part of the harness.
| Control | Requirement |
|---|---|
| Redaction | sensitive fields redacted or tokenized before broad access |
| Retention bands | longer retention for replay metadata, shorter for raw sensitive payloads |
| Access control | trace access scoped by tenant, role, and incident need |
| Replay fixtures | tool transcripts preserve behavior without exposing unnecessary raw data |
| Audit | trace reads are themselves auditable |
Do not solve privacy by deleting the trace shape. Solve it by separating payload, metadata, replay fixtures, and access.
The ContextOS version
In ContextOS, every production-grade trace should connect the canonical execution artifacts:
trace_id
-> RunContext
-> CompiledContext
-> Plan
-> ToolEnvelope[]
-> PolicyDecision[]
-> CriticVerdict[]
-> Scorecard
-> DecisionRecord
-> ReplayHandleThat chain gives the Improvement Loop its raw material. The Insight Synthesizer clusters failures. The Strategy Compiler turns patterns into proposals. Autotune searches bounded variants. The release gate replays the candidate. None of that works from final answers alone.
The bar
Trace review is production-grade when:
| Check | Pass line |
|---|---|
| End-to-end | spans cover compile, plan, execute, guardrail, critic, record |
| Joinable | every span carries run, intent, version, and trace IDs |
| Gradable | trace-level graders can score path correctness |
| Replayable | tool transcripts can substitute for live calls |
| Safe | sensitive payloads are redacted, scoped, or retained by policy |
| Actionable | failures become dataset rows or proposals |
Great agent debugging does not ask the model what went wrong. It reads the trace.