Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer

A final answer can lie about the run that produced it.

The answer may be correct because the model guessed. It may be wrong even though retrieval found the right document. It may be safe only because the tool failed. It may be concise because the harness silently dropped the evidence. If you only grade the final text, you cannot tell which thing happened.

Trace review is the agent debugger.

The trace is the record of what the harness actually did: what context it compiled, what plan it proposed, what tools it called, what guardrails fired, what the Critic accepted, and what final decision shipped. For production agents, the trace is not an observability nice-to-have. It is the unit of diagnosis and improvement.

OpenAI’s tracing docs describe the same shape at the SDK layer: traces collect model generations, tool calls, handoffs, guardrails, and custom events. Trace grading then turns those traces into structured evaluation objects. ContextOS generalizes that idea across the five planes.

Why final-answer grading is too shallow

Final-answer grading answers one question: did the output look acceptable?

Agent debugging needs more:

Failure	Final answer says	Trace says
Wrong source won retrieval	answer is wrong	evidence manifest admitted stale source
Tool failed silently	answer is vague	tool result was timeout but planner continued
Policy fired late	answer was blocked	action plan should have been rejected earlier
Approval was bypassed	answer looks helpful	destructive tool executed without gate
Model guessed correctly	answer is correct	evidence refs are missing
Over-retrieval inflated cost	answer is correct	context budget used 4x needed tokens

If the trace does not exist, the team argues from symptoms. If the trace exists, the team can assign the failure to a plane and fix the harness.

The trace anatomy

A useful agent trace has seven layers.

Layer	What it records
Intake	`RunContext`, actor, tenant, intent, risk class, budget, trace ID
Compile	Context Pack version, evidence manifest, memory manifest, policy manifest, tool manifest, omissions
Plan	typed plan, step order, dependencies, tool intents, approval requirements
Execute	tool calls, arguments, idempotency keys, tool results, retry metadata
Guardrail	policy decisions, schema checks, redactions, approvals, loop guards
Critic	verify verdicts, step scores, retry/replan/escalate reasons
Record	`DecisionRecord`, scorecard, replay handle, correction links

The layer names matter less than the completeness. A trace that stops at “model returned” is not an agent trace. It is a model log.

Span attributes that pay rent

Every span should carry enough identity to join later.

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "run_id": "run_refund_017",
  "intent": "support.refund",
  "risk_class": "delegated_destructive",
  "pack_version": "ctxpack.support@5.2.0",
  "policy_bundle": "policy.returns@4.1.0",
  "tool_manifest": "tools.support@3.7.0",
  "decision_record_id": "dr_refund_017"
}

Then each plane adds its own fields:

Plane	Span fields
Intelligence	`kg_snapshot`, `entity_resolution_confidence`, `memory_candidate_count`
Context	`evidence_ref_count`, `budget_tokens_used`, `omission_count`, `redaction_count`
Decision	`plan_id`, `replan_attempt`, `critic_verdict`, `loop_guard_state`
Action	`tool_id`, `approval_mode_effective`, `idempotency_key`, `retry_count`
Trust	`policy_decision_id`, `approval_gate_id`, `scorecard_id`, `replay_id`

These are the fields that let an on-call engineer answer “what happened?” without asking the model to explain itself after the fact.

Trace grading

Trace grading means scoring the path, not just the output.

Examples:

Grader	Checks
Evidence grader	final material claims are supported by evidence refs
Tool-use grader	right tool called with valid args and no forbidden tools
Policy grader	policy decision matched the expected allow/deny/gate
Plan grader	plan included required verification before action
Recovery grader	tool failure produced retry, replan, or escalation
Cost grader	context and tool usage stayed within target band

The best graders are mixed. Some are deterministic rules. Some are schema checks. Some are LLM-as-judge with a pinned rubric. Some require human review. Do not force every concern through the same evaluator style.

Trace review workflow

A weekly trace review should be short and mechanical.

Sample runs by intent, risk class, scorecard failure, correction, and high cost.
Pick five traces that changed after the last release.
For each trace, assign the failure to a plane.
Convert the recurring failures into dataset rows or proposals.
Close with owners and release gates, not discussion notes.

The working table:

Trace	Symptom	Plane	Harness fix
`trace_017`	denied supplier exception	Context	retrieve supplier exception before default refund window
`trace_044`	duplicate retry risk	Action	require idempotency key before retry
`trace_081`	replan loop	Decision	lower re-plan budget and escalate on repeated tool timeout
`trace_103`	unsupported claim	Trust	tighten evidence grader

The point is not to admire traces. The point is to turn them into bounded changes.

What a good trace reveals

Consider a failed refund run.

The final answer:

I cannot refund this order because it is outside the 90-day refund window.

Final-answer grading says: wrong.

Trace review says:

Context compile:
  order evidence included
  default refund policy included
  supplier exception source omitted due to max_hops=2
 
Plan:
  s1 orders.lookup
  s2 policy.eval
  s3 deny
 
Tool:
  orders.lookup returned order age=104 days
 
Critic:
  accepted denial because supplier exception evidence was absent
 
Correction:
  operator supplied supplier_policy.exception_window

Now the fix is clear. This is not a better-refusal prompt problem. It is a context retrieval and evidence coverage problem. The right candidate is a Context Pack change or retrieval rule, not a model upgrade.

The trace-to-proposal bridge

A trace should be able to generate a proposal skeleton.

{
  "proposal_id": "tp_trace_017",
  "source_trace": "trace_017",
  "failure_plane": "context",
  "failure_class": "missing_required_evidence",
  "candidate_change": {
    "target": "ctxpack.support@5.2.0",
    "surface": "retrieval",
    "patch": {
      "supplier_policy.max_hops": { "from": 2, "to": 3 }
    }
  },
  "replay_required": ["case_refund_supplier_exception_017"],
  "guardrails": ["policy==1.0", "safety==1.0"]
}

Humans can edit this. Autotune can generate variants. The release gate can replay it. The important point is that the trace produced an engineering artifact, not just a note.

Sampling strategy

Do not review only failures. Review:

Sample	Why
scorecard failures	obvious improvement candidates
operator corrections	high-signal supervised examples
expensive successes	cost and latency optimization
approval-gate runs	high-risk path verification
changed-by-candidate runs	release review
random successful runs	drift and blind-spot detection

If the team only reviews failures, it misses quiet cost regressions and lucky successes. If it only reviews random runs, it misses rare high-risk paths. Sampling should be stratified.

Privacy and retention

Trace review can expose sensitive data. Treat retention as part of the harness.

Control	Requirement
Redaction	sensitive fields redacted or tokenized before broad access
Retention bands	longer retention for replay metadata, shorter for raw sensitive payloads
Access control	trace access scoped by tenant, role, and incident need
Replay fixtures	tool transcripts preserve behavior without exposing unnecessary raw data
Audit	trace reads are themselves auditable

Do not solve privacy by deleting the trace shape. Solve it by separating payload, metadata, replay fixtures, and access.

The ContextOS version

In ContextOS, every production-grade trace should connect the canonical execution artifacts:

trace_id
  -> RunContext
  -> CompiledContext
  -> Plan
  -> ToolEnvelope[]
  -> PolicyDecision[]
  -> CriticVerdict[]
  -> Scorecard
  -> DecisionRecord
  -> ReplayHandle

That chain gives the Improvement Loop its raw material. The Insight Synthesizer clusters failures. The Strategy Compiler turns patterns into proposals. Autotune searches bounded variants. The release gate replays the candidate. None of that works from final answers alone.

The bar

Trace review is production-grade when:

Check	Pass line
End-to-end	spans cover compile, plan, execute, guardrail, critic, record
Joinable	every span carries run, intent, version, and trace IDs
Gradable	trace-level graders can score path correctness
Replayable	tool transcripts can substitute for live calls
Safe	sensitive payloads are redacted, scoped, or retained by policy
Actionable	failures become dataset rows or proposals

Great agent debugging does not ask the model what went wrong. It reads the trace.