Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·6 min read

Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer

ContextOS
AI Engineering
Tracing
Evaluation
Agents
Share:XHN

A final answer can lie about the run that produced it.

The answer may be correct because the model guessed. It may be wrong even though retrieval found the right document. It may be safe only because the tool failed. It may be concise because the harness silently dropped the evidence. If you only grade the final text, you cannot tell which thing happened.

Trace review is the agent debugger.

The trace is the record of what the harness actually did: what context it compiled, what plan it proposed, what tools it called, what guardrails fired, what the Critic accepted, and what final decision shipped. For production agents, the trace is not an observability nice-to-have. It is the unit of diagnosis and improvement.

OpenAI’s tracing docs describe the same shape at the SDK layer: traces collect model generations, tool calls, handoffs, guardrails, and custom events. Trace grading then turns those traces into structured evaluation objects. ContextOS generalizes that idea across the five planes.

Why final-answer grading is too shallow

Final-answer grading answers one question: did the output look acceptable?

Agent debugging needs more:

FailureFinal answer saysTrace says
Wrong source won retrievalanswer is wrongevidence manifest admitted stale source
Tool failed silentlyanswer is vaguetool result was timeout but planner continued
Policy fired lateanswer was blockedaction plan should have been rejected earlier
Approval was bypassedanswer looks helpfuldestructive tool executed without gate
Model guessed correctlyanswer is correctevidence refs are missing
Over-retrieval inflated costanswer is correctcontext budget used 4x needed tokens

If the trace does not exist, the team argues from symptoms. If the trace exists, the team can assign the failure to a plane and fix the harness.

The trace anatomy

A useful agent trace has seven layers.

LayerWhat it records
IntakeRunContext, actor, tenant, intent, risk class, budget, trace ID
CompileContext Pack version, evidence manifest, memory manifest, policy manifest, tool manifest, omissions
Plantyped plan, step order, dependencies, tool intents, approval requirements
Executetool calls, arguments, idempotency keys, tool results, retry metadata
Guardrailpolicy decisions, schema checks, redactions, approvals, loop guards
Criticverify verdicts, step scores, retry/replan/escalate reasons
RecordDecisionRecord, scorecard, replay handle, correction links

The layer names matter less than the completeness. A trace that stops at “model returned” is not an agent trace. It is a model log.

Span attributes that pay rent

Every span should carry enough identity to join later.

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "run_id": "run_refund_017",
  "intent": "support.refund",
  "risk_class": "delegated_destructive",
  "pack_version": "ctxpack.support@5.2.0",
  "policy_bundle": "policy.returns@4.1.0",
  "tool_manifest": "tools.support@3.7.0",
  "decision_record_id": "dr_refund_017"
}

Then each plane adds its own fields:

PlaneSpan fields
Intelligencekg_snapshot, entity_resolution_confidence, memory_candidate_count
Contextevidence_ref_count, budget_tokens_used, omission_count, redaction_count
Decisionplan_id, replan_attempt, critic_verdict, loop_guard_state
Actiontool_id, approval_mode_effective, idempotency_key, retry_count
Trustpolicy_decision_id, approval_gate_id, scorecard_id, replay_id

These are the fields that let an on-call engineer answer “what happened?” without asking the model to explain itself after the fact.

Trace grading

Trace grading means scoring the path, not just the output.

Examples:

GraderChecks
Evidence graderfinal material claims are supported by evidence refs
Tool-use graderright tool called with valid args and no forbidden tools
Policy graderpolicy decision matched the expected allow/deny/gate
Plan graderplan included required verification before action
Recovery gradertool failure produced retry, replan, or escalation
Cost gradercontext and tool usage stayed within target band

The best graders are mixed. Some are deterministic rules. Some are schema checks. Some are LLM-as-judge with a pinned rubric. Some require human review. Do not force every concern through the same evaluator style.

Trace review workflow

A weekly trace review should be short and mechanical.

  1. Sample runs by intent, risk class, scorecard failure, correction, and high cost.
  2. Pick five traces that changed after the last release.
  3. For each trace, assign the failure to a plane.
  4. Convert the recurring failures into dataset rows or proposals.
  5. Close with owners and release gates, not discussion notes.

The working table:

TraceSymptomPlaneHarness fix
trace_017denied supplier exceptionContextretrieve supplier exception before default refund window
trace_044duplicate retry riskActionrequire idempotency key before retry
trace_081replan loopDecisionlower re-plan budget and escalate on repeated tool timeout
trace_103unsupported claimTrusttighten evidence grader

The point is not to admire traces. The point is to turn them into bounded changes.

What a good trace reveals

Consider a failed refund run.

The final answer:

I cannot refund this order because it is outside the 90-day refund window.

Final-answer grading says: wrong.

Trace review says:

Context compile:
  order evidence included
  default refund policy included
  supplier exception source omitted due to max_hops=2
 
Plan:
  s1 orders.lookup
  s2 policy.eval
  s3 deny
 
Tool:
  orders.lookup returned order age=104 days
 
Critic:
  accepted denial because supplier exception evidence was absent
 
Correction:
  operator supplied supplier_policy.exception_window

Now the fix is clear. This is not a better-refusal prompt problem. It is a context retrieval and evidence coverage problem. The right candidate is a Context Pack change or retrieval rule, not a model upgrade.

The trace-to-proposal bridge

A trace should be able to generate a proposal skeleton.

{
  "proposal_id": "tp_trace_017",
  "source_trace": "trace_017",
  "failure_plane": "context",
  "failure_class": "missing_required_evidence",
  "candidate_change": {
    "target": "ctxpack.support@5.2.0",
    "surface": "retrieval",
    "patch": {
      "supplier_policy.max_hops": { "from": 2, "to": 3 }
    }
  },
  "replay_required": ["case_refund_supplier_exception_017"],
  "guardrails": ["policy==1.0", "safety==1.0"]
}

Humans can edit this. Autotune can generate variants. The release gate can replay it. The important point is that the trace produced an engineering artifact, not just a note.

Sampling strategy

Do not review only failures. Review:

SampleWhy
scorecard failuresobvious improvement candidates
operator correctionshigh-signal supervised examples
expensive successescost and latency optimization
approval-gate runshigh-risk path verification
changed-by-candidate runsrelease review
random successful runsdrift and blind-spot detection

If the team only reviews failures, it misses quiet cost regressions and lucky successes. If it only reviews random runs, it misses rare high-risk paths. Sampling should be stratified.

Privacy and retention

Trace review can expose sensitive data. Treat retention as part of the harness.

ControlRequirement
Redactionsensitive fields redacted or tokenized before broad access
Retention bandslonger retention for replay metadata, shorter for raw sensitive payloads
Access controltrace access scoped by tenant, role, and incident need
Replay fixturestool transcripts preserve behavior without exposing unnecessary raw data
Audittrace reads are themselves auditable

Do not solve privacy by deleting the trace shape. Solve it by separating payload, metadata, replay fixtures, and access.

The ContextOS version

In ContextOS, every production-grade trace should connect the canonical execution artifacts:

trace_id
  -> RunContext
  -> CompiledContext
  -> Plan
  -> ToolEnvelope[]
  -> PolicyDecision[]
  -> CriticVerdict[]
  -> Scorecard
  -> DecisionRecord
  -> ReplayHandle

That chain gives the Improvement Loop its raw material. The Insight Synthesizer clusters failures. The Strategy Compiler turns patterns into proposals. Autotune searches bounded variants. The release gate replays the candidate. None of that works from final answers alone.

The bar

Trace review is production-grade when:

CheckPass line
End-to-endspans cover compile, plan, execute, guardrail, critic, record
Joinableevery span carries run, intent, version, and trace IDs
Gradabletrace-level graders can score path correctness
Replayabletool transcripts can substitute for live calls
Safesensitive payloads are redacted, scoped, or retained by policy
Actionablefailures become dataset rows or proposals

Great agent debugging does not ask the model what went wrong. It reads the trace.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.