Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·6 min read

Scorecards Over Vibes: The Five Metrics That Keep Agents Honest

ContextOS
AI Engineering
Evaluation
Scorecards
Agents
Share:XHN

The sentence “the agent got better” is not an engineering statement.

Better at what? On which intent? Against which dataset? Did safety hold? Did policy hold? Did cost rise? Did latency move? Did it solve more tasks, or did it just answer more confidently?

Agent teams get into trouble when they collapse quality into one global feeling. A single score is seductive because it simplifies release decisions. It is also dangerous because agents can improve one dimension by damaging another. ContextOS uses Evaluation and Observability to keep those dimensions separate.

Cost improves when the harness retrieves less evidence. Latency improves when it avoids tools. Utility improves when it answers instead of escalating. User satisfaction improves when the agent says yes. None of those are improvements if the agent bypassed policy, skipped evidence, or created an irreversible side effect.

Great AI engineers use scorecards over vibes.

Five dimensions

A production agent scorecard needs at least five dimensions:

DimensionQuestion
PolicyDid the harness obey the rules outside the model?
SafetyDid it avoid unsafe, unsupported, or sensitive behavior?
UtilityDid it complete the task correctly and usefully?
LatencyDid it finish within the workflow’s time budget?
EconomicsDid it spend a reasonable amount to reach a verified outcome?

Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That split is the first guardrail against metric gaming.

The OpenAI evals and trace-grading direction reinforces this: final-output scoring is useful, but agent evaluation needs structured graders and trace-level inspection when tool calls, handoffs, and guardrails matter. A scorecard turns that into an operating practice.

The scorecard shape

Use one scorecard per run, then aggregate by intent, pack version, policy version, tool manifest, evaluator suite, model profile, and risk class.

{
  "run_id": "run_refund_017",
  "intent": "support.refund",
  "pack_version": "ctxpack.support@5.2.0",
  "policy_bundle": "policy.returns@4.1.0",
  "tool_manifest": "tools.support@3.7.0",
  "evaluator_suite": "evals.refund@2.4.0",
  "scores": {
    "policy": {
      "score": 1.0,
      "violations": [],
      "approval_gate_honored": true
    },
    "safety": {
      "score": 1.0,
      "unsupported_claim": false,
      "redaction_success": true
    },
    "utility": {
      "score": 0.92,
      "task_success": true,
      "operator_correction": false
    },
    "latency": {
      "wall_clock_ms": 1840,
      "p95_target_ms": 3000
    },
    "economics": {
      "tokens": 4720,
      "tool_calls": 3,
      "cost_cents": 0.91
    }
  }
}

The exact numbers matter less than the dimensions and joins. If you cannot slice by intent and version, you cannot know what changed.

Policy

Policy answers whether the harness obeyed rules outside model discretion.

Good policy evals are usually boring:

CheckExample
Must-denyrefund above limit without approval returns deny or gate
Must-approve-with-gatedestructive action requires the named gate
Must-redactrestricted attribute never reaches final answer
Must-not-use-tooltool absent from manifest cannot be called
Must-recordpolicy decision ID appears on DecisionRecord

A Policy score of 0.99 is not always acceptable. For some workflows, one policy violation blocks release. Treat hard policy rules as floors, not averages.

Safety

Safety is broader than policy. Policy is the explicit rule set. Safety catches unsupported claims, prompt-injection success, harmful tool-use patterns, leakage, and evidence gaps.

Useful safety metrics:

MetricMeaning
unsupported_claim_ratefinal claims without evidence refs
evidence_coveragerequired evidence present before decision
prompt_injection_resistanceuntrusted text did not become instruction
redaction_successsensitive fields stayed out of unsafe surfaces
irreversible_action_without_gateshould always be zero

Safety evals should include adversarial and accidental failures. Most production safety failures are not cinematic attacks. They are stale documents, copied user text, confused identity, and missing evidence.

Utility

Utility is where teams are tempted to spend all their time. It matters, but it is not enough.

For agents, utility should be task-grounded:

MetricBetter than
decision correctnessgeneric “helpfulness”
task completionthumbs-up
operator-corrected rateanecdotal feedback
answer completenesslength
escalation appropriatenessescalation count alone

The hard part is deciding expected outcomes. That is why the dataset has owners. An AI engineer should not invent refund policy labels without support-ops and policy ownership.

Latency

Latency is not one number. Agent latency is staged.

StageWhat to track
compileretrieval, policy resolution, memory recall, budget assembly
modelgeneration time and queueing
toolexternal service waits and retries
criticvalidation and scoring time
approvalhuman gate wait, if applicable

Stage latency matters because fixes differ. A slow retrieval stage needs Context-plane indexing or budget work. A slow tool stage needs Adapter Mesh reliability. A slow Critic needs evaluator design. A slow approval stage needs operations, not model tuning.

Economics

Cost is meaningful only per verified outcome.

Do not report only tokens. Report:

MetricWhy
cents per decisioncomparable across models and tools
cents per verified successavoids cheap wrong answers
tokens by bucketshows context pressure
tool calls per decisionexposes unnecessary orchestration
retry costcaptures unreliable tools and loops

Cost improvements are suspicious until Policy, Safety, and Utility hold. A cheaper agent that skips evidence is not efficient. It is under-instrumented.

Scorecard deltas

The scorecard becomes useful when comparing a baseline to a candidate under a release gate.

candidate: ctxpack.support@5.3.0
baseline:  ctxpack.support@5.2.0
intent:    support.refund
set:       heldout release
 
policy:    1.000 -> 1.000   pass
safety:    1.000 -> 1.000   pass
utility:   0.901 -> 0.924   +0.023
latency:   2180ms -> 2010ms -170ms
economics: 0.91c -> 0.84c  -0.07c
changed:   37 / 900 runs
regressed: 0 destructive, 2 utility review
verdict:   ready_for_shadow

The changed count is important. A candidate that changes 37 runs is easier to review than one that changes 650. Blast radius is part of the release decision.

Release thresholds

Write thresholds before running the candidate.

release_gate:
  intent: support.refund
  hard_floors:
    policy: "== 1.0"
    safety: "== 1.0"
    approval_bypass_rate: "== 0"
  soft_bounds:
    utility_delta: ">= -0.01"
    p95_latency_delta_ms: "<= 250"
    cost_delta_cents: "<= 0.10"
  review_required_when:
    changed_runs: "> 100"
    utility_regressions: "> 0"
    destructive_path_changed: true

The release gate should return discrete verdicts:

VerdictMeaning
passcan advance to the next rollout stage
needs_humanwithin bounds but requires owner judgment
blockviolates a hard floor or unacceptable soft bound
invalidcandidate or evaluation data is malformed

Ambiguous gates produce political releases. Typed gates produce engineering releases.

Common scorecard failures

FailureWhat happens
One global averagerare high-risk regressions disappear
No version dimensionsnobody can attribute the regression
No held-out setcandidate overfits search examples
No trace linkscore says failed but not why
No business ownerexpected labels become AI-team guesses
No cost per successcheap wrong answers look good
No rejection memorysame bad proposal returns later

The last one is subtle. Rejected candidates are data. If a proposal failed because it reduced evidence coverage, the next search should know that path is bad.

The ContextOS version

In ContextOS, scorecards attach to DecisionRecords and release tuples:

DecisionRecord
  -> evidence_refs
  -> policy_decisions
  -> approvals
  -> tool_transcripts
  -> scorecard
  -> replay_handle

This makes scorecards auditable. A utility score can be traced back to evidence refs. A policy score can be traced back to a policy decision ID. A cost score can be traced to tokens and tool calls. A release gate can replay the run against the candidate harness.

The bar

A scorecard is production-grade when:

CheckPass line
It is multi-dimensionalno single blended score controls release
It is intent-scopedevery metric can be sliced by intent and risk
It is versionedpack, policy, tools, evaluator suite, model profile are attached
It has hard floorsPolicy and Safety cannot be averaged away
It links to tracesfailures are diagnosable
It gates rolloutcandidates cannot ship on vibes

Great AI engineers do not ask whether the agent feels better. They ask which score moved, which guardrail held, which traces changed, and whether the release gate agrees.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.