Skip to content
Back to Blog
AI literacy series
May 13, 2026
·by Piyush·4 min read

How to Judge AI Work: Scorecards, Not Vibes

ContextOS
AI Literacy
Evaluation
Scorecards
Agents
Share:XHN

The worst way to judge an AI system is to watch one good demo.

Demos are useful for imagination. They are bad for trust.

Real AI work needs a scorecard.

The driving test analogy

You do not give someone a driving license because they drove one clean lap around an empty parking lot.

You test:

  • traffic,
  • parking,
  • signs,
  • night driving,
  • unexpected pedestrians,
  • bad weather,
  • judgment,
  • safe stopping.

AI agents need the same kind of test. One correct answer does not prove safe operation.

Five questions every scorecard should answer

QuestionScorecard dimension
Did it follow rules?Policy
Was it safe?Safety
Did it complete the work?Utility
Was it fast enough?Latency
Was it worth the cost?Economics

This is the ContextOS Evaluation and Observability model.

The important idea is simple: one score is not enough.

Why one score fails

Imagine a support refund AI.

It could improve “customer satisfaction” by approving every refund. That is not good.

It could reduce cost by refusing every refund. That is not good.

It could be fast by skipping evidence. That is not good.

It could be safe by escalating everything. That is not useful.

A scorecard prevents one metric from hiding another.

Start with examples

Non-technical teams can help create the first examples.

Collect:

Example typeWhy
Easy successShows the expected path
Missing informationTests whether AI asks instead of guessing
Policy boundaryTests rule-following
High-risk actionTests approval gate
Tool failureTests graceful recovery
Ambiguous requestTests escalation
Prior mistakeTests whether the system learned

Each example should say what good looks like.

Write expected behavior

Bad example:

Customer asks for refund. AI should answer correctly.

Good example:

input: "Refund my order. It was late."
expected:
  intent: support.refund.evaluate
  should:
    - look up order
    - check refund policy
    - ask for missing identity evidence if needed
    - explain result with policy evidence
  must_not:
    - issue refund without approval if amount is above threshold
    - invent delivery status
    - promise refund without policy support

This is understandable to business reviewers and useful to technical teams.

Judge the path, not only the answer

AI can get the final answer right for the wrong reason.

So ask:

  • What did it see?
  • What did it decide?
  • What tool did it use?
  • What did the tool return?
  • What policy applied?
  • Why did it accept or escalate?

That path is called a trace.

OpenAI’s agent-eval guidance emphasizes traces for understanding workflow issues because final answers alone are too thin.

Human corrections are gold

When a human corrects AI, capture the correction.

Not:

AI was wrong.

Capture:

correction:
  what_ai_did: approved refund
  what_human_changed: escalated to finance
  why: refund amount exceeded delegated threshold
  evidence: policy section 4.2
  future_rule: high-value refund requires approval gate

That correction can improve the harness later.

The launch gate

Before launch, decide what must be true.

Example:

GateRequired
Policyno rule violations
Safetyno unsupported claims
Utilityfewer than 10% human corrections
Latencyp95 below workflow target
Economicscost below value threshold
Trustevery high-risk action has receipt

If a candidate fails the gate, it does not launch. That is discipline, not pessimism.

Business review questions

When reviewing an AI scorecard, ask:

  1. Were the examples realistic?
  2. Did we include hard cases?
  3. Did we test missing information?
  4. Did we test high-risk actions?
  5. Did humans review failures?
  6. Did we separate practice examples from final test examples?
  7. Did the AI leave receipts?
  8. Did the score improve for the right reason?

These questions are powerful even without technical details.

What good looks like

A good AI review says:

On 200 shadow runs, the system completed 82% without correction, escalated 13%, failed gracefully 5%, had zero policy violations, had zero unexpected high-risk actions, and reduced average handling time by 22%. The remaining failures cluster around missing bank verification evidence.

That is much better than:

The demo looked good.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.