How to Judge AI Work: Scorecards, Not Vibes

The worst way to judge an AI system is to watch one good demo.

Demos are useful for imagination. They are bad for trust.

Real AI work needs a scorecard.

The driving test analogy

You do not give someone a driving license because they drove one clean lap around an empty parking lot.

You test:

traffic,
parking,
signs,
night driving,
unexpected pedestrians,
bad weather,
judgment,
safe stopping.

AI agents need the same kind of test. One correct answer does not prove safe operation.

Five questions every scorecard should answer

Question	Scorecard dimension
Did it follow rules?	Policy
Was it safe?	Safety
Did it complete the work?	Utility
Was it fast enough?	Latency
Was it worth the cost?	Economics

This is the ContextOS Evaluation and Observability model.

The important idea is simple: one score is not enough.

Why one score fails

Imagine a support refund AI.

It could improve “customer satisfaction” by approving every refund. That is not good.

It could reduce cost by refusing every refund. That is not good.

It could be fast by skipping evidence. That is not good.

It could be safe by escalating everything. That is not useful.

A scorecard prevents one metric from hiding another.

Start with examples

Non-technical teams can help create the first examples.

Collect:

Example type	Why
Easy success	Shows the expected path
Missing information	Tests whether AI asks instead of guessing
Policy boundary	Tests rule-following
High-risk action	Tests approval gate
Tool failure	Tests graceful recovery
Ambiguous request	Tests escalation
Prior mistake	Tests whether the system learned

Each example should say what good looks like.

Write expected behavior

Bad example:

Customer asks for refund. AI should answer correctly.

Good example:

input: "Refund my order. It was late."
expected:
  intent: support.refund.evaluate
  should:
    - look up order
    - check refund policy
    - ask for missing identity evidence if needed
    - explain result with policy evidence
  must_not:
    - issue refund without approval if amount is above threshold
    - invent delivery status
    - promise refund without policy support

This is understandable to business reviewers and useful to technical teams.

Judge the path, not only the answer

AI can get the final answer right for the wrong reason.

So ask:

What did it see?
What did it decide?
What tool did it use?
What did the tool return?
What policy applied?
Why did it accept or escalate?

That path is called a trace.

OpenAI’s agent-eval guidance emphasizes traces for understanding workflow issues because final answers alone are too thin.

Human corrections are gold

When a human corrects AI, capture the correction.

Not:

AI was wrong.

Capture:

correction:
  what_ai_did: approved refund
  what_human_changed: escalated to finance
  why: refund amount exceeded delegated threshold
  evidence: policy section 4.2
  future_rule: high-value refund requires approval gate

That correction can improve the harness later.

The launch gate

Before launch, decide what must be true.

Example:

Gate	Required
Policy	no rule violations
Safety	no unsupported claims
Utility	fewer than 10% human corrections
Latency	p95 below workflow target
Economics	cost below value threshold
Trust	every high-risk action has receipt

If a candidate fails the gate, it does not launch. That is discipline, not pessimism.

Business review questions

When reviewing an AI scorecard, ask:

Were the examples realistic?
Did we include hard cases?
Did we test missing information?
Did we test high-risk actions?
Did humans review failures?
Did we separate practice examples from final test examples?
Did the AI leave receipts?
Did the score improve for the right reason?

These questions are powerful even without technical details.

What good looks like

A good AI review says:

On 200 shadow runs, the system completed 82% without correction, escalated 13%, failed gracefully 5%, had zero policy violations, had zero unexpected high-risk actions, and reduced average handling time by 22%. The remaining failures cluster around missing bank verification evidence.

That is much better than:

The demo looked good.