Evaluation
13 essays tagged with Evaluation.
Agent Harness: An Architectural Framework for Production AI Agents
A whitepaper on typed contracts, policy gates, traces, verification loops, and release control for production AI agents.
ContextOS: A Research-Grounded Architecture for Governed Agent Runtimes
A research-grounded framing of ContextOS as a governed runtime for context, tools, memory, security, evaluation, replay, and optimization.
Harness Improvement Loops Need Replayable Environments
Why harness improvement needs replayable episodes, bounded mutations, scorecards, source closure, and promotion gates.
How to Judge AI Work: Scorecards, Not Vibes
A practical guide for business teams evaluating AI agents with scorecards, examples, traces, human corrections, and launch gates instead of demos and vibes.
Scorecards Before Screens: Evals and Launch Gates for PMs Building Agents
A PM guide to defining agent quality with datasets, trace reviews, scorecards, release gates, and business metrics before building the agent UI.
Autotune the Harness: Baking the Improvement Loop into ContextOS
How ContextOS treats autotune as a gated loop over traces, scorecards, replay sets, bounded candidates, approval, and rollout.
Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents
A practical guide to golden sets, task distributions, corrected runs, held-out releases, and production slices for agent engineering.
How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve
Why strong AI engineers build datasets, scorecards, traces, and improvement loops instead of treating agents as prompts plus tools.
Scorecards Over Vibes: The Five Metrics That Keep Agents Honest
The five metrics that keep agents honest: policy, utility, latency, safety, and economics.
Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer
How trace review grades the path, not just the answer, by inspecting context, plans, tools, guardrails, critic verdicts, and corrections.
Agentic AI Systems Before and After ContextOS
A table-first guide to why agentic systems need bounded context, governed tools, typed decisions, replay, evaluation, and controlled improvement.
Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One
How to extend the reviewer pattern for reliability: timeouts, retries, idempotency, fallback behavior, and rollback declarations.
Building a Compliance Reviewer Agent in 60 Lines and a Golden Set
How to build a compliance reviewer agent with a typed verdict envelope, rubric, golden set, and change-control queue.