Skip to content
Back to Blog
Blog series
6 posts · 55 min read

Agent engineering series

How strong AI engineers build agents with datasets, scorecards, traces, and harness improvement loops.

Share:XHN
1
May 12, 2026·17 min read

How to Develop an Agent with an Agent Harness, End to End

A field guide for building production agents as measurable harnesses: intent, RunContext, Context Pack compilation, Planner-Critic-Executor, Tool Gateway, DecisionRecords, trace evals, staged rollout, and the improvement loop.

2
May 12, 2026·13 min read

How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve

Great agent engineers do not treat an agent as a prompt plus tools. They build datasets, scorecards, traces, and improvement loops, then treat the harness itself as a versioned artifact that can be measured and improved like a model.

3
May 12, 2026·7 min read

Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents

Great agent teams build datasets before they tune prompts. This is the practical guide to task distributions, golden sets, corrected runs, held-out release sets, and production slices for ContextOS-style harness engineering.

4
May 12, 2026·6 min read

Scorecards Over Vibes: The Five Metrics That Keep Agents Honest

Agent quality cannot be managed with one score. Production teams need scorecards that separate policy, safety, utility, latency, and economics, then gate harness changes by intent and version.

5
May 12, 2026·6 min read

Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer

Final-answer evals miss how agents actually fail. Trace review captures context, plans, tool calls, guardrails, critic verdicts, and corrections so teams can improve the harness where the failure really happened.

6
May 12, 2026·6 min read

Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation

If the harness is the real optimization target, then every prompt, retrieval, planner, tool, policy, and evaluator change should be treated like a checkpoint candidate: scored, reviewed, rolled out, and reversible.

Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.