Harness Engineering
22 essays tagged with Harness Engineering.
Agent Harness: An Architectural Framework for Production AI Agents
A whitepaper on typed contracts, policy gates, traces, verification loops, and release control for production AI agents.
Harness Improvement Loops Need Replayable Environments
Why harness improvement needs replayable episodes, bounded mutations, scorecards, source closure, and promotion gates.
Product Managers: How to Think About and Build Complex Agentic Systems
A practical PM guide to building agentic systems with workflow maps, intents, context packs, tools, records, evals, and rollout gates.
How to Develop an Agent with an Agent Harness, End to End
An end-to-end field guide for building agents as measurable harnesses: context, planning, tools, records, evals, rollout, and learning.
Autotune the Harness: Baking the Improvement Loop into ContextOS
How ContextOS treats autotune as a gated loop over traces, scorecards, replay sets, bounded candidates, approval, and rollout.
How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve
Why strong AI engineers build datasets, scorecards, traces, and improvement loops instead of treating agents as prompts plus tools.
Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation
How to treat every prompt, retrieval, tool, policy, and evaluator change as a scored, reviewed, reversible harness candidate.
AGENTS.md Done Right: The Navigation File That Actually Helps Coding Agents
How to write AGENTS.md as a short, scoped, testable navigation file for coding agents instead of a bloated prompt dump.
The Agent Harness Audit: A Production Readiness Checklist for Governed AI Agents
A production readiness audit for agent harnesses: forty runtime controls grouped into eight evidence-backed outcomes.
Replay Harness in Code: Reproducing a DecisionRecord Byte-for-Byte
A TypeScript build-along for replay: input loading, hash-chain verification, canonical loop replay, and DecisionRecord diffing.
End-to-End Refund: How 12 Primitives Compose in One Production Run
A single refund run traced through 12 ContextOS primitives, from invokeAgent envelope to byte-equal replay.
Failure Playbooks: The Typed Verdict Map
How to replace generic retry loops with typed failure verdicts, compensations, escalation paths, and reversal-token checks.
Approval Gates in Code: The Destructive-Mode Handshake
A build-along for approval gates: frozen evidence, human signatures, gateway redemption, and replayable destructive-action handshakes.
Build the Tool Gateway: The Boundary That Actually Stops a Bad Action
A build-along for the Tool Gateway: adapter manifests, typed envelopes, resolver checks, dispatch, and destructive-action boundaries.
The Critic: verify, score, consolidate — in 80 Lines
A compact Critic implementation that verifies plans, scores outcomes, consolidates results, and records caveats.
Promotion-Aware Memory: Capture, Review, Promote, Recall in Code
A build-along for agent memory: capture, review, promote, recall, contradiction checks, and governed memory writes.
Build the Context Pack Compiler: Eight Stages, Eight Files
A build-along for the Context Pack compiler: eight deterministic stages that turn runtime inputs into a typed compiled context.
From Operator Correction to Released StrategyRule: The Improvement Loop, Coded
How one operator correction becomes a reviewed, replayed, versioned StrategyRule that prevents repeat agent failures.
Pack Rollout in Five Stages: Shipping a Context Pack Without Blowing Up Production
A five-stage rollout model for Context Packs: shadow, internal, low-risk, monitored expansion, full release, and rollback.
Wiring the Five Evaluators: Policy, Utility, Latency, Safety, Cost
A build-along for wiring policy, utility, latency, safety, and cost evaluators into a release-gated scorecard.
Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One
How to extend the reviewer pattern for reliability: timeouts, retries, idempotency, fallback behavior, and rollback declarations.
Building a Compliance Reviewer Agent in 60 Lines and a Golden Set
How to build a compliance reviewer agent with a typed verdict envelope, rubric, golden set, and change-control queue.