Operating Agent Products: Feedback, Rollout, and the Improvement Loop

Shipping an agent is not the finish line.

It is the point where the product starts producing evidence.

Every run creates traces. Every correction reveals a gap. Every approval delay shows operational friction. Every tool denial says something about authority or context. Every customer complaint may be a policy, UI, context, or workflow bug.

The product manager’s job after launch is to turn those signals into governed improvement.

In ContextOS, that is the Improvement Loop: observe, capture corrections, synthesize insights, compile strategy rules, queue research, propose tuning candidates, gate, release, and roll back when needed.

The PM operating model

A real agentic product needs three cadences:

Cadence	Meeting	Purpose
Daily	Operator triage	unblock approvals, inspect severe failures
Weekly	Scorecard review	decide advance, pause, tune, or roll back
Monthly	Harness strategy	retire weak paths, add intents, refresh evals

This is not bureaucracy. It is how product learning becomes runtime improvement instead of folklore.

What changes after launch

Before launch, you ask:

Can this work?

After launch, you ask:

What did real work teach us about the harness?

The data changes:

Signal	Product question
Trace	Where did the behavior diverge?
DecisionRecord	What was decided and why?
Operator correction	What did a human know that the system missed?
Approval gate	Which actions are blocked, delayed, or over-gated?
Tool denial	Which capability is missing or mis-scoped?
Escalation	Which intents are not ready for automation?
Customer feedback	Where does the mental model fail?

Each signal should have an owner and a route.

Route feedback into typed buckets

Do not let every failure become “improve the prompt.”

Use this routing table:

Feedback	Likely fix	ContextOS primitive
Agent had wrong facts	context or evidence policy	Context Pack
Agent chose wrong tool	tool description or planner rule	Tool Gateway / Planner
Agent skipped approval	policy or Critic verification	Governance
Agent over-escalated	risk classification or rubric	Intent Catalog / Critic
Agent answered badly but path was right	response rubric or examples	Evaluator / prompt
Agent got stuck	loop guard or replan budget	Orchestration
Operator repeated same correction	strategy rule proposal	StrategyCompiler
Missing domain knowledge	research task or memory promotion	ResearchQueue / Memory

This table keeps the team from treating the model as the only change surface.

The weekly scorecard review

Run this every week during rollout.

Agenda:

What changed in the harness tuple?
Did Policy or Safety floors hold?
Which intents improved Utility?
What happened to latency and economics?
Which traces represent new failure modes?
Which operator corrections repeated?
Which approvals are creating bottlenecks?
Which proposals should be approved, rejected, or sent back?
Should rollout advance, pause, or roll back?

Decision options:

Verdict	Meaning
Advance	Next rollout stage allowed
Hold	Keep current stage; collect more traces
Tune	Create bounded proposal
Roll back	Re-pin previous harness tuple
Retire	Remove intent, lane, or tool path

The review should end with a decision, not a discussion.

The proposal queue

Every improvement should be a proposal with evidence.

proposal:
  id: prop_vendor_onboarding_2026w20_07
  source:
    - correction: corr_182
    - trace: trace_vendor_771
    - decision_record: dr_vendor_771
  target:
    intent: vendor.onboarding.compliance_check
    layer: planner
  change:
    add_step: bank_verification_before_sanctions_summary
  expected_effect:
    utility.operator_correction_rate: -0.04
  guardrails:
    policy_score: 1.0
    safety_score: 1.0
    latency_p95_delta: "<= +10%"
  evals:
    search_set: vendor_compliance_search_v3
    release_set: vendor_compliance_release_v2
  rollout:
    start_stage: 0%_shadow
    rollback_tuple: vendor_harness_4.1.0

This is how a PM turns “the agent keeps missing bank verification” into a release-gated change.

Autotune with boundaries

Autotune is useful only when the PM defines the search space.

Bad:

Improve onboarding agent quality.

Good:

autotune_request:
  intent: vendor.onboarding.compliance_check
  primary_metric: operator_correction_rate
  direction: decrease
  allowed_surfaces:
    - context_pack.evidence_order
    - planner.tool_preference
    - response_examples
  forbidden_surfaces:
    - approval_modes
    - policy_rules
    - destructive_tool_access
  guardrails:
    policy_score: 1.0
    safety_score: 1.0
    p95_latency_delta: "<= +10%"

The PM owns what may change. The optimizer proposes. The release gate decides.

Treat rollout as operations

A staged rollout is not a launch ritual. It is a learning plan.

Stage	PM learning goal
`0%_shadow`	Does the agent produce correct receipts without impact?
`1%_internal`	Can trained operators correct and trust it?
`5%_low_risk`	Does it hold on real safe traffic?
`25%_monitored`	Does it generalize across tenants and edge cases?
`100%`	Can support and operations sustain it?

Each stage should have:

entry criteria,
exit criteria,
rollback trigger,
support owner,
scorecard window,
open-question list.

Incident review for agent products

When an agent incident happens, avoid blame-first reviews.

Use the trace:

Incident question	Artifact
What did the user ask?	request envelope
What intent was selected?	Intent Catalog resolution
What did the model see?	CompiledContext
What plan was made?	Plan
What was verified?	Critic.verify
What tools were called?	Tool Gateway transcript
What policy fired?	policy decision id
What receipt was emitted?	DecisionRecord
What proposal prevents recurrence?	Improvement Loop record

The incident is not closed until the improvement proposal is accepted, rejected with rationale, or explicitly deferred.

PM metrics after launch

Track these by intent and version:

Metric	Why it matters
Verified completion rate	Real utility
Operator correction rate	Human disagreement
Correction recurrence	Learning loop health
Approval latency	Operational bottleneck
Escalation quality	Human fallback usefulness
Replay determinism	Audit health
Policy/safety floor violations	Release blockers
Cost per verified completion	Economics
Harness proposal acceptance rate	Improvement quality

Do not average all intents together. A safe average can hide a dangerous high-risk path.

Refresh the evals

Production changes the distribution.

Every month:

promote representative corrected runs into dev,
add recurring failures to search,
add only stable, reviewed cases to release_test,
retire stale examples,
rebalance rare high-risk cases,
review grader drift.

The eval set is not static documentation. It is the product’s regression memory.

What the PM should not delegate

PMs can delegate implementation. They should not delegate these decisions:

Which intent matters most?
Which error is worse?
Which human owns approval?
Which score is a hard floor?
Which rollout stage is acceptable?
Which feedback becomes a release candidate?
Which incidents justify pausing automation?

These are product decisions wearing technical clothing.

Operating checklist

Every shipped agentic product should have:

an owner by intent,
a weekly scorecard,
a trace review queue,
a correction capture flow,
a proposal queue,
release gates,
rollback tuple,
support runbook,
monthly eval refresh,
incident review template.

If those do not exist, the product is not operated. It is merely deployed.