Shipping an agent is not the finish line.
It is the point where the product starts producing evidence.
Every run creates traces. Every correction reveals a gap. Every approval delay shows operational friction. Every tool denial says something about authority or context. Every customer complaint may be a policy, UI, context, or workflow bug.
The product manager’s job after launch is to turn those signals into governed improvement.
In ContextOS, that is the Improvement Loop: observe, capture corrections, synthesize insights, compile strategy rules, queue research, propose tuning candidates, gate, release, and roll back when needed.
The PM operating model
A real agentic product needs three cadences:
| Cadence | Meeting | Purpose |
|---|---|---|
| Daily | Operator triage | unblock approvals, inspect severe failures |
| Weekly | Scorecard review | decide advance, pause, tune, or roll back |
| Monthly | Harness strategy | retire weak paths, add intents, refresh evals |
This is not bureaucracy. It is how product learning becomes runtime improvement instead of folklore.
What changes after launch
Before launch, you ask:
Can this work?
After launch, you ask:
What did real work teach us about the harness?
The data changes:
| Signal | Product question |
|---|---|
| Trace | Where did the behavior diverge? |
| DecisionRecord | What was decided and why? |
| Operator correction | What did a human know that the system missed? |
| Approval gate | Which actions are blocked, delayed, or over-gated? |
| Tool denial | Which capability is missing or mis-scoped? |
| Escalation | Which intents are not ready for automation? |
| Customer feedback | Where does the mental model fail? |
Each signal should have an owner and a route.
Route feedback into typed buckets
Do not let every failure become “improve the prompt.”
Use this routing table:
| Feedback | Likely fix | ContextOS primitive |
|---|---|---|
| Agent had wrong facts | context or evidence policy | Context Pack |
| Agent chose wrong tool | tool description or planner rule | Tool Gateway / Planner |
| Agent skipped approval | policy or Critic verification | Governance |
| Agent over-escalated | risk classification or rubric | Intent Catalog / Critic |
| Agent answered badly but path was right | response rubric or examples | Evaluator / prompt |
| Agent got stuck | loop guard or replan budget | Orchestration |
| Operator repeated same correction | strategy rule proposal | StrategyCompiler |
| Missing domain knowledge | research task or memory promotion | ResearchQueue / Memory |
This table keeps the team from treating the model as the only change surface.
The weekly scorecard review
Run this every week during rollout.
Agenda:
- What changed in the harness tuple?
- Did Policy or Safety floors hold?
- Which intents improved Utility?
- What happened to latency and economics?
- Which traces represent new failure modes?
- Which operator corrections repeated?
- Which approvals are creating bottlenecks?
- Which proposals should be approved, rejected, or sent back?
- Should rollout advance, pause, or roll back?
Decision options:
| Verdict | Meaning |
|---|---|
| Advance | Next rollout stage allowed |
| Hold | Keep current stage; collect more traces |
| Tune | Create bounded proposal |
| Roll back | Re-pin previous harness tuple |
| Retire | Remove intent, lane, or tool path |
The review should end with a decision, not a discussion.
The proposal queue
Every improvement should be a proposal with evidence.
proposal:
id: prop_vendor_onboarding_2026w20_07
source:
- correction: corr_182
- trace: trace_vendor_771
- decision_record: dr_vendor_771
target:
intent: vendor.onboarding.compliance_check
layer: planner
change:
add_step: bank_verification_before_sanctions_summary
expected_effect:
utility.operator_correction_rate: -0.04
guardrails:
policy_score: 1.0
safety_score: 1.0
latency_p95_delta: "<= +10%"
evals:
search_set: vendor_compliance_search_v3
release_set: vendor_compliance_release_v2
rollout:
start_stage: 0%_shadow
rollback_tuple: vendor_harness_4.1.0This is how a PM turns “the agent keeps missing bank verification” into a release-gated change.
Autotune with boundaries
Autotune is useful only when the PM defines the search space.
Bad:
Improve onboarding agent quality.
Good:
autotune_request:
intent: vendor.onboarding.compliance_check
primary_metric: operator_correction_rate
direction: decrease
allowed_surfaces:
- context_pack.evidence_order
- planner.tool_preference
- response_examples
forbidden_surfaces:
- approval_modes
- policy_rules
- destructive_tool_access
guardrails:
policy_score: 1.0
safety_score: 1.0
p95_latency_delta: "<= +10%"The PM owns what may change. The optimizer proposes. The release gate decides.
Treat rollout as operations
A staged rollout is not a launch ritual. It is a learning plan.
| Stage | PM learning goal |
|---|---|
0%_shadow | Does the agent produce correct receipts without impact? |
1%_internal | Can trained operators correct and trust it? |
5%_low_risk | Does it hold on real safe traffic? |
25%_monitored | Does it generalize across tenants and edge cases? |
100% | Can support and operations sustain it? |
Each stage should have:
- entry criteria,
- exit criteria,
- rollback trigger,
- support owner,
- scorecard window,
- open-question list.
Incident review for agent products
When an agent incident happens, avoid blame-first reviews.
Use the trace:
| Incident question | Artifact |
|---|---|
| What did the user ask? | request envelope |
| What intent was selected? | Intent Catalog resolution |
| What did the model see? | CompiledContext |
| What plan was made? | Plan |
| What was verified? | Critic.verify |
| What tools were called? | Tool Gateway transcript |
| What policy fired? | policy decision id |
| What receipt was emitted? | DecisionRecord |
| What proposal prevents recurrence? | Improvement Loop record |
The incident is not closed until the improvement proposal is accepted, rejected with rationale, or explicitly deferred.
PM metrics after launch
Track these by intent and version:
| Metric | Why it matters |
|---|---|
| Verified completion rate | Real utility |
| Operator correction rate | Human disagreement |
| Correction recurrence | Learning loop health |
| Approval latency | Operational bottleneck |
| Escalation quality | Human fallback usefulness |
| Replay determinism | Audit health |
| Policy/safety floor violations | Release blockers |
| Cost per verified completion | Economics |
| Harness proposal acceptance rate | Improvement quality |
Do not average all intents together. A safe average can hide a dangerous high-risk path.
Refresh the evals
Production changes the distribution.
Every month:
- promote representative corrected runs into
dev, - add recurring failures to
search, - add only stable, reviewed cases to
release_test, - retire stale examples,
- rebalance rare high-risk cases,
- review grader drift.
The eval set is not static documentation. It is the product’s regression memory.
What the PM should not delegate
PMs can delegate implementation. They should not delegate these decisions:
- Which intent matters most?
- Which error is worse?
- Which human owns approval?
- Which score is a hard floor?
- Which rollout stage is acceptable?
- Which feedback becomes a release candidate?
- Which incidents justify pausing automation?
These are product decisions wearing technical clothing.
Operating checklist
Every shipped agentic product should have:
- an owner by intent,
- a weekly scorecard,
- a trace review queue,
- a correction capture flow,
- a proposal queue,
- release gates,
- rollback tuple,
- support runbook,
- monthly eval refresh,
- incident review template.
If those do not exist, the product is not operated. It is merely deployed.