Most enterprises are asking the wrong question about AI agents.
They ask:
How autonomous should this agent be?
The better question is:
How much autonomy has this agent earned for this task, this user, this context, and this risk level?
That question needs a management primitive. Call it the Autonomy Budget.
Autonomy is not a feature you switch on. It is a budget an agent earns, spends, loses, and regains based on risk, evidence, eval performance, policy confidence, tool reliability, approval history, and operational outcomes.
That is the sharp distinction: Agent Harness is the runtime container around the model. Autonomy Budget is the control ledger inside the harness that decides what the agent is allowed to do now.
Autonomy Budget is a runtime control model that defines how much authority an AI agent is allowed to exercise in a specific context. It is computed from task risk, user intent clarity, evidence quality, tool confidence, policy constraints, eval history, financial exposure, and approval requirements.
The timing matters. The World Economic Forum says 82% of executives plan to adopt AI agents within one to three years while oversight remains immature. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027 because of escalating cost, unclear business value, or inadequate risk controls. Capgemini reports that only 2% of organizations have deployed agents at scale, while 61% are still exploring deployment. McKinsey points to the same operating lesson: companies need strong data foundations, workflow selection, and operating models before scaling autonomy.
So the issue is not whether enterprises will try agents. They will.
The issue is whether each agent receives an explicit operating budget before it is allowed to affect customers, money, records, infrastructure, or regulated decisions.
Why binary autonomy fails
“Copilot or autonomous” is not an operating model.
Binary autonomy hides the real question: how much authority is safe for this workflow under these facts? An agent might be safe to answer policy questions, risky when drafting exception language, and unacceptable when executing a refund without evidence.
The useful modes are more granular:
| Autonomy mode | What the agent may do | Budget posture |
|---|---|---|
| Read | Retrieve, summarize, and cite evidence. | allow when access and data classification pass |
| Recommend | Produce a decision with evidence and uncertainty. | allow when policy is clear enough to score |
| Draft | Prepare a response, case note, form, or action packet. | require human edit or confirmation before effect |
| Execute with approval | Execute only after user, operator, or policy approval. | allow when approval state is resumable and logged |
| Auto-execute | Execute inside narrow, reversible, measured bounds. | allow only when evals, policy, and rollback are mature |
These are product operating modes, not a replacement for ContextOS’s canonical approval-mode tiers. The five approval modes remain exactly read_only, local_write, network, delegated, and destructive. Approval mode classifies side-effect risk. Autonomy mode says how far the agent may carry the workflow before it must ask, simulate, or escalate.
The budget formula
No universal formula will fit every enterprise. But every serious agent program needs a scoring model explicit enough to review.
Start with the signals:
| Signal | Meaning | Example |
|---|---|---|
| Task risk | What can go wrong? | refund, payment, cancellation, legal response |
| Evidence confidence | Is the agent grounded? | retrieved policy, booking record, inventory state |
| Policy confidence | Are rules clear and applicable? | refund eligibility, fare rules, compliance policy |
| Tool reliability | Can downstream systems execute safely? | idempotent API, dry-run available, rollback available |
| Eval score | Has this workflow passed tests? | simulation pass rate, regression score, replay stability |
| User impact | How severe is the consequence? | money loss, missed flight, wrong medical or legal advice |
| Human approval need | Does this action need review? | approve, reject, escalate, ask user |
A simple first-pass model can look like this:
budget_score =
0.20 * evidence_confidence
+ 0.20 * policy_confidence
+ 0.20 * eval_score
+ 0.15 * tool_reliability
+ 0.15 * reversibility
- 0.10 * task_risk
- 0.10 * user_impactThen apply hard caps:
autonomy_budget =
min(policy_cap, workflow_cap, user_delegation_cap, tool_cap, eval_cap, exposure_cap)The score suggests how much autonomy the agent has earned. The caps decide what it can never exceed.
| Budget result | Runtime behavior |
|---|---|
| Very low | ask for clarification, refuse, or escalate |
| Low | read and explain only |
| Medium | recommend or draft with evidence |
| High | execute with approval inside declared limits |
| Very high | auto-execute only for low-risk, reversible, well-tested cases |
Higher budget means the agent can act with fewer approvals. Lower budget means it must ask, explain, simulate, or escalate.
The eight dimensions
The score is not enough. The budget also needs dimensions that product, risk, operations, and engineering can inspect.
| Dimension | Budget question | Runtime expression |
|---|---|---|
| Decision scope | What decisions can the agent make without another actor deciding? | intent, task template, DecisionSpec |
| Tool scope | Which capabilities can it see and call? | compiled tool manifest, Tool Gateway policy |
| Money scope | What financial exposure can it create or recommend? | policy rule, approval gate, RunBudget.max_cost_cents |
| Data scope | What data can it read, write, retain, or reveal? | data classification, redaction, evidence refs |
| Reversibility | Can the action be undone, compensated, or only explained afterward? | reversal token, idempotency key, failure playbook |
| Evidence requirement | What proof must exist before the agent acts? | evidence manifest, required refs, frozen snapshot |
| Confidence threshold | What score must the run meet before action or promotion? | evaluator targets, Critic score, release gate |
| Blast radius | How many users, records, systems, or cases can be affected? | cohort rule, rate limit, rollout stage, kill switch |
The table prevents autonomy from becoming a personality trait. Autonomy is not whether the agent feels capable. It is the size of the envelope around its decisions and effects.
How agents earn, spend, and lose autonomy
Autonomy should not be static. It should evolve as the system proves or loses reliability.
| Event | Budget impact |
|---|---|
| Passes offline evals | Budget increases within the eval cap. |
| Performs well in shadow mode | Budget increases for that workflow and cohort. |
| Executes safely with human approval | Budget increases slowly after repeated clean traces. |
| Produces replay-stable DecisionRecords | Budget can expand to a larger rollout slice. |
| Fails policy check | Budget decreases immediately. |
| Causes user-visible error | Budget decreases sharply and routes traces to review. |
| Creates high-value financial exposure | Budget is capped regardless of score. |
| Encounters ambiguous user intent | Budget is capped until clarified. |
| Uses a new tool or new workflow version | Budget resets or drops until evals and shadow runs pass. |
This is SRE error budgets plus risk-based access control plus eval-driven release management for agents.
The promotion path should be explicit:
| Move | Required proof |
|---|---|
| Read to recommend | Evidence coverage is high and missing evidence is handled clearly. |
| Recommend to draft | Humans rarely rewrite the decision packet, and policy citations are stable. |
| Draft to execute with approval | Approval packets pass review, tool arguments stay within schema, and replay is deterministic. |
| Execute with approval to auto-execute | Production slices show stable utility, zero approval bypasses, bounded cost, low correction rate, and clean rollback drills. |
Demotion should be automatic enough that it does not depend on a heroic operator noticing a trend.
| Trigger | Demotion response |
|---|---|
| Policy or safety violation | Drop one mode immediately and route affected traces to review. |
| Failed replay determinism | Freeze promotion, pin prior pack or policy, and open an incident. |
| Cost spike | Reduce tool budget or model route until cost per verified success returns to target. |
| Evidence gap | Disable actions that depend on the missing evidence source. |
| Correction cluster | Move repeated human corrections into a replay case before increasing autonomy. |
| Approval override spike | Lower execution authority and inspect whether the budget or policy is wrong. |
| Tool denial spike | Reduce tool scope or fix the manifest before the agent learns workarounds. |
| Tenant or cohort cliff | Pause rollout for the affected slice rather than averaging it away. |
If promotion is manual and demotion is political, the budget will drift upward. A real Autonomy Budget has revocation paths.
Runtime architecture inside ContextOS
In ContextOS, the Autonomy Budget is not a document. It is a runtime object.
The Planner proposes actions. The Policy layer computes the allowed autonomy. The Evaluation layer checks historical and simulated reliability. The Observability layer records every decision. The Approval layer decides whether the agent can execute, ask, simulate, or escalate.
OpenAI’s guardrails and human review guidance makes the same boundary visible at the SDK level: guardrails and human review define whether a run should continue, pause, or stop. ContextOS turns that boundary into a cross-plane operating contract.
| ContextOS plane | Role in the Autonomy Budget |
|---|---|
| Intelligence | Supplies identity, ontology, knowledge, memory, and provenance so the budget binds to stable subjects and facts. |
| Context | Compiles the budget into evidence requirements, source priorities, redaction, tool visibility, and RunBudget limits. |
| Decision | Runs Planner, Critic, Executor, scoring, and consolidation inside the declared decision scope and confidence thresholds. |
| Action | Enforces tool scope, argument constraints, idempotency, delegation, approval mode, and side-effect boundaries. |
| Trust | Owns policy, approvals, scorecards, replay, rollout gates, incident review, and promotion or demotion decisions. |
The budget should appear in the artifacts that matter: pack metadata, policy bundles, rollout config, scorecard thresholds, approval packets, DecisionRecords, replay packets, and operating review notes.
If it exists only in a slide, it will not govern anything.
What must be logged
An Autonomy Budget cannot work without observability. Every autonomy decision should leave a trace that explains why the agent was allowed to continue, paused for review, or stopped.
| Trace field | Why it matters |
|---|---|
| User intent | What the agent believed the user wanted. |
| Task class | Refund, booking, search, modification, support, content generation. |
| Risk tier | Low, medium, high, critical. |
| Budget granted | Read, recommend, draft, execute with approval, auto-execute. |
| Evidence used | Policies, APIs, memory, retrieved documents. |
| Policy checks | Passed, failed, uncertain. |
| Eval context | Golden set, shadow score, regression score, production slice. |
| Tool calls | What was called and with what arguments. |
| Approval status | Auto-approved, user-approved, human-approved, rejected. |
| Outcome | Success, failure, rollback, escalation. |
| Budget adjustment | Increased, unchanged, reduced. |
This trace is what lets a business owner ask “why did the agent get this much freedom?” and receive an answer that is more precise than “the prompt said it could.”
Example: customer support refund agent
Refund support is a good example because the same user journey moves from harmless explanation to real financial exposure.
| Scenario | Budget decision |
|---|---|
| User asks for the refund policy. | Agent can answer directly with policy citations. |
| User asks whether a booking is eligible. | Agent can retrieve the booking and explain eligibility. |
| User asks to initiate a refund. | Agent can draft an action packet and ask for confirmation. |
| Refund amount is small and policy is clear. | Agent can execute only if policy permits delegated execution and the user confirms. |
| Refund amount is high or policy is ambiguous. | Agent must escalate with a frozen evidence packet. |
| User is angry and asks for an exception. | Agent can summarize the case but cannot approve the exception. |
That operating envelope can be represented as a management-level contract. This is not a replacement for the Context Pack schema. It is the budget object that should compile into pack policy, run budgets, approval gates, tool manifests, scorecards, and rollout rules.
{
"budget_id": "autobudget_support_refund_l3",
"agent_id": "agent:support/refund@3.4.0",
"workflow": "support.refund_resolution",
"autonomy_mode": "execute_with_approval",
"decision_scope": [
"eligibility_check",
"refund_offer_draft",
"case_note_update"
],
"tool_scope": {
"auto_execute": [
"orders.read_booking",
"payments.calculate_refund",
"cases.create_note"
],
"requires_approval": [
"payments.issue_refund"
]
},
"money_scope": {
"currency": "INR",
"max_recommendation_amount": 5000,
"approval_required_above": 1000,
"auto_execute_amount": 0
},
"data_scope": {
"max_classification": "CONFIDENTIAL",
"write_targets": ["case_note"],
"retention_policy": "tenant_default"
},
"evidence_required": [
"booking_policy",
"payment_status",
"customer_identity",
"refund_history"
],
"confidence_thresholds": {
"policy": 1.0,
"safety": 1.0,
"utility": 0.92
},
"blast_radius": {
"max_users_per_run": 1,
"max_cases_per_hour": 50,
"rollout_stage": "25%_monitored"
},
"rollback_mode": "pin_previous_pack_and_route_to_review_queue"
}Notice the shape: the agent can do useful work without receiving unrestricted authority over money movement. It can read, calculate, recommend, draft, and write case notes. It can prepare a refund action packet, but actual refund execution remains approval-gated unless a later budget change earns a narrower execution path through replay and scorecard evidence.
The budget is not anti-autonomy. It is how autonomy becomes promotable.
Autonomy Budget checklist
Before an agent gains more authority, the review should answer these questions:
- What value did the current autonomy mode create?
- Which decisions did humans still correct, and why?
- Which evidence sources were missing, stale, or disputed?
- Which tools were denied, retried, or used near limits?
- Which cohorts performed worse than the average?
- Did cost per verified success improve or degrade?
- Could replay reproduce the accepted DecisionRecords?
- Has rollback been rehearsed, not only documented?
- What exact budget field changes if autonomy increases?
- Who owns the risk if the promotion is wrong?
The last question keeps the system honest. Autonomy is not earned by a model in the abstract. It is granted by an organization that can prove the agent is creating value inside a controlled envelope.
Governed freedom
The future of enterprise agents is not full autonomy.
It is budgeted autonomy: earned, bounded, observable, reversible, and continuously evaluated.
The future enterprise will not ask:
Do we trust AI agents?
It will ask:
What autonomy budget has this agent earned?
That phrase changes the conversation. It moves the room from hype to governance, from fear to operating design, and from demos to evidence.
An agent without an Autonomy Budget is just a system with hidden delegation. An agent with one can be promoted, demoted, audited, replayed, and improved.
That is the difference between trying agents and operating them.
