Skip to content
Press / to search

Agentic Incident Command Center

Multi-agent IT and security incident response with policy gates, evidence snapshots, and reversible execution.

Use Case PlaybookLast reviewed: Edit on GitHub
At a glance

Purpose

Run an incident from signal to verified remediation without turning the agent into an unbounded admin. ContextOS coordinates detector, diagnosis, change-planning, communications, and remediation agents against the same evidence snapshot, then gates every network, privileged, or destructive action.

Why this is agentic-first

Classic automation works when the incident matches a runbook. Real incidents often cross APM, SIEM, cloud, ticketing, IAM, and customer communications. Gartner uses cybersecurity threat response as an example of task-specific agents that scan logs and behavior, assess the issue, and initiate response. ServiceNow describes agent orchestration across network management, SIEM, APM, and human network approval for incident resolution.

This is a ContextOS-native use case because the problem is not “summarize logs.” It is “decide what can safely be changed, who must approve it, and how to replay the decision later.”

Context Pack

LayerRequired entries
decision_layer.decision_specs[]incident.triage.classify, incident.remediation.execute, incident.customer_update.publish
policy_layer.policy_bundles[]Severity policy, change-freeze policy, service ownership policy, data-disclosure policy
policy_layer.approval_gates[]GATE_SRE_APPROVAL, GATE_SECURITY_APPROVAL, GATE_CHANGE_MANAGER
tooling_layer.adapter_registry[]adp_apm.query, adp_siem.search, adp_cloud.scale, adp_iam.revoke_session, adp_statuspage.post, adp_ticket.update
memory_layer.write_classes_allowedincident_pattern, runbook_correction, decision_outcome
evaluation_layer.eval_targets[]false-positive rate, remediation rollback rate, approval latency, time-to-diagnosis

Agent roles

AgentResponsibilityBoundary
Signal AgentCorrelates alerts, traces, deploys, and user-impact signals.Read-only across observability and ticketing.
Diagnosis AgentProposes likely root cause and confidence.Cannot change systems.
Remediation PlannerProduces candidate actions with blast radius and rollback plan.Cannot execute without policy decision.
Communications AgentDrafts internal and external status updates.Publication is gated.
Guardian AgentVerifies policy, scope, evidence, and rollback readiness.Can block or escalate any step.

Execution flow

  1. invokeAgent arrives with intent=incident.resolve, service, symptoms, and active alert IDs.
  2. Compiler pins the service map, ownership graph, change calendar, policy bundle, and observability snapshot.
  3. Signal Agent correlates APM, logs, traces, deploy history, and open incidents.
  4. Diagnosis Agent emits candidate causes with evidence references.
  5. Remediation Planner emits one or more plans, each with risk tier, rollback, and expected user impact.
  6. Guardian Agent checks blast radius and policy constraints.
  7. Low-risk read-only checks execute inline. Scaling, IAM, config, database, or network actions route through the required gate.
  8. The approved action executes through the Tool Gateway with idempotency and rollback metadata.
  9. Communications Agent drafts updates; publication is gated when customer-visible.
  10. ContextOS emits a DecisionRecord with evidence, approvals, controls, trace, and post-action verification.

Decision gates

GateTriggerRequired evidence
GATE_SRE_APPROVALAny production remediation with customer impact.service owner, active incident, rollback plan, blast radius
GATE_SECURITY_APPROVALIAM, token revocation, firewall, containment, or forensic action.security severity, affected identities, containment rationale
GATE_CHANGE_MANAGERChange-freeze window, regulated service, or irreversible migration.change calendar, exception reason, business owner

Failure modes

  • Alert storm - evidence pack must deduplicate alerts by service, symptom, deploy, and user impact before planning.
  • Wrong root cause - Guardian requires post-action verification and can force replan when the metric does not recover.
  • Unsafe remediation - policy blocks actions without rollback, owner approval, or tenant boundary proof.
  • Tool side-effect race - all privileged tools require idempotency keys and current-state preconditions.
  • Incident communication leak - external updates must pass data-disclosure policy before publishing.

Metrics

  • Mean time to classify, mean time to propose, mean time to approve, mean time to restore.
  • Remediation acceptance rate by severity.
  • Rollback rate and failed-change rate.
  • False-correlation rate between alert and root cause.
  • Evidence completeness at approval.
  • Human override and Guardian block rate.

Research signals