About Piyush Kumar
I am Piyush Kumar, an AI platform builder and systems thinker focused on making AI agents reliable for real-world enterprise work.
Through ContextOS AI, I write about the architecture of governed intelligence systems: agents with memory, context, tools, policies, evaluations, observability, and human oversight.
Helping technology leaders, product leaders, architects, and builders move from impressive AI demos to production-grade agentic systems.
Building governed intelligence systems for the agentic era
My focus is not AI as a chatbot layer. It is AI as a new execution layer for business - one that needs context, memory, tools, policies, evaluations, observability, and governance.
ContextOS AI is my attempt to explain this shift in simple, practical language. The goal is to make complex AI architecture understandable, actionable, and useful for people building serious systems.
What I care about
The next phase of AI will not be won by simply calling larger models. It will be won by organizations that design systems where AI can:
My work
AI Platforms
Reusable foundations for agents, copilots, personalization systems, and decisioning engines.
Agentic AI Architecture
Planners, executors, memory, tools, guardrails, evaluations, and human approval loops.
Personalization and Context
Systems that understand users, journeys, intent, preferences, and situational context.
Data and Decision Infrastructure
Data platforms, knowledge graphs, feature systems, embeddings, telemetry, and business rules.
Evaluation and Observability
Scorecards, traces, simulations, and feedback systems for useful, safe, improving AI.
Why ContextOS AI exists
Most AI discussions focus on models. In production, the model is only one part of the system.
Real enterprise AI needs an operating environment around the model: context assembly, memory retrieval, tool orchestration, policy enforcement, approval workflows, audit trails, cost controls, quality evaluations, and continuous learning loops.
ContextOS AI exists to document that architecture shift.
My perspective
I believe AI agents should not be treated as magic workers. They should be treated as governed digital operators.
Every agent needs a clear contract:
Professional background
Over the last several years, I have worked on large-scale consumer technology platforms across AI, personalization, data infrastructure, marketing technology, customer experience, and digital commerce.
My experience includes building and scaling systems with high traffic, high reliability, and high business impact across personalization, growth, conversational AI, customer experience automation, data platforms, evaluation systems, and AI-native product experiences.
That practical exposure shapes the way I write: less abstract AI hype, more systems that can survive production reality.
Current areas of exploration
ContextOS
A governed runtime where memory, tools, policies, evaluations, and observability form a common intelligence layer.
Agent Harness Engineering
The discipline required to make agents repeatable, measurable, safe, and production-ready.
AI Memory Systems
How long-term memory, short-term context, user preferences, knowledge graphs, and retrieval should work together.
Tokenomics
How enterprises should measure the cost, value, and reliability of AI beyond simple token cost.
Evaluation-first AI
Scorecards, simulation, regression testing, and feedback loops before systems are trusted.
Business Leadership in AI
AI adoption as operating infrastructure, not a collection of isolated tools.
A simple belief
AI will not replace systems thinking. It will reward it.
The organizations that win with AI will be the ones that combine models, data, tools, workflows, policies, and people into coherent systems. That is the future I am interested in building and explaining.
Writing
Essays, frameworks, architecture notes, and implementation-oriented thinking by Piyush Kumar.
Reversibility Is the Missing Safety Primitive for AI Agents
Prevention decides whether agents may act. Reversibility lets them survive being wrong through reversal contracts, compensation, and blast-radius caps.
AI Tokenomics: From Cost per Token to Cost per Trusted Outcome
AI tokenomics connects cost per token, agentic cost multipliers, routing, evals, governance, and cost per trusted outcome.
The Autonomy Budget: How Enterprises Should Decide What AI Agents Are Allowed to Do
A practical governance model for granting AI agents bounded authority based on risk, evidence, policy confidence, evals, and approval.
Antahkarana Stack: A Cognitive Layer for Local-First Agents
A builder-facing explanation of Antahkarana as an engineering layer inspired by the inner faculties of Manas, Buddhi, Chitta, and Ahamkara.
Agent Harness: An Architectural Framework for Production AI Agents
A whitepaper on typed contracts, policy gates, traces, verification loops, and release control for production AI agents.
Agent Identity Is the New Trust Boundary
A practical model for separating agent identity, workload proof, user delegation, scoped authority, and audit across MCP and A2A.
ContextOS: A Research-Grounded Architecture for Governed Agent Runtimes
A research-grounded framing of ContextOS as a governed runtime for context, tools, memory, security, evaluation, replay, and optimization.
Agentic Incident Command Center: Agents Can Coordinate, Boundaries Still Decide
How incident-response agents can coordinate signal, diagnosis, remediation, communications, and approvals without bypassing operational boundaries.
AI Gateway and LLM Router: Model Choice Is a Runtime Decision
How an AI Gateway and LLM Router make model choice policy-bound, budgeted, observable, and replayable across production agent workflows.
Financial Crime Operations: Agentic AI Needs Evidence, Not Autonomy
How KYC, AML, sanctions, and fraud casework can use agentic workflows while preserving evidence, policy gates, and human adjudication.
The Identity Layer: Agents Need Two Identities, Not One
Why governed agent runs need entity identity, delegated user identity, and workload identity in the same RunContext.
MCP Adapters in Production: The Manifest Is the Safety Boundary
How MCP fits behind a production adapter manifest with schemas, auth, approval modes, idempotency, observability, and replay.
Harness Improvement Loops Need Replayable Environments
Why harness improvement needs replayable episodes, bounded mutations, scorecards, source closure, and promotion gates.
AI Does Not Launch Once: Feedback Loops After Go-Live
A plain-English guide to operating agents after launch: corrections, recurring failures, proposal queues, rollout, rollback, and review.
How to Judge AI Work: Scorecards, Not Vibes
A practical guide for business teams evaluating AI agents with scorecards, examples, traces, human corrections, and launch gates instead of demos and vibes.
Trusting AI at Work: Approvals, Boundaries, and Receipts
A plain-English guide to agent trust: what AI can read, draft, send, change, approve, and how receipts make decisions accountable.
Before Your Team Asks for an AI Agent, Map the Real Work
A practical guide for business teams mapping real work before building agents: actors, evidence, tools, decisions, risks, exceptions, and feedback loops.
AI Agents for Business Leaders: Build the Airport, Not Just the Plane
A practical executive playbook for agentic AI: define the work, evidence, authority, scorecards, approvals, security, observability, and improvement loop.
Operating Agent Products: Feedback, Rollout, and the Improvement Loop
A PM operating model for shipped agents: trace review, corrections, proposal queues, scorecards, rollout, and rollback.
Trust Is a Product Surface: Approval Modes and Human Control for Agentic Products
How PMs should design trust for real agentic products: approval modes, human roles, evidence snapshots, DecisionRecords, policy gates, and graceful failure.
Scorecards Before Screens: Evals and Launch Gates for PMs Building Agents
A PM guide to defining agent quality with datasets, trace reviews, scorecards, release gates, and business metrics before building the agent UI.
The Control Tower Pattern: How PMs Should Design Multi-Agent Products
A PM guide to splitting multi-agent systems into specialist lanes while keeping orchestration governed and inspectable.
From PRD to Intent Catalog: The PM Spec for Agentic Products
How PMs turn vague agent ideas into intent catalogs, task templates, authority models, DecisionRecords, and launch criteria.
Product Managers: How to Think About and Build Complex Agentic Systems
A practical PM guide to building agentic systems with workflow maps, intents, context packs, tools, records, evals, and rollout gates.
How to Develop an Agent with an Agent Harness, End to End
An end-to-end field guide for building agents as measurable harnesses: context, planning, tools, records, evals, rollout, and learning.
Autotune the Harness: Baking the Improvement Loop into ContextOS
How ContextOS treats autotune as a gated loop over traces, scorecards, replay sets, bounded candidates, approval, and rollout.
Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents
A practical guide to golden sets, task distributions, corrected runs, held-out releases, and production slices for agent engineering.
How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve
Why strong AI engineers build datasets, scorecards, traces, and improvement loops instead of treating agents as prompts plus tools.
Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation
How to treat every prompt, retrieval, tool, policy, and evaluator change as a scored, reviewed, reversible harness candidate.
Scorecards Over Vibes: The Five Metrics That Keep Agents Honest
The five metrics that keep agents honest: policy, utility, latency, safety, and economics.
Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer
How trace review grades the path, not just the answer, by inspecting context, plans, tools, guardrails, critic verdicts, and corrections.
Agentic AI Systems Before and After ContextOS
A table-first guide to why agentic systems need bounded context, governed tools, typed decisions, replay, evaluation, and controlled improvement.
AGENTS.md Done Right: The Navigation File That Actually Helps Coding Agents
How to write AGENTS.md as a short, scoped, testable navigation file for coding agents instead of a bloated prompt dump.
The Agent Harness Audit: A Production Readiness Checklist for Governed AI Agents
A production readiness audit for agent harnesses: forty runtime controls grouped into eight evidence-backed outcomes.
Replay Harness in Code: Reproducing a DecisionRecord Byte-for-Byte
A TypeScript build-along for replay: input loading, hash-chain verification, canonical loop replay, and DecisionRecord diffing.
End-to-End Refund: How 12 Primitives Compose in One Production Run
A single refund run traced through 12 ContextOS primitives, from invokeAgent envelope to byte-equal replay.
Failure Playbooks: The Typed Verdict Map
How to replace generic retry loops with typed failure verdicts, compensations, escalation paths, and reversal-token checks.
Approval Gates in Code: The Destructive-Mode Handshake
A build-along for approval gates: frozen evidence, human signatures, gateway redemption, and replayable destructive-action handshakes.
Build the Tool Gateway: The Boundary That Actually Stops a Bad Action
A build-along for the Tool Gateway: adapter manifests, typed envelopes, resolver checks, dispatch, and destructive-action boundaries.
The Critic: verify, score, consolidate — in 80 Lines
A compact Critic implementation that verifies plans, scores outcomes, consolidates results, and records caveats.
The Five Planes of Agentic Operating Systems
A working decomposition for production agent systems: Intelligence, Context, Decision, Action, and Trust.
Promotion-Aware Memory: Capture, Review, Promote, Recall in Code
A build-along for agent memory: capture, review, promote, recall, contradiction checks, and governed memory writes.
Build the Context Pack Compiler: Eight Stages, Eight Files
A build-along for the Context Pack compiler: eight deterministic stages that turn runtime inputs into a typed compiled context.
Context Graphs: Decision Lineage as a System of Record
How hash-chained DecisionRecords turn execution-time context into a queryable lineage graph for why an agent acted.
From Operator Correction to Released StrategyRule: The Improvement Loop, Coded
How one operator correction becomes a reviewed, replayed, versioned StrategyRule that prevents repeat agent failures.
Pack Rollout in Five Stages: Shipping a Context Pack Without Blowing Up Production
A five-stage rollout model for Context Packs: shadow, internal, low-risk, monitored expansion, full release, and rollback.
Replay Is the Real Audit Log
Why "we have logs" is not an audit story, and what a hash-chained Decision Record plus canonical replay actually buys you when an incident hits.
Wiring the Five Evaluators: Policy, Utility, Latency, Safety, Cost
A build-along for wiring policy, utility, latency, safety, and cost evaluators into a release-gated scorecard.
Context Packs in Practice: From Spec to Run
A practical walkthrough of Context Packs: buckets, policy bundles, evaluation gates, lifecycle, and the compile pipeline.
Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One
How to extend the reviewer pattern for reliability: timeouts, retries, idempotency, fallback behavior, and rollback declarations.
Building a Compliance Reviewer Agent in 60 Lines and a Golden Set
How to build a compliance reviewer agent with a typed verdict envelope, rubric, golden set, and change-control queue.
Approval-Mode Tiers: A Risk Taxonomy You Can Actually Ship
Why ad-hoc approval gates rot in production, and how five canonical risk tiers turn governance from a meeting into a contract.
Beyond Prompts: The Architecture of Trust for Agentic AI
Building a governed decision runtime across Intelligence, Context, Decision, Action, and Trust — with evaluator scoring, approval tiers, and replay-bound audit.
Prompt Injection Is a Boundary Problem, Not a Prompt Problem
Why "smarter prompts" don't defend against indirect prompt injection, and what changes when authority lives outside the model's view.
Context Engineering in Production
Why most agent failures are not model failures — they are context failures — and what changes when context becomes a versioned, testable, replayable contract.