Skip to content
Back to Blog
About

About Piyush Kumar

I am Piyush Kumar, an AI platform builder and systems thinker focused on making AI agents reliable for real-world enterprise work.

Through ContextOS AI, I write about the architecture of governed intelligence systems: agents with memory, context, tools, policies, evaluations, observability, and human oversight.

Focus

Helping technology leaders, product leaders, architects, and builders move from impressive AI demos to production-grade agentic systems.

Building governed intelligence systems for the agentic era

My focus is not AI as a chatbot layer. It is AI as a new execution layer for business - one that needs context, memory, tools, policies, evaluations, observability, and governance.

ContextOS AI is my attempt to explain this shift in simple, practical language. The goal is to make complex AI architecture understandable, actionable, and useful for people building serious systems.

What I care about

The next phase of AI will not be won by simply calling larger models. It will be won by organizations that design systems where AI can:

understand business context
reason over changing situations
call the right tools safely
remember what matters
operate within clear authority boundaries
explain why an action was taken
improve through feedback and evaluations
remain observable, auditable, and governed

My work

AI Platforms

Reusable foundations for agents, copilots, personalization systems, and decisioning engines.

Agentic AI Architecture

Planners, executors, memory, tools, guardrails, evaluations, and human approval loops.

Personalization and Context

Systems that understand users, journeys, intent, preferences, and situational context.

Data and Decision Infrastructure

Data platforms, knowledge graphs, feature systems, embeddings, telemetry, and business rules.

Evaluation and Observability

Scorecards, traces, simulations, and feedback systems for useful, safe, improving AI.

Why ContextOS AI exists

Most AI discussions focus on models. In production, the model is only one part of the system.

Real enterprise AI needs an operating environment around the model: context assembly, memory retrieval, tool orchestration, policy enforcement, approval workflows, audit trails, cost controls, quality evaluations, and continuous learning loops.

ContextOS AI exists to document that architecture shift.

What makes an AI agent trustworthy?
How should enterprises govern autonomous workflows?
What is the right architecture for agent memory?
How should AI decisions be evaluated?
What does observability mean for agentic systems?
How should business leaders think about AI beyond chatbots?
How do we move from prompt engineering to system engineering?

My perspective

I believe AI agents should not be treated as magic workers. They should be treated as governed digital operators.

Every agent needs a clear contract:

/what work it can do
/what evidence it must use
/what tools it can call
/what authority it has
/what risks it must check
/when it must ask for approval
/how output will be evaluated
/how behavior will improve over time

Professional background

Over the last several years, I have worked on large-scale consumer technology platforms across AI, personalization, data infrastructure, marketing technology, customer experience, and digital commerce.

My experience includes building and scaling systems with high traffic, high reliability, and high business impact across personalization, growth, conversational AI, customer experience automation, data platforms, evaluation systems, and AI-native product experiences.

That practical exposure shapes the way I write: less abstract AI hype, more systems that can survive production reality.

Current areas of exploration

ContextOS

A governed runtime where memory, tools, policies, evaluations, and observability form a common intelligence layer.

Agent Harness Engineering

The discipline required to make agents repeatable, measurable, safe, and production-ready.

AI Memory Systems

How long-term memory, short-term context, user preferences, knowledge graphs, and retrieval should work together.

Tokenomics

How enterprises should measure the cost, value, and reliability of AI beyond simple token cost.

Evaluation-first AI

Scorecards, simulation, regression testing, and feedback loops before systems are trusted.

Business Leadership in AI

AI adoption as operating infrastructure, not a collection of isolated tools.

A simple belief

AI will not replace systems thinking. It will reward it.

The organizations that win with AI will be the ones that combine models, data, tools, workflows, policies, and people into coherent systems. That is the future I am interested in building and explaining.

Writing

Essays, frameworks, architecture notes, and implementation-oriented thinking by Piyush Kumar.

55 essays
May 30, 2026·18 min read·Intermediate

Reversibility Is the Missing Safety Primitive for AI Agents

Prevention decides whether agents may act. Reversibility lets them survive being wrong through reversal contracts, compensation, and blast-radius caps.

Read essay
May 26, 2026·16 min read·Beginner

AI Tokenomics: From Cost per Token to Cost per Trusted Outcome

AI tokenomics connects cost per token, agentic cost multipliers, routing, evals, governance, and cost per trusted outcome.

Read essay
May 23, 2026·12 min read·Intermediate

The Autonomy Budget: How Enterprises Should Decide What AI Agents Are Allowed to Do

A practical governance model for granting AI agents bounded authority based on risk, evidence, policy confidence, evals, and approval.

Read essay
May 20, 2026·16 min read·Intermediate

Antahkarana Stack: A Cognitive Layer for Local-First Agents

A builder-facing explanation of Antahkarana as an engineering layer inspired by the inner faculties of Manas, Buddhi, Chitta, and Ahamkara.

Read essay
May 19, 2026·33 min read·Intermediate

Agent Harness: An Architectural Framework for Production AI Agents

A whitepaper on typed contracts, policy gates, traces, verification loops, and release control for production AI agents.

Read essay
May 17, 2026·13 min read·Intermediate

Agent Identity Is the New Trust Boundary

A practical model for separating agent identity, workload proof, user delegation, scoped authority, and audit across MCP and A2A.

Read essay
May 16, 2026·28 min read·Intermediate

ContextOS: A Research-Grounded Architecture for Governed Agent Runtimes

A research-grounded framing of ContextOS as a governed runtime for context, tools, memory, security, evaluation, replay, and optimization.

Read essay
May 14, 2026·5 min read·Beginner

Agentic Incident Command Center: Agents Can Coordinate, Boundaries Still Decide

How incident-response agents can coordinate signal, diagnosis, remediation, communications, and approvals without bypassing operational boundaries.

Read essay
May 14, 2026·6 min read·Intermediate

AI Gateway and LLM Router: Model Choice Is a Runtime Decision

How an AI Gateway and LLM Router make model choice policy-bound, budgeted, observable, and replayable across production agent workflows.

Read essay
May 14, 2026·5 min read·Beginner

Financial Crime Operations: Agentic AI Needs Evidence, Not Autonomy

How KYC, AML, sanctions, and fraud casework can use agentic workflows while preserving evidence, policy gates, and human adjudication.

Read essay
May 14, 2026·6 min read·Intermediate

The Identity Layer: Agents Need Two Identities, Not One

Why governed agent runs need entity identity, delegated user identity, and workload identity in the same RunContext.

Read essay
May 14, 2026·5 min read·Intermediate

MCP Adapters in Production: The Manifest Is the Safety Boundary

How MCP fits behind a production adapter manifest with schemas, auth, approval modes, idempotency, observability, and replay.

Read essay
May 14, 2026·7 min read·Intermediate

Harness Improvement Loops Need Replayable Environments

Why harness improvement needs replayable episodes, bounded mutations, scorecards, source closure, and promotion gates.

Read essay
May 13, 2026·8 min read·Beginner

AI Does Not Launch Once: Feedback Loops After Go-Live

A plain-English guide to operating agents after launch: corrections, recurring failures, proposal queues, rollout, rollback, and review.

Read essay
May 13, 2026·4 min read·Beginner

How to Judge AI Work: Scorecards, Not Vibes

A practical guide for business teams evaluating AI agents with scorecards, examples, traces, human corrections, and launch gates instead of demos and vibes.

Read essay
May 13, 2026·4 min read·Beginner

Trusting AI at Work: Approvals, Boundaries, and Receipts

A plain-English guide to agent trust: what AI can read, draft, send, change, approve, and how receipts make decisions accountable.

Read essay
May 13, 2026·4 min read·Beginner

Before Your Team Asks for an AI Agent, Map the Real Work

A practical guide for business teams mapping real work before building agents: actors, evidence, tools, decisions, risks, exceptions, and feedback loops.

Read essay
May 13, 2026·20 min read·Beginner

AI Agents for Business Leaders: Build the Airport, Not Just the Plane

A practical executive playbook for agentic AI: define the work, evidence, authority, scorecards, approvals, security, observability, and improvement loop.

Read essay
May 13, 2026·6 min read·Beginner

Operating Agent Products: Feedback, Rollout, and the Improvement Loop

A PM operating model for shipped agents: trace review, corrections, proposal queues, scorecards, rollout, and rollback.

Read essay
May 13, 2026·6 min read·Beginner

Trust Is a Product Surface: Approval Modes and Human Control for Agentic Products

How PMs should design trust for real agentic products: approval modes, human roles, evidence snapshots, DecisionRecords, policy gates, and graceful failure.

Read essay
May 13, 2026·6 min read·Beginner

Scorecards Before Screens: Evals and Launch Gates for PMs Building Agents

A PM guide to defining agent quality with datasets, trace reviews, scorecards, release gates, and business metrics before building the agent UI.

Read essay
May 13, 2026·7 min read·Beginner

The Control Tower Pattern: How PMs Should Design Multi-Agent Products

A PM guide to splitting multi-agent systems into specialist lanes while keeping orchestration governed and inspectable.

Read essay
May 13, 2026·6 min read·Beginner

From PRD to Intent Catalog: The PM Spec for Agentic Products

How PMs turn vague agent ideas into intent catalogs, task templates, authority models, DecisionRecords, and launch criteria.

Read essay
May 13, 2026·18 min read·Beginner

Product Managers: How to Think About and Build Complex Agentic Systems

A practical PM guide to building agentic systems with workflow maps, intents, context packs, tools, records, evals, and rollout gates.

Read essay
May 12, 2026·17 min read·Intermediate

How to Develop an Agent with an Agent Harness, End to End

An end-to-end field guide for building agents as measurable harnesses: context, planning, tools, records, evals, rollout, and learning.

Read essay
May 12, 2026·11 min read·Intermediate

Autotune the Harness: Baking the Improvement Loop into ContextOS

How ContextOS treats autotune as a gated loop over traces, scorecards, replay sets, bounded candidates, approval, and rollout.

Read essay
May 12, 2026·7 min read·Intermediate

Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents

A practical guide to golden sets, task distributions, corrected runs, held-out releases, and production slices for agent engineering.

Read essay
May 12, 2026·13 min read·Intermediate

How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve

Why strong AI engineers build datasets, scorecards, traces, and improvement loops instead of treating agents as prompts plus tools.

Read essay
May 12, 2026·6 min read·Intermediate

Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation

How to treat every prompt, retrieval, tool, policy, and evaluator change as a scored, reviewed, reversible harness candidate.

Read essay
May 12, 2026·6 min read·Intermediate

Scorecards Over Vibes: The Five Metrics That Keep Agents Honest

The five metrics that keep agents honest: policy, utility, latency, safety, and economics.

Read essay
May 12, 2026·6 min read·Intermediate

Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer

How trace review grades the path, not just the answer, by inspecting context, plans, tools, guardrails, critic verdicts, and corrections.

Read essay
May 9, 2026·18 min read·Intermediate

Agentic AI Systems Before and After ContextOS

A table-first guide to why agentic systems need bounded context, governed tools, typed decisions, replay, evaluation, and controlled improvement.

Read essay
May 9, 2026·9 min read·Intermediate

AGENTS.md Done Right: The Navigation File That Actually Helps Coding Agents

How to write AGENTS.md as a short, scoped, testable navigation file for coding agents instead of a bloated prompt dump.

Read essay
May 9, 2026·18 min read·Intermediate

The Agent Harness Audit: A Production Readiness Checklist for Governed AI Agents

A production readiness audit for agent harnesses: forty runtime controls grouped into eight evidence-backed outcomes.

Read essay
May 9, 2026·6 min read·Intermediate

Replay Harness in Code: Reproducing a DecisionRecord Byte-for-Byte

A TypeScript build-along for replay: input loading, hash-chain verification, canonical loop replay, and DecisionRecord diffing.

Read essay
May 8, 2026·5 min read·Intermediate

End-to-End Refund: How 12 Primitives Compose in One Production Run

A single refund run traced through 12 ContextOS primitives, from invokeAgent envelope to byte-equal replay.

Read essay
May 7, 2026·6 min read·Intermediate

Failure Playbooks: The Typed Verdict Map

How to replace generic retry loops with typed failure verdicts, compensations, escalation paths, and reversal-token checks.

Read essay
May 6, 2026·5 min read·Intermediate

Approval Gates in Code: The Destructive-Mode Handshake

A build-along for approval gates: frozen evidence, human signatures, gateway redemption, and replayable destructive-action handshakes.

Read essay
May 5, 2026·5 min read·Intermediate

Build the Tool Gateway: The Boundary That Actually Stops a Bad Action

A build-along for the Tool Gateway: adapter manifests, typed envelopes, resolver checks, dispatch, and destructive-action boundaries.

Read essay
May 2, 2026·5 min read·Intermediate

The Critic: verify, score, consolidate — in 80 Lines

A compact Critic implementation that verifies plans, scores outcomes, consolidates results, and records caveats.

Read essay
April 29, 2026·10 min read·Intermediate

The Five Planes of Agentic Operating Systems

A working decomposition for production agent systems: Intelligence, Context, Decision, Action, and Trust.

Read essay
April 25, 2026·6 min read·Intermediate

Promotion-Aware Memory: Capture, Review, Promote, Recall in Code

A build-along for agent memory: capture, review, promote, recall, contradiction checks, and governed memory writes.

Read essay
April 21, 2026·6 min read·Intermediate

Build the Context Pack Compiler: Eight Stages, Eight Files

A build-along for the Context Pack compiler: eight deterministic stages that turn runtime inputs into a typed compiled context.

Read essay
April 18, 2026·8 min read·Intermediate

Context Graphs: Decision Lineage as a System of Record

How hash-chained DecisionRecords turn execution-time context into a queryable lineage graph for why an agent acted.

Read essay
April 15, 2026·7 min read·Intermediate

From Operator Correction to Released StrategyRule: The Improvement Loop, Coded

How one operator correction becomes a reviewed, replayed, versioned StrategyRule that prevents repeat agent failures.

Read essay
April 11, 2026·7 min read·Intermediate

Pack Rollout in Five Stages: Shipping a Context Pack Without Blowing Up Production

A five-stage rollout model for Context Packs: shadow, internal, low-risk, monitored expansion, full release, and rollback.

Read essay
April 8, 2026·8 min read·Intermediate

Replay Is the Real Audit Log

Why "we have logs" is not an audit story, and what a hash-chained Decision Record plus canonical replay actually buys you when an incident hits.

Read essay
April 5, 2026·6 min read·Intermediate

Wiring the Five Evaluators: Policy, Utility, Latency, Safety, Cost

A build-along for wiring policy, utility, latency, safety, and cost evaluators into a release-gated scorecard.

Read essay
March 26, 2026·8 min read·Intermediate

Context Packs in Practice: From Spec to Run

A practical walkthrough of Context Packs: buckets, policy bundles, evaluation gates, lifecycle, and the compile pipeline.

Read essay
March 18, 2026·4 min read·Intermediate

Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One

How to extend the reviewer pattern for reliability: timeouts, retries, idempotency, fallback behavior, and rollback declarations.

Read essay
March 15, 2026·6 min read·Intermediate

Building a Compliance Reviewer Agent in 60 Lines and a Golden Set

How to build a compliance reviewer agent with a typed verdict envelope, rubric, golden set, and change-control queue.

Read essay
March 14, 2026·8 min read·Intermediate

Approval-Mode Tiers: A Risk Taxonomy You Can Actually Ship

Why ad-hoc approval gates rot in production, and how five canonical risk tiers turn governance from a meeting into a contract.

Read essay
March 2, 2026·14 min read·Intermediate

Beyond Prompts: The Architecture of Trust for Agentic AI

Building a governed decision runtime across Intelligence, Context, Decision, Action, and Trust — with evaluator scoring, approval tiers, and replay-bound audit.

Read essay
February 21, 2026·9 min read·Intermediate

Prompt Injection Is a Boundary Problem, Not a Prompt Problem

Why "smarter prompts" don't defend against indirect prompt injection, and what changes when authority lives outside the model's view.

Read essay
February 4, 2026·8 min read·Intermediate

Context Engineering in Production

Why most agent failures are not model failures — they are context failures — and what changes when context becomes a versioned, testable, replayable contract.

Read essay