AI Agent Platform Checklist: Identity, Memory, and Governance
On this page
Most platform evaluations fail at the same spot. Not on model quality. Not on pricing. They fail because the buyer asks ‘do you support governance?’ and the vendor says yes — and both parties agree that means something.
It doesn’t. ‘We have admin settings’ is not governance. Governance means you can prove who acted, what the agent decided, and why — and reconstruct that chain after the fact. The gap between those two definitions is where production incidents live. If you’re evaluating an AI agent platform, this is the checklist you should be running.
There’s a specific failure mode that shows up more than any other during production approvals. I’ll get to it in the identity section — but first, the framework.
Why Governance Gaps Don’t Show Up Until It’s Too Late
Traditional automation was deterministic. A workflow engine executed predefined steps. You could map every branch in advance. If something broke, you knew exactly which branch to inspect.
Agentic AI doesn’t work that way. The agent chooses its steps dynamically. That’s what makes it useful — and what makes it harder to predict. A chatbot has a low blast radius. Robotic process automation (software that clicks buttons and fills forms like a person would) is predictable because it follows scripts. An agent that selects its own tools and routes its own decisions introduces a control challenge neither of those systems required.
This is why point-of-login protection isn’t enough. Some agent sessions run for hours. An agent approved to act at 9 AM may still be executing at midnight — and anything that authenticated it at login but doesn’t validate it continuously is a control gap, not a control.
The 4 Areas Every Platform Evaluation Must Cover
Evaluation questions cluster into four categories. Run through all four before any platform conversation goes past demo stage.
Identity & Lifecycle
How the platform provisions, authenticates, and deprovisions agent identities. Non-human identities carry the same oversight requirements as human ones — they just don't have someone watching their inbox.
Persistent Context & Memory
Whether agents maintain coherent state across sessions, channels, and workflows — and whether that state is isolated per agent or shared in ways that create unexpected cross-contamination.
Approvals & Policy Gates
Whether human-in-the-loop checkpoints exist for high-risk actions, and whether those checkpoints are enforced by the platform or just described in documentation.
Auditability & Explainability
Whether every decision can be reconstructed — not as an aggregate summary, but as a traceable chain: input, reasoning, tool call, output, and any approval that preceded it.
The Declared-vs-Observed Gap Nobody Checks
Here’s the failure mode I flagged earlier. It’s the most common reason agents get approved for production when they shouldn’t.
An agent declares access to 47 APIs. In practice, it touches 3. The other 44 are either dormant attack surface or undetected drift — capabilities the agent could reach if its behavior shifted, but that nobody is actively monitoring. The declared permission set and the observed behavior set don’t match. And nobody flags the delta.
This isn’t a configuration mistake. It’s a structural gap in how most platforms handle identity. They provision access at setup and never reconcile it against what the agent actually uses. The fix requires a platform that can show you both numbers — declared and observed — and alert you when they diverge.
The Identity Gate: What to Verify
Treat every agent like a workforce identity with a defined job role and least-privilege access. That’s not a metaphor — it’s the actual control model. An agent that can reach a tool it doesn’t need for its defined function is an agent with unnecessary blast radius.
- Centralized provisioning: Is there a single place where agent identities are created, scoped, and deprovisioned? Or does each team/workflow spin up agents independently?
- Lifecycle management: When a project ends or a workflow changes, is the agent identity automatically deprovisioned — or does it persist indefinitely with full credentials?
- Dynamic controls: Are access policies evaluated continuously, or only at session start? Static controls can’t keep up with agents that operate for hours.
- Declared vs. observed reconciliation: Does the platform log which tools the agent actually calls, and does it surface a comparison against declared permissions?
- Cross-agent isolation: If one agent is compromised or behaves unexpectedly, can it affect another agent’s memory, credentials, or execution context?
The Memory Gate: Persistent Context That Doesn’t Leak
Persistent memory is what separates an agent from a stateless chatbot. But persistent memory without isolation controls is a liability.
The concern isn’t just data retention — it’s whether context bleeds between agents, between users, or between workflow runs in ways that produce unexpected behavior. A support agent that retains conversation context from a previous customer’s session and surfaces it in the next one isn’t persisting memory. It’s leaking it.
- Session continuity: Does the agent maintain context across disconnected sessions — or does each conversation start cold?
- Isolation model: Is memory scoped to the specific agent instance, or shared across agents? What are the boundaries?
- Shared state across channels: If the same agent handles a request via email and then via messaging, does it maintain coherent state — or does it behave differently depending on entry point?
- Retention policy: How long is context retained? Is there a mechanism for expiry or controlled deletion?
- Long-session controls: For workflows that run for hours, are there checkpoints that validate the agent’s context hasn’t been tampered with or corrupted mid-run?
The enterprise standard is one agent with shared state and shared control — not parallel implementations that behave differently depending on where or how the conversation starts. If a platform can’t demonstrate consistent behavior across entry points, shared state is a marketing claim, not a feature.
Some things only become reliable once you’ve sorted out who’s in charge, what’s remembered, and who’s watching. Beacon’s got the checklist covered.
The Approvals Gate: Human-in-the-Loop That Actually Enforces
This is the section where most platforms fail quietly. They have approval workflows in the documentation. Those workflows are rarely enforced by the platform itself.
There’s a difference between ‘you can configure approvals’ and ‘the platform requires approvals for high-risk tool categories.’ The first is a setting. The second is a control.
- Policy gates per tool call: Is there a mechanism to require human approval before specific tool categories — financial transactions, data deletion, external communications?
- Approval enforcement vs. suggestion: Does the platform block action until approval is received, or does it proceed and notify after the fact?
- Blast radius tolerance: Has the platform defined — and can it enforce — maximum impact bounds for autonomous action? E.g., transaction caps, rate limits on external calls.
- Escalation paths: If an agent can’t get approval (human unavailable, timeout), what happens? Does it halt, retry, or proceed?
- Approval audit trail: Is each approval decision logged with timestamp, approver identity, and the action that was approved?
The Auditability Gate: Black Box vs. Reconstructable System
If your agent can take action but you cannot reconstruct the full chain of decision, policy check, tool call, and human approval afterward, you don’t have an auditable system. You have a black box with logs.
The distinction matters for compliance and it matters operationally. When something goes wrong — and eventually something will — ‘the agent did it’ is not a root cause. You need to show what data the agent considered, what reasoning led to the action, which tool was called, and whether a policy gate was cleared.
Regulatory explainability requirements are moving in one direction: every individual decision needs to be traceable. Not in aggregate. Not as a statistical summary. Every decision, from input to reasoning to output, reconstructable after the fact.
- Immutable audit trail: Is every action logged in a tamper-proof record? Can that log be altered by agents or admins after the fact?
- Decision-level traceability: Does the log capture the specific data the agent considered, not just the final action?
- Reasoning chain: Is the agent’s chain of reasoning — not just its output — captured in a form that can be reviewed?
- Tool call logging: Are individual tool invocations logged with inputs, outputs, and timestamps?
- Cross-session query: Can you query across agents and sessions to find all actions that touched a specific dataset, user, or resource?
- Compliance export: Can audit logs be exported in formats required by your compliance framework?
Where Evaluations Break Down: Three Patterns We’ve Seen
The identity gap is the most common. But there are two others worth naming.
The governance vocabulary problem. Many vendors use ‘AI governance’ to mean ‘we have admin settings.’ Ask them to demonstrate governance in a live session — specifically: show me a logged decision with its full reasoning chain. Most can’t. The ones that can are the ones to keep evaluating.
The pilot expansion trap. Governance controls that work in a controlled pilot often break when the agent handles higher volume, more tool categories, or multi-agent orchestration. The standard recommendation is to start with one narrow workflow and expand only after controls hold up in real use — not in a sandbox.
The session duration blind spot. Most security reviews focus on login and authentication. They miss the continuous validation problem: an agent authenticated at session start may run for hours, and anything that doesn’t re-validate access continuously treats a long session as one point-in-time trust decision.
Your Pre-Production Governance Checklist
Run these before any platform goes live, and before any expansion of an agent’s scope or tool access.
Map every agent identity
List all agents currently provisioned — names, owners, tools they can access, and when they were last reviewed. If you can't produce this list in under 10 minutes, centralized governance isn't operational yet.
Compare declared vs. observed access
Pull the permission manifest for each agent and compare it against actual tool usage logs from the last 30 days. Any tool declared but never observed is either dormant attack surface or configuration drift. Flag both.
Verify continuous session validation
Check whether your platform re-validates agent credentials and access scope during long-running sessions — not only at login. If validation is point-of-login only, you have a gap for any session running longer than your token expiry.
Test approval enforcement on a high-risk tool
Trigger a tool call that should require approval. Verify that the platform blocks the action until approval is received — not that it proceeds and notifies. If it proceeds, your approvals are notifications, not gates.
Reconstruct a recent decision
Pick a logged agent action from the past week and try to reconstruct the full chain: what data was considered, what reasoning led to the action, which tool was called, and whether a policy gate was cleared. If you can't do this in under 15 minutes, auditability is incomplete.
Define your re-approval triggers
Document the conditions that require a new production approval: new tool access, expanded scope, behavioral baseline shift, or any incident. Agents that never require re-approval are agents that accumulate drift.
Set blast radius bounds before expanding
Before adding tool categories or moving from a single workflow to multi-agent orchestration, define the maximum autonomous impact the system is permitted to take — transaction caps, rate limits, data scope. Write these down and verify the platform can enforce them, not just document them.
If you want a reference point for how these controls map to platform-level configuration, the BrainRoad Console Guide covers how the console surfaces agent identity and tool scope — useful context for understanding what these controls look like in practice.
The Governance Standard That Actually Holds
Pick the platform that can answer the hard questions in a live demo, not on a slide.
Ask them to show you declared permissions next to observed tool usage. Ask them to reconstruct a logged decision. Ask them what happens when an agent can’t get approval in time. Ask them what triggers re-approval. The answers tell you whether governance is a feature or a marketing term.
For a deeper look at what governance architecture actually looks like at the platform level — including how identity, memory, and approval boundaries interact — the article on AI governance platforms covers the structural questions that sit underneath any specific checklist.
If you want the operator-side product proof after that, read What Is the BrainRoad AI Company? Your First 15 Minutes. It shows what these controls feel like in a real first-run path instead of a procurement worksheet.
Use the parent platform lens before you buy.
Go back to the AI agent platform pillar if you want the higher-level buying framework, then use this checklist during live demos and production reviews.
Explore the AI Agent PlatformWhat This Means for Your Platform Decision
- 91% of organizations are running AI agents in production. Only 10% have a well-developed governance strategy. The gap isn’t awareness — it’s operational verification.
- The declared-vs-observed permission gap is the most common reason agents get approved for production when they shouldn’t. Any platform evaluation must surface both numbers.
- Session security for long-running agents requires continuous validation — not just point-of-login authentication.
- Governance means reconstructable decisions, not admin settings. Every individual decision should be traceable from input to reasoning to output.
- Start with one narrow workflow. Expand only after controls hold in real use. This is not a best practice suggestion — it’s the only approach that validates governance under actual conditions.
- Platform governance claims should be verified in a live demo: show me the decision chain, show me declared vs. observed, show me what happens when approval isn’t received in time.
Frequently Asked Questions
What's the difference between agent identity and user identity in an AI platform?
User identities belong to humans who interact with the system. Agent identities belong to the agents themselves — the non-human software entities that take actions, call tools, and execute workflows. Both require lifecycle management: provisioning, scoping, and deprovisioning. The difference is that agent identities don’t have someone watching their inbox when permissions accumulate or credentials expire. Centralized governance of agent identities closes that gap.
Why aren't static access controls sufficient for AI agents?
Static controls are evaluated once, usually at session start. AI agent sessions can run for hours — which means a control that validated access at 9 AM isn’t checking anything at 11 PM when the same session is still executing. Dynamic controls re-validate continuously, ensuring every action is authenticated, scoped, and within policy at the moment it happens — not just when the session opened.
What does a complete audit trail actually require?
Logging that an agent took an action isn’t sufficient for compliance or incident response. A complete, immutable audit trail captures what data the agent considered, the reasoning chain it followed, the specific tool call it made, any policy gate it cleared or triggered, and any human approval that preceded the action. Every individual decision should be reconstructable — not summarized in aggregate, but traceable from input to output.
How do you verify a platform's governance claims during evaluation?
Ask for a live demonstration of three things: (1) declared permissions alongside observed tool usage for a production agent, (2) a reconstructed decision chain from a real logged action, and (3) the behavior when an agent triggers a high-risk action that requires approval. If the demo stays on slides for any of these, the platform can’t demonstrate the capability — only describe it.
When should re-approval be required for an agent already in production?
Any change to an agent’s tool access, workflow scope, or behavioral baseline should trigger a re-approval process. Other triggers include security incidents involving the agent, compliance requirement changes, and significant model updates that affect the agent’s decision-making. Agents that never require re-approval are agents whose governance posture drifts over time without accountability.
Sources
- Okta AI Agent Identity Readiness Checklist
- ARMO: CISO’s AI Agent Production Approval Checklist
- Agentica: AI Governance Checklist for Enterprise Buyers
- Mongoose Cloud: Designing Auditable Agent Orchestration
- Scadea: Agentic AI Security Checklist for Enterprise Workflows
- The Cloud Life: How to Evaluate AI Platforms for Governance and Auditability
- Rasa: The Enterprise Guide to AI Agent Orchestration