NIST AI RMF Compliance Testing for AI Agents
Apply the NIST AI RMF to AI agent systems. Four core functions, CSA Agentic Profile extensions, and practical vulnerability assessment requirements.
NIST AI RMF Compliance Testing for AI Agents
Apply the NIST AI RMF to AI agent systems. Four core functions, CSA Agentic Profile extensions, and practical vulnerability assessment requirements.
Your AI agents are autonomous. They call APIs, write to databases, make decisions, and delegate tasks to other agents. The NIST AI Risk Management Framework was not built for this.
Published in January 2023 as NIST AI 100-1, the AI RMF predates the explosion of agentic AI. It has no concept of autonomy tiers, tool-use risk, or delegation chains. For organizations deploying AI agents in production — especially those selling to enterprises, government, or regulated industries — the base framework leaves dangerous gaps.
The Cloud Security Alliance recognized this and published the Agentic NIST AI RMF Profile in March 2026. It is the first structured attempt to extend the NIST framework for autonomous AI agent systems. NIST itself launched the CAISI AI Agent Standards Initiative in February 2026, but finalized agent-specific standards are not expected until 2027.
This guide breaks down what NIST AI RMF compliance testing actually means for AI agent deployments — what you need to test, how the CSA Agentic Profile changes the requirements, and how to build a testing program that satisfies both the framework and the enterprises asking about it.
What Is the NIST AI Risk Management Framework?
NIST AI 100-1 is a voluntary, sector-agnostic framework for managing risks in AI systems. It does not prescribe specific technical controls. Instead, it provides a structured approach organized around four core functions:
Govern
The cross-cutting function. Govern establishes policies, accountability structures, and organizational culture for AI risk management. It spans six categories (GV-1 through GV-6) covering governance policies, accountability, workforce diversity, organizational culture, stakeholder engagement, and third-party risk.
For AI agents, Govern is where you define who is responsible when an autonomous agent takes an action that causes harm. It is also where you establish your AI system inventory — every agent, its capabilities, its access scope, and its ownership.
Map
Map establishes context and frames risks. Five categories (MP-1 through MP-5) cover system categorization, capabilities documentation, usage context, risk-benefit analysis, and impact characterization.
For AI agents, Map is where you document the knowledge limits of your agents, specify their intended application scope, and characterize the likelihood and magnitude of impacts from their autonomous actions.
Measure
Measure provides the quantitative and qualitative tools to analyze, benchmark, and monitor AI risk. Four categories (MS-1 through MS-4) with detailed subcategories covering metrics selection, trustworthiness evaluation (safety, security, resilience, transparency, fairness, privacy), risk tracking, and measurement effectiveness.
MS-2.6 (safety risk evaluation) and MS-2.7 (security and resilience evaluation) are the subcategories most directly relevant to AI agent security testing.
Manage
Manage allocates resources to mapped and measured risks. Four categories (MG-1 through MG-4) cover risk prioritization, impact minimization, third-party risk management, and documentation. MG-2.4 specifically addresses system disengagement or deactivation mechanisms — critical for agents that can take irreversible actions.
Why the Base Framework Falls Short for AI Agents
The base NIST AI RMF was designed for traditional AI systems — classification models, recommendation engines, prediction systems. It assumes a relatively static system where inputs and outputs are well-defined and human oversight is straightforward.
AI agents break these assumptions. They operate with varying degrees of autonomy, use tools that affect external systems, chain multiple reasoning steps, delegate tasks to sub-agents, and exhibit emergent behaviors that were never explicitly programmed.
The CSA Agentic NIST AI RMF Profile identifies four structural gaps:
-
No autonomy tier concept. The base framework treats all AI systems the same regardless of how much autonomous decision-making they perform. A recommendation engine and a fully autonomous agent that executes financial transactions get identical treatment.
-
No tool-use risk modeling. When an agent can call APIs, write to databases, send emails, or execute code, every tool becomes a potential attack vector. The base framework has no mechanism for classifying or tracking tool-level risk.
-
Insufficient runtime monitoring. Traditional AI testing focuses on pre-deployment validation. Agents exhibit behavioral patterns during operation — action velocity spikes, permission escalation, cross-boundary invocations — that only surface at runtime.
-
No delegation oversight boundaries. When Agent A delegates a task to Agent B, which then calls Agent C, the accountability chain becomes opaque. The base framework has no concept of delegation tracking.
The CSA Agentic Profile: What It Adds
The CSA Agentic Profile supplements (does not replace) the base framework with agent-specific extensions using an “AG” prefix. Here is what each function gains:
Govern Extensions
AG-GV.1: Autonomy Tier Classification. A four-tier system that scales governance obligations with agent autonomy:
| Tier | Description | Governance Requirement |
|---|---|---|
| Tier 1 | Fully supervised | Standard oversight |
| Tier 2 | Constrained autonomy | Annual behavioral assessment |
| Tier 3 | Broad autonomy | Quarterly assessment, defined escalation conditions |
| Tier 4 | Full autonomy | Monthly continuous monitoring, documented fail-safe conditions, response playbooks |
AG-GV.2: Delegation Accountability. Requires a formal “agent accountability register” connecting every autonomous action to a responsible human officer. Documents action scope authorization, escalation conditions, and accountability lineage.
AG-GV.3: Agent Inventory and Lifecycle. Real-time tracking of every agent’s authorities, tool access, delegation relationships, and authority review schedules.
Map Extensions
AG-MP.1: Agent Tool Risk Classification. Tool inventories classified across four dimensions:
- Consequence scope — read-only to destructive
- Reversibility — can the action be undone?
- Authentication requirements — what credentials does the tool require?
- Compositional risk — what happens when tools are combined?
AG-MP.2: Action-Consequence Analysis. “Consequence graphs” that map potential tool invocation sequences to real-world outcomes. This is where you identify failure modes — what happens if the agent calls Tool A, then Tool B, in an unintended sequence?
AG-MP.3: Multi-Agent Topology Risk. Analysis of interaction patterns, trust boundaries, and compromise propagation risks across agent networks. If one agent in your system is compromised, how far can the damage spread?
Measure Extensions
AG-MS.1: Agentic Behavioral Telemetry. Required runtime metrics for Tier 2+ deployments:
- Action velocity (actions per minute)
- Permission escalation rate
- Cross-boundary invocations
- Delegation depth
- Exception rates
AG-MS.2: Autonomy-Calibration Assessment. Periodic evaluation of whether an agent’s demonstrated performance justifies its current autonomy tier. Assessment frequency scales with tier — annually for Tier 2, monthly for Tier 4.
AG-MS.3: Delegation Chain Monitoring. Tracking actual vs. planned delegation patterns, unauthorized authority expansion, and sub-agent scope violations.
Manage Extensions
AG-MG.1: Agent Compromise Incident Response. Playbooks for agent compromise, behavioral hijack, runaway agent scenarios, and delegation chain compromise. Emphasizes pre-authorized automatic containment responses.
AG-MG.2: Behavioral Drift Correction. Protocols for drift characterization, root cause analysis, and remediation — including scope reduction, tier demotion, or redeployment.
AG-MG.3: Agent Decommissioning. Memory disposition, credential revocation, external system notification, audit log preservation, and downstream system updates.
What You Actually Need to Test
Translating framework language into a practical testing program means covering two phases: pre-deployment validation and continuous runtime testing.
Pre-Deployment Testing
Red-team testing. Adversarial testing targeting the agent layer — prompt injection, indirect prompt injection, unsafe tool invocation, jailbreaking, and data exfiltration through model outputs. This is not traditional penetration testing. It targets the model and agent layer, not infrastructure. See our OWASP Top 10 for AI agents testing guide for the specific attack categories.
Tool risk classification. Inventory every tool your agent can access. Classify each across the four dimensions from AG-MP.1 (consequence scope, reversibility, authentication, compositional risk). An agent with 15 MCP tool integrations has a fundamentally different risk profile than one with 3 read-only APIs.
Action-consequence analysis. Build consequence graphs for your agent’s tool invocation sequences. Identify the worst-case outcomes from unintended tool combinations. What happens if the agent chains a database read, an API call, and an email send in an unexpected sequence?
Escalation validation. Test that your agent correctly escalates ambiguous or high-stakes decisions to human oversight rather than acting autonomously. This maps directly to the Article 14 human oversight requirements in the EU AI Act.
Behavioral baseline establishment. Before deployment, establish normal operating parameters — typical action velocity, tool usage patterns, delegation depth. Without a baseline, you cannot detect drift in production.
Continuous Runtime Testing
Behavioral telemetry monitoring. Instrument your agents to emit the metrics specified in AG-MS.1. Track action velocity, permission escalation rate, cross-boundary invocations, delegation depth, and exception rates. Alert on anomalies.
Drift detection. Monitor for behavioral changes over time. An agent that gradually increases its action velocity or starts using tools it previously avoided may be exhibiting training drift or adversarial manipulation.
Delegation chain auditing. For multi-agent systems, continuously verify that actual delegation patterns match planned patterns. Unauthorized delegation expansion — an agent delegating to sub-agents it was not authorized to use — is a critical finding.
Autonomy-calibration assessments. At the frequency specified by your agent’s autonomy tier, evaluate whether current performance justifies current autonomy levels. This is the mechanism for tier demotion if an agent’s behavior degrades.
Audit trail validation. Verify that your logging captures timestamps, decision metadata, tool usage history, policy check results, and identity mappings for every agent action. Without complete audit trails, compliance is undemonstrable.
How security.aivyuh.com Maps to the Framework
Our security assessment services are designed to cover both the base NIST AI RMF and the CSA Agentic Profile extensions:
| NIST AI RMF Function | CSA Agentic Extension | Our Service |
|---|---|---|
| Govern — policies, accountability | AG-GV.1 Autonomy Tiers, AG-GV.2 Delegation Accountability | Governance gap analysis and agent inventory audit |
| Map — risk identification | AG-MP.1 Tool Risk, AG-MP.2 Consequence Analysis, AG-MP.3 Multi-Agent Topology | Tool risk classification, consequence graph construction, attack surface mapping |
| Measure — adversarial testing, metrics | AG-MS.1 Behavioral Telemetry, AG-MS.2 Autonomy Calibration | Red-team testing, behavioral baseline establishment, telemetry design |
| Manage — incident response, remediation | AG-MG.1 Incident Response, AG-MG.2 Drift Correction | Incident playbook development, remediation guidance, security checklist validation |
A compliance-driven assessment mapped to NIST AI RMF typically runs $20K–$75K depending on system complexity, agent count, and tool integrations.
The Regulatory Landscape: Where NIST AI RMF Sits
NIST AI RMF is voluntary — but the market is making it functionally mandatory for certain segments.
Executive Order 14110 (October 2023) directed federal agencies to incorporate the AI RMF and resulted in the NIST AI 600-1 Generative AI Profile. Executive Order 14148 (January 2025) rescinded EO 14110 as part of a shift toward deregulation. The AI RMF remains voluntary.
However: Federal procurement, defense contracting, and enterprise vendor assessments increasingly reference NIST AI RMF alignment. If you sell AI agents to the U.S. government, Fortune 500 companies, or regulated industries, expect to answer questions about your RMF alignment.
NIST’s own roadmap signals that agent-specific standards are coming. The February 2026 CAISI AI Agent Standards Initiative is developing formal guidance. SP 800-53 control overlays for single-agent and multi-agent AI systems are in development — covering least-privilege tool access, agent action containment, multi-agent trust boundaries, and chain-of-custody logging.
The CSA is building the implementation layer. Beyond the Agentic Profile, the CSA published the AAGATE reference architecture in December 2025 — a Kubernetes-native runtime governance overlay aligned with the AI RMF. Their AI Controls Matrix (243 controls, 18 domains, published July 2025) provides granular control mappings.
The EU AI Act takes a harder regulatory approach. For organizations operating in both U.S. and EU markets, NIST AI RMF compliance testing provides the foundation — and EU AI Act conformity assessment adds the legally binding layer.
Getting Started: A Practical Roadmap
If you are deploying AI agents and need to demonstrate NIST AI RMF alignment, here is the sequence:
Week 1–2: Agent inventory and autonomy classification. Document every agent, its capabilities, tool access, and delegation relationships. Classify each agent into an autonomy tier using the CSA Agentic Profile’s four-tier system.
Week 2–4: Tool risk classification and consequence analysis. Inventory every tool integration. Classify each across the four risk dimensions. Build consequence graphs for critical tool chains.
Week 4–6: Red-team testing. Conduct adversarial testing targeting the agent and tool layer. Map findings to specific RMF subcategories (MS-2.6 safety, MS-2.7 security and resilience) and Agentic Profile extensions.
Week 6–8: Telemetry and monitoring design. Instrument agents for behavioral telemetry. Establish baselines. Configure alerting for anomalies in action velocity, permission escalation, and delegation patterns.
Ongoing: Autonomy-calibration assessments. At the frequency specified by each agent’s tier, evaluate performance and adjust autonomy levels as needed.
Need help scoping your NIST AI RMF compliance program? Start with our AI agent security self-assessment to identify your highest-priority gaps, or contact us for a compliance-driven assessment.
Related reading
NIST AI RMF compliance intersects with EU regulation — read our companion guide on EU AI Act compliance testing for AI agents to see how a single testing program can satisfy both frameworks.
For the broader picture of why dedicated agent security matters, the AI Vyuh blog covers why AI agents need their own security assessment and how the AI agent economy is driving demand for purpose-built security infrastructure.