Essential Points
- Anthropic’s RSP v3 took effect February 24, 2026, introducing mandatory Frontier Safety Roadmaps and periodic Risk Reports every 3 to 6 months
- ASL-3 safeguards were provisionally activated for Claude Opus 4 in May 2025 as a precautionary measure against CBRN weapon misuse risk, initially focused on biological weapons
- RSP v3 separates what Anthropic will implement unilaterally from broader industry-wide safety recommendations it cannot commit to alone
- Higher-ASL mitigations require security roughly in line with RAND SL4, a standard no single company can currently guarantee without multilateral coordination
Anthropic rewrote the rulebook on AI safety, and the implications reach beyond one company. The third version of its Responsible Scaling Policy (RSP), effective February 24, 2026, is a structural overhaul that honestly confronts where earlier versions fell short and what safety at the frontier actually requires. If you follow AI governance, frontier model development, or enterprise AI risk, this policy shift matters directly to how the industry handles increasingly powerful systems.
What the Responsible Scaling Policy Actually Does
The RSP is Anthropic’s voluntary framework for managing catastrophic risks from advanced AI systems, first published in September 2023. It establishes how Anthropic identifies and evaluates risks, makes decisions about AI development and deployment, and aims to ensure that model benefits exceed their costs. The core logic is conditional: specific capability thresholds require specific safeguards before a model can be deployed or trained further.
Each tier of safeguard corresponds to an AI Safety Level, abbreviated as ASL. ASL-2 covers baseline protections applied to all current models, including training models to refuse dangerous requests and defenses against opportunistic weight theft. ASL-3, now provisionally active for Claude Opus 4, addresses risks from models capable of materially assisting individuals or groups with undergraduate STEM backgrounds in creating, obtaining, or deploying chemical and biological weapons with serious potential for catastrophic damage.
Claude Under Attack: How Three Chinese AI Labs Extracted 16 Million Exchanges from Anthropic
Why Anthropic Rewrote RSP Before Reaching ASL-4
The RSP has always been a living document. Versions 1.0 through 2.2 were published between September 2023 and May 2025, with version 3.0 representing the most significant structural rethinking. The core problem driving this rewrite is a collective action problem: the overall level of catastrophic risk from AI depends on the actions of multiple developers, not just one.
Anthropic’s prior RSP committed to implementing mitigations that would reduce its models’ risk to acceptable levels regardless of what competitors did. Version 3.0 acknowledges this approach is unsustainable: if one developer paused to implement safety measures while others moved forward without equivalent protections, responsible developers would lose both their competitive position and their ability to conduct safety research. Higher-ASL mitigations, particularly those needed against well-resourced threat actors, require security roughly equivalent to RAND SL4, a standard that depends on industry-wide or government coordination no single company can guarantee.
The 3 Core Structural Changes in RSP Version 3.0
Anthropic restructured the policy around three concrete innovations:
1. Separating company commitments from industry-wide recommendations
RSP v3 now presents a three-column table mapping capability thresholds to mitigations. The middle column documents what Anthropic plans to implement as a company regardless of what any other developer does. The right column describes the more ambitious mitigations the entire industry should adopt to keep catastrophic risks reliably low. Anthropic cannot commit to following the industry-wide column unilaterally, but these recommendations drive its policy advocacy, public goal-setting, and competitor-contingent commitments outlined in Appendix A of the document.
2. Mandatory Frontier Safety Roadmap
RSP v3 requires Anthropic to maintain and publish a Frontier Safety Roadmap laying out ambitious but achievable goals across four domains: Security, Alignment, Safeguards, and Policy. These are public goals against which Anthropic will openly grade its own progress. They are not hard commitments, but the expectation is that Anthropic will avoid revising them downward simply because they are difficult to achieve. The current Roadmap includes goals such as completing moonshot R&D for security, achieving an “eyes on everything” state for internal AI development through comprehensive centralized logging, performing systematic alignment assessments incorporating mechanistic interpretability and adversarial red-teaming, and developing internal red-teaming that outperforms the collective contributions of hundreds of bug bounty participants.
3. Periodic Risk Reports with structured external review
Anthropic will publish Risk Reports every 3 to 6 months covering the safety profiles of all publicly deployed models. These reports go beyond system cards, integrating capability evaluations, threat model analyses, active mitigations, and an overall risk-benefit determination. A mandatory external review process applies whenever a Risk Report covers a “highly capable” model and is significantly redacted, where “highly capable” is currently operationalized as a model capable of compressing two years of 2018 to 2024 AI progress into a single year. External reviewers must have significant expertise in dangerous capability evaluations, must have no financial interest in Anthropic, and must have no close personal relationships with Anthropic staff.
The ASL-3 Activation: What Actually Happened
Anthropic activated ASL-3 Deployment and Security Standards for Claude Opus 4 in May 2025. This was a precautionary and provisional action. Anthropic explicitly states it has not yet determined whether Claude Opus 4 definitively crossed the Capability Threshold requiring ASL-3 protections. The activation occurred because continued improvements in CBRN-related knowledge made clearly ruling out ASL-3 risks impossible in the way it had been for every previous model.
The ASL-3 Deployment Standard is narrowly focused. It targets extended, end-to-end CBRN workflows that would be additive to what is already possible without large language models, and specifically addresses universal jailbreaks that could enable consistent extraction of harmful CBRN information. The three-part implementation approach includes Constitutional Classifiers that monitor model inputs and outputs in real time, a bug bounty program focused on stress-testing those classifiers, and iterative classifier retraining using synthetic jailbreak data. Importantly, the ASL-3 Standard is designed to defend against sophisticated non-state attackers. Nation-state threats and sophisticated insider risks are explicitly out of scope of the ASL-3 Standard.
The ASL-3 Security Standard involves more than 100 different security controls targeting model weight protection, including two-party authorization for weight access, enhanced change management protocols, binary allowlisting, and preliminary egress bandwidth controls that limit outbound data flow from secure environments where model weights reside.
ASL Framework and Capability Thresholds
| Capability Threshold | Anthropic’s Planned Mitigations | Threat Actors in Scope |
|---|---|---|
| Non-novel CBRN weapons (undergraduate STEM background actors) | ASL-3 Constitutional Classifiers, access controls, bug bounty, 100+ security controls for weight protection | Sophisticated non-state attackers |
| Novel CBRN weapons (moderately resourced expert-backed teams) | ASL-3 protections extended to additional use cases; policy recommendations for early detection shared with governments | Well-resourced non-state teams; requires RAND SL4-equivalent security |
| High-stakes sabotage by AI systems | Risk Reports documenting AI capabilities, propensities, and monitoring practices | Internal AI systems with extensive access and autonomous operation capacity |
| Automated R&D acceleration | Moonshot security R&D, “eyes on everything” internal logging, systematic alignment assessments | Top-tier state-backed actors; requires RAND SL4-equivalent security |
Limitations and Honest Trade-Offs
Anthropic acknowledges that its industry-wide recommendations cannot be implemented unilaterally. The company openly states that if one developer pauses while others advance, the developers with the weakest protections set the pace and responsible developers lose influence over safety outcomes. This structural tension, not any technical failure, is what makes RSP v3 a fundamentally different document from its predecessors.
Risk Reports will be published with minimal redactions, but certain information may be withheld for legal compliance, intellectual property protection, or public safety considerations. Anthropic acknowledges that external reviewers may publicly disagree with redaction decisions and that this tension is intentional: reviewer commentary carries significant weight precisely because reviewers are selected to be candid rather than to validate Anthropic’s conclusions.
Frequently Asked Questions (FAQs)
What is Anthropic’s Responsible Scaling Policy?
The RSP is Anthropic’s voluntary framework for managing catastrophic risks from advanced AI. It establishes how Anthropic identifies risks, evaluates model capabilities, and decides whether to continue development or deployment. Version 3.0 took effect February 24, 2026, and is the most structurally significant update since the original.
What is ASL-3 and which models does it currently apply to?
ASL-3 is the third AI Safety Level in Anthropic’s framework, activated provisionally for Claude Opus 4 in May 2025. It targets risks from models that could assist individuals with undergraduate STEM backgrounds in creating CBRN weapons. Anthropic has not yet confirmed that Claude Opus 4 definitively crossed the ASL-3 capability threshold.
What threat actors does ASL-3 protect against?
ASL-3 is designed to defend against sophisticated non-state attackers. Nation-state threats and sophisticated insider risks are explicitly out of scope of the ASL-3 Standard. Higher capability thresholds in RSP v3 address nation-state-level threats and require security roughly equivalent to RAND SL4.
How is RSP v3 structurally different from version 2?
RSP v3 replaces a single mitigation framework with a three-column table separating Anthropic’s unilateral company plans from ambitious industry-wide recommendations it cannot commit to alone. It adds mandatory Frontier Safety Roadmaps with publicly graded goals and periodic Risk Reports every 3 to 6 months with structured external review.
What is the Frontier Safety Roadmap?
The Frontier Safety Roadmap is a new mandatory element of RSP v3 documenting Anthropic’s concrete plans across Security, Alignment, Safeguards, and Policy. Goals include moonshot security R&D, comprehensive internal AI development logging, systematic alignment assessments using mechanistic interpretability, and internal red-teaming that surpasses hundreds of bug bounty contributors.
When and how will Risk Reports be published?
Risk Reports will be published every 3 to 6 months and cover all publicly deployed models. They are not tied to individual model releases. When a significantly more capable model is deployed between reports, Anthropic will publish a supplementary discussion within its system card. External review is mandatory when a report covers a highly capable model and contains significant redactions.
What defines a “highly capable” model under RSP v3?
A model is currently classified as highly capable if Anthropic determines it could compress two years of 2018 to 2024 AI research progress into a single year. This operationalization focuses specifically on AI R&D acceleration and is intended to expand to additional domains as measurement methods mature.

