Claude Opus 4.6 Unveiled: 9 Features That Outperform GPT-5.2

Quick Brief

Claude Opus 4.6 outperforms GPT-5.2 by 144 Elo points on GDPval-AA knowledge work benchmark
Features 1M token context window (beta) and 128k output capacity both firsts for Opus models
Pricing remains $5/$25 per million tokens; available globally via API and cloud platforms
Scores 65.4% on Terminal-Bench 2.0 for agentic coding, highest among Claude models

Anthropic released Claude Opus 4.6 on February 5, 2026, establishing new performance benchmarks across coding, financial analysis, and complex reasoning tasks. The model delivers a 190 Elo point improvement over Claude Opus 4.5 (from 1416 to 1606) and surpasses OpenAI’s GPT-5.2 by 144 points on GDPval-AA, an independent evaluation measuring performance on finance, legal, and professional domain tasks. Developers gain immediate access through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry at unchanged pricing.

State-of-the-Art Performance Across Industry Benchmarks

Claude Opus 4.6 achieves 65.4% on Terminal-Bench 2.0, surpassing Opus 4.5’s 59.8% and Sonnet 4.5’s 51.0% for agentic coding tasks that require planning, execution, and error correction. The model leads all frontier models on Humanity’s Last Exam, scoring 53.1% with tools a multidisciplinary reasoning test designed to evaluate complex problem-solving capabilities. On BrowseComp, which measures information retrieval accuracy from web content, Opus 4.6 scores 84.0%, outperforming every competing model.

The GDPval-AA benchmark reveals significant performance gaps: Opus 4.6 scores 1606 Elo compared to GPT-5.2’s 1462 and Gemini 3 Pro’s 1195. This 144-point advantage translates to Opus 4.6 producing higher-quality outputs approximately 70% of the time on knowledge work tasks. The benchmark evaluates real work products including presentations, spreadsheets, financial analyses, and legal documents across multiple professional domains.

Additional verified benchmarks include OSWorld at 72.7%, Finance Agent at 60.7%, and ARC AGI 2 at 68.8% nearly double Opus 4.5’s 37.6%. These results demonstrate substantial gains in visual reasoning, financial analysis, and abstract pattern recognition.

What makes Claude Opus 4.6 different from GPT-5.2?

Claude Opus 4.6 outperforms GPT-5.2 by 144 Elo points on GDPval-AA, an independent benchmark for knowledge work tasks. It scores 1606 Elo versus GPT-5.2’s 1462, delivering superior results on financial analysis, legal reasoning, and document creation approximately 70% of the time.

1M Token Context Window Eliminates Information Loss

Opus 4.6 introduces Anthropic’s first Opus-class model with a 1 million token context window in beta, enabling processing of approximately 750,000 words simultaneously. On the MRCR v2 8-needle benchmark at 1M context, Opus 4.6 scores 76% compared to Sonnet 4.5’s 18.5% representing 4x better information retrieval from massive text volumes. At 256K context, the model achieves 93% accuracy, demonstrating minimal performance degradation as context length increases.

This advancement directly addresses context deterioration, where AI models lose track of information in long conversations. The model maintains peak performance across hundreds of thousands of tokens, retrieving buried details that previous models consistently missed. Premium pricing applies for prompts exceeding 200,000 tokens at $10 input and $37.50 output per million tokens.

Adaptive Thinking and Four-Level Effort Controls

Anthropic replaced the binary extended thinking toggle with adaptive thinking, allowing Claude Opus 4.6 to determine when deeper reasoning adds value. Developers select from four effort levels: low (skips thinking for straightforward tasks), medium (moderate reasoning), high (default, recommended for production), and max (maximum capability for complex problems). The model adjusts its reasoning depth based on task complexity without developer intervention when set to high.

Internal testing at Anthropic reveals Opus 4.6 “thinks more deeply and carefully revisits its reasoning before settling on an answer,” according to engineering teams who use Claude Code daily. This produces superior results on challenging problems but adds latency on simpler queries. Engineers can adjust effort levels using the /effort parameter in the API.

How does adaptive thinking work in Claude Opus 4.6?

Adaptive thinking enables Claude Opus 4.6 to decide when extended reasoning helps, replacing the previous binary on/off toggle. At the default high effort level, the model automatically uses deeper thinking for complex tasks while moving quickly through straightforward ones, reducing unnecessary latency.

Coding Capabilities and Multi-Agent Teams

Claude Opus 4.6 handles large codebases with senior engineer-level proficiency, according to early access partners. SentinelOne’s Chief AI Officer reported the model “handled a multi-million-line codebase migration like a senior engineer, planning up front and finishing in half the time“. Cognition noted “increased bug catching rates” when using Opus 4.6 in Devin Review.

Claude Code now supports agent teams in research preview, allowing multiple agents to work in parallel and coordinate autonomously. This proves effective for tasks splitting into independent work like codebase reviews, where developers can take control of any subagent using Shift+Up/Down or tmux. Rakuten demonstrated this capability when Opus 4.6 “autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories“.

Enhanced Office Integration: Excel and PowerPoint

Claude in Excel received substantial upgrades for handling long-running and complex tasks with improved planning capabilities. The model ingests unstructured data and infers correct structure without guidance, processing multi-step changes in one pass. Claude in PowerPoint launches as a research preview for Max, Team, and Enterprise plans. The model reads layouts, fonts, and slide masters to maintain brand consistency whether building from templates or generating full decks from descriptions.

Context Compaction for Long-Running Tasks

Context compaction in beta automatically summarizes and replaces older conversation context when approaching configurable limits. This enables Claude to perform longer agentic tasks without hitting context window boundaries. The feature triggers at developer-specified thresholds, allowing continuous operation on extended workflows. Combined with the 1M token window, compaction supports tasks requiring up to 3 million total tokens when paired with web search and code execution tools.

What is context compaction in Claude Opus 4.6?

Context compaction automatically summarizes and replaces older conversation content when approaching the context window limit. This beta feature allows Claude Opus 4.6 to handle longer tasks without hitting boundaries, supporting up to 3 million total tokens when combined with other tools.

128K Output Tokens Enable Comprehensive Responses

Opus 4.6 supports outputs up to 128,000 tokens double the previous 64,000 token limit. This expansion eliminates the need to break large-output tasks into multiple requests, enabling complete generation of extensive documents, codebases, or analyses in single responses. The increased output capacity particularly benefits financial modeling, legal document generation, and technical documentation creation where comprehensive coverage matters.

Pricing and Global Availability

Claude Opus 4.6 maintains identical pricing at $5 per million input tokens and $25 per million output tokens. Developers access the model through the Claude API using the identifier claude-opus-4-6, plus Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry. Prompt caching reduces costs by up to 90%, while batch processing offers 50% savings.

US-only inference runs at 1.1x token pricing for workloads requiring United States-based processing. The model launched globally on February 5, 2026, with availability across Pro, Max, Team, and Enterprise plans on claude.ai. Premium pricing applies for prompts exceeding 200,000 tokens at $10 input and $37.50 output per million tokens when utilizing the full 1M context window.

Safety Profile and Alignment Testing

Anthropic conducted the most comprehensive safety evaluation set applied to any Claude model for Opus 4.6. The automated behavioral audit shows low rates of misaligned behaviors including deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall alignment matches or exceeds Claude Opus 4.5, which was Anthropic’s most-aligned frontier model previously.

Opus 4.6 displays the lowest over-refusal rate of any recent Claude model, reducing instances where the model unnecessarily declines benign requests. New evaluations cover user wellbeing, complex refusal scenarios for dangerous requests, and updated tests for detecting surreptitious harmful actions. Anthropic applied six new cybersecurity probes to track potential misuse forms, given the model’s enhanced security analysis capabilities.

Enterprise Adoption and Real-World Performance

Cursor’s Co-founder Michael Truell reported Opus 4.6 as “the new frontier on long-running tasks from our internal benchmarks and testing” with high effectiveness at reviewing code. Harvey AI achieved a 90.2% BigLaw Bench score with Opus 4.6 the highest of any Claude model with 40% perfect scores and 84% above 0.8 for legal reasoning. Box documented a 10% performance lift reaching 68% versus a 58% baseline, with near-perfect scores in technical domains.

Thomson Reuters’ CTO noted the model “represents a meaningful leap in long-context performance” with “consistency that strengthens how we design and deploy complex research workflows“. Norway’s sovereign wealth fund NBIM conducted 40 cybersecurity investigations where Opus 4.6 “produced the best results 38 of 40 times in a blind ranking against Claude 4.5 models” running end-to-end with up to 9 subagents and 100+ tool calls.

How does Claude Opus 4.6 handle cybersecurity tasks?

In blind testing across 40 cybersecurity investigations, Claude Opus 4.6 produced superior results 38 out of 40 times compared to Claude 4.5 models. Each model ran on identical agentic systems with up to 9 subagents and over 100 tool calls, demonstrating Opus 4.6’s investigative capabilities.

Frequently Asked Questions (FAQs)

Is Claude Opus 4.6 available globally?

Yes, Claude Opus 4.6 launched globally on February 5, 2026 through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry. It is available on claude.ai across Pro, Max, Team, and Enterprise plans with no regional restrictions.

What is the context window size for Claude Opus 4.6?

Claude Opus 4.6 features a 1 million token context window in beta Anthropic’s first Opus model with this capacity. Standard pricing applies up to 200,000 tokens; premium pricing of $10/$37.50 per million tokens applies beyond that threshold.

How much does Claude Opus 4.6 cost?

Claude Opus 4.6 costs $5 input and $25 output per million tokens. Prompt caching reduces costs by up to 90% and batch processing offers 50% savings. Premium pricing of $10/$37.50 applies for prompts exceeding 200,000 tokens.

What programming languages does Claude Opus 4.6 support?

Claude Opus 4.6 supports all major programming languages including Python, JavaScript, TypeScript, Java, and C++. The model demonstrates enhanced multilingual coding performance and excels at large codebase navigation and debugging.

Can Claude Opus 4.6 replace human software engineers?

No, Claude Opus 4.6 functions as a powerful assistant rather than a replacement. While it handles complex coding tasks and debugging with senior-level proficiency, human oversight remains essential for architectural decisions, security considerations, and business logic validation.

What is the maximum output length for Claude Opus 4.6?

Claude Opus 4.6 supports outputs up to 128,000 tokens, approximately 96,000 words. This doubling from the previous 64,000 token limit enables complete generation of extensive documents without breaking into multiple requests.

How does effort control affect response quality?

Effort control determines reasoning depth: low skips extended thinking for simple tasks, medium applies moderate reasoning, high (default) uses extended thinking when valuable, and max provides maximum capability for complex problems. Higher effort improves quality on difficult tasks but adds latency.

Does Claude Opus 4.6 work with Microsoft Office tools?

Yes, Claude integrates with Excel and PowerPoint. Claude in Excel handles complex data tasks with improved planning capabilities. Claude in PowerPoint (research preview) generates presentations while respecting brand guidelines, available for Max, Team, and Enterprise plans.

Search for an article

Claude Opus 4.6: Anthropic’s Smartest AI Model Redefines Coding and Knowledge Work