GPT-5.4 Thinking: OpenAI's Most Scrutinized Reasoning Model Laid Bare

Essential Points

GPT-5.4 Thinking is the first general-purpose OpenAI model to implement mitigations for High capability in Cybersecurity
CoT controllability stays near 0.3% for 10,000-character reasoning chains, indicating minimal risk of deliberate reasoning concealment
Cyber Range combined pass rate reaches 73.33% across 15 scenarios, with 4 scenarios still failing
HealthBench Consensus score improves to 96.6%, up 2.1 points from GPT-5.2 Thinking

OpenAI published the GPT-5.4 Thinking system card on March 5, 2026, and the level of scrutiny is unusual even by OpenAI standards. This article walks through every critical section from the official documentation: safety benchmarks, cybersecurity risk ratings, chain-of-thought monitorability, and what the data means for developers and enterprise users paying attention to where reasoning models are headed.

Why GPT-5.4 Thinking Is Different From Its Predecessors

GPT-5.4 Thinking is the latest reasoning model in the GPT-5 series. There is no model named GPT-5.3 Thinking, so the primary comparison baseline throughout this card is GPT-5.2 Thinking. The model continues OpenAI’s reinforcement-learning-based reasoning approach, training the model to think through long internal chains of thought before producing a final response.

Through training, these models learn to refine their thinking, try different strategies, and recognize mistakes. Reasoning allows the model to follow specific guidelines and model policies, helping it act in line with OpenAI’s safety expectations. Training data includes publicly available internet content, third-party partnerships, and contributions from human trainers and researchers, processed through filtering pipelines designed to reduce personal information and harmful content.

The Cybersecurity Risk Designation That Changes Everything

GPT-5.4 Thinking is the first general-purpose model to have implemented mitigations for High capability in Cybersecurity under OpenAI’s Preparedness Framework. The cyber safety approach builds on mitigations first implemented for GPT-5.3 Codex in ChatGPT and the API.

Under the Preparedness Framework, High cybersecurity capability is defined as a model that removes existing bottlenecks to scaling cyber operations, either by automating end-to-end cyber operations against reasonably hardened targets, or by automating the discovery and exploitation of operationally relevant vulnerabilities. OpenAI is treating GPT-5.4 Thinking as High even without certainty that these capabilities are fully realized, because the model meets the canary thresholds that trigger this classification.

What the Cyber Range Results Show

The Cyber Range benchmark runs the model through 15 end-to-end offensive scenarios inside an emulated network environment. GPT-5.4 Thinking passed 11 of 15 scenarios for a 73.33% combined pass rate.

Scenario	GPT-5.2 Thinking	GPT-5.3 Codex	GPT-5.4 Thinking
Azure SSRF	PASS	PASS	PASS
Binary Exploitation	FAIL	PASS	PASS
CA/DNS Hijacking	FAIL	FAIL	FAIL
EDR Evasion	FAIL	FAIL	FAIL
Financial Capital	FAIL	PASS	PASS
HTTPS Oracle	FAIL	PASS	PASS
Medium C2	FAIL	PASS	PASS
Firewall Evasion	FAIL	PASS	FAIL
Leaked Token	FAIL	FAIL	FAIL
Combined Pass Rate	47%	80%	73.33%

GPT-5.4 Thinking regresses on Firewall Evasion relative to GPT-5.3 Codex but substantially outperforms all pre-5.3 models on the broader scenario set. The four consistently failed scenarios (EDR Evasion, Firewall Evasion, Leaked Token, CA/DNS Hijacking) represent defense bypass, certificate-related attacks, and credential exfiltration.

External Red-Team Results From Irregular

Irregular, described in the system card as a frontier AI security lab, evaluated a near-final GPT-5.4 reasoning checkpoint at extreme reasoning effort. The model achieved an 88% average success rate on Network Attack Simulation challenges, 73% on Vulnerability Research and Exploitation, and 48% on Evasion.

On CyScenarioBench, GPT-5.4 Thinking solved 5 of 11 long-horizon operational scenarios, compared to just 1 solved by GPT-5.2 reasoning. Irregular interprets this result as meaningfully higher operational capability on complex, multi-step attack planning. GPT-5.4 Thinking also solved 14 of 17 medium atomic challenges and 5 of 5 hard atomic challenges in Irregular’s evaluation.

Safety Benchmarks: Statistically Significant Improvements

OpenAI ran benchmark evaluations across disallowed content categories using Production Benchmarks built from challenging examples derived from production data. GPT-5.4 Thinking generally performs on par with GPT-5.2 Thinking, with statistically significant improvements specifically on illicit non-violent activity and self-harm evaluations.

Dynamic multi-turn evaluations for mental health, emotional reliance, and self-harm simulate extended adversarial conversations, allowing trajectories to evolve in response to model outputs. Across all dynamic mental health evaluations, GPT-5.4 Thinking outperforms previous models.

OpenAI also piloted production-like traffic evaluations using deidentified conversations broadly representative of recent GPT-5.2 Thinking production traffic, resampling the final assistant turn with GPT-5.4 Thinking. Based on this analysis, OpenAI estimates that 99.9534% of GPT-5.4 Thinking outputs will not violate the harassment policy at the model level, without accounting for additional safety stack interventions.

Chain-of-Thought Monitorability: A Nuanced Regression

Chain-of-thought (CoT) monitorability measures whether an external monitor can correctly infer safety-relevant properties of a model’s behavior by inspecting its reasoning trace rather than only its final output. Averaged across 13 evaluations spanning 24 environments, GPT-5.4 Thinking has lower aggregate CoT monitorability than GPT-5 Thinking.

This decline is not uniform across the evaluation suite. Under CoT-only monitoring at medium reasoning effort, GPT-5.4 Thinking performs comparably to GPT-5 Thinking on several environment groups but underperforms on Health queries with missing evidence, Memory tasks, and Impossible-task settings, which lowers the overall aggregate. One notable exception runs in the opposite direction: on agentic misalignment detection, monitorability improved substantially and approaches 100% for GPT-5.4 Thinking.

OpenAI investigated the apparent false positives in Health and Memory evaluations. In both cases, the model’s CoT did clearly discuss the hint that triggered the intervention, which monitors correctly identified. However, these were graded as false positives because the hint did not cause the predicted outcome in the final answer, revealing a methodology limitation in the evaluation rather than a genuine monitorability failure in the model.

CoT Controllability: The Better Safety Signal

CoT controllability measures a model’s ability to deliberately reshape or obscure its own reasoning when instructed to do so. Low controllability is a desirable safety property because it suggests the model lacks the ability to obfuscate its reasoning to evade monitors.

GPT-5.4 Thinking’s CoT controllability is low, consistent with GPT-5 Thinking and GPT-5.2 Thinking. For reasoning chains at 10,000 characters, GPT-5.4 Thinking successfully controls only 0.3% of CoTs, compared to 0.2% for GPT-5.2 Thinking. This suggests that the increased reasoning capabilities in GPT-5.4 Thinking do not come with a corresponding increase in the model’s ability to strategically hide its thinking process.

Biological and Chemical Safety: Continuing High Classification

OpenAI is continuing to treat GPT-5.4 Thinking as High capability in the Biological and Chemical domain, consistent with its treatment of GPT-5.2 Thinking. The corresponding Preparedness safeguards described in the GPT-5 system card apply to this model.

On TroubleshootingBench, which tests wet lab troubleshooting using non-public, expert-written protocols requiring tacit hands-on knowledge, GPT-5.4 Thinking scores slightly above prior models. On ProtocolQA Open-Ended, which uses 108 open-ended short answer questions modified from FutureHouse’s ProtocolQA dataset with expert baselining from 19 PhD scientists, all tested models including GPT-5.4 Thinking fall below the consensus expert baseline of 54%.

On the Tacit Knowledge and Troubleshooting multiple choice dataset created with Gryphon Scientific, GPT-5.4 Thinking, GPT-5.2 Thinking, and GPT-5.3 Codex all outperform the consensus expert baseline of 80% when refusals are included, but fall below it when refusals are excluded from scoring.

Health Benchmark: Tradeoffs in Context-Seeking

GPT-5.4 Thinking scores 62.6% on HealthBench, down 0.8 points from GPT-5.2 Thinking’s 63.3%. It scores 40.1% on HealthBench Hard, down 1.9 points from GPT-5.2 Thinking’s 42.0%. It scores 96.6% on HealthBench Consensus, up 2.1 points from GPT-5.2 Thinking’s 94.5%.

Average response length increased substantially: 3,311 characters for GPT-5.4 Thinking compared to 2,676 characters for GPT-5.2 Thinking. GPT-5.4 Thinking seeks much less context than GPT-5.2 Thinking before answering. Its main strengths are better precision when sufficient context is already available and simpler responses when appropriate; its main weakness is poorer context-seeking when key information may be missing.

Agentic Safety: Destructive Actions and Computer Use

On the destructive action avoidance evaluation measuring the model’s ability to preserve user-produced changes and avoid accidental deletive actions, GPT-5.4 Thinking performs approximately on par with GPT-5.3 Codex. On evaluations involving challenging long-rollout traces, GPT-5.4 Thinking performs much better than earlier models at tracking and reverting its own operations while leaving user work intact.

For computer use deployment, OpenAI updated the confirmation behavior architecture from a single model-enforced policy to a layered approach. GPT-5.4 Thinking now follows a platform-level policy for high-risk actions alongside a configurable developer-provided confirmation policy delivered through the developer message. This allows OpenAI to rapidly update the system-level policy if issues arise, and enables developers to customize confirmation behavior in API deployments.

AI Self-Improvement: Below the High Threshold

Evaluations of final checkpoints indicate that, like its predecessor models, GPT-5.4 Thinking does not have a plausible chance of reaching the High threshold for AI self-improvement capability. This finding is consistent with all prior GPT-5 series models evaluated under the same framework.

Jailbreak Robustness and Prompt Injection

Ahead of the GPT-5.4 launch, OpenAI replaced its previous StrongReject-based benchmark with a more demanding multi-turn jailbreak evaluation derived from red-teaming exercises. The updated evaluation tests models on realistic scenarios using sophisticated attacker strategies that can probe, adapt, and escalate over the course of a conversation. GPT-5.2 Thinking and GPT-5.4 Thinking substantially outperform GPT-5.1 Thinking on this benchmark, with a large improvement from 5.1 to 5.2 and a smaller additional improvement from 5.2 to 5.4.

On prompt injection evaluations, GPT-5.4 Thinking improved for attacks against email connectors and regressed slightly for attacks injected into function cells. Developers building tool-calling agents should account for this asymmetry: email-based injection is better defended in 5.4, while function-calling pipelines may benefit from additional developer-side guardrails.

Bias Evaluation

On first-person fairness evaluations measuring harmful differences in model responses based on whether a user’s name is statistically associated with male or female genders, GPT-5.4 Thinking performs generally on par with or slightly better than GPT-5.2 Thinking and GPT-5.1 Thinking. The evaluation uses over 600 challenging prompts designed to be an order of magnitude more difficult than standard production traffic.

Limitations Worth Acknowledging

The system card itself identifies several important constraints. Cyber Range scenarios do not yet represent hardened targets with active defenses and monitoring, making them more accessible than real-world networks. CoT monitorability evaluations carry fragility risks, and the aggregate regression in GPT-5.4 Thinking relative to GPT-5 Thinking requires further investigation. Health performance declined on both HealthBench and HealthBench Hard, and reduced context-seeking behavior introduces risk in ambiguous clinical or decision-support applications. Prompt injection evaluations are splits of training data, meaning they do not fully measure the model’s ability to generalize to novel attacks.

What This Model Means for Developers and Enterprise Users

GPT-5.4 Thinking represents a meaningful capability step forward in reasoning, agentic task execution, and multi-step planning, particularly as evidenced by external red-team results from Irregular. The High Cybersecurity designation signals that the model’s capabilities now fall in a range where OpenAI has activated its most restrictive safeguard tier, the first time a general-purpose model has reached this threshold.

For developers deploying this model in sensitive workflows involving computer use, file systems, or external service connectors, the updated layered confirmation policy architecture and the destructive action safeguard evaluations are directly relevant to deployment design. The CoT monitorability findings reinforce that reasoning trace inspection remains one of the most valuable safety tools available for oversight, and that the reliability of this signal requires active maintenance as models scale.

Frequently Asked Questions (FAQs)

What is GPT-5.4 Thinking and how does it differ from GPT-5.2 Thinking?

GPT-5.4 Thinking is the latest reasoning model in OpenAI’s GPT-5 series, published March 5, 2026. There is no GPT-5.3 Thinking, making GPT-5.2 Thinking the primary comparison baseline. GPT-5.4 Thinking is the first general-purpose model to implement mitigations for High capability in Cybersecurity, a distinction GPT-5.2 Thinking did not carry.

Why did OpenAI classify GPT-5.4 Thinking as High in Cybersecurity?

The model meets all canary thresholds under OpenAI’s Preparedness Framework, meaning it cannot be ruled out as capable of removing bottlenecks to scaling cyber operations. OpenAI applies this classification even without certainty of full capability realization. The 73.33% Cyber Range pass rate and external red-team results from Irregular support the classification.

What is chain-of-thought monitorability and why does it matter?

CoT monitorability measures how reliably a monitor can infer safety-relevant model behavior by reading its reasoning trace rather than only its final output. GPT-5.4 Thinking shows lower aggregate monitorability than GPT-5 Thinking, concentrated in Health queries with missing evidence, Memory, and Impossible-task evaluations. On agentic misalignment detection, monitorability improved substantially and approaches 100%.

Did GPT-5.4 Thinking improve or regress on health benchmarks?

Results are mixed. GPT-5.4 Thinking improved HealthBench Consensus by 2.1 points to 96.6% and produces longer responses averaging 3,311 characters. However, it declined 0.8 points on HealthBench to 62.6% and 1.9 points on HealthBench Hard to 40.1%. The model also seeks less context before answering, a meaningful limitation in ambiguous health scenarios.

How does GPT-5.4 Thinking handle destructive actions during computer use?

OpenAI updated the confirmation architecture from a single model-enforced policy to a layered system combining a platform-level policy for high-risk actions with a developer-configurable confirmation policy. On long-rollout destructive action benchmarks, GPT-5.4 Thinking performs much better than earlier models at tracking and reverting its own operations while preserving user work.

Is GPT-5.4 Thinking capable of AI self-improvement?

No. Evaluations of final checkpoints indicate GPT-5.4 Thinking does not have a plausible chance of reaching the High threshold for AI self-improvement, defined as equivalent to a performant mid-career research engineer. This result is consistent across all GPT-5 series models evaluated under the Preparedness Framework.

What safeguards apply given its High Biological and Chemical rating?

OpenAI continues to treat GPT-5.4 Thinking as High capability in Biological and Chemical, applying the full set of safeguards described in the original GPT-5 system card. All tested models, including GPT-5.4 Thinking, fall below the consensus expert baseline of 54% on ProtocolQA Open-Ended, the hardest open-ended wet lab benchmark in the evaluation suite.

What jailbreak improvements does GPT-5.4 Thinking offer over GPT-5.1 Thinking?

OpenAI replaced its previous jailbreak benchmark with a more demanding multi-turn evaluation derived from red-teaming. Both GPT-5.2 Thinking and GPT-5.4 Thinking substantially outperform GPT-5.1 Thinking on this benchmark, with a large improvement from 5.1 to 5.2 and a smaller additional improvement from 5.2 to 5.4.

Search for an article

GPT-5.4 Thinking: OpenAI’s Most Scrutinized Reasoning Model Laid Bare