HomeNewsOpenAI Codex Security Rejects SAST: The Real Reason Behind a Bold Design...

OpenAI Codex Security Rejects SAST: The Real Reason Behind a Bold Design Choice

Published on

WordPress 7.0 Beta 5 Is Live: The Biggest CMS Update in Over a Decade Arrives April 9

WordPress 7.0 does not iterate quietly. It rewrites how teams build, edit, and manage sites at a fundamental level, folding in AI infrastructure, live collaboration, and a redesigned admin experience in a single release

Essential Points

  • Codex Security starts from the repository itself, not a SAST findings list, to avoid premature narrowing of investigation scope
  • The hardest vulnerabilities are not dataflow problems but failures in whether a security check actually guarantees the property the system relies on
  • A DryRun Security study found 26 of 30 AI-generated pull requests contained at least one vulnerability, an 87% rate, with logic and authorization flaws dominating
  • DryRun’s 2025 SAST Accuracy Report found its contextual analysis tool identified 88% of seeded vulnerabilities, with the largest performance gap on logic-level findings

OpenAI published a formal explanation on March 16, 2026, for why Codex Security excludes Static Application Security Testing (SAST) reports as a starting point for its agent. The decision is a deliberate architectural choice grounded in what actually determines whether a vulnerability exists.

What SAST Does Well and Where It Stops

SAST is built around an elegant model: identify a source of untrusted input, trace data through the program, and flag cases where that data reaches a sensitive sink without sanitization. It covers a large class of real bugs and remains effective for enforcing secure coding standards and catching known patterns at scale.

In practice, SAST has to make approximations to stay tractable across real codebases with indirection, dynamic dispatch, callbacks, reflection, and framework-heavy control flow. Those approximations are not a flaw in SAST design but an inherent constraint of reasoning about code without executing it.

The Core Problem: Whether the Defense Actually Works

OpenAI’s central argument goes beyond dataflow coverage. Even when a SAST tool correctly traces input across multiple functions and layers, it still has to answer the question that determines whether a vulnerability actually exists: did the defense really work?

Consider a sanitizer call before rendering untrusted content. A static analyzer can confirm the sanitizer ran. What it cannot determine is whether that sanitizer is sufficient for the specific rendering context, template engine, encoding behavior, and downstream transformations involved. The difference between “the code calls a sanitizer” and “the system is safe” is precisely where the most critical vulnerabilities live.

Order of Operations: A Real-World Example

OpenAI uses a concrete pattern to illustrate the problem. A web application receives a JSON payload, extracts a redirect URL, validates it against an allowlist regex, URL-decodes it, and passes the result to a redirect handler.

A standard source-to-sink SAST report describes the flow cleanly as: untrusted input, regex check, URL decode, redirect. But the real question is whether the regex check still constrains the value after the URL decoding transformation that follows it. Answering that requires reasoning about the entire transformation chain: what the regex allows, how decoding and normalization behave, how URL parsing treats edge cases, and how the redirect logic resolves schemes and authorities.

This is not a theoretical scenario. CVE-2024-29041 affected Express.js through an open redirect issue where malformed URLs bypassed common allowlist implementations because of how redirect targets were encoded and then interpreted. The dataflow was visible. The weakness was in how constraints propagated through the transformation chain.

How Codex Security Approaches Vulnerability Discovery

Rather than triaging pre-generated findings, Codex Security starts from the repository’s architecture, trust boundaries, and intended behavior. When the system encounters something that looks like validation or sanitization, it does not treat that as a checkbox. It attempts to understand what the code is trying to guarantee and then tries to falsify that guarantee.

In practice, that involves four methods working together:

  • Reading the relevant code path with full repository context, the way a security researcher would, looking for mismatches between intent and implementation
  • Reducing the problem to the smallest testable slice, such as the transformation pipeline around a single input, then writing micro-fuzzers for that slice
  • Reasoning about constraints across transformations rather than treating each check independently, including using z3-solver for integer overflows or complex input constraint problems
  • Executing hypotheses in a sandboxed validation environment to distinguish “this could be a problem” from “this is a problem,” with full end-to-end proof-of-concept artifacts

The key shift is moving from “a check exists” to “the invariant holds or it does not, and here is the evidence.”

Why Seeding an AI Agent with SAST Output Backfires

OpenAI identified three specific failure modes that arise when an agent starts from a SAST report.

First, it encourages premature narrowing. A findings list is a map of where a tool already looked. Starting there biases the agent toward the same regions, using the same abstractions, and missing entire classes of issues that do not fit the prior tool’s worldview.

Second, it introduces implicit judgments that are difficult to unwind. Many SAST findings encode assumptions about sanitization, validation, or trust boundaries. If those assumptions are incomplete, feeding them into the reasoning loop shifts the agent from “investigate” to “confirm or dismiss.”

Third, it makes it harder to evaluate the reasoning system itself. If the pipeline starts with SAST output, it becomes impossible to separate what the agent discovered through independent analysis from what it inherited from another tool, which is necessary for the system to improve over time.

What Independent Research Shows About AI Agent Security Gaps

A March 2026 study by DryRun Security, reported by Help Net Security, tested Claude Code with Sonnet 4.6, OpenAI Codex with GPT-5.2, and Google Gemini with 2.5 Pro across 38 scans covering 30 pull requests. The three agents produced 143 security issues, and 26 of those 30 pull requests contained at least one vulnerability, an 87% rate.

The recurring vulnerability patterns across all three agents were broken access control, business logic failures where game scores and balances were accepted from the client without server-side validation, and OAuth implementation failures including missing state parameters in every social login implementation. Most notably, WebSocket authentication was missing from every final game codebase. All three agents wired REST authentication middleware correctly but failed to extend it to the WebSocket upgrade handler, a logic-level gap that appeared in every final scan regardless of which agent wrote the code.

The study also confirmed the SAST coverage gap directly. DryRun noted that pattern-based static analysis tools do not trace whether middleware is mounted, whether authentication policies apply to every connection type, or whether server-side validation actually runs. DryRun’s own 2025 SAST Accuracy Report found its contextual analysis tool identified 88% of seeded vulnerabilities across four application stacks, with the largest gap on logic-level findings.

What the Semgrep Benchmark Reveals About Codex and SAST-Style Detection

A September 2025 study by Semgrep researchers evaluated Codex (v0.2.0, o4-mini) and Claude Code (v1.0.32, Sonnet 4) against 11 large, real-world Python web applications spanning Django, Flask, and FastAPI. Together the two agents produced over 400 security findings, all manually triaged.

Vulnerability Class Codex True Positive Rate Claude Code True Positive Rate
Auth Bypass 13% (5/37) 10% (6/58)
IDOR 0% (0/5) 22% (13/59)
Path Traversal 47% (8/17) 13% (5/36)
SQL Injection 0% (0/5) 5% (2/38)
SSRF 34% (8/23) 12% (8/65)
XSS 0% (0/28) 16% (12/74)

Overall, Codex reported 21 true vulnerabilities at an 18% true positive rate, while Claude Code found 46 vulnerabilities at a 14% true positive rate. Codex performed best on Path Traversal at 47% but found zero correct IDOR, SQL Injection, or XSS findings. Semgrep attributed the injection class weakness directly to difficulty performing inter-procedural taint flow, exactly the gap where SAST-style dataflow reasoning breaks down even when assisted by an AI agent.

The study also identified a significant non-determinism problem. Running the same prompt on the same codebase multiple times produced different results every time. In one application, three identical runs produced 3, 6, and then 11 distinct findings. This inconsistency creates risks for vulnerability management systems that assume a previously detected issue is fixed when it disappears from a scan.

Vulnerability Classes That Go Beyond Dataflow

OpenAI explicitly calls out a class of bugs that SAST cannot structurally address: state and invariant problems. These include workflow bypasses, authorization gaps, and failures where “the system is in the wrong state.” For these bugs, no tainted value reaches a dangerous sink. The risk is in what the program assumes will always be true.

This maps directly to what DryRun found in practice. The most persistent unresolved findings in its study were logic-level issues: a temporary token bypass that persisted through all of Codex’s final code, Claude introducing a 2FA-disable bypass not seen in any other agent’s build, and Gemini carrying OAuth CSRF and invite bypass issues through to the final scan. None of these are dataflow findings. All of them are constraint and invariant failures.

Limitations Worth Knowing

Codex Security is still in research preview as of March 2026, and full documentation is available at developers.openai.com/codex/security. The sandbox validation model adds compute overhead compared to a standard SAST run, and coverage across all language ecosystems and framework types is not yet complete.

OpenAI is also direct that SAST tools remain important for defense-in-depth. SAST is excellent at enforcing secure coding standards, catching known source-to-sink issues, and detecting patterns at scale with predictable tradeoffs. Codex Security targets the work that costs security teams the most: turning “this looks suspicious” into “this is real, here is how it fails, and here is a fix that matches system intent.”

Frequently Asked Questions (FAQs)

Why doesn’t OpenAI Codex Security include a SAST report?

OpenAI determined that starting from SAST output creates three failure modes: premature narrowing of investigation scope, inherited incorrect assumptions about security checks, and blurred attribution between the agent’s own findings and those inherited from the prior tool. Codex Security starts from the repository itself instead.

What is the core difference between SAST and Codex Security’s approach?

SAST traces data from untrusted sources to sensitive sinks. Codex Security asks a deeper question: does the security check in the code actually guarantee the property the system relies on? It then tries to falsify that guarantee using micro-fuzzers, constraint solving, and sandboxed proof-of-concept validation.

What is CVE-2024-29041 and why does OpenAI reference it?

CVE-2024-29041 is an open redirect vulnerability in Express.js where malformed URLs bypassed allowlist validation because of how redirect targets were encoded then decoded. OpenAI uses it as a concrete example of an order-of-operations flaw where the dataflow looks clean in a SAST report but the constraint fails in practice.

How accurate are AI coding agents at finding vulnerabilities today?

In Semgrep’s September 2025 benchmark across 11 real-world Python applications, OpenAI Codex achieved an 18% true positive rate overall, with strong performance on Path Traversal at 47% but zero correct findings for IDOR, SQL Injection, and XSS. Claude Code achieved a 14% overall true positive rate.

What security vulnerabilities do AI coding agents introduce most often?

DryRun Security’s March 2026 study found the most consistent patterns were broken access control with unauthenticated endpoints, business logic failures where client-supplied values were trusted without server-side validation, missing WebSocket authentication despite correct REST auth wiring, and weak JWT secret management with hardcoded fallback values.

Is SAST still worth using if Codex Security is available?

Yes. OpenAI explicitly states SAST is valuable for enforcing coding standards, detecting known bug patterns at scale, and providing predictable coverage with consistent tradeoffs. The two tools serve different purposes. SAST provides broad, fast pattern coverage. Codex Security provides deep, validated findings for complex and logic-level vulnerabilities.

How does Codex Security validate findings before showing them to developers?

The system attempts to reproduce each finding in an isolated sandboxed environment, compiling code in debug mode and generating full end-to-end proof-of-concept artifacts. A finding only reaches a developer’s queue after successful reproduction, which significantly reduces alert fatigue compared to unvalidated SAST output.

Why do AI coding agents miss WebSocket authentication even when REST authentication is correct?

DryRun’s study found all three tested agents wired REST authentication middleware correctly but did not connect that middleware to WebSocket upgrade handlers. This is a structural logic gap. The agents implemented the security policy for one connection type and did not reason about whether that policy needed to extend across all connection types in the same application.

Mohammad Kashif
Mohammad Kashif
Senior Technology Analyst and Writer at AdwaitX, specializing in the convergence of Mobile Silicon, Generative AI, and Consumer Hardware. Moving beyond spec sheets, his reviews rigorously test "real-world" metrics analyzing sustained battery efficiency, camera sensor behavior, and long-term software support lifecycles. Kashif’s data-driven approach helps enthusiasts and professionals distinguish between genuine innovation and marketing hype, ensuring they invest in devices that offer lasting value.

Latest articles

WordPress 7.0 Beta 5 Is Live: The Biggest CMS Update in Over a Decade Arrives April 9

WordPress 7.0 does not iterate quietly. It rewrites how teams build, edit, and manage sites at a fundamental level, folding in AI infrastructure, live collaboration, and a redesigned admin experience in a single release

Grok Text to Speech API Is Live: Build Voice Apps With Expressive, Human-Like Speech

xAI just made voice a first-class API feature, and it changes what developers can build in a single afternoon. The Grok TTS API delivers expressive, human-like speech with fine-grained delivery control

xAI Is Hiring Wall Street Professionals to Build Grok Into a Finance Powerhouse

Wall Street is now on xAI's payroll, not as clients but as teachers. Elon Musk's AI company is recruiting experienced finance professionals to train Grok from the inside out, a move that signals a deliberate push

Meta Is Building 4 AI Chip Generations in Under 2 Years to Scale GenAI Inference

Meta has committed to one of the fastest custom chip iteration cycles in the tech industry. Four successive generations of its in-house AI silicon in under two years signals a structural bet: that purpose-built inference

More like this

WordPress 7.0 Beta 5 Is Live: The Biggest CMS Update in Over a Decade Arrives April 9

WordPress 7.0 does not iterate quietly. It rewrites how teams build, edit, and manage sites at a fundamental level, folding in AI infrastructure, live collaboration, and a redesigned admin experience in a single release

Grok Text to Speech API Is Live: Build Voice Apps With Expressive, Human-Like Speech

xAI just made voice a first-class API feature, and it changes what developers can build in a single afternoon. The Grok TTS API delivers expressive, human-like speech with fine-grained delivery control

xAI Is Hiring Wall Street Professionals to Build Grok Into a Finance Powerhouse

Wall Street is now on xAI's payroll, not as clients but as teachers. Elon Musk's AI company is recruiting experienced finance professionals to train Grok from the inside out, a move that signals a deliberate push