HomeAI & LLMGitHub Security Lab's AI Framework Found 80+ Real Vulnerabilities in Open Source...

GitHub Security Lab’s AI Framework Found 80+ Real Vulnerabilities in Open Source Code

Published on

Microsoft 365 E7 Frontier Suite: The Enterprise AI Bundle That Changes How Work Operates

Microsoft restructured enterprise AI into a single offer for the first time, and the shift covers every layer organizations have been assembling piece by piece. The Microsoft 365 E7 Frontier Suite, available May 1, packages

Essential Points

  • GitHub Security Lab’s AI framework has reported over 80 vulnerabilities across 40 open source repositories, with approximately 20 already publicly disclosed as of March 2026
  • Business logic issues had the highest confirmed vulnerability rate at 25%, with IDOR/Access control producing the largest absolute count of confirmed findings at 38
  • The Rocket.Chat finding allowed login as any user with a password set, caused by a missing await keyword that made a Promise always evaluate as truthy
  • The framework is fully open source, runs via GitHub Codespaces, and requires a paid GitHub Copilot license to use premium model requests

An AI tool read TypeScript code across multiple files, followed an unawaited Promise through the call stack, and concluded that any password would unlock any Rocket.Chat account. GitHub Security Lab’s open source Taskflow Agent framework, published March 6, 2026, automates the kind of deep security auditing that typically requires expert human review. This article breaks down exactly how it works, what it found, and how to run it on your own codebase.

What the GitHub Security Lab Taskflow Agent Actually Does

The seclab-taskflow-agent is not a simple CVE scanner. It uses large language models to perform full security code audits by simulating the reasoning process of a human penetration tester. The framework divides a repository into functional components, maps entry points and intended privilege, models threats, and then runs a structured audit.

The tool operates on YAML-based taskflow files that chain LLM prompts sequentially. Each task receives only the context it needs, which keeps focus tight and reduces hallucinations. This modular design also enables custom taskflows targeting specific vulnerability classes.

The Three-Stage Audit Pipeline Explained

Every audit runs through three distinct phases before anything is flagged as a confirmed vulnerability.

Stage 1: Threat Modeling

The LLM inspects the repository’s source code and documentation, then identifies and audits each component separately to match distinct security boundaries. It identifies entry points, HTTP methods, user actions, and the intended privilege of each component. This context prevents false positives, such as flagging SSRF in a reverse proxy intentionally designed to forward requests.

Stage 2: Issue Suggestion

With the threat model loaded, the LLM suggests types of vulnerabilities most likely to appear per component, based on entry point exposure and intended use. The model is explicitly instructed not to audit at this stage, only to suggest. This separation preserves the integrity of the next stage’s triage process.

Stage 3: Issue Audit

A fresh LLM context receives the suggestions as unvalidated hypotheses and audits each one against the actual source code. The model must produce:

  • A realistic attack scenario with file paths and line numbers
  • Evidence of privilege gain not intended by the component’s design
  • A conclusion that explicitly allows “no vulnerability” as a valid outcome

This three-stage design produced 19 impactful findings from 91 deduplicated issues across 40 repositories, with the majority rated high or critical severity.

Three Real Vulnerabilities This Tool Discovered

Privilege Escalation in Outline (CVE-2025-64487)

Outline is a multi-user collaborative document platform with complex per-user and per-team permissions. The AI taskflow identified that the documents.add_group and documents.remove_group endpoints authorized using the weaker update permission instead of the required manageUsers permission. A user with only ReadWrite membership on a document could add any group, including granting Admin-level access to themselves if they were a member of that group. With Admin membership, the attacker could then archive, delete, move, or manage arbitrary users on the document. The Outline project fixed this and a second reported issue within three days.

Ecommerce Cart Data Exposure (CVE-2025-15033 and CVE-2026-25758)

In the PHP-based WooCommerce project, the taskflows found a way for signed-in users to view all guest orders, including names, addresses, and phone numbers. After Automattic patched that issue, the team extended scans to other ecommerce platforms. The Ruby-based Spree commerce application contained two similar vulnerabilities. The more critical one (CVE-2026-25758) allowed unauthenticated users to enumerate guest order addresses and phone numbers by incrementing a sequential number in the request. These authorization logic bugs had been undiscovered for years.

Rocket.Chat Password Authentication Bypass (CVE-2026-28514)

The AI traced a missing await keyword across multiple TypeScript files and correctly concluded that validatePassword() returned a Promise<boolean>, not a boolean. Since a Promise object is always truthy in JavaScript, the if (!valid) return false check never triggered when a bcrypt hash existed. Any password succeeded for any account with a bcrypt password set. Once authenticated, the attacker could also connect to arbitrary chat channels and read messages sent to them. The LLM identified this subtle async bug entirely through code comprehension across multiple files.

How to Run It on Your Own Repository

Running the framework requires a GitHub Copilot license and takes under five minutes to configure.

  1. Go to the seclab-taskflows repository and start a GitHub Codespace
  2. Wait a few minutes for the Codespace to initialize
  3. Run ./scripts/audit/run_audit.sh myorg/myrepo in the terminal
  4. Wait 1 to 2 hours for a medium-sized codebase to complete
  5. Open the SQLite viewer that launches automatically and filter the audit_results table for rows with a check in the has_vulnerability column

Running the audit twice is recommended because LLMs are non-deterministic and a second run can produce entirely different results. For better coverage, GitHub recommends using different models across runs, for example GPT 5.2 on one pass and Claude Opus 4.6 on another. The framework also works on private repositories, but the Codespace configuration must be modified to allow private repo access, as it is not permitted by default.

What the Data Reveals About AI Security Auditing

After processing 1,003 suggested issues across 40 repositories, the audit stage marked 139 as confirmed vulnerabilities. Manual review deduplicated those to 91 unique findings. Of those 91:

  • 19 (21%) were kept as impactful enough to report, all serious with the majority rated high or critical severity
  • 20 (22%) were rejected as false positives that could not be reproduced manually
  • 52 (57%) were rejected as low severity, such as blind SSRF returning only an HTTP status code

This data was collected using gpt-5.x as the model for code analysis and audit tasks.

Issue Category All Suggested Confirmed Vulnerable Confirmation Rate
IDOR / Access Control 241 38 15.8%
XSS 131 17 13.0%
CSRF 110 17 15.5%
Authentication Issue 91 15 16.5%
Security Misconfiguration 75 13 17.3%
Path Traversal 61 10 16.4%
SSRF 45 7 15.6%
Command Injection 39 5 12.8%
Business Logic Issue 24 6 25.0%
Remote Code Execution 24 1 4.2%
Template Injection 24 1 4.2%
File Upload Handling 18 2 11.1%
Insecure Deserialization 17 0 0.0%
Open Redirect 16 0 0.0%
SQL Injection 9 0 0.0%
Sensitive Data Exposure 8 0 0.0%
XXE 4 0 0.0%
Memory Safety 3 0 0.0%
Others 66 7 10.6%

Why LLMs Excel at Logic Bugs

The IDOR/Access Control category alone produced more confirmed findings (38) than the next two categories combined (XSS and CSRF at 17 each). Business logic issues had the highest confirmation rate at 25%. This performance reflects the LLM’s strength in understanding intended code behavior, following control flow across files, and reasoning about what a user should versus should not be permitted to do. These are precisely the capabilities that traditional static analysis tools lack.

SQL injection, memory safety, and insecure deserialization produced zero confirmed findings. This aligns with expectations: pattern-matching tools and fuzzers cover those categories more effectively. The framework’s clear strength is authorization logic, authentication flows, and access control, where understanding intent matters more than recognizing syntax patterns.

Where the Tool Has Limits

The framework struggles most with threat modeling of desktop applications, because it is often unclear whether other processes running on a user’s desktop should be treated as trusted or untrusted. When multiple layers of authentication are in place, the LLM can sometimes miss nested permission checks deeper in the call stack, leading to false positives.

None of the 20 false positives were hallucinations. All had sound code-level evidence, and the researchers could follow each report to locate the relevant endpoints and test suggested payloads. The false positives reflected genuine complexity, such as browser-side XSS mitigations not visible in server code, rather than fabricated findings.

How This Changes Developer Security Workflows

The framework complements existing SAST tools rather than replacing them. Running both in parallel covers a wider spectrum: traditional tools for pattern-based vulnerability classes, and taskflows for novel logic flaws that require contextual reasoning. The open source design also enables community contributions, allowing new taskflow files to target specific vulnerability types, compliance frameworks, or language-specific patterns over time.

GitHub Security Lab continues to update its advisories page as new disclosures are made, building a growing public record of AI-assisted security research findings.

Frequently Asked Questions (FAQs)

What is the GitHub Security Lab Taskflow Agent?

It is an open source AI framework that uses LLMs to automate security code audits through a three-stage pipeline covering threat modeling, issue suggestion, and issue verification. It has reported over 80 vulnerabilities across 40 open source repositories as of March 2026.

Do I need to pay to use the seclab-taskflow-agent?

The framework itself is free and open source on GitHub. However, a paid GitHub Copilot license is required because the auditing taskflows consume premium model requests to run LLM-based analysis. The cost depends on repository size and the number of runs performed.

How long does a full security audit take?

A medium-sized repository typically takes 1 to 2 hours to complete one full audit run. GitHub recommends running the audit at least twice because LLMs are non-deterministic and a second pass can surface entirely different vulnerability candidates.

What types of vulnerabilities does this AI framework find best?

The framework performs strongest on logic-based flaws including IDOR, authentication bypasses, privilege escalation, and business logic gaps. Business logic issues had the highest confirmation rate at 25%. SQL injection, memory safety, and insecure deserialization produced zero confirmed findings, where traditional tooling remains more effective.

Can the framework scan private repositories?

Yes, but the default GitHub Codespace configuration does not allow access to private repositories. You need to modify the Codespace settings to grant the necessary permissions before pointing the tool at a private codebase.

How accurate are the findings?

Of 91 manually reviewed findings, 21% were reported as impactful vulnerabilities (majority high or critical severity), 22% were false positives, and 57% were low severity issues. Importantly, none of the false positives were hallucinations; all had verifiable code-level evidence.

Which AI models were used in GitHub’s testing?

GitHub’s data table was collected using gpt-5.x as the model for code analysis and audit tasks. For best coverage, GitHub recommends using two different models across two runs, specifically GPT 5.2 on one pass and Claude Opus 4.6 on another.

Mohammad Kashif
Mohammad Kashif
Senior Technology Analyst and Writer at AdwaitX, specializing in the convergence of Mobile Silicon, Generative AI, and Consumer Hardware. Moving beyond spec sheets, his reviews rigorously test "real-world" metrics analyzing sustained battery efficiency, camera sensor behavior, and long-term software support lifecycles. Kashif’s data-driven approach helps enthusiasts and professionals distinguish between genuine innovation and marketing hype, ensuring they invest in devices that offer lasting value.

Latest articles

Microsoft 365 E7 Frontier Suite: The Enterprise AI Bundle That Changes How Work Operates

Microsoft restructured enterprise AI into a single offer for the first time, and the shift covers every layer organizations have been assembling piece by piece. The Microsoft 365 E7 Frontier Suite, available May 1, packages

Supabase Storage Rebuilt From the Ground Up: Faster, More Secure, More Reliable

Supabase pushed one of its most consequential storage updates on March 5, 2026. The release does not add flashy features; it fixes the foundations every production app depends on: query

OpenAI Just Acquired the World’s Most-Used AI Red-Teaming Platform, and the Implications Run Deep

OpenAI’s acquisition of Promptfoo is not about adding another tool to its portfolio. It is about owning the security layer that sits between powerful AI models and real-world enterprise deployment, and that

Cursor Automations: The Always-On AI Agents Changing How Engineers Build Software

Software development has a new bottleneck, and it is not writing code. It is managing the growing volume of review, monitoring, and maintenance work that AI-generated code creates. Cursor Automations solves this by

More like this

Microsoft 365 E7 Frontier Suite: The Enterprise AI Bundle That Changes How Work Operates

Microsoft restructured enterprise AI into a single offer for the first time, and the shift covers every layer organizations have been assembling piece by piece. The Microsoft 365 E7 Frontier Suite, available May 1, packages

Supabase Storage Rebuilt From the Ground Up: Faster, More Secure, More Reliable

Supabase pushed one of its most consequential storage updates on March 5, 2026. The release does not add flashy features; it fixes the foundations every production app depends on: query

OpenAI Just Acquired the World’s Most-Used AI Red-Teaming Platform, and the Implications Run Deep

OpenAI’s acquisition of Promptfoo is not about adding another tool to its portfolio. It is about owning the security layer that sits between powerful AI models and real-world enterprise deployment, and that