SASTAppSecAI AgentsBenchmarkingSASTbench

Why SAST Tools Fail on Agentic Code (and How to Measure It)

Current static analysis tools weren't built for agentic systems where dangerous capabilities are features, not bugs. Here's a benchmark to measure what they actually catch.

2026-03-297 min read|GitHub

AI agents run commands, fetch URLs, read files, and talk to databases. That's not a vulnerability. That's the product.

But point a SAST tool at an agentic codebase and watch what happens. It flags subprocess.run() as command injection. It flags requests.get() as SSRF. It flags every outbound call, every file read, every shell invocation. It doesn't understand that these are intentional capabilities behind proper guards, not exploitable sinks.

The result is noise. A wall of false positives that teaches developers to ignore their security scanner. And buried somewhere in that noise, the real vulnerability: the one tool that forgot to check its allowlist.

This is the fundamental problem with running traditional SAST on agentic code. The tools were built for a world where subprocess.run() appearing in your web app was almost certainly a bug. In agentic systems, it's a feature. The question isn't whether the call exists, it's whether it's guarded.

Consider this Python code from an AI coding agent:

python

ALLOWED_COMMANDS = ["pytest", "ruff", "mypy"]

def run_command(cmd: str) -> str:
    normalized = shlex.split(cmd)
    binary = normalized[0]
    if binary not in ALLOWED_COMMANDS:
        raise PermissionError(f"Command not allowed: {binary}")
    result = subprocess.run(
        normalized,
        shell=False,
        capture_output=True,
        timeout=120,
    )
    return result.stdout[:500]

This code calls subprocess.run(). Every pattern-based SAST rule for command injection will flag it. But look closer:

There's an explicit allowlist of three binaries
Input is tokenized with shlex.split, not concatenated into a shell string
shell=False means arguments are passed directly to the binary, not through sh -c
Output is truncated, timeout is enforced

This is what we call capability-safe code. The dangerous API is intentional, and the guards are correct. A scanner that flags this is producing a capability false positive, and in a codebase with dozens of guarded tool invocations, these false positives drown out the real findings.

The problem is that no existing SAST benchmark tests for this. Traditional benchmarks like Juliet or OWASP Benchmark only include vulnerable and not-vulnerable cases. They never ask: can your scanner recognize that a dangerous API call is properly guarded in the context of an agent that's supposed to have that capability?

Measuring What Matters

We built SASTbench to answer this question. It's an open-source benchmark specifically designed for evaluating SAST tools on agentic codebases.

The benchmark has two tracks:

Core Track: 17 self-contained synthetic cases across Python, TypeScript, and Rust. No external dependencies, deterministic, fast to run.
Full Track: 40 additional cases from real-world disclosed vulnerabilities in open-source agentic projects, including Swift, pinned at git snapshots with known CVEs.

How SASTbench Works

Cases flow through adapters to produce agentic-aware scores

Cases

Synthetic Vulnerable

Capability Safe

Mixed Intent

Real World (CVEs)

↓

Runner + Adapter

Runner

benchmark or pr mode

Adapter

normalize to 6 canonical kinds

Any Scanner

semgrep, bandit, LLM-backed...

↓

Scoring

Region Overlap

path + line range matching

Classify

TPFPCFPFN

Agentic Score

recall x (1-CFP) x intent accuracy

Every case is defined by a case.json that specifies annotated source regions with line ranges, expected vulnerability kinds, and whether the scanner should or should not flag each region. The scoring engine compares scanner findings against these annotations using path and line-range overlap, then classifies each finding as a true positive, false positive, capability false positive, or false negative.

Three Types of Cases

What makes SASTbench different from existing benchmarks is the case types. Each tests a distinct dimension of scanner capability.

Four Case Types

Each tests a different dimension of scanner capability

Synthetic Vulnerable

Scanner MUST detect

def fetch(url):
  # no validation
  return requests.get(url)

Expected: flag as SSRF

Region R1 · lines 21-55 · vulnerable

Capability Safe

Scanner must NOT flag

if cmd not in ALLOWLIST:
  raise Denied(cmd)
subprocess.run(cmd,
  shell=False)

Expected: recognize as SAFE

Region R1 · capability_safe · allowlist guard

Mixed Intent

Tests intent discrimination

R1: allowlisted fetcher

host_allowlist enforced

R2: preview fetcher

no validation, raw URL pass

Flag R2, skip R1

mustDetect: [R2] · mustNotFlag: [R1]

Real World Disclosed

40 cases from production CVEs, pinned git snapshots from AutoGPT, LangChain, n8n, Flowise, and more

Synthetic Vulnerable cases contain real vulnerabilities in agentic code: an SSRF in a medical research agent's reference fetcher, command injection in a debug executor, path traversal in a file manager tool. The scanner must detect these. This is traditional SAST territory.

Capability Safe cases are the novel addition. These contain deliberately guarded dangerous operations: subprocess.run() behind an allowlist with shell=False, requests.get() with a host allowlist, file operations confined to a workspace root. Each case specifies the required guards (allowlist, workspace_root, host_allowlist, scheme_allowlist, parameterized_query, etc.) and the scanner must recognize these as safe. Any finding here is a capability false positive.

Mixed Intent cases combine both in a single codebase. One tool fetcher is allowlisted and safe. Another is unvalidated and vulnerable. Same codebase, same patterns, different intent. The scanner must flag the vulnerable tool without flagging the safe one. This tests intent discrimination, the ability to distinguish between "this code is dangerous" and "this code handles danger correctly."

A fourth type, Real World Disclosed, anchors these synthetic dimensions against actual CVEs from production agentic projects — covered in the next section.

This distinction between capability and vulnerability is what existing benchmarks miss entirely. Traditional benchmarks label code as "vulnerable" or "not vulnerable." SASTbench adds a third category: code that uses a dangerous API correctly, with formally annotated guards (requiredGuards in the case schema). This turns "does the scanner flag dangerous calls?" into the more useful question: "does the scanner understand which dangerous calls are defended?"

Real-World Disclosed Vulnerabilities

Synthetic cases test scanner logic in isolation, but they don't capture the messiness of real codebases — thousands of files, complex dependency trees, and vulnerabilities buried in context that only makes sense with the full project around it. So the Full Track includes 40 cases from real disclosed CVEs in open-source agentic projects, pinned at the exact vulnerable commit.

Full Track: 40 Real-World Cases

Pinned git snapshots from disclosed CVEs in production agentic projects

AutoGPT

Python

183k

cases

SSRFDNS rebinding

LangChain

Python

132k

case

SSRF

Semantic Kernel

Python

28k

cases

Path traversalCommand injection

Dify

Python

135k

case

Authorization bypass

OpenCode

TypeScript

132k

cases

Command injection

OpenClaw

TS / Swift

341k

cases

Auth bypassCommand injectionPath traversal

Flowise

TypeScript

51k

cases

SSRFPath traversal

n8n

TypeScript

182k

cases

SSRFCommand injection

Codex

Rust

68k

case

Path traversal

Each case references a GHSA advisory + CVE, pinned at the exact vulnerable commit

Each case references the GHSA advisory and CVE, specifies the vulnerable commit and fix commit, and annotates the exact file and line range where the vulnerability lives. This means you can validate your scanner against the same bugs that were disclosed in production.

The Agentic Score

Traditional SAST metrics (precision, recall) don't capture what matters for agentic code. A scanner with high recall but a high capability false positive rate will generate alert fatigue on every properly-built agent. So SASTbench introduces a composite metric:

The Agentic Score is the geometric mean of three components:

Recall

Did the scanner find the actual vulnerabilities?

1 - Capability FP Rate

Did it correctly leave guarded capabilities alone?

Mixed-Intent Accuracy

In codebases with both safe and unsafe code, did it get both right?

Agentic Score = geometric mean of all three

A scanner needs to perform well on all dimensions. Flag everything? CFP rate tanks. Too conservative? Recall drops.

Why PR Mode Matters

Most real security scanning happens at PR time. A developer adds a new tool to an agent, and the scanner needs to decide: is this new capability properly guarded, or did they forget the allowlist?

PR Mode: Diff-Aware Scanning

Detect introduced vulnerabilities without flagging pre-existing code

Git History

Base Commit

clean code

↓diff

Head Commit

vuln introduced

Changed:

tools/fetcher.py
agent/router.py

↓

Scanner Analysis

Scan Base Tree

establish baseline findings

Scan Head Tree

find all current issues

Compute Review Findings

new-in-head only

or: native PR scan in one pass

↓

PR Metrics

Target Hit Rate

caught the introduced vuln?

Review Noise

unrelated new findings

Capability Noise

flagging safe new guards

SASTbench includes a PR simulation mode. For core track cases, it ships vendored base/head directory trees. For full track cases, it uses git commit pairs from the actual repository history (the commit that introduced the vulnerability and the commit that fixed it).

The scanner gets both trees plus the diff and must identify vulnerabilities introduced in the change without flagging pre-existing code or properly guarded new capabilities. PR mode measures:

Introduced Target Hit Rate

Did the scanner catch the vulnerability that was introduced in this change?

Review Noise

How many unrelated findings did it produce on the diff?

Capability Noise

Did it flag new but properly guarded capabilities as vulnerabilities?

This matters because agentic PRs frequently add new capabilities. A scanner that flags every new subprocess.run() or fetch() call regardless of context creates friction that slows down development without improving security.

Plugging In Your Scanner

SASTbench is scanner-agnostic. You write a Python adapter module with two functions:

python

def get_version() -> str:
    """Return your scanner's version string."""

def scan(scan_root: Path, language: str) -> list[dict]:
    """Run your scanner and return normalized findings."""
    return [
        {
            "ruleId": "your-rule-id",
            "mappedKind": "command_injection",  # one of 6 canonical kinds
            "path": "tools/executor.py",
            "startLine": 21,
            "endLine": 55,
            "severity": "high",
            "message": "Unsanitized input passed to subprocess",
        }
    ]

Findings are normalized to six canonical vulnerability kinds that map to the OWASP Agentic Top 10: command_injection, path_traversal, ssrf, auth_bypass, authz_bypass, and sql_injection. This lets the benchmark compare scanners that use different rule taxonomies on equal footing.

The benchmark ships with adapters for Semgrep and Bandit. Writing a new adapter typically takes an afternoon.

The harness validates every case schema, captures per-finding audit trails with raw scanner output, and exposes a stable adapter interface, so results are reproducible rather than hand-curated.

What We Need as a Community

Agentic systems are shipping to production. The security tooling hasn't caught up. Pattern-matching SAST rules written for traditional web apps don't understand that an AI agent calling subprocess.run() might be doing exactly what it's supposed to do.

We need scanners that understand intent: not just "is this API dangerous?" but "is this dangerous API properly guarded for its intended use?" We need benchmarks that measure this capability, not just vulnerability detection. And we need the data to know which tools actually work on the codebases we're building today.

SASTbench is our attempt at providing that measurement. The benchmark, the cases, the scoring, and the adapter interface are all open source. Run it against your scanner. Write an adapter. Add cases from your own disclosed vulnerabilities. File issues when the scoring gets it wrong.

The gap between what SAST tools test for and what agentic code actually needs is wide. The only way to close it is to measure it.

All posts

SASTAppSecAI AgentsBenchmarkingSASTbench

Why SAST Tools Fail on Agentic Code (and How to Measure It)

Current static analysis tools weren't built for agentic systems where dangerous capabilities are features, not bugs. Here's a benchmark to measure what they actually catch.

2026-03-297 min read|GitHub

AI agents run commands, fetch URLs, read files, and talk to databases. That's not a vulnerability. That's the product.

Consider this Python code from an AI coding agent:

python

ALLOWED_COMMANDS = ["pytest", "ruff", "mypy"]

def run_command(cmd: str) -> str:
    normalized = shlex.split(cmd)
    binary = normalized[0]
    if binary not in ALLOWED_COMMANDS:
        raise PermissionError(f"Command not allowed: {binary}")
    result = subprocess.run(
        normalized,
        shell=False,
        capture_output=True,
        timeout=120,
    )
    return result.stdout[:500]

This code calls subprocess.run(). Every pattern-based SAST rule for command injection will flag it. But look closer:

There's an explicit allowlist of three binaries
Input is tokenized with shlex.split, not concatenated into a shell string
shell=False means arguments are passed directly to the binary, not through sh -c
Output is truncated, timeout is enforced

Measuring What Matters

We built SASTbench to answer this question. It's an open-source benchmark specifically designed for evaluating SAST tools on agentic codebases.

The benchmark has two tracks:

Core Track: 17 self-contained synthetic cases across Python, TypeScript, and Rust. No external dependencies, deterministic, fast to run.
Full Track: 40 additional cases from real-world disclosed vulnerabilities in open-source agentic projects, including Swift, pinned at git snapshots with known CVEs.

How SASTbench Works

Cases flow through adapters to produce agentic-aware scores

Cases

Synthetic Vulnerable

Capability Safe

Mixed Intent

Real World (CVEs)

↓

Runner + Adapter

Runner

benchmark or pr mode

Adapter

normalize to 6 canonical kinds

Any Scanner

semgrep, bandit, LLM-backed...

↓

Scoring

Region Overlap

path + line range matching

Classify

TPFPCFPFN

Agentic Score

recall x (1-CFP) x intent accuracy

Three Types of Cases

What makes SASTbench different from existing benchmarks is the case types. Each tests a distinct dimension of scanner capability.

Four Case Types

Each tests a different dimension of scanner capability

Synthetic Vulnerable

Scanner MUST detect

def fetch(url):
  # no validation
  return requests.get(url)

Expected: flag as SSRF

Region R1 · lines 21-55 · vulnerable

Capability Safe

Scanner must NOT flag

if cmd not in ALLOWLIST:
  raise Denied(cmd)
subprocess.run(cmd,
  shell=False)

Expected: recognize as SAFE

Region R1 · capability_safe · allowlist guard

Mixed Intent

Tests intent discrimination

R1: allowlisted fetcher

host_allowlist enforced

R2: preview fetcher

no validation, raw URL pass

Flag R2, skip R1

mustDetect: [R2] · mustNotFlag: [R1]

Real World Disclosed

40 cases from production CVEs, pinned git snapshots from AutoGPT, LangChain, n8n, Flowise, and more

A fourth type, Real World Disclosed, anchors these synthetic dimensions against actual CVEs from production agentic projects — covered in the next section.

Real-World Disclosed Vulnerabilities

Full Track: 40 Real-World Cases

Pinned git snapshots from disclosed CVEs in production agentic projects

AutoGPT

Python

183k

cases

SSRFDNS rebinding

LangChain

Python

132k

case

SSRF

Semantic Kernel

Python

28k

cases

Path traversalCommand injection

Dify

Python

135k

case

Authorization bypass

OpenCode

TypeScript

132k

cases

Command injection

OpenClaw

TS / Swift

341k

cases

Auth bypassCommand injectionPath traversal

Flowise

TypeScript

51k

cases

SSRFPath traversal

n8n

TypeScript

182k

cases

SSRFCommand injection

Codex

Rust

68k

case

Path traversal

Each case references a GHSA advisory + CVE, pinned at the exact vulnerable commit

The Agentic Score

The Agentic Score is the geometric mean of three components:

Recall

Did the scanner find the actual vulnerabilities?

1 - Capability FP Rate

Did it correctly leave guarded capabilities alone?

Mixed-Intent Accuracy

In codebases with both safe and unsafe code, did it get both right?

Agentic Score = geometric mean of all three

A scanner needs to perform well on all dimensions. Flag everything? CFP rate tanks. Too conservative? Recall drops.

Why PR Mode Matters

Most real security scanning happens at PR time. A developer adds a new tool to an agent, and the scanner needs to decide: is this new capability properly guarded, or did they forget the allowlist?

PR Mode: Diff-Aware Scanning

Detect introduced vulnerabilities without flagging pre-existing code

Git History

Base Commit

clean code

↓diff

Head Commit

vuln introduced

Changed:

tools/fetcher.py
agent/router.py

↓

Scanner Analysis

Scan Base Tree

establish baseline findings

Scan Head Tree

find all current issues

Compute Review Findings

new-in-head only

or: native PR scan in one pass

↓

PR Metrics

Target Hit Rate

caught the introduced vuln?

Review Noise

unrelated new findings

Capability Noise

flagging safe new guards

The scanner gets both trees plus the diff and must identify vulnerabilities introduced in the change without flagging pre-existing code or properly guarded new capabilities. PR mode measures:

Introduced Target Hit Rate

Did the scanner catch the vulnerability that was introduced in this change?

Review Noise

How many unrelated findings did it produce on the diff?

Capability Noise

Did it flag new but properly guarded capabilities as vulnerabilities?

Plugging In Your Scanner

SASTbench is scanner-agnostic. You write a Python adapter module with two functions:

python

def get_version() -> str:
    """Return your scanner's version string."""

def scan(scan_root: Path, language: str) -> list[dict]:
    """Run your scanner and return normalized findings."""
    return [
        {
            "ruleId": "your-rule-id",
            "mappedKind": "command_injection",  # one of 6 canonical kinds
            "path": "tools/executor.py",
            "startLine": 21,
            "endLine": 55,
            "severity": "high",
            "message": "Unsanitized input passed to subprocess",
        }
    ]

The benchmark ships with adapters for Semgrep and Bandit. Writing a new adapter typically takes an afternoon.

The harness validates every case schema, captures per-finding audit trails with raw scanner output, and exposes a stable adapter interface, so results are reproducible rather than hand-curated.

What We Need as a Community

The gap between what SAST tools test for and what agentic code actually needs is wide. The only way to close it is to measure it.

Why SAST Tools Fail on Agentic Code (and How to Measure It)

The Capability-Safe Blind Spot

Measuring What Matters

How SASTbench Works

Three Types of Cases

Four Case Types

Real-World Disclosed Vulnerabilities

Full Track: 40 Real-World Cases

The Agentic Score

Why PR Mode Matters

PR Mode: Diff-Aware Scanning

Plugging In Your Scanner

What We Need as a Community

Why SAST Tools Fail on Agentic Code (and How to Measure It)

The Capability-Safe Blind Spot

Measuring What Matters

How SASTbench Works

Three Types of Cases

Four Case Types

Real-World Disclosed Vulnerabilities

Full Track: 40 Real-World Cases

The Agentic Score

Why PR Mode Matters

PR Mode: Diff-Aware Scanning

Plugging In Your Scanner

What We Need as a Community