Why SAST Tools Fail on Agentic Code (and How to Measure It)
Current static analysis tools weren't built for agentic systems where dangerous capabilities are features, not bugs. Here's a benchmark to measure what they actually catch.
AI agents run commands, fetch URLs, read files, and talk to databases. That's not a vulnerability. That's the product.
But point a SAST tool at an agentic codebase and watch what happens. It flags subprocess.run() as command injection. It flags requests.get() as SSRF. It flags every outbound call, every file read, every shell invocation. It doesn't understand that these are intentional capabilities behind proper guards, not exploitable sinks.
The result is noise. A wall of false positives that teaches developers to ignore their security scanner. And buried somewhere in that noise, the real vulnerability: the one tool that forgot to check its allowlist.
This is the fundamental problem with running traditional SAST on agentic code. The tools were built for a world where subprocess.run() appearing in your web app was almost certainly a bug. In agentic systems, it's a feature. The question isn't whether the call exists, it's whether it's guarded.
The Capability-Safe Blind Spot
Consider this Python code from an AI coding agent:
ALLOWED_COMMANDS = ["pytest", "ruff", "mypy"]
def run_command(cmd: str) -> str:
normalized = shlex.split(cmd)
binary = normalized[0]
if binary not in ALLOWED_COMMANDS:
raise PermissionError(f"Command not allowed: {binary}")
result = subprocess.run(
normalized,
shell=False,
capture_output=True,
timeout=120,
)
return result.stdout[:500]
This code calls subprocess.run(). Every pattern-based SAST rule for command injection will flag it. But look closer:
- There's an explicit allowlist of three binaries
- Input is tokenized with
shlex.split, not concatenated into a shell string shell=Falsemeans arguments are passed directly to the binary, not throughsh -c- Output is truncated, timeout is enforced
This is what we call capability-safe code. The dangerous API is intentional, and the guards are correct. A scanner that flags this is producing a capability false positive, and in a codebase with dozens of guarded tool invocations, these false positives drown out the real findings.
The problem is that no existing SAST benchmark tests for this. Traditional benchmarks like Juliet or OWASP Benchmark only include vulnerable and not-vulnerable cases. They never ask: can your scanner recognize that a dangerous API call is properly guarded in the context of an agent that's supposed to have that capability?
Measuring What Matters
We built SASTbench to answer this question. It's an open-source benchmark specifically designed for evaluating SAST tools on agentic codebases.
The benchmark has two tracks:
- Core Track: 17 self-contained synthetic cases across Python, TypeScript, and Rust. No external dependencies, deterministic, fast to run.
- Full Track: 40 additional cases from real-world disclosed vulnerabilities in open-source agentic projects, including Swift, pinned at git snapshots with known CVEs.
How SASTbench Works
Cases flow through adapters to produce agentic-aware scores
Every case is defined by a case.json that specifies annotated source regions with line ranges, expected vulnerability kinds, and whether the scanner should or should not flag each region. The scoring engine compares scanner findings against these annotations using path and line-range overlap, then classifies each finding as a true positive, false positive, capability false positive, or false negative.
Three Types of Cases
What makes SASTbench different from existing benchmarks is the case types. Each tests a distinct dimension of scanner capability.
Four Case Types
Each tests a different dimension of scanner capability
def fetch(url):
# no validation
return requests.get(url)if cmd not in ALLOWLIST:
raise Denied(cmd)
subprocess.run(cmd,
shell=False)Synthetic Vulnerable cases contain real vulnerabilities in agentic code: an SSRF in a medical research agent's reference fetcher, command injection in a debug executor, path traversal in a file manager tool. The scanner must detect these. This is traditional SAST territory.
Capability Safe cases are the novel addition. These contain deliberately guarded dangerous operations: subprocess.run() behind an allowlist with shell=False, requests.get() with a host allowlist, file operations confined to a workspace root. Each case specifies the required guards (allowlist, workspace_root, host_allowlist, scheme_allowlist, parameterized_query, etc.) and the scanner must recognize these as safe. Any finding here is a capability false positive.
Mixed Intent cases combine both in a single codebase. One tool fetcher is allowlisted and safe. Another is unvalidated and vulnerable. Same codebase, same patterns, different intent. The scanner must flag the vulnerable tool without flagging the safe one. This tests intent discrimination, the ability to distinguish between "this code is dangerous" and "this code handles danger correctly."
A fourth type, Real World Disclosed, anchors these synthetic dimensions against actual CVEs from production agentic projects — covered in the next section.
This distinction between capability and vulnerability is what existing benchmarks miss entirely. Traditional benchmarks label code as "vulnerable" or "not vulnerable." SASTbench adds a third category: code that uses a dangerous API correctly, with formally annotated guards (requiredGuards in the case schema). This turns "does the scanner flag dangerous calls?" into the more useful question: "does the scanner understand which dangerous calls are defended?"
Real-World Disclosed Vulnerabilities
Synthetic cases test scanner logic in isolation, but they don't capture the messiness of real codebases — thousands of files, complex dependency trees, and vulnerabilities buried in context that only makes sense with the full project around it. So the Full Track includes 40 cases from real disclosed CVEs in open-source agentic projects, pinned at the exact vulnerable commit.
Full Track: 40 Real-World Cases
Pinned git snapshots from disclosed CVEs in production agentic projects
Each case references the GHSA advisory and CVE, specifies the vulnerable commit and fix commit, and annotates the exact file and line range where the vulnerability lives. This means you can validate your scanner against the same bugs that were disclosed in production.
The Agentic Score
Traditional SAST metrics (precision, recall) don't capture what matters for agentic code. A scanner with high recall but a high capability false positive rate will generate alert fatigue on every properly-built agent. So SASTbench introduces a composite metric:
The Agentic Score is the geometric mean of three components:
Why PR Mode Matters
Most real security scanning happens at PR time. A developer adds a new tool to an agent, and the scanner needs to decide: is this new capability properly guarded, or did they forget the allowlist?
PR Mode: Diff-Aware Scanning
Detect introduced vulnerabilities without flagging pre-existing code
agent/router.py
SASTbench includes a PR simulation mode. For core track cases, it ships vendored base/head directory trees. For full track cases, it uses git commit pairs from the actual repository history (the commit that introduced the vulnerability and the commit that fixed it).
The scanner gets both trees plus the diff and must identify vulnerabilities introduced in the change without flagging pre-existing code or properly guarded new capabilities. PR mode measures:
This matters because agentic PRs frequently add new capabilities. A scanner that flags every new subprocess.run() or fetch() call regardless of context creates friction that slows down development without improving security.
Plugging In Your Scanner
SASTbench is scanner-agnostic. You write a Python adapter module with two functions:
def get_version() -> str:
"""Return your scanner's version string."""
def scan(scan_root: Path, language: str) -> list[dict]:
"""Run your scanner and return normalized findings."""
return [
{
"ruleId": "your-rule-id",
"mappedKind": "command_injection", # one of 6 canonical kinds
"path": "tools/executor.py",
"startLine": 21,
"endLine": 55,
"severity": "high",
"message": "Unsanitized input passed to subprocess",
}
]
Findings are normalized to six canonical vulnerability kinds that map to the OWASP Agentic Top 10: command_injection, path_traversal, ssrf, auth_bypass, authz_bypass, and sql_injection. This lets the benchmark compare scanners that use different rule taxonomies on equal footing.
The benchmark ships with adapters for Semgrep and Bandit. Writing a new adapter typically takes an afternoon.
The harness validates every case schema, captures per-finding audit trails with raw scanner output, and exposes a stable adapter interface, so results are reproducible rather than hand-curated.
What We Need as a Community
Agentic systems are shipping to production. The security tooling hasn't caught up. Pattern-matching SAST rules written for traditional web apps don't understand that an AI agent calling subprocess.run() might be doing exactly what it's supposed to do.
We need scanners that understand intent: not just "is this API dangerous?" but "is this dangerous API properly guarded for its intended use?" We need benchmarks that measure this capability, not just vulnerability detection. And we need the data to know which tools actually work on the codebases we're building today.
SASTbench is our attempt at providing that measurement. The benchmark, the cases, the scoring, and the adapter interface are all open source. Run it against your scanner. Write an adapter. Add cases from your own disclosed vulnerabilities. File issues when the scoring gets it wrong.
The gap between what SAST tools test for and what agentic code actually needs is wide. The only way to close it is to measure it.