SASTAppSecBenchmarking

Benchmarking SAST

Choosing the right static analyzer — comparing SAST tools using precision, recall, and F1 metrics on realistic codebases.

2025-10-023 min read|GitHub

Static Application Security Testing (SAST) is a powerful approach for catching vulnerabilities early in the development lifecycle. By analyzing source code without execution, SAST tools can prevent defects before production deployment. However, without objective benchmarking frameworks, organizations struggle to select the most suitable solution.

Why Benchmarking SAST Tools Matters

Relying on vendor claims or informal testing creates risk. Many organizations lack resources for comprehensive evaluations, instead conducting superficial scans of random open-source projects. This unreliable approach can misrepresent tool capabilities.

Effective benchmarking provides:

Objective scoring using tool-agnostic metrics (precision, recall, F1) that remain consistent across versions
Clear signal-to-noise quantification showing real vulnerability detection versus false positives
Standardized testing conditions enabling fair tool comparison
Continuous improvement incentives for tool developers

Existing Synthetic Benchmarks

OWASP Benchmark (Java-only): A free, open-source suite designed to test detection accuracy. Its narrow scope means results can be irrelevant to non-Java stacks.

SARD / Juliet (NIST): Extensive C/C++ test sets mapped to CWEs. Paired "good()/bad()" patterns may artificially inflate accuracy scores without reflecting real application complexity.

OpenSSF CVE Benchmark (JS/TS): Uses 200+ authentic JavaScript/TypeScript CVEs, testing tools against vulnerable and patched versions. Stronger realism than synthetic suites, but limited to JS/TS coverage.

Core Evaluation Metrics

1. Accuracy (Signal-to-Noise Ratio)

Precision measures legitimate findings versus total alerts. Excessive false positives overwhelm developers and encourage tool abandonment. Noise is the number-one factor that deters developers from using SAST products.

2. Completeness

Recall represents detected vulnerabilities versus all actual issues. Perfect completeness is impossible — some bugs inevitably slip through. The challenge involves balancing detection rate against false positive volume.

3. Additional Qualitative Factors

Coverage & depth: Language/framework support quality and modern framework handling
Rule quality: Well-crafted defaults with easy customization
Maintenance velocity: Active development and rigorous community-driven improvements
Integration: Seamless CI/CD, IDE, and pull request workflow incorporation

Benchmark Limitations

Language fragmentation: No single suite covers multiple languages. OWASP Benchmark addresses Java exclusively; intentionally vulnerable applications (WebGoat, NodeGoat, Juice Shop) target specific frameworks.

Realism gap: Synthetic test cases feature linear data flows rarely matching production complexity. Real vulnerabilities span modules, use complex library chains, or depend on framework configurations absent from benchmarks.

Overfitting risk: Once benchmarks become standards, vendors optimize specifically for those tests, producing impressive scores disconnected from real-world performance.

Multiple tool trade-offs: Running parallel SAST engines theoretically improves coverage, but multiplies false positives, duplicates, and configuration overhead, creating unsustainable triage burdens in continuous pipelines.

Future Direction: AI-Powered Benchmarking

I'm exploring a benchmarking framework that leverages automation and AI to run multiple SAST tools and aggregate their performance into a single score. A standardized "SAST scorecard" could simplify tool selection without requiring extensive manual evaluation.

References

OWASP Foundation. OWASP Benchmark Project
van Schaik, Bas. "Introducing the OpenSSF CVE Benchmark." OpenSSF Blog, 2020
NIST Software Quality Group (SAMATE). Software Assurance Reference Dataset (SARD)
Gigleux, Alexandre. "Enhancing SAST Detection: Sonar's Scoring on the Top 3 Java SAST Benchmarks." Sonar Blog, 2023
Biton, Asaf, and Shani Gal. "3 parameters to measure SAST testing." Snyk Blog, 2021

All posts