| name | create-semgrep-rule |
| description | Create custom Semgrep rules for vulnerability detection. Use when writing new rules for specific vulnerability patterns, creating org-specific detections, or building rules for novel attack vectors discovered during bug bounty hunting. |
Create Custom Semgrep Rules
Expert workflow for creating high-quality, low-false-positive Semgrep rules for security vulnerability detection.
When to Create Custom Rules
Create custom rules when:
- Novel vulnerability patterns not covered by
p/defaultor existing custom rules - Org-specific code patterns (custom frameworks, internal APIs, coding conventions)
- Chained vulnerabilities requiring multi-step detection
- Language/framework-specific bugs (e.g., PHP
parse_urlbypass, Go unsafe patterns) - High-value targets warranting deeper, targeted analysis
- CVE variant hunting - Finding the same vulnerable pattern in other codebases
CVE-to-Rule Workflow
When creating rules from CVEs, the goal is to find the underlying vulnerable code pattern in OTHER codebases - NOT to detect the vulnerable library (SCA tools like Dependabot/Snyk do that better).
Anti-Pattern: SCA-Style Detection (DON'T DO THIS)
# WRONG - This is SCA work, not pattern detection
# Dependabot/Snyk already do this, and do it better
patterns:
- pattern: require("loader-utils").parseQuery(...)
- pattern: import { parseQuery } from "loader-utils"
- pattern: require("vulnerable-package")
This approach:
- Duplicates what SCA tools already do
- Only finds the specific library, not the pattern
- Misses the same vulnerability in custom code
- Provides no value for bug bounty hunting
Correct Approach: Pattern Detection
Step 1: Fetch and analyze the fix commit
# Get the patch diff
curl -s https://github.com/org/repo/commit/abc123.patch
Ask yourself:
- What was the root cause of the vulnerability?
- What code pattern made it exploitable?
- How did the fix address the root cause?
- What would this pattern look like in custom code?
Step 2: Abstract the pattern
The key question: "If a developer wrote similar functionality from scratch, what would the vulnerable version look like?"
Don't think about the library. Think about the category of code that has this problem.
Step 3: Create a library-agnostic rule
The rule should find the SAME MISTAKE anywhere, not just in the specific library.
Example: CVE-2022-37601 (loader-utils Prototype Pollution)
Fix commit analysis:
// BEFORE (vulnerable)
const result = {}; // Has prototype chain
result[key] = value; // key could be "__proto__"
// AFTER (fixed)
const result = Object.create(null); // No prototype chain
result[key] = value; // "__proto__" is just a regular key
Root cause: Query string parsing into {} with unsanitized dynamic keys.
Abstracted pattern: Any code that:
- Creates an object with
{}(notObject.create(null)) - Assigns properties using dynamic/user-controlled keys
- Doesn't validate against
__proto__,constructor,prototype
Rule focus: Find custom query parsers, config loaders, merge utilities, or any key-value processing with this antipattern.
What to detect:
// DETECT: Custom query parser with same vulnerability
function parseConfig(input) {
const config = {}; // Vulnerable: has prototype
for (const [key, val] of entries) {
config[key] = val; // Unsanitized key assignment
}
return config;
}
// DETECT: Custom merge/extend function
function merge(target, source) {
for (const key in source) {
target[key] = source[key]; // Prototype pollution sink
}
}
What NOT to detect:
// SKIP: Using the library (SCA handles this)
const { parseQuery } = require("loader-utils");
// SKIP: Already using safe pattern
const result = Object.create(null);
result[key] = value;
// SKIP: Has prototype pollution guard
if (key === "__proto__" || key === "constructor") continue;
CVE-to-Rule Checklist
Before writing the rule, verify:
| Check | Question |
|---|---|
| Root cause identified | What code pattern caused the vulnerability? |
| Pattern abstracted | Would I find this in custom code, not just the library? |
| Not SCA | Am I detecting a pattern, not a library import? |
| Realistic matches | Will this find bugs in real-world code? |
| Low FP rate | Are there clear safe patterns to exclude? |
Common CVE Pattern Categories
| CVE Type | Root Cause Pattern | Rule Focus |
|---|---|---|
| Prototype Pollution | obj[userKey] = val on {} |
Custom parsers, merge functions |
| Template Injection | User input in template options | Custom template rendering |
| Command Injection | String concat to shell exec | Custom exec wrappers |
| Path Traversal | User input in file paths | Custom file handlers |
| SSRF | User input in URL construction | Custom HTTP clients |
| Deserialization | Untrusted data to deserializer | Custom data loaders |
Rule Broadness: When Patterns Are Too Generic
Some vulnerability patterns are too common to detect without drowning in false positives. Before writing a rule, assess whether it will produce signal or noise.
Pattern Frequency Spectrum
| Signal Level | Pattern Type | Example | Approach |
|---|---|---|---|
| HIGH | Rare sink + user input | res.render(tpl, req.query) |
Direct detection, HIGH confidence |
| MEDIUM | Common pattern + specific context | obj[key] = val in loops |
Audit rule, MEDIUM confidence |
| LOW | Ubiquitous pattern | obj[key] = val anywhere |
Skip or sink-focused only |
Example: Prototype Pollution
Too broad (produces noise):
# This matches almost every JS file
pattern: $OBJ[$KEY] = $VALUE
Specific enough (produces signal):
# Recursive descent pattern - characteristic of vulnerable merge functions
patterns:
- pattern: $SMTH = $SMTH[$A]
- pattern-inside: |
for (...) { ... }
Sink-focused (best signal):
# Detect where pollution becomes exploitable
pattern-sinks:
- pattern: res.render($T, $OPTS) # Template options = RCE
- pattern: spawn($CMD, $ARGS, $OPTS) # child_process options
When to Use Audit vs Vuln Rules
| Rule Type | Confidence | Use Case |
|---|---|---|
subcategory: vuln |
HIGH | Rare pattern, clear exploit, few FPs |
subcategory: audit |
LOW-MEDIUM | Common pattern, needs manual review |
If you can't achieve HIGH confidence, mark the rule as audit with LOW confidence.
The official Semgrep registry does this for prototype pollution:
metadata:
subcategory: audit
confidence: LOW
likelihood: LOW
Sink-Focused vs Pattern-Focused Rules
When a vulnerability pattern is too common to detect directly, focus on the sinks where it becomes exploitable:
| Vulnerability | Pattern-Focused (noisy) | Sink-Focused (high signal) |
|---|---|---|
| Prototype Pollution | obj[key] = val |
Template options, child_process options |
| XSS | String concatenation | innerHTML, document.write |
| SQLi | String + variable | cursor.execute, ORM raw queries |
Rule of thumb: If the source pattern is ubiquitous, detect at the sink instead.
Project Structure
custom-rules/
├── 0xdea-semgrep-rules/ # Third-party: Memory safety, C/C++ vulns
├── open-semgrep-rules/ # Third-party: Multi-language security rules
├── web-vulns/ # Web-specific injection rules
└── custom/ # YOUR custom rules
├── org-specific/ # Rules targeting specific organizations
│ └── <org-name>/ # Per-org rule directories
└── novel-vulns/ # Novel vulnerability patterns
CRITICAL: Rule Quality Standards
Custom rules must meet these standards before use:
- LOW false positive rate - Every FP wastes time; add exclusions aggressively
- Clear security impact - Rule must detect exploitable vulnerabilities, not code smells
- Tested against real code - Validate on target repos before adding to pipeline
- Complete metadata - CWE, severity, confidence, references
- Path exclusions for performance - Exclude bundled/minified files to prevent timeouts
CRITICAL: Path Exclusions for Performance
Taint mode rules are computationally expensive and will timeout on large bundled/minified files. Always add path exclusions to your rules.
Required Path Exclusions
Add this paths block to EVERY rule (especially taint mode):
rules:
- id: my-taint-rule
mode: taint
paths:
exclude:
# Package managers
- "**/node_modules/**"
- "**/vendor/**"
# Build output
- "**/dist/**"
- "**/build/**"
# Minified/bundled files (specific patterns only)
- "**/*.min.js"
- "**/*.min.mjs"
- "**/*.bundle.js"
- "**/*.chunk.js"
- "**/*.chunk.mjs"
- "**/*-init.mjs"
# NOTE: Do NOT use broad patterns like "**/js/*.js" or "**/assets/**"
# as they exclude legitimate source files in some repos
# ... rest of rule
Why This Matters
| File Type | Typical Size | Taint Mode Behavior |
|---|---|---|
| Source file | 1-50 KB | Fast analysis |
| Bundled JS | 100KB-2MB | TIMEOUT (30s default) |
| Minified JS | 50KB-500KB | TIMEOUT or very slow |
Real example: A 588KB Vite bundle (viewer-init.mjs) caused 3 timeout errors and blocked rule execution until path exclusions were added.
Signs You Need More Exclusions
When running your rule, watch for:
Warning: 3 timeout error(s) in path/to/file.mjs when running rules...
Semgrep stopped running rules on path/to/file.mjs after 3 timeout error(s).
Add the problematic file pattern to your paths.exclude list.
Workflow
Step 1: Define the Vulnerability
Before writing any YAML, answer these questions:
Vulnerability Type: [e.g., Command Injection, SSRF, SQLi]
CWE ID: [e.g., CWE-78]
Security Impact: [e.g., Remote code execution as web server user]
Vulnerable Pattern: [e.g., os.system() with user-controlled input]
Exploit Scenario: [e.g., Attacker controls filename parameter, injects shell commands]
Find 2-3 real examples from target codebase to guide pattern creation.
Step 2: Choose Rule Mode
| Mode | Use When | Example |
|---|---|---|
| Pattern-based | Single function calls, hardcoded values, dangerous API usage | eval(), hardcoded secrets, weak crypto |
| Taint mode | Data flows from user input to dangerous sink | SQLi, XSS, command injection, SSRF |
Decision guide:
- "Is user input involved?" → Taint mode
- "Is it a dangerous function regardless of input?" → Pattern mode
- "Do I need to track data across variables/functions?" → Taint mode
Step 3: Write the Rule
Pattern-Based Rule Template
rules:
- id: <org>-<vuln-type>-<specific-pattern>
languages:
- python
message: |
<Clear description of what was detected and why it's dangerous>
Remediation: <Specific fix recommendation>
severity: ERROR # ERROR, WARNING, or INFO
metadata:
cwe: "CWE-XX"
owasp:
- "A03:2021-Injection"
category: security
confidence: HIGH # HIGH, MEDIUM, LOW
author: "Your Name"
references:
- https://cwe.mitre.org/data/definitions/XX.html
patterns:
- pattern-either:
- pattern: dangerous_function($ARG)
- pattern: other_dangerous_function($ARG)
- pattern-not: safe_wrapper(...)
- pattern-not-inside: |
if $X is None:
...
Taint Mode Rule Template
rules:
- id: <org>-<vuln-type>-taint
mode: taint
languages:
- python # or javascript, typescript, etc.
# CRITICAL: Always include path exclusions for taint mode
paths:
exclude:
- "**/node_modules/**"
- "**/vendor/**"
- "**/dist/**"
- "**/build/**"
- "**/*.min.js"
- "**/*.min.mjs"
- "**/*.bundle.js"
- "**/*.chunk.js"
- "**/*.chunk.mjs"
- "**/*-init.mjs"
message: |
User input flows to <dangerous sink> without proper sanitization.
This could allow <attack type>.
Remediation: <Specific fix>
severity: ERROR
metadata:
cwe: "CWE-XX"
owasp:
- "A03:2021-Injection"
category: security
confidence: HIGH
author: "Your Name"
pattern-sources:
- pattern: request.args.get(...)
- pattern: request.form[...]
- pattern: request.json[...]
pattern-sinks:
- pattern: cursor.execute($QUERY, ...)
focus-metavariable: $QUERY
pattern-sanitizers:
- pattern: escape(...)
- pattern: int(...)
- pattern: parameterized_query(...)
Step 4: Reduce False Positives
This is the most critical step. For every rule, consider:
Exclusion patterns to add:
# Exclude hardcoded/literal strings (not user input)
- pattern-not: $FUNC("...", ...)
# Exclude safe wrappers
- pattern-not: safe_execute(...)
# Exclude already-validated contexts
- pattern-not-inside: |
if validate($INPUT):
...
# Exclude test files (if not already in .semgrepignore)
- pattern-not-inside: |
def test_...:
...
Common FP sources:
- Hardcoded strings (not user-controlled)
- Test/example code
- Already-sanitized inputs
- Framework auto-escaping
- Admin-only code paths
Step 5: Test the Rule
Create test file alongside rule:
custom-rules/custom/novel-vulns/
├── command-injection-eval.yml
└── command-injection-eval.py # Test cases
Test file format:
# ruleid: command-injection-eval
eval(user_input)
# ruleid: command-injection-eval
exec(request.args.get('code'))
# ok: command-injection-eval
eval("2 + 2") # Hardcoded, safe
# ok: command-injection-eval
safe_eval(user_input) # Uses sanitizer
Run validation:
# Test rule syntax and test cases
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
--test custom-rules/custom/novel-vulns/
# Test against real target repo
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
repos/<org>/<repo>/
# Count findings
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
repos/<org>/ --json | jq '.results | length'
Step 5b: Test Performance (CRITICAL for Taint Mode)
Taint mode rules can timeout on large files. Always test on repos with bundled JS:
# Test against a repo known to have bundled files
time semgrep --config my-rule.yaml repos/<org>/<repo-with-bundles>/ 2>&1 | grep -E "(timeout|Error|Ran)"
Watch for these warning signs:
Warning: 3 timeout error(s) in path/to/file.mjs when running rules...
If you see timeouts:
Check which files are causing issues:
ls -la path/to/problematic/file.mjs # Check file size head -c 200 path/to/problematic/file.mjs # Check if minifiedAdd path exclusions to your rule:
paths: exclude: - "**/path/pattern/*.mjs"Re-test until no timeouts:
# Should complete in seconds, not timeout time semgrep --config my-rule.yaml repos/<org>/<repo>/
Performance targets:
| Repo Size | Expected Time | Action if Slower |
|---|---|---|
| Small (<100 files) | < 5 seconds | Check for bundled files |
| Medium (100-1000 files) | < 30 seconds | Add path exclusions |
| Large (1000+ files) | < 2 minutes | Verify exclusions working |
Verify findings still work after exclusions:
# Run on source directory only (where real vulns are)
semgrep --config my-rule.yaml repos/<org>/<repo>/src/
Step 6: Integrate with Pipeline
Rules in custom-rules/ are automatically included when running:
./scripts/scan-semgrep.sh <org-name>
To use only your custom rule:
semgrep --config custom-rules/custom/novel-vulns/my-rule.yml repos/<org>/
Pattern Operators Reference
Basic Matching
| Operator | Purpose | Example |
|---|---|---|
pattern |
Match exact code | os.system($CMD) |
pattern-either |
Match any (OR) | Multiple dangerous functions |
patterns |
Match all (AND) | Function + constraint |
Metavariables
| Syntax | Meaning |
|---|---|
$VAR |
Capture any expression |
$_ |
Match anything (no capture) |
$...ARGS |
Match multiple arguments |
<... $X ...> |
Match $X nested at any depth |
... |
Match any statements between |
Exclusions (Critical for FP reduction)
pattern-not: safe_function(...) # Exclude specific pattern
pattern-not-inside: | # Exclude if inside context
if validated($X):
...
Metavariable Constraints
# Regex match on captured variable
metavariable-regex:
metavariable: $FUNC
regex: "(system|exec|popen)"
# Pattern match on captured variable
metavariable-pattern:
metavariable: $ARG
pattern-either:
- pattern: request.args[...]
- pattern: request.form[...]
# Entropy analysis (detect secrets)
metavariable-analysis:
analyzer: entropy
metavariable: $VALUE
# Highlight specific variable in output
focus-metavariable: $DANGEROUS_ARG
Taint Mode Operators
mode: taint # Enable taint tracking
pattern-sources: # Where tainted data enters
- pattern: request.args[...]
pattern-sinks: # Where tainted data causes harm
- pattern: cursor.execute($Q)
focus-metavariable: $Q
pattern-sanitizers: # Functions that clean data
- pattern: escape(...)
- pattern: int(...)
pattern-propagators: # Custom taint spread (Pro only)
- pattern: $TO = transform($FROM)
from: $FROM
to: $TO
Common Rule Patterns
Command Injection
patterns:
- pattern-either:
- pattern: os.system($CMD)
- pattern: os.popen($CMD)
- pattern: subprocess.call($CMD, shell=True, ...)
- pattern: subprocess.Popen($CMD, shell=True, ...)
- pattern-not: $FUNC("...", ...) # Exclude hardcoded strings
SQL Injection (Taint)
mode: taint
pattern-sources:
- pattern: request.$METHOD[...]
- pattern: request.$METHOD.get(...)
pattern-sinks:
- pattern: $CURSOR.execute($QUERY, ...)
- pattern: $CURSOR.executemany($QUERY, ...)
pattern-sanitizers:
- pattern: $CURSOR.execute("...", ($PARAM,)) # Parameterized
Hardcoded Secrets
patterns:
- pattern: $VAR = "..."
- metavariable-regex:
metavariable: $VAR
regex: "(?i)(password|secret|api_key|token|private_key)"
- metavariable-analysis:
analyzer: entropy
metavariable: $VAR
- pattern-not-inside: |
# Example: ...
Insecure Cryptography
pattern-either:
- pattern: hashlib.md5(...)
- pattern: hashlib.sha1(...)
- pattern: DES.new(...)
- pattern: Blowfish.new(...)
- pattern: ARC4.new(...)
Path Traversal
mode: taint
pattern-sources:
- pattern: request.args.get("...")
- pattern: request.form["..."]
pattern-sinks:
- pattern: open($PATH, ...)
- pattern: os.path.join(..., $PATH, ...)
pattern-sanitizers:
- pattern: os.path.basename(...)
- pattern: secure_filename(...)
Metadata Standards
Every rule MUST include:
metadata:
# Required
cwe: "CWE-78" # Primary CWE ID
category: security # Always "security" for vulns
confidence: HIGH # HIGH, MEDIUM, LOW
# Recommended
owasp:
- "A03:2021-Injection" # OWASP Top 10 2021
likelihood: HIGH # Exploitation probability
impact: HIGH # Damage if exploited
subcategory:
- vuln # vuln, audit, guardrail
# For custom rules
author: "Your Name"
created: "2025-01-15"
tested_against: "org-name" # Where you validated it
references:
- https://cwe.mitre.org/...
- https://blog.example.com/... # Writeups explaining the vuln
Severity Guidelines
| Severity | Use For | Examples |
|---|---|---|
ERROR |
Exploitable vulns with high impact | RCE, SQLi, auth bypass |
WARNING |
Likely vulns needing verification | Potential XSS, weak crypto |
INFO |
Code smells, audit points | Missing headers, debug code |
Pro Engine Features
When running with --pro (our default), you get:
- Cross-file taint tracking - Follow data across imports
- Interprocedural analysis - Track through function calls
- Field sensitivity - Track object properties
These are automatic; no rule changes needed.
Debugging Rules
Rule not matching expected code?
# Verbose output shows matching attempts
semgrep --config rule.yml target/ --debug
# Test specific pattern interactively
semgrep --pattern 'os.system($X)' target/
Too many false positives?
- Add
pattern-notfor safe patterns - Add
pattern-not-insidefor safe contexts - Use
metavariable-regexto constrain variable names - Lower
confidencein metadata if FPs are expected
Output
Save completed rules to:
custom-rules/custom/
├── org-specific/<org-name>/ # Org-targeted rules
└── novel-vulns/ # General novel patterns
Rules are automatically picked up by ./scripts/scan-semgrep.sh.
References
- Semgrep Rule Syntax
- Taint Mode Overview
- Advanced Taint Techniques
- Semgrep Playground - Interactive rule testing