| name | pr-review-standards |
| description | Code quality standards for PR review agents including try/except rules, logging patterns, type hints, and verification requirements. Use when reviewing code, spawning PR review agents, or validating code quality. |
| allowed-tools | Read |
PR Review Standards
Purpose: Critical standards for code review agents to ensure consistent, evidence-based PR reviews with proper verification requirements.
Use when:
- Spawning PR review agents (code-analysis, security, performance, tests, verification)
- Reviewing code changes manually
- Validating code quality during implementation
- Setting up PR review workflows
Core Principles
| Principle | Description | Why It Matters |
|---|---|---|
| NO Assumptions | If uncertain, flag as "needs_verification" | Prevents false positives and wasted effort |
| PROOF Required | Every finding needs evidence (code snippet, trace, test) | Enables actionable fixes, not vague suggestions |
| Read IN CONTEXT | Entire function, not just flagged line | Prevents missing validation/handling elsewhere |
| Trace Execution | Confirm issue actually occurs | Distinguishes ACTUAL bugs from POTENTIAL risks |
| Context-Aware | External APIs need higher scrutiny than internal utils | Focuses effort on real risk areas |
Code Quality Standards
Try/Except Usage Rules
| Allowed For | NEVER Wrap | Evidence Pattern |
|---|---|---|
Network calls (requests.*, http.*) |
dict.get() |
Connection errors only |
Database ops (pymongo.*, sqlalchemy.*, redis.*) |
json.loads() |
DB connection failures |
| External APIs (cache, message queues) | File I/O (files auto-close) | API timeouts |
| Type conversions | ||
List operations (list[0], list.append()) |
||
| String operations |
Exception wrapping security: try/except can hide auth failures
- ❌ Bad:
try: verify_token() except: pass(silent auth bypass) - ✅ Good:
if not verify_token(): return abort(401)(explicit)
Location: templates/pr-review-code-analysis.md:69-73
Logging Standards
| Requirement | Pattern | Anti-Pattern |
|---|---|---|
| Import | import logging; LOG = logging.getLogger(__name__) |
import logging (root logger) |
| Usage | LOG.info('msg', extra={'key': val}) |
print() or logging.info() |
| NO PII | extra={'user_id': id} |
f"User {email}" |
| NO Secrets | Never log passwords, tokens, API keys | LOG.debug(f"API key: {key}") |
Location: templates/pr-review-code-analysis.md:20
Type Hints and Line Limits
| Standard | Requirement | Exception |
|---|---|---|
| Type hints | Required for new code | Legacy code not being modified |
| Line length | 80 chars max | Long strings, URLs (with # noqa + reason) |
# noqa |
Must document reason | Never use without comment |
Location: templates/pr-review-code-analysis.md:21-22
Code Quality Checks
| Check | Look For | Severity |
|---|---|---|
| Single-line wrappers | Functions that just call another function | Medium |
| Commented code | Code blocks in comments (should be deleted) | Medium |
| Error messages | Include context for debugging | High |
| Error handling | Errors logged with sufficient context | High |
Location: templates/pr-review-code-analysis.md:23-24
Verification Requirements
Evidence Standards
| Finding Type | Required Evidence | Example |
|---|---|---|
| Logic bug | Code snippet + reproduce scenario | average = total / count # count can be 0 if list empty |
| Security | Vulnerable code + exploitation scenario | query = f"SELECT * WHERE id={user_id}" + user_id="1 OR 1=1" |
| Performance | Before/after + quantified impact | 1000 users = 1000 queries (was 1 batch) |
| Breaking change | Changed code + affected callers | def func(a, b): → def func(a, b, c): + caller list |
Location: templates/pr-review-code-analysis.md:26-31, pr-review-security.md:28-32
Proof Requirements by Category
Code Analysis
- Show code snippet with issue
- Execution trace showing how to trigger
- Fix suggestion with exact code
- File:line reference required
Security
- Vulnerable code snippet
- Exploitation scenario (how attacker exploits)
- Impact assessment (what attacker gains)
- Mitigation with exact code
Performance
- Quantified impact (N queries, O(n²), X MB RAM)
- Before/after comparison
- Calculation showing regression
- Frequency/volume context (hot path vs cold)
Tests
- Function being tested (with signature)
- Missing test scenarios (list specific cases)
- Current coverage assessment
- Risk level (critical functions vs utilities)
Location: All templates have variations, consolidated here
Severity Guidelines
STRICT Severity Requirements (Updated)
critical: WILL break production (with proof)
- Must include: Reproduction scenario OR exploit proof
- Examples:
TypeErrorwith inputs, SQL injection with payload, crash on empty list - NOT "critical": "Could crash", "Might be exploited", "Should add validation"
- Evidence: Actual failing test case or exploitation demo
high: WILL cause user-visible problems (with proof)
- Must include: Quantified impact >10x OR API contract break with caller list
- Examples: N+1 with 1000 queries, API field removal + 5 callers affected
- NOT "high": "Poor performance", "Breaking change" without affected callers
- Evidence: Performance calculation OR list of broken callers
medium: Should fix, increases bug risk
- Must include: Specific scenario where issue manifests
- Examples: No error handling in critical path, uncovered edge case
- NOT "medium": "Could be better", "Consider refactoring"
- Evidence: Code path showing risk
low: Nice to have, not urgent
- Optimization opportunities with minor benefit
- Code clarity improvements
- NOT "low": Personal preferences, style nitpicks
DO NOT ASSIGN SEVERITY (don't flag at all):
- Personal preferences (naming, comments, formatting)
- Theoretical issues without proof
- "Could be refactored" without benefit
- Style suggestions
THE THRESHOLD: Can you write a failing test? If NO, don't flag it.
Universal Severity Matrix
| Severity | Code Analysis | Security | Performance | Tests |
|---|---|---|---|---|
| Critical | Crashes, data corruption, breaking changes | Exploitable vulnerabilities (injection, auth bypass, secrets) | 100x+ regression (O(n²), GB RAM) | Critical functions untested (payment, auth) |
| High | Logic bugs with user impact, missing error handling | Security weaknesses (missing validation, PII in logs) | 10x+ regression (N+1 queries, removed caching) | Modified functions tests not updated, incorrect tests |
| Medium | Code quality issues, improper try/except, missing edge cases | Defense-in-depth issues (missing size limits, weak crypto) | 2-5x regression (suboptimal algorithm) | Missing edge cases, test organization issues |
| Low | Style issues, optimization opportunities | Best practice violations (not immediate risk) | <2x regression, micro-optimizations | Minor improvements (better names, fixtures) |
| Uncertain | Can't confirm without human review | Can't confirm exploit without context | Can't measure without profiling | Test looks wrong but can't confirm |
Location: templates/pr-review-code-analysis.md:178-184, pr-review-security.md:263-269, pr-review-performance.md:201-207, pr-review-tests.md:231-237
When to Use "Uncertain"
Use "uncertain" severity when:
- Can't determine if issue actually occurs
- Context is missing (lock elsewhere, config-dependent, etc.)
- Need human judgment (business logic, architectural decision)
- Can't confirm without running code/tests
DON'T use "uncertain" for:
- Issues you're confident about (use appropriate severity)
- Laziness in investigation (do the work first)
- Style preferences (shouldn't be flagged at all)
Location: All templates in needs_verification sections
Context-Aware Scrutiny (Security)
Security Scrutiny Levels
| Level | Applies To | Additional Checks | Skip Checks |
|---|---|---|---|
| HIGH | Public APIs, payment logic, PII handling, auth/authz | Rate limiting, CSRF, input size limits, output encoding, SQL parameterization, audit logging | None |
| MEDIUM | Internal APIs, backend services, no direct user input | Input validation (less paranoid), SQL parameterization, no secrets in logs | Rate limiting, CSRF, output encoding |
| LOW | Pure utilities, data transformations, no I/O | Logic correctness, edge cases | Most security checks (no attack surface) |
Detection patterns:
- External API:
@app.route,@api.route, FastAPI decorators - Payment:
stripe,payment,transaction,billing - PII:
email,ssn,phone,addressin user models - Internal:
internal,rpc,service-to-service
Location: templates/pr-review-security.md:34-79
Testing Standards (Quick Reference)
Core Requirements
| Standard | Requirement | Location |
|---|---|---|
| 1:1 mapping | One test file per production file | tests/unit/test_<module>.py for src/<module>.py |
| Coverage | 95%+ line, 100% function | Unit tests |
| Pattern | AAA (Arrange-Act-Assert) | All tests |
| Mocking | Mock EVERYTHING external + OTHER INTERNAL FUNCTIONS | Use @patch() |
| Parametrization | Use @pytest.mark.parametrize for multiple cases |
When >3 similar test cases |
| Integration | 2-4 files per module (ADD to existing) | Don't create new files |
Location: templates/pr-review-tests.md:16-27, TESTING_STANDARDS.md (referenced)
What to Mock
| Always Mock | NEVER Mock |
|---|---|
Database calls (db.*, cursor.*) |
Function being tested |
API calls (requests.*, external services) |
Pure functions (no side effects) |
File I/O (open(), write()) |
Constants/config |
Cache operations (redis.*, memcache.*) |
|
| Other internal functions called by function under test |
WHY: Test function in ISOLATION, not with dependencies
Location: templates/pr-review-tests.md:22-24
Important Rules (All Review Types)
Universal Rules
| Rule | Why | Applies To |
|---|---|---|
| Every finding MUST have file:line reference | No vague claims | All reviews |
| Every finding MUST have evidence | Actionable fixes | All reviews |
| Be specific and actionable | Exact fix, not "fix this" | All reviews |
| If uncertain, flag as needs_verification | Prevent false positives | All reviews |
| Context matters - read entire function | Catch validation elsewhere | All reviews |
| Check ALL changed files | Don't skip tests/config | All reviews |
Location: templates/pr-review-code-analysis.md:186-194
Analysis Approach
Progressive investigation:
- Read smallest changed file to understand changes
- Trace data flow through changes
- Use Grep to find callers (understand impact)
- Read callers to check for breaking changes
- Check error paths (what happens when operations fail?)
- Check edge cases (None, empty, 0, negative, large values)
If >10 issues in one file: Flag for architectural discussion (not nitpicking)
Location: templates/pr-review-code-analysis.md:196-207
Response Format Requirements
JSON Structure (All Review Types)
Required fields:
status: "COMPLETE"[category]_issues: Array of findingsneeds_verification: Array of uncertain findingspositive_findings: Array of good patterns found
Each finding MUST include:
file: File pathline: Line number (or range)issue/vulnerability/problem: What's wrongevidence: Code snippet showing issueseverity: critical/high/medium/low/uncertainfix: Specific actionable solution
Optional but recommended:
impact: What happens if not fixedbefore: Previous code (for comparisons)after: Current code
Location: All templates include JSON response format sections
Edge Cases to Check (Universal)
Standard Edge Cases
| Category | Cases to Test | Example |
|---|---|---|
| None/Null | Does function handle None gracefully? | if value is None: return default |
| Empty | Empty list, dict, string | if not items: return [] |
| Zero | Numeric 0 (division by zero) | if count == 0: return 0.0 |
| Negative | Negative numbers where positive expected | if amount < 0: raise ValueError |
| Large | Very large numbers, long strings | Overflow, memory issues |
| Unicode | Non-ASCII characters, emojis | "test 🎉".encode('utf-8') |
| Duplicates | Duplicate items in list | Deduplication logic |
| Boundary | Min/max values (INT_MAX, 1-item list) | Off-by-one errors |
Location: templates/pr-review-tests.md:283-293
Performance Patterns
Common Performance Issues
| Pattern | Evidence | Impact Calculation |
|---|---|---|
| N+1 queries | DB query inside loop | N items × 1 query = N queries (should be 1) |
| Algorithm regression | Nested loops replacing single pass | O(n) → O(n²): 1000 items = 1M ops |
| Removed caching | @lru_cache deleted |
1 calc + cache → N calcs (1000x) |
| No batching | for item: db.insert(item) |
100 items = 100 round-trips (should be 1) |
| Memory bloat | list(db.find()) on large collection |
1M records × 1KB = 1GB RAM |
Location: templates/pr-review-performance.md:40-67, 219-247
Quantification Requirements
Always quantify performance impact:
- ✅ "Adds 1000 queries per request"
- ✅ "O(n) → O(n²) = 1M operations for 1000 items"
- ✅ "Called 1000 times = 1000x more work"
- ❌ "Might be slow" (too vague)
Location: templates/pr-review-performance.md:210-217
Verification Mission (Verification Agents)
Verdict Types
| Verdict | When to Use | Evidence Required |
|---|---|---|
| CONFIRMED | Issue is real | Additional proof, execution trace |
| CONFIRMED + DOWNGRADED | Issue real but less severe | Mitigating factors found |
| FALSE_POSITIVE | Issue doesn't exist | Code preventing issue, explanation why |
| UNCERTAIN | Can't determine | What's needed to resolve, why uncertain |
Location: templates/pr-review-verification.md:17-24
Verification Decision Tree
1. Read code in full context
↓
2. Can you find evidence issue exists?
YES → Continue to 3
NO → Continue to 6
↓
3. Trace execution - can issue actually be triggered?
YES → CONFIRMED
NO → Continue to 4
↓
4. Is there code preventing issue that agent missed?
YES → FALSE_POSITIVE
NO → Continue to 5
↓
5. Unclear if issue can be triggered?
YES → UNCERTAIN
↓
6. Can you prove issue doesn't exist?
YES → FALSE_POSITIVE
NO → UNCERTAIN
Location: templates/pr-review-verification.md:232-256
Common False Positive Patterns
| Pattern | What Agent Missed | Resolution |
|---|---|---|
| Validation in caller | Agent checked function, not caller | FALSE_POSITIVE |
| Error handling | try/except or guard clause | FALSE_POSITIVE |
| Type checking | Type hints guarantee non-None | FALSE_POSITIVE |
| Business logic | "Missing auth" is intentional (public endpoint) | FALSE_POSITIVE |
Location: templates/pr-review-verification.md:258-304
Quick Reference Checklist
Before Spawning PR Review Agents
Include these standards in agent prompt:
Code Analysis Agent:
- try/except ONLY for connection errors
- Logging:
LOG = logging.getLogger(__name__) - Type hints required
- 80 char limit,
# noqaneeds reason - NO assumptions - flag as needs_verification
Security Agent:
- Context-aware scrutiny (high/medium/low)
- Provide exploitation scenarios
- Check entire data flow (input → DB/logs/response)
- NO theoretical "could be" suggestions
Performance Agent:
- QUANTIFY impact (N queries, O(n²), X MB)
- Compare before/after
- Check if hot path vs cold path
- Consider request volume
Test Agent:
- 1:1 file mapping required
- Mock EVERYTHING external + other internal functions
- 95%+ coverage
- Check edge cases (None, empty, 0, negative)
Verification Agent:
- Read entire function (not just flagged line)
- Check callers (validation might be there)
- Provide verdict with evidence
- If uncertain, return UNCERTAIN (not FALSE_POSITIVE)
Integration with agent-prompting Skill
Relationship: pr-review-standards provides the WHAT (standards to check), agent-prompting provides the HOW (how to write prompts).
When to load both:
- agent-prompting: When writing agent prompts (structure, format)
- pr-review-standards: When defining review criteria (standards, severity)
Typical usage:
- Load agent-prompting for prompt template
- Load pr-review-standards for standards to include inline
- Combine: Template + Standards = Effective agent prompt
Summary
The Golden Rules
- NO Assumptions - If uncertain, flag for verification
- PROOF Required - Every finding needs evidence
- Read IN CONTEXT - Entire function, not just one line
- Quantify Impact - Numbers, not vague claims
- Context-Aware - External APIs ≠ internal utilities
- File:Line Required - Every finding must have location
The Test
Can you prove the issue exists with concrete evidence? If not, flag as "needs_verification".
The Goal
Evidence-based reviews that produce actionable findings, minimize false positives, and respect context.
Remember: Better to flag as uncertain than to claim something is wrong without proof.