name	expert-debugging-and-lint-fixing
description	Systematic debugging workflow to reproduce, isolate, and fix hard software bugs, resolve related lint issues, and add tests and guardrails to prevent regressions. This skill should be used for complex bugs that teams struggle to fix, flaky/intermittent failures, production-only bugs, and environment-specific issues where code changes must be lint-clean.
version	1.0.0
dependencies	python>=3.8

Expert Debugging & Lint Fixing

Overview

This skill provides a systematic debugging playbook for extremely hard bugs that teams struggle to fix. It focuses on reproducing and isolating bugs through hypothesis-driven experiments, fixing root causes, explicitly resolving lint issues and static analysis findings in touched code, and adding tests and guardrails to prevent regressions.

When to Use This Skill

Use this skill for:

Complex bugs that were not solved by normal debugging approaches
Flaky or intermittent failures that are difficult to reproduce
Production-only or environment-specific bugs
Cases where code changes also need to be lint-clean and align with project lint rules
Bugs requiring systematic investigation with measurable progress
Heisenbugs that change behavior when being debugged
Performance issues, race conditions, or data corruption bugs

Core Principles

Follow these principles throughout the debugging process:

Always define a precise bug contract: "Given X, expect A, got B"
Prioritize reproducibility before attempting to fix: If it cannot be reproduced, the current task is to make it reproducible
Use hypothesis-driven debugging, not random code edits: Each change must test a specific hypothesis
Aggressively shrink the search space: Use divide-and-conquer to isolate the failure
Treat lint and static analysis violations as bugs: These must be resolved in changed areas
Finish with root-cause prevention, not just symptom fixes: Add systemic guardrails to prevent recurrence

Step-by-Step Debugging Protocol

1. Frame the Problem Precisely

Convert vague bug reports into a precise bug contract that can be verified.

Actions:

State the bug as: "Given state X and input Y, expected A, got B"
Capture all relevant context:
- Who: Which user/role/environment
- What: Exact behavior observed
- When: Timing/frequency/conditions
- Where: Component/service/layer
- How often: Always/intermittent/once
- Since when: Recent change or longstanding issue
Identify the primary symptom and impact (user-visible, data corruption, performance degradation, security risk)
Write one precise sentence describing the bug

Gate: If the bug cannot be stated in one precise sentence, do not touch code yet. Continue refining the problem statement.

2. Make It Reproducible (or Tightest Approximation)

Create the smallest, fastest, most reliable reproduction case possible.

Actions:

Start from the real failing path (same API/UI/job and environment if possible)
Strip down to the smallest input + environment that still fails
Turn this into a script, unit test, or integration test that can be re-run
For flaky bugs:
- Run in a loop (100+ iterations)
- Log every run with timestamps and relevant state
- Capture failing cases for analysis
- Look for patterns (timing, resource usage, specific inputs)
Document the repro steps clearly

Gate: If the bug is not reproducible, the current task is: make it reproducible. Do not proceed until there is a way to trigger the failure on demand or with high probability.

3. Add/Improve Observability

Ensure logs, metrics, and traces illuminate the failing path.

What to log:

Key parameters and branch decisions at each step
External calls (database, cache, HTTP, queue operations)
Concurrency boundaries (locks acquired/released, queues, async operations)
State transitions and invariant checks
Timing information (timestamps, durations)

Logging quality:

Each log should answer: "What did we know? What did we decide? What happened next?"
If logs are noisy and uninformative, refine them as part of the fix
Use structured logging with correlation IDs to track requests through the system
Include context that helps differentiate between different executions

4. Shrink the Search Space

Use systematic techniques to narrow down where the bug occurs.

Binary search on time:

Use git bisect between known-good and known-bad commits
Identify the exact commit that introduced the bug

Binary search on code path:

Temporarily short-circuit sections or use feature flags to disable blocks
See if the bug disappears when specific code paths are bypassed

Isolate layers:

Replace real dependencies with fakes/mocks
Try in-process vs over-the-network variants
Test with minimal/maximal configurations

Evaluation criteria:

Each step must move the bug closer or farther
Avoid inconclusive changes that provide no information
Document what each experiment revealed

5. Classify the Bug Type

Understanding the bug category dictates which tools and experiments to run next.

Bug categories:

Logic: Wrong condition, off-by-one error, incorrect algorithm
Data: Bad/inconsistent records, violated invariants, corrupted state
Environment/config: Environment variables, version mismatches, feature flags
Concurrency/race: Shared mutable state, timing-dependent behavior, deadlocks
Performance: Memory leaks, excessive CPU usage, inefficient algorithms
Integration: API contract violations, dependency issues, protocol errors

Actions:

Identify the most likely category based on symptoms
Select appropriate debugging tools for that category
Prepare specific experiments to confirm or rule out the classification

6. Hypothesis-Driven Experiments

Conduct systematic experiments to identify the root cause.

Process:

List top hypotheses (3–5 most likely causes)
For each hypothesis, define:
- "If this is the cause, then doing X should produce observable Y"
- How to test it (minimal change, log addition, config tweak)
- What outcome would falsify it
Run minimal, fast experiments to falsify hypotheses
Discard falsified hypotheses quickly; don't cling to favorite theories
After each experiment, explicitly state:
- What was tested
- What was observed
- What was learned
- Which hypotheses remain viable

Avoid:

Testing multiple hypotheses at once (confounds results)
Making large changes without clear predictions
Confirmation bias (looking only for evidence that supports preferred theory)

7. Use the Right Tools (Including Linters)

Select debugging tools appropriate for the bug type.

For logic bugs:

Debuggers with breakpoints, conditional breakpoints, watchpoints
Print debugging with strategic log placement
Unit tests that isolate specific functions

For performance/leak issues:

Profilers (CPU, memory, I/O)
Memory leak detectors
Performance monitoring tools

For concurrency issues:

Thread sanitizers
Race condition detectors
Stress testing with high parallelism

For all bugs:

Run all relevant linter and static analysis tools on:
- The changed files
- Ideally the impacted module/package
Tools include: ESLint, Flake8, mypy, Pylint, go vet, clippy, custom linters
Treat new or existing lint violations in the changed area as part of the work to fix

8. Handle Flaky and Heisenbugs

For bugs that appear/disappear unpredictably, use amplification techniques.

Amplify the bug:

Run repro in tight loops (1000+ iterations)
Run in parallel with multiple processes
Run under stress (high CPU/memory/disk usage)
Introduce jitter and delays to surface race conditions
Use chaos engineering techniques (network delays, packet loss, resource constraints)

Capture evidence:

Take snapshots/dumps on detected bad states if tools allow
Correlate logs/traces/metrics with unique request or correlation ID
Record timing information to identify patterns
Save state before/after failure for comparison

Reduce noise:

Disable unrelated jobs/traffic/features to simplify
Use minimal test data
Isolate the system under test from external dependencies

9. Implement the Fix

Make the smallest, clearest change that eliminates the failing behavior.

Fix quality criteria:

Eliminates the failing behavior in the repro case
Respects the system's invariants and constraints
Maintains or improves code clarity
Does not introduce new bugs or performance issues
Aligns with team coding standards

Approach:

Prefer refactors that increase clarity over patchy hacks
Keep commits focused and well-described for easy review
Include comments explaining non-obvious aspects
Consider edge cases and boundary conditions
Update any affected documentation

Before committing:

Verify the fix resolves the original bug
Check for unintended side effects
Ensure the fix doesn't just move the problem elsewhere

10. Validation: Tests, Lint, and CI

Ensure the fix is complete and won't regress.

Test the fix:

Turn the repro into an automated test that fails on the old code
Confirm:
- The new test fails on the pre-fix commit
- The new test passes on the fix
Add negative/edge-case tests around the bug
Test related functionality for regressions

Lint and static analysis:

Run the full lint suite relevant to the project
Fix all lint issues in the changed files
Do not ignore or suppress lint errors unless there is a clear, documented reason
Ensure static analysis tools pass (type checkers, security scanners)

CI validation:

Run the existing test suite (or at least impacted subset)
Ensure all CI checks pass (build, tests, linting, security scans)
Verify no new warnings are introduced
Check that code coverage hasn't decreased

11. Root Cause Analysis & Systemic Prevention

Prevent similar bugs from occurring in the future.

Write a root cause summary:

When was the bug introduced? (specific commit/release)
When was it detected? (how long did it exist?)
Why was it not caught earlier? (gaps in testing, code review, monitoring)
What was the underlying cause? (not just the symptom)

Conduct "5 Whys" analysis:

Why did the bug occur? (immediate cause)
Why did that happen? (contributing factor)
Why was that possible? (system weakness)
Why wasn't this caught? (process gap)
Why does this pattern exist? (root cause)

Add systemic guardrails:

New tests (unit, integration, property-based, regression)
Stronger validation and type checks
Runtime invariant assertions
Lint rules or static checks that catch similar issues early (when possible)
Monitoring/alerting to detect similar failures
Documentation updates (architecture docs, code comments, runbooks)
Code review checklist items for this module

Share learnings:

Update team documentation
Present findings in team meetings
Add to incident postmortem if applicable

12. Team Protocol for "Impossible Bugs"

When escalating difficult bugs to the team, provide comprehensive context.

Required information:

One-sentence bug contract ("Given X, expect A, got B")
Repro script or test that demonstrates the failure
Logs/metrics/traces for a failing run
Top hypotheses and experiments tried with outcomes
Relevant commit range and config/environment diffs
Impact assessment (severity, affected users, workarounds)

Escalation gate:

Use this checklist as a gate before escalating to senior engineers
Ensures sufficient investigation has been done
Provides context for effective collaboration

Linting and Static Analysis Protocol

Linting and static analysis are integral to the debugging process, not optional cleanup.

Always Run Relevant Tools

For every file modified during debugging:

Run all relevant linters (language-specific and project-specific)
Run static analyzers (type checkers, security scanners)
Run code formatters if the project uses them

Common tools by language:

JavaScript/TypeScript: ESLint, Prettier, TSLint
Python: Flake8, Pylint, Black, mypy, Bandit
Go: go vet, golint, staticcheck
Rust: clippy, rustfmt
Java: CheckStyle, SpotBugs, PMD
Ruby: RuboCop
C/C++: clang-tidy, cppcheck

Fix Lint Issues

What to fix:

All newly introduced lint issues (from the changes made)
Existing lint issues in the touched code region (where feasible)
Critical security or correctness issues flagged by static analysis

When to suppress: Only suppress or relax lint rules when:

There is a clear, written justification
The justification is documented in a code comment
Preferable to update configuration rather than scattered inline disables
The team agrees this is an appropriate exception

Never:

Ignore lint errors by disabling the linter
Commit code with unresolved lint violations without justification
Suppress entire categories of checks without review

Treat Passing Lint as Part of "Done"

A bug fix is not complete until:

The bug is fixed
Tests are added
All lint checks pass
CI pipeline is green

Example Usage

Example 1: Flaky CI Test

User prompt: "We have a flaky test in our CI that fails randomly on our Node service. Help us track it down and fix it."

How this skill applies:

Frame the problem: Identify which test fails, under what conditions, and how frequently
Make it reproducible: Run the test in a loop locally (1000+ iterations), capture failing cases
Add observability: Add detailed logging around the failing assertions, log timing information
Shrink search space: Isolate the test from others, run with minimal fixtures
Classify: Likely a race condition or timing issue
Hypothesize: Test hypotheses like "async operation not awaited", "shared state between tests", "timing-dependent assertion"
Use tools: Run with stress testing, check for race conditions
Fix: Add proper awaits, isolate test state, or fix timing assumptions
Validate: Ensure test passes 10,000+ times in a row, run ESLint/Prettier on changed files
Prevent: Add test isolation guards, document async patterns

Example 2: Production 500 Errors

User prompt: "Production is throwing intermittent 500s on checkout; logs are unclear. Guide me through reproducing and fixing this."

How this skill applies:

Frame the problem: "Given checkout request with cart X, expect 200 OK, got 500 Internal Server Error"
Make it reproducible:
- Gather production logs with correlation IDs
- Identify common patterns in failing requests
- Create test with similar cart composition/state
Add observability:
- Enhance logging in checkout flow
- Log all external service calls (payment, inventory)
- Log state transitions
Shrink search space:
- Test each checkout step in isolation
- Mock external services to identify which dependency causes failures
Classify: Likely an integration issue (external service) or data issue (bad cart state)
Hypothesize: Test theories like "payment service timeout", "inventory service race condition", "invalid cart state"
Fix: Add proper error handling, validate cart state earlier, implement retry logic
Validate: Run lint tools on modified files, add integration tests for edge cases
Prevent: Add input validation, monitoring alerts for 500s, circuit breakers for external services

Example 3: Lint Violations After Quick Patch

User prompt: "I applied a quick patch to stop a crash, but now our linter is complaining all over that file."

How this skill applies:

Re-evaluate the patch: Check if the quick fix actually addresses the root cause or just the symptom
Frame the original problem: Define what crash was occurring and why
Improve the implementation:
- Refactor the patch to follow code standards
- Address the root cause properly, not just the symptom
Fix lint issues:
- Run linter and address all violations in the file
- Do not suppress the linter without justification
- Ensure the fix follows team coding standards
Validate: Add tests that prove the crash is fixed, ensure lint passes
Prevent: Add guardrails to prevent similar crashes, document the fix

Example 4: Memory Leak in Long-Running Service

User prompt: "Our API service's memory usage grows unbounded over days. Find and fix the leak."

How this skill applies:

Frame the problem: "After X hours of operation, memory usage reaches Y GB and service crashes"
Make it reproducible:
- Create load test that simulates days of traffic in minutes
- Monitor memory usage during test
Add observability:
- Add memory profiling
- Log object creation/destruction for suspected components
Classify: Memory leak (performance bug)
Use tools:
- Memory profilers (heapdump, valgrind, etc.)
- Analyze heap snapshots over time
Hypothesize: Test theories like "event listeners not removed", "cache growing unbounded", "circular references"
Fix: Remove event listeners, add cache eviction, break circular references
Validate:
- Run extended load test, verify memory stays stable
- Run linter on changed files
- Add regression test that monitors memory growth
Prevent: Add memory monitoring alerts, document lifecycle management patterns

Related Skills

This skill focuses on systematically debugging hard bugs. For related tasks, use:

test-specialist: Writing comprehensive tests after fixing bugs (TDD approach, test coverage analysis)
code-validation: Validating fixes don't introduce red flags (secrets, test disabling, security issues)
test-quality-audit: Auditing test quality after adding regression tests
test-standards: Ensuring test code follows project standards
webapp-testing: Browser-based debugging and E2E test creation for web applications
chrome-devtools: Browser performance debugging and Core Web Vitals measurement

Resources

This skill does not require bundled scripts, references, or assets. The debugging protocol is entirely procedural and can be applied to any programming language, framework, or bug type.

If language-specific debugging scripts or checklists would be helpful for your team's common debugging scenarios, they can be added to the scripts/ directory. Similarly, team-specific debugging runbooks or reference materials can be added to references/.

expert-debugging-and-lint-fixing

Install Skill

SKILL.md

Expert Debugging & Lint Fixing

Overview

When to Use This Skill

Core Principles

Step-by-Step Debugging Protocol

1. Frame the Problem Precisely

2. Make It Reproducible (or Tightest Approximation)

3. Add/Improve Observability

4. Shrink the Search Space

5. Classify the Bug Type

6. Hypothesis-Driven Experiments

7. Use the Right Tools (Including Linters)

8. Handle Flaky and Heisenbugs

9. Implement the Fix

10. Validation: Tests, Lint, and CI

11. Root Cause Analysis & Systemic Prevention

12. Team Protocol for "Impossible Bugs"

Linting and Static Analysis Protocol

Always Run Relevant Tools

Fix Lint Issues

Treat Passing Lint as Part of "Done"

Example Usage

Example 1: Flaky CI Test

Example 2: Production 500 Errors

Example 3: Lint Violations After Quick Patch

Example 4: Memory Leak in Long-Running Service

Related Skills

Resources