name	ci-monitoring
description	Use after creating PR - monitor CI pipeline, resolve failures cyclically until green or issue is identified as unresolvable

CI Monitoring

Overview

Monitor CI pipeline and resolve failures until green.

CRITICAL: CI is validation, not discovery.

If CI finds a bug you didn't find locally, your local testing was insufficient.

Before blaming CI, ask yourself:

Did you run all tests locally?

Did you test against local services (postgres, redis)?

Did you run the same checks CI runs?

Did you run integration tests, not just unit tests with mocks?

CI should only fail for: environment differences, flaky tests, or infrastructure issues—never for bugs you could have caught locally.

Core principle: CI failures are blockers. But they should never be surprises.

Announce at start: "I'm monitoring CI and will resolve any failures."

The CI Loop

PR Created
     │
     ▼
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ CI Status?  │
└──────┬──────┘
       │
   ┌───┴───┐
   │       │
 Green   Red/Failed
   │       │
   ▼       ▼
 DONE   ┌─────────────┐
        │ Diagnose    │
        │ failure     │
        └──────┬──────┘
               │
               ▼
        ┌─────────────┐
        │ Fixable?    │
        └──────┬──────┘
               │
          ┌────┴────┐
          │         │
         Yes        No
          │         │
          ▼         ▼
     ┌─────────┐  ┌─────────────┐
     │ Fix and │  │ Document as │
     │ push    │  │ unresolvable│
     └────┬────┘  └─────────────┘
          │
          └────► Back to "Wait for CI"

Checking CI Status

Using GitHub CLI

# Check all CI checks
gh pr checks [PR_NUMBER]

# Watch CI in real-time
gh pr checks [PR_NUMBER] --watch

# Get detailed status
gh pr view [PR_NUMBER] --json statusCheckRollup

Expected Output

All checks were successful
0 failing, 0 pending, 5 passing

CHECKS
✓  build          1m23s
✓  lint           45s
✓  test           3m12s
✓  typecheck      1m05s
✓  security-scan  2m30s

Handling Failures

Step 1: Identify the Failure

# Get failed check details
gh pr checks [PR_NUMBER]

# View workflow run logs
gh run view [RUN_ID] --log-failed

Step 2: Diagnose the Cause

Common failure types:

Type	Symptoms	Cause
Test failure	`FAIL` in test output	Code bug or test bug
Build failure	Compilation errors	Type errors, syntax errors
Lint failure	Style violations	Formatting, conventions
Typecheck failure	Type errors	Missing types, wrong types
Timeout	Job exceeded time limit	Performance issue or stuck test
Flaky test	Passes locally, fails CI	Race condition, environment difference

Step 3: Fix the Issue

Test Failures

# Reproduce locally
pnpm test

# Run specific failing test
pnpm test --grep "test name"

# Fix the code or test
# Commit and push

Build Failures

# Reproduce locally
pnpm build

# Fix compilation errors
# Commit and push

Lint Failures

# Check lint errors
pnpm lint

# Auto-fix what's possible
pnpm lint:fix

# Manually fix remaining
# Commit and push

Type Failures

# Check type errors
pnpm typecheck

# Fix type issues
# Commit and push

Step 4: Push Fix and Wait

# Commit fix
git add .
git commit -m "fix(ci): Resolve test failure in user validation"

# Push
git push

# Wait for CI again
gh pr checks [PR_NUMBER] --watch

Step 5: Repeat Until Green

Loop through diagnose → fix → push → wait until all checks pass.

Flaky Tests

Identifying Flakiness

Test passes locally
Test fails in CI
Test passes on retry in CI

Handling Flakiness

Don't just retry - Find the root cause
Check for race conditions - Timing-dependent code
Check for environment differences - Paths, env vars, services
Check for state pollution - Tests affecting each other

// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100);  // Hoping 100ms is enough
const result = await loadData();

// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();

Unresolvable Failures

Sometimes failures can't be fixed in the current PR:

Legitimate Unresolvable Cases

Case	Example
CI infrastructure issue	Service down, rate limited
Pre-existing flaky test	Not introduced by this PR
Upstream dependency issue	External API changed
Requires manual intervention	Needs secrets, permissions

Process for Unresolvable

Document the issue

gh pr comment [PR_NUMBER] --body "## CI Issue

The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).

This is not related to changes in this PR. The scan passes when run locally.

Requesting bypass approval from @maintainer."

Create issue if new

gh issue create \
  --title "CI: Security scanner service timeout" \
  --body "The security scanner is timing out in CI..."

Request bypass if appropriate

Some teams allow merging with known infrastructure failures.

Do NOT merge with real failures

If the failure is from your code, it must be fixed.

CI Best Practices

Run Locally First (MANDATORY)

CI is the last resort, not the first check.

Before pushing, run EVERYTHING CI will run:

# Run the same checks CI will run
pnpm lint
pnpm typecheck
pnpm test              # Unit tests
pnpm test:integration  # Integration tests against real services
pnpm build

# If you have database changes
docker-compose up -d postgres
pnpm migrate

If your project has docker-compose services:

Start them before testing: docker-compose up -d
Run integration tests against real services
Verify migrations apply to real database
Don't rely on mocks alone

Skill: local-service-testing

Commit Incrementally

Don't push 10 commits at once. Push smaller changes:

# Small fix, push, verify
git push

# Wait for CI
gh pr checks --watch

# Then next change

Monitor Actively

Don't "push and forget":

# Watch CI after each push
gh pr checks [PR_NUMBER] --watch

Checklist

For each CI run:

Waited for CI to complete
All checks examined
Failures diagnosed
Fixes implemented
Re-pushed and re-checked
All green before proceeding

For unresolvable issues:

Root cause identified
Not caused by PR changes
Documented in PR comment
Issue created if new problem
Bypass approval requested if appropriate

Integration

This skill is called by:

issue-driven-development - Step 13

This skill follows:

pr-creation - PR exists

This skill precedes:

verification-before-merge - Final checks

This skill may trigger:

error-recovery - If CI reveals deeper issues

ci-monitoring

Install Skill

SKILL.md