| name | test-writer |
| description | MANDATORY - INVOKE BEFORE writing ANY test code (def test_*, class Test*). Prevents brittle tests. Read this skill first, then write tests. |
test-writer Skill
🚨 CRITICAL: MANDATORY FOR ALL TEST WRITING AND UPDATING
YOU CANNOT WRITE OR UPDATE TESTS WITHOUT THIS SKILL.
If you write or update tests without following this skill, you will:
- Write brittle tests with hardcoded library outputs
- Create self-evident tests that provide zero value
- Use fixtures incorrectly (overuse for simple cases, underuse for complex)
- Test Python/library behavior instead of YOUR code's contracts
This skill is your checklist. Follow it step-by-step. No shortcuts.
🚨 CRITICAL FOR TEST WRITING
- BEFORE writing tests → Use test-writer skill (MANDATORY - analyzes code type, dependencies, contract)
- AFTER writing tests → Invoke pytest-test-reviewer agent (validates patterns)
- YOU CANNOT WRITE TESTS WITHOUT test-writer SKILL - No exceptions, no shortcuts, every test, every time
When to Use This Skill
Use this skill when:
- ✅ User asks "write tests for X"
- ✅ You're creating a new test file (
test_*.py) - ✅ You're adding tests to an existing test file
- ✅ User says "test this" or "add test coverage"
- ✅ You've just written code and need to test it
- ✅ You're updating/modifying existing tests (e.g., when test-fixer needs to update test expectations)
- ✅ Tests are failing and need to be fixed (use this skill to understand what to change)
DO NOT write or update tests without using this skill. PERIOD.
🔄 How This Skill Interacts With Other Skills
- Called by test-fixer when modifying test files - determines if code or contract is wrong
- Can call sql-reader to query production data model and design realistic fixtures
- MUST call semantic-search before writing tests to find existing test patterns and fixtures:
docker exec arsenal-semantic-search-cli code-search find "test <feature>"- Check for existing fixtures, test utilities, and similar test patterns
- Works autonomously but flags UX contract changes: "⚠️ UX contract change: [explain]"
🚨 CRITICAL: Don't Encode Broken Behavior
When updating tests, ask:
- Is the CODE wrong? → Fix code, keep test
- Is the TEST wrong? → Update test (legitimate contract change)
- Is this encoding BROKEN behavior? → Flag to user and continue
Red flags:
- "Code changed so I'll update the test" ← DANGER
- Test passed → code changed → test fails → changing test instead of code ← DANGER
Safe updates:
- Intentional contract change (documented in spec)
- Refactoring (same behavior, different implementation)
- Fixing brittle tests (testing implementation not contract)
When in doubt: Flag it and continue autonomously: "⚠️ This may encode broken behavior: [explain]"
Step 1: Analyze the Code Being Tested
Before writing A SINGLE LINE of test code, answer these questions:
Question 1: What type of code is this?
Pure function (no side effects, no state, deterministic)
- Example:
def calculate_total(items: list[Item]) -> float - Example:
def infer_timezone_from_phone(phone: str) -> str | None
- Example:
Database model/ORM (models with relationships, DB operations)
- Example:
create_intervention(message: Message, user: User) -> Intervention - Example:
get_conversation_messages(conversation_id: int) -> list[Message]
- Example:
API endpoint (FastAPI routes, HTTP handlers)
- Example:
POST /webhook/sendblue - Example:
GET /conversations/{id}/messages
- Example:
External service integration (calls to OpenAI, Langfuse, SendBlue, etc.)
- Example:
send_intervention_via_sendblue(message: str, phone: str) - Example:
fetch_langfuse_prompt(prompt_name: str)
- Example:
Business logic with state (complex rules, workflows, state machines)
- Example:
should_send_daily_reminder(user: User, last_intervention: datetime) - Example:
calculate_conflict_score(message: Message, conversation: Conversation)
- Example:
Write your answer:
Type: [YOUR ANSWER HERE]
Reasoning: [WHY you chose this type]
Question 2: What are the dependencies?
Check all that apply:
- External library (phonenumbers, pytz, croniter, etc.)
- Database (PostgreSQL via SQLAlchemy)
- External API (OpenAI, Langfuse, SendBlue, etc.)
- File system
- Redis/Queue
- None (pure function with no external deps)
Write your answer:
Dependencies: [LIST THEM]
Which are external (library/API): [WHICH ONES]
Which need mocking: [WHICH ONES]
Question 3: What's YOUR code's contract?
NOT what libraries return. What does YOUR code GUARANTEE?
Think about:
- What does this function promise to do?
- What are valid inputs?
- What are valid outputs?
- What errors should it raise?
- What invariants must hold?
Write your answer:
Contract:
- Input guarantees: [e.g., "accepts valid US phone numbers"]
- Output guarantees: [e.g., "returns valid pytz timezone or None"]
- Error handling: [e.g., "returns None for invalid input, doesn't raise"]
- Invariants: [e.g., "US numbers always return America/* timezones"]
Question 4: What are the edge cases?
- None/empty input?
- Invalid input?
- Boundary values (min, max)?
- Error conditions?
- Race conditions or timing issues?
Write your answer:
Edge cases to test:
1. [EDGE CASE 1]
2. [EDGE CASE 2]
3. [EDGE CASE 3]
Step 2: Choose the Right Test Type
Based on your analysis, determine which test type(s) to use:
Unit Tests (tests/unit/)
When: Complex business logic in isolation Database: SQLite in-memory Redis: FakeRedis APIs: All mocked Speed: <5s total
Use for:
- Pure functions with complex logic
- Business rule combinations
- Edge cases and boundaries
- Data transformations
Integration Tests (tests/integration/)
When: Component interactions Database: SQLite in-memory Redis: FakeRedis APIs: All mocked Speed: <5s total
Use for:
- Service interactions
- Database operations
- API endpoint contracts
- FastAPI TestClient validation
E2E Mocked Tests (tests/e2e_mocked/)
When: Critical workflows Database: Docker PostgreSQL (SHARED - use UUIDs!) Redis: FakeRedis APIs: All mocked Speed: <20s total
Use for:
- Complete workflows (webhook → queue → worker)
- Full pipeline testing
- Integration of multiple components
⚠️ CRITICAL: Use UUID-based unique identifiers for parallel execution:
unique_id = str(uuid.uuid4())[:8]
user_name = f"TestUser_{unique_id}"
E2E Live Tests (tests/e2e_live/) 💰
When: Validate prompts with REAL LLMs Database: SQLite in-memory Redis: FakeRedis APIs: REAL (costs money!) Speed: <60s total
⚠️ COSTS REAL MONEY! Use gpt-4.1-nano for efficiency.
Use for:
- Prompt validation with real LLMs
- Langfuse prompt deployment verification
- Critical AI behavior validation
Smoke Tests (tests/smoke_tests/)
When: Production health validation Database: Real PostgreSQL (via API) Redis: Real Redis (via API) Speed: <60s total
Use for:
- Deployment validation
- API availability checks
- Production monitoring
Write your decision:
Test type: [UNIT | INTEGRATION | E2E_MOCKED | E2E_LIVE | SMOKE]
Reasoning: [WHY this type is appropriate]
Step 3: Decide Fixture Strategy
DO Use Fixtures For:
✅ Database models with relationships:
def test_message_processing(mock_couple_conversation, mock_message):
# Fixtures handle complex DB setup
conversation, participants = mock_couple_conversation
result = process_message(conversation, mock_message)
✅ Complex objects with many fields:
@pytest.fixture
def oauth_client():
return OAuthClient(
client_id="...",
client_secret="...",
redirect_uri="...",
# 10+ more required fields
)
✅ Stateful components:
@pytest.fixture
def redis_connection():
conn = Redis(...)
yield conn
conn.close()
DON'T Use Fixtures For:
❌ Pure functions with simple inputs:
# ❌ OVERKILL
@pytest.fixture
def phone_numbers():
return ["+14155551234", "+12125551234"]
def test_timezone(phone_numbers):
result = infer_timezone(phone_numbers[0])
# ✅ SIMPLE
def test_timezone():
result = infer_timezone("+14155551234")
assert result.startswith("America/")
❌ Simple strings/primitives (< 5 fields):
# ❌ Unnecessary fixture
@pytest.fixture
def sample_json():
return '{"key": "value"}'
# ✅ Inline it
def test_parsing():
data = '{"key": "value"}'
assert parse_json(data)["key"] == "value"
Rule of thumb: If your "fixture" is just returning a hardcoded string/dict with <5 fields, inline it.
Write your decision:
Fixtures needed: [YES/NO]
Which fixtures: [LIST THEM OR "NONE"]
Why: [REASONING]
Step 4: The 5 Critical Questions
Before writing ANY assert statement, ask:
1. Am I testing MY code or someone else's?
❌ Testing library behavior:
# BAD: Testing that phonenumbers library works
def test_phonenumbers_library():
assert phonenumbers.parse("+14155551234").country_code == 1 # phonenumbers' job!
✅ Testing MY wrapper's contract:
# GOOD: Testing what MY function guarantees
def test_us_phone_returns_us_timezone():
result = infer_timezone_from_phone("+14155551234")
assert result is not None # MY guarantee: non-None for valid input
assert result.startswith("America/") # MY guarantee: US number → US timezone
assert pytz.timezone(result) # MY guarantee: valid pytz timezone
2. What can change without touching my code?
❌ Hardcoding external library outputs:
# BAD: Brittle - breaks if phonenumbers updates timezone mappings
def test_timezone_inference():
assert infer_timezone("+14155551234") == "America/Los_Angeles"
# phonenumbers controls this exact value, not YOUR code!
✅ Testing contracts:
# GOOD: Tests behavior, not exact library output
def test_timezone_inference():
result = infer_timezone("+14155551234")
assert result.startswith("America/") # Contract: US timezone
# Robust to library changing "Los_Angeles" to "Los_Angeles/Pacific"
3. Is this self-evident?
❌ Self-evident tests:
# BAD: Testing that setting a value works
def test_setting_state():
participant.state = ConversationState.ACTIVE
assert participant.state == ConversationState.ACTIVE # Duh!
# BAD: Testing pass-through logic
def test_returns_input_unchanged():
result = resolve_timezone("Europe/London", phone=None)
assert result == "Europe/London" # Just testing: if x: return x
# BAD: Testing mocks
def test_mock_returns_value():
mock.get_value.return_value = 42
assert mock.get_value() == 42 # Of course it does!
✅ Testing business logic:
# GOOD: Tests decision logic (priority order)
def test_timezone_resolution_priority():
# When both configured AND phone available, configured wins
result = resolve_timezone("Europe/London", "+14155551234")
assert result == "Europe/London" # Tests priority, not pass-through
4. Am I testing "WHAT" or "HOW"?
❌ Testing implementation (HOW):
# BAD: Exact values from library
assert infer_timezone("+1415...") == "America/Los_Angeles"
✅ Testing contract (WHAT):
# GOOD: Behavior and guarantees
result = infer_timezone("+1415...")
assert result.startswith("America/") # What: returns US timezone
5. Do I need fixtures/factories?
- Complex DB setup with relationships → ✅ YES
- Pure function with primitives → ❌ NO
- Stateful components → ✅ YES
- Simple strings/dicts (<5 fields) → ❌ NO
Write your answers:
Q1 (My code or library): [ANSWER]
Q2 (What can change): [ANSWER]
Q3 (Self-evident): [YES/NO + reasoning]
Q4 (What or how): [ANSWER]
Q5 (Need fixtures): [YES/NO + which ones]
Step 5: Anti-Pattern Check
Before writing code, verify you will NOT:
❌ ANTI-PATTERNS TO AVOID:
1. Hardcoded library outputs:
# ❌ NO
assert infer_timezone("+14155551234") == "America/Los_Angeles"
# ✅ YES
assert infer_timezone("+14155551234").startswith("America/")
2. Self-evident assertions:
# ❌ NO
user.name = "Alice"
assert user.name == "Alice"
# ✅ YES - test business rules
assert can_send_intervention(user) == (user.has_consented and not user.is_banned)
3. Testing library/Python behavior:
# ❌ NO
result = {**dict1, **dict2}
assert len(result) == len(dict1) + len(dict2) # Testing Python!
# ✅ YES - test YOUR logic
merged = merge_conversation_contexts(conv1, conv2)
assert merged.participant_count == conv1.participant_count + conv2.participant_count
4. Fixtures for primitives:
# ❌ NO
@pytest.fixture
def phone_numbers():
return ["+14155551234"]
# ✅ YES - inline it
def test_phone():
result = process_phone("+14155551234")
5. Mock chains:
# ❌ NO
mock.query.return_value.filter.return_value.first.return_value = user
# ✅ YES - specific mock
with patch("data.models.User.get_by_id", return_value=user):
6. Multiple fixture variants:
# ❌ NO
@pytest.fixture
def full_payload(): ...
@pytest.fixture
def partial_payload(): ...
@pytest.fixture
def minimal_payload(): ...
# ✅ YES - one factory with overrides
@pytest.fixture
def payload_factory():
def _create(**overrides):
defaults = {"name": "Alice", "consent": True}
return {**defaults, **overrides}
return _create
7. Wrong mocking for test type:
# ❌ NO - in E2E_live test
with patch('openai.ChatCompletion.create'): # Don't mock in live tests!
# ✅ YES - in unit/integration test
with patch('openai.ChatCompletion.create', return_value=mock_response):
Checklist:
- No hardcoded library outputs?
- No self-evident assertions?
- Not testing library/Python behavior?
- Fixtures used appropriately?
- No mock chains?
- Factory fixtures with overrides (not multiple variants)?
- Correct mocking for test type?
Step 5.5: Pattern Reference - DO THIS, NOT THAT
Before writing code, review these concrete examples of good vs bad test patterns.
Pattern 1: Test Setup
❌ DON'T create test data inline:
def test_message_processing():
# 20+ lines of manual setup
person1 = Persons(name="Alice")
person2 = Persons(name="Bob")
conversation = Conversations()
# ... more boilerplate
✅ DO use shared fixtures:
def test_message_processing(mock_couple_conversation, mock_message):
# Clean test focused on logic
conversation, participants = mock_couple_conversation
result = process_message(conversation, mock_message)
Pattern 2: Test Mocking
❌ DON'T mock everything or use mock chains:
# Over-mocking with chains
mock.query.return_value.filter.return_value.first.return_value = user
# Wrong mocking for test type - In E2E_live test:
with patch('openai.ChatCompletion.create'): # NEVER mock live services in e2e_live!
✅ DO use targeted mocking appropriate to test type:
# Unit/Integration: Mock external services
with patch('data.models.message.Message.get_latest', return_value=[]):
# Test specific integration point
# E2E_live: NEVER mock - use real APIs
response = generate_intervention(message) # Real OpenAI call
assert "coach" in response.lower() # Not "therapist"
Pattern 3: Test Assertions - Self-Evident Truths
❌ DON'T test obvious Python behavior:
# Testing that Python works
user.name = "Alice"
assert user.name == "Alice" # Self-evident!
# Testing framework features
assert session.commit() is None # SQLAlchemy always returns None
# Testing that setting a value works
participant.state = ConversationState.ACTIVE
assert participant.state == ConversationState.ACTIVE # Of course!
# Testing that mocks return what you told them
mock.get_value.return_value = 42
assert mock.get_value() == 42 # Duh!
# Testing Python built-ins
result = {**dict1, **dict2}
assert len(result) == len(dict1) + len(dict2) # Testing Python!
✅ DO test business logic:
# Tests business rule
def test_consent_required_before_coaching():
"""Ensures coaching only starts after explicit consent."""
user = create_user(has_consented=False)
assert not can_send_intervention(user)
# Tests complex logic
def test_conflict_detection():
message = "You never listen to me!"
assert detect_conflict_level(message) == "high"
Pattern 4: Test Assertions - Hardcoded vs Computed
❌ DON'T use hardcoded expected values from formatters:
# BAD: Hardcoded string breaks when format changes
def test_form_to_message():
message = create_message_from_form({"relationship_type": "romantic"})
assert "romantic relationship" in message.lower() # Brittle!
✅ DO compute expected values using actual formatting methods:
# GOOD: Uses the same formatting logic being tested
def test_form_to_message():
message = create_message_from_form({"relationship_type": "romantic"})
expected = RELATIONSHIP_TYPE_FIELD.to_message("romantic")
assert expected and expected.lower() in message.lower()
Pattern 5: Test Organization - Fixtures
❌ DON'T create multiple fixture variants:
# BAD - creates maintenance burden, violates DRY
@pytest.fixture
def full_payload_data():
return {"user_name": "Alice", "consent": True, ...}
@pytest.fixture
def partial_payload_data():
return {"user_name": "Alice", "consent": True, "communication_goals": None}
@pytest.fixture
def minimal_payload_data():
return {"user_name": "Alice"}
# Now you have 3 fixtures to maintain when schema changes!
✅ DO create one factory fixture with configurable overrides:
@pytest.fixture
def payload_factory() -> Callable:
"""Factory for test payloads with sane defaults and overrides."""
def _create_payload(user_name: str = "Alice", **overrides):
defaults = {
"user_name": user_name,
"consent": True,
"relationship_type": "romantic",
"communication_goals": "better listening",
}
defaults.update(overrides)
return defaults
return _create_payload
# Usage - customize only what varies per test
def test_full_data(payload_factory):
payload = payload_factory() # Uses all defaults
def test_partial_data(payload_factory):
payload = payload_factory(communication_goals=None)
def test_custom_data(payload_factory):
payload = payload_factory(user_name="Bob", relationship_type="co-parenting")
Pattern 6: Test Organization - Parallel Execution
❌ DON'T use hardcoded values in E2E tests:
# BAD: Hardcoded values cause conflicts in parallel execution
def test_workflow():
user_name = "TestUser" # Will conflict when tests run in parallel!
✅ DO use UUID-based unique identifiers:
# GOOD: Each test run gets unique data
def test_workflow():
unique_id = str(uuid.uuid4())[:8]
user_name = f"TestUser_{unique_id}" # Parallel-safe
Pattern 7: Test Documentation
❌ DON'T write technical descriptions:
def test_webhook():
"""Tests POST /webhook returns 200."""
✅ DO explain business value:
def test_webhook_queues_messages():
"""
Ensures incoming messages are reliably queued for async processing,
preventing message loss during high load or worker downtime.
"""
Pattern 8: Test Parametrization
❌ DON'T write separate tests for each variant:
# BAD - repetitive, hard to maintain
def test_romantic_relationship_creates_fact():
assert "romantic" in facts
def test_coparenting_relationship_creates_fact():
assert "co-parenting" in facts
def test_friendship_relationship_creates_fact():
assert "friendship" in facts
✅ DO use parametrize for common patterns:
# GOOD - single parametrized test
@pytest.mark.parametrize("relationship_type", ["romantic", "co-parenting", "friendship"])
def test_relationship_type_creates_fact(relationship_type):
assert relationship_type in facts
# GOOD - test business rule combinations
@pytest.mark.parametrize(
"sender_interventions,recipient_interventions,expected_should_send",
[
(False, False, True), # No recent interventions → send reminder
(True, False, False), # Sender has interventions → don't spam
(False, True, False), # Recipient has interventions → don't spam
],
)
def test_daily_reminder_logic(sender_interventions, recipient_interventions, expected_should_send):
"""Tests reminder logic respects intervention cooldown periods."""
# Single test implementation covering 3 business rule combinations
Pattern 9: Contract Testing (Library Wrappers)
❌ DON'T hardcode library outputs:
# BAD: Brittle - breaks if phonenumbers updates mappings
def test_timezone_inference():
assert infer_timezone_from_phone("+14155551234") == "America/Los_Angeles"
✅ DO test YOUR contract, not library internals:
# GOOD: Contract test
def test_us_phone_returns_us_timezone():
"""
Valid US phone numbers should return a US timezone.
Contract test: validates that US numbers map to America/* timezones
without depending on exact phonenumbers library output that could change.
"""
result = infer_timezone_from_phone("+14155551234")
# Test YOUR contract, not library internals
assert result is not None
assert result.startswith("America/") # Contract: US → America/*
assert pytz.timezone(result) # Contract: valid timezone
Pattern 10: Wrong Test Type / Fixtures
❌ DON'T mix test types or use wrong fixtures:
# Wrong fixture for test type
# In unit test:
def test_logic(real_database): # Should use SQLite/mocks!
# In E2E_mocked:
user_name = "TestUser" # Hardcoded = parallel test failures
✅ DO use correct test type and fixtures:
# Unit test: SQLite + FakeRedis + Mocks
def test_complex_logic(mock_session, mock_message):
# Test algorithm only
# E2E_mocked: Docker PostgreSQL + unique data
def test_workflow():
unique_id = str(uuid.uuid4())[:8]
user_name = f"TestUser_{unique_id}" # Parallel-safe
# E2E_live: Real APIs (costs money!)
@pytest.fixture(scope="module") # Cache expensive calls
def gpt_response():
return openai.complete(model="gpt-4.1-nano") # Cheapest model
Step 6: Write Test Structure
Now you can write the test. Follow this template:
For Pure Functions:
class TestFunctionName:
"""Test [function_name] [what it does]."""
def test_[descriptive_name](self):
"""
[Business value explanation - WHY this test matters]
[What contract/guarantee this verifies]
"""
# Arrange: Set up inputs
input_value = "test_input"
# Act: Call the function
result = function_name(input_value)
# Assert: Verify contract (not exact values!)
assert result is not None
assert isinstance(result, ExpectedType)
assert result.meets_contract() # Whatever YOUR guarantee is
For Database/Stateful Code:
class TestFeatureName:
"""Test [feature] [what it does]."""
def test_[descriptive_name](
self,
test_db_session: Session,
mock_fixture_1,
mock_fixture_2,
):
"""
[Business value explanation - WHY this test matters]
[What business rule this verifies]
"""
# Arrange: Use fixtures
entity = mock_fixture_1()
# Act: Execute business logic
result = business_function(entity)
# Assert: Verify business rules
test_db_session.refresh(result)
assert result.state == ExpectedState.CORRECT
assert result.relationship_set_correctly
For Parametrized Tests:
@pytest.mark.parametrize(
"input_value,expected_behavior",
[
("value1", "behavior1"), # Comment explaining this case
("value2", "behavior2"), # Comment explaining this case
("edge_case", "edge_behavior"), # Edge case
],
)
def test_[descriptive_name](self, input_value, expected_behavior):
"""
[Business value explanation]
Tests that [function] handles [variety] of inputs correctly.
"""
result = function_name(input_value)
assert result.matches_expected(expected_behavior)
For Contract Testing (Library Wrappers):
def test_wrapper_contract(self):
"""
[What your wrapper guarantees]
Contract test: validates [YOUR guarantees] without depending on
exact library outputs that could change.
"""
result = your_wrapper_function(input)
# Test YOUR contract, not library internals
assert result is not None # Guarantee: non-None for valid input
assert result.matches_expected_pattern() # Guarantee: correct format
assert result.passes_validation() # Guarantee: valid output
# NOT: assert result == "exact_library_value" # ❌ Brittle!
Step 7: Write Business-Focused Docstrings
Every test MUST have a docstring that explains:
- Business value - WHY this test matters
- What guarantee/contract it verifies
❌ BAD - Technical description:
def test_webhook():
"""Tests POST /webhook returns 200."""
✅ GOOD - Business value:
def test_webhook_queues_messages():
"""
Ensures incoming messages are reliably queued for async processing,
preventing message loss during high load or worker downtime.
"""
❌ BAD - Obvious:
def test_timezone_inference():
"""Tests that timezone is inferred from phone."""
✅ GOOD - Contract and value:
def test_us_phone_returns_us_timezone():
"""
Valid US phone numbers should return a US timezone.
Contract test: validates that US numbers map to America/* timezones
without depending on exact phonenumbers library output that could change.
Ensures scheduling happens in user's local timezone.
"""
Template:
def test_[descriptive_name]():
"""
[One sentence: business value - what breaks if this fails]
[Optional: Additional context about contract, edge case, or business rule]
[Optional: Why this matters for users/product]
"""
Step 8: Golden Rule Check
Before finalizing, ask yourself:
"If this test fails, what business requirement did we break?"
If you can't answer that question clearly, the test shouldn't exist.
Examples:
- ✅ "We broke the guarantee that US phone numbers return US timezones"
- ✅ "We broke the rule that interventions require user consent"
- ✅ "We broke the priority order for timezone resolution"
- ❌ "We broke... um... setting a value returns that value?" (self-evident)
- ❌ "We broke... the phonenumbers library?" (not your code)
Write your answer:
If this test fails, we broke: [SPECIFIC BUSINESS REQUIREMENT]
Step 9: Decision Tree Summary
Final check:
- Am I testing a business decision or rule? → Write the test
- Am I testing that Python/framework features work? → Don't write it
- Am I testing what I just set/mocked? → Don't write it
- Would this test catch a real bug? → Write the test
- Would this test help someone understand the system? → Write the test
- Is this test just for coverage percentage? → Don't write it
Step 10: Present Analysis to User
Before writing code, present your analysis:
## Test Writing Analysis
### Code Type
[Pure function | Database model | API endpoint | etc.]
Reasoning: [WHY]
### Dependencies
- [Dependency 1]: [Mock it | Use real | etc.]
- [Dependency 2]: [Mock it | Use real | etc.]
### Contract
YOUR code guarantees:
- [Guarantee 1]
- [Guarantee 2]
- [Guarantee 3]
### Test Type
[UNIT | INTEGRATION | E2E_MOCKED | E2E_LIVE | SMOKE]
Reasoning: [WHY this type]
### Fixture Strategy
[YES: Use fixtures for X, Y, Z | NO: Pure function, inline data]
### Edge Cases
1. [Edge case 1]
2. [Edge case 2]
3. [Edge case 3]
### Anti-Pattern Check
✅ No hardcoded library outputs
✅ No self-evident assertions
✅ Testing MY code's contract
✅ Appropriate fixture usage
✅ Business-focused docstrings
### Golden Rule
If these tests fail, we broke: [SPECIFIC BUSINESS REQUIREMENT]
### Proposed Test Structure
```python
[SHOW TEST TEMPLATE]
Does this approach look correct?
**Get user confirmation before proceeding.**
---
## Step 11: Write the Tests
Only after Steps 1-10, write the actual test code.
Use the structure from Step 6.
Use the docstrings from Step 7.
Follow the anti-patterns from Step 5.
---
## Step 12: Invoke pytest-test-reviewer
After writing tests, ALWAYS invoke the `pytest-test-reviewer` agent to validate:
- Patterns followed correctly
- No anti-patterns introduced
- Business value clear
- Contracts tested (not implementation)
---
## Examples
### Example 1: Pure Function (Timezone Util)
**User:** "Write tests for `infer_timezone_from_phone`"
**Step 1-3: Analysis**
Code type: Pure function wrapping phonenumbers library Dependencies: phonenumbers (external), pytz (validation) Contract:
- Input: phone number string (various formats)
- Output: valid pytz timezone string OR None
- Guarantee: US numbers → America/* timezones
- Guarantee: Invalid input → None (no exceptions)
**Step 4: Test Type**
UNIT test - pure function, no DB/state
**Step 5: Fixtures**
NO fixtures needed - simple string inputs
**Step 6-7: Code**
```python
class TestInferTimezoneFromPhone:
"""Test timezone inference from phone numbers."""
def test_valid_us_phone_returns_us_timezone(self):
"""
Valid US phone numbers should return a US timezone.
Contract test: validates that US numbers map to America/* timezones
without depending on exact phonenumbers library output that could change.
Ensures cronjobs run in user's local timezone.
"""
# Test various US formats
test_numbers = [
"+14155551234", # With country code
"4155551234", # Without country code
"415-555-1234", # With dashes
]
for phone in test_numbers:
result = infer_timezone_from_phone(phone)
# Test OUR contract, not library internals
assert result is not None, f"Should infer timezone for {phone}"
assert result.startswith("America/"), f"US number should return America/* timezone"
assert pytz.timezone(result) is not None # Valid timezone
def test_different_us_regions_return_different_timezones(self):
"""
Different US regions should map to different timezones.
Validates that the wrapper preserves geographic precision for
accurate scheduling across time zones.
"""
california = infer_timezone_from_phone("+14155551234")
new_york = infer_timezone_from_phone("+12125551234")
assert california is not None
assert new_york is not None
assert california != new_york, "Different regions should have different timezones"
def test_invalid_phone_numbers_return_none(self):
"""
Invalid phone numbers should return None.
Critical for fallback logic - we need to know when inference
failed so we can use the fallback timezone instead of crashing.
"""
invalid_numbers = [None, "", "not a phone", "123"]
for phone in invalid_numbers:
result = infer_timezone_from_phone(phone)
assert result is None, f"Invalid number {phone} should return None"
Golden Rule: If these tests fail, we broke:
- The guarantee that US phone numbers return US timezones
- The guarantee that invalid input doesn't crash (returns None)
- The preservation of geographic precision (different regions)
Example 2: Database Logic (Intervention Creation)
User: "Write tests for create_intervention"
Step 1-3: Analysis
Code type: Business logic with database models
Dependencies: Database (SQLAlchemy), Message model, User model
Contract:
- Creates Intervention in DB with correct relationships
- Sets state to PENDING
- Links to message and user correctly
- Returns created intervention
Step 4: Test Type
INTEGRATION test - tests DB operations and model interactions
Step 5: Fixtures
YES - need mock_message, mock_user, test_db_session
Complex DB setup with relationships
Step 6-7: Code
class TestCreateIntervention:
"""Test intervention creation business logic."""
def test_create_intervention_sets_correct_relationships(
self,
test_db_session: Session,
mock_message,
mock_user,
):
"""
Creating an intervention should link it to the message and user.
Ensures data integrity and enables querying interventions by
user or message for analytics and debugging.
"""
# Arrange: Use fixtures for complex DB setup
message = mock_message()
user = mock_user()
# Act: Execute business logic
intervention = create_intervention(message, user)
# Assert: Verify business rules
test_db_session.refresh(intervention)
assert intervention.message_id == message.id
assert intervention.user_id == user.id
assert intervention.state == InterventionState.PENDING
def test_create_intervention_fails_without_consent(
self,
test_db_session: Session,
mock_message,
mock_user,
):
"""
Interventions should not be created for users without consent.
Enforces ethical boundary - ensures we only coach users who
explicitly opted in, maintaining trust and legal compliance.
"""
# Arrange
message = mock_message()
user = mock_user(has_consented=False)
# Act & Assert: Should raise
with pytest.raises(ValueError, match="User has not consented"):
create_intervention(message, user)
Golden Rule: If these tests fail, we broke:
- Data integrity (relationships not set correctly)
- Ethical boundaries (sending to non-consented users)
- State machine correctness (interventions start in wrong state)
Success Criteria
Tests are ready when ALL of these are true:
- Contracts tested, not implementation details
- No hardcoded external library outputs
- Fixtures used appropriately (complex setup only)
- Business value explained in docstrings
- Robust to library updates and minor changes
- Can answer "If this fails, what business requirement broke?"
- Anti-patterns avoided (checked against Step 5 list)
- Appropriate test type chosen (unit/integration/e2e/etc.)
- 5 Critical Questions answered correctly
- pytest-test-reviewer agent invoked for validation
Common Mistakes to Avoid
- Starting to code before analysis - STOP. Do Steps 1-5 first.
- Skipping the Golden Rule check - If you can't articulate what breaks, delete the test.
- Using fixtures for simple strings - Inline them!
- Hardcoding library outputs - Test contracts instead.
- Writing self-evident tests - Ask "Am I testing Python or MY code?"
- Testing library behavior - Test YOUR wrapper, not wrapped library.
- Forgetting pytest-test-reviewer - ALWAYS invoke after writing tests.
After Test Writing
MANDATORY: Invoke pytest-test-reviewer agent to validate:
# Agent will check:
# - Patterns followed?
# - Anti-patterns avoided?
# - Business value clear?
# - Contracts tested?
Remember
YOU CANNOT WRITE TESTS WITHOUT THIS SKILL.
This skill is your safeguard against:
- Brittle tests that break with library updates
- Self-evident tests that waste time
- Wrong fixture usage
- Testing library behavior instead of YOUR code
Follow every step. No shortcuts. Every test. Every time.