| name | reclassify-tests |
| description | Reclassify tests by adding @pytest.mark.functionality to tests not explicitly shown in the spec. Invoke with /reclassify-tests <problem> <checkpoint>. |
Reclassify Tests
Analyze a checkpoint's spec and test file to identify tests that are NOT explicitly demonstrated in the spec examples, and add @pytest.mark.functionality markers to them.
Usage: /reclassify-tests file_backup checkpoint_1
Purpose
Core tests (unmarked) should only cover behaviors that are explicitly demonstrated in spec examples. Tests that verify implicit requirements, edge cases, or inferred behaviors should be marked as @pytest.mark.functionality.
This skill helps ensure fair evaluation by:
- Keeping core tests limited to explicit spec examples
- Moving implicit/inferred behaviors to functionality tests
- Making test classification consistent and justifiable
Step 1: Gather Context
Read these files for the specified problem/checkpoint:
problems/{problem}/checkpoint_N.md # Spec with examples
problems/{problem}/tests/test_checkpoint_N.py # Test file to reclassify
problems/{problem}/tests/conftest.py # Fixtures context
problems/{problem}/tests/data/checkpoint_N/ # Test case data (if present)
Step 2: Extract Explicit Spec Behaviors
From the spec, identify behaviors that are explicitly demonstrated:
What Counts as "Explicit"
- Direct Examples: Code blocks showing exact input → output
- Explicit Statements: "MUST", "SHALL", specific values defined
- Example Output: JSON/text showing exact expected format
- Stated Requirements: Numbered/bulleted requirements with specific behavior
What Does NOT Count as "Explicit"
- Implied Behavior: Reasonable inferences not shown in examples
- Edge Cases: Boundary conditions not demonstrated
- Error Handling: Unless specific errors are shown in examples
- Format Details: Exact formatting beyond what examples show
- Ordering: Unless explicitly stated or shown in all examples
Create a Checklist
For each spec example, note:
- What input is shown?
- What output is shown?
- What behavior is demonstrated?
Step 3: Analyze Each Test
For each test function or parametrized test case:
A. Identify What the Test Verifies
- What specific behavior does this test check?
- What input does it use?
- What output does it expect?
B. Find Spec Support
Ask: "Can I point to a spec example that explicitly demonstrates this exact behavior?"
| Spec Support | Classification |
|---|---|
| Example shows exact scenario | Keep as core (no marker) |
| Behavior stated but no example | Mark as functionality |
| Inferred from context | Mark as functionality |
| Edge case not shown | Mark as functionality |
| Error handling not in examples | Mark as error |
C. Document the Decision
For each test, note:
- Spec support (quote or "NONE")
- Classification decision
- Reason
Step 4: Apply Markers
For Tests to Reclassify
Add @pytest.mark.functionality ABOVE the test function or parametrize decorator:
Before:
@pytest.mark.parametrize("case", EDGE_CASES, ids=[c["id"] for c in EDGE_CASES])
def test_edge_cases(entrypoint_argv, case, tmp_path):
...
After:
@pytest.mark.functionality
@pytest.mark.parametrize("case", EDGE_CASES, ids=[c["id"] for c in EDGE_CASES])
def test_edge_cases(entrypoint_argv, case, tmp_path):
...
For Individual Parametrized Cases
If some cases in a parametrized test are explicit and others aren't, consider:
- Split the test: Separate core cases from functionality cases
- Use pytest.param with marks:
@pytest.mark.parametrize("case", [
pytest.param(case1, id="explicit-case"),
pytest.param(case2, marks=pytest.mark.functionality, id="inferred-case"),
])
def test_cases(case):
...
Step 5: Report Changes
Generate a summary of changes made:
# Reclassification Report: {problem} {checkpoint}
## Summary
- Tests analyzed: N
- Tests reclassified: N
- Tests unchanged: N
## Reclassified Tests
### `test_function_name` → @pytest.mark.functionality
**Reason**: Test verifies [behavior] which is not explicitly demonstrated in spec.
**Spec says**: [Quote or "No example shown"]
**Test expects**: [What the test checks]
---
### `test_another_function` → @pytest.mark.functionality
...
## Unchanged Tests (Explicitly in Spec)
| Test | Spec Example Reference |
|------|------------------------|
| `test_core_case_1` | Example 1: shows exact input/output |
| `test_core_case_2` | Example 2: demonstrates this scenario |
Classification Guidelines
KEEP AS CORE (no marker) when:
- Spec example shows exact input and expected output
- Behavior is explicitly stated with specific values
- Test directly validates a demonstrated scenario
- Example output format matches test expectations exactly
MARK AS FUNCTIONALITY when:
- Behavior is reasonable but not shown in examples
- Test checks edge cases not demonstrated
- Test verifies format details beyond examples
- Test validates inferred requirements
- Test checks ordering not explicitly stated
- Test validates error messages/codes not shown
MARK AS ERROR when:
- Test validates error handling for invalid input
- Test checks failure modes
- Test expects non-zero exit or exceptions
Common Reclassification Scenarios
1. Implicit Ordering
Spec: "Process all jobs"
Test: Expects specific alphabetical order
→ Mark as functionality (unless spec explicitly shows ordering)
2. Edge Values
Spec: "Duration in hours"
Test: Checks duration=0, duration=MAX
→ Mark as functionality (unless 0/MAX shown in examples)
3. Format Details
Spec: Shows JSON output example
Test: Checks exact whitespace, key ordering
→ Mark as functionality (unless spec says "must match exactly")
4. Error Handling
Spec: "Return non-zero on error"
Test: Checks specific error code or message
→ Mark as error (and consider relaxing the assertion)
5. Field Presence
Spec: Example shows fields A, B, C
Test: Requires field D not in any example
→ Mark as functionality
Data-Driven Test Considerations
Many test files use data directories like:
tests/data/checkpoint_N/
├── core/ # Core test cases
├── hidden/ # Functionality test cases
├── errors/ # Error test cases
When test data is already organized this way:
Check that markers match data directories:
- Tests loading from
core/should have no marker - Tests loading from
hidden/should have@pytest.mark.functionality - Tests loading from
errors/should have@pytest.mark.error
- Tests loading from
Consider moving cases between directories if:
- A "core" case tests implicit behavior → move to
hidden/ - A "hidden" case tests explicit behavior → move to
core/
- A "core" case tests implicit behavior → move to
Example Analysis
Given spec excerpt:
## Example 1
Input: `--now 2025-09-10T03:30:00Z --duration 1`
Output:
{"event":"SCHEDULE_PARSED","timezone":"UTC","jobs_total":2}
{"event":"JOB_ELIGIBLE","job_id":"job-root","kind":"daily",...}
And tests:
def test_schedule_parsed_event():
"""Verify SCHEDULE_PARSED is emitted first."""
# ✓ KEEP AS CORE - Example 1 shows this event first
def test_schedule_parsed_fields():
"""Verify SCHEDULE_PARSED has timezone and jobs_total."""
# ✓ KEEP AS CORE - Example 1 shows these exact fields
def test_empty_schedule():
"""Verify empty schedule returns jobs_total=0."""
# → MARK AS FUNCTIONALITY - No example shows empty schedule
def test_invalid_timezone():
"""Verify invalid timezone returns error."""
# → MARK AS ERROR - Error handling case
After Reclassification
Run tests to ensure markers are syntactically correct:
cd problems/{problem} && pytest --collect-only tests/test_checkpoint_N.pyVerify counts match expectations:
pytest --collect-only -m "not functionality and not error" tests/ # Core only pytest --collect-only -m functionality tests/ # Functionality only pytest --collect-only -m error tests/ # Error onlyReview the changes - did any truly explicit tests get marked?