| name | ds-plan |
| version | 1 |
| description | REQUIRED Phase 2 of /ds workflow. Profiles data and creates analysis task breakdown. |
Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."
Contents
Planning (Data Profiling + Task Breakdown)
Profile the data and create an analysis plan based on the spec.
Requires .claude/SPEC.md from /ds-brainstorm first.
SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.
Before exploring data or creating tasks, you MUST have:
.claude/SPEC.mdwith objectives and constraints- Clear success criteria
- User-approved spec
If .claude/SPEC.md doesn't exist, run /ds-brainstorm first.
Rationalization Table - STOP If You Think:
| Excuse | Reality | Do Instead |
|---|---|---|
| "Data looks clean, profiling unnecessary" | Your data is never clean | PROFILE to discover issues |
| "I can profile as I go" | You'll miss systemic issues | PROFILE comprehensively NOW |
| "Quick .head() is enough" | Your head hides tail problems | RUN full profiling checklist |
| "Missing values won't affect my analysis" | They always do | DOCUMENT and plan handling |
| "I'll handle data issues during analysis" | Your issues will derail your analysis | FIX data issues FIRST |
| "User didn't mention data quality" | They assume YOU'LL check | QUALITY check is YOUR job |
| "Profiling takes too long" | Your skipping it costs days later | INVEST time now |
Honesty Framing
Creating an analysis plan without profiling the data is LYING about understanding the data.
You cannot plan analysis steps without knowing:
- Your data's shape and types
- Your missing value patterns
- Your data quality issues
- Your cleaning requirements
Profiling costs you minutes. Your wrong plan costs hours of rework and incorrect results.
No Pause After Completion
After writing .claude/PLAN.md, IMMEDIATELY invoke:
Skill(skill="workflows:ds-implement")
DO NOT:
- Ask "should I proceed with implementation?"
- Summarize the plan
- Wait for user confirmation (they approved SPEC already)
- Write status updates
The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.
What Plan Does
| DO | DON'T |
|---|---|
| Read .claude/SPEC.md | Skip brainstorm phase |
| Profile data (shape, types, stats) | Skip to analysis |
| Identify data quality issues | Ignore missing/duplicate data |
| Create ordered task list | Write final analysis code |
| Write .claude/PLAN.md | Make completion claims |
Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY
Process
1. Verify Spec Exists
cat .claude/SPEC.md # verify-spec: read SPEC file to confirm it exists
If missing, stop and run /ds-brainstorm first.
2. Data Profiling
For multiple data sources: Profile in parallel using background Task agents.
Single Data Source (Direct Profiling)
MANDATORY profiling steps:
import pandas as pd
# Basic structure
df.shape # (rows, columns)
df.dtypes # Column types
df.head(10) # Sample data
df.tail(5) # End of data
# Summary statistics
df.describe() # Numeric summaries
df.describe(include='object') # Categorical summaries
df.info() # Memory, non-null counts
# Data quality checks
df.isnull().sum() # Missing values per column
df.duplicated().sum() # Duplicate rows
df[col].value_counts() # Distribution of categories
# For time series
df[date_col].min(), df[date_col].max() # Date range
df.groupby(date_col).size() # Records per period
Multiple Data Sources (Parallel Profiling)
Use run_in_background: true for parallel execution.
When profiling 2+ data sources, launch agents in parallel:
# PARALLEL + BACKGROUND: All Task calls in ONE message
Task(
subagent_type="general-purpose",
description="Profile dataset 1",
run_in_background=true,
prompt="""
Profile this dataset and return a data quality report.
Dataset: /path/to/dataset1.csv
Required checks:
1. Shape: rows x columns
2. Data types: df.dtypes
3. Missing values: df.isnull().sum()
4. Duplicates: df.duplicated().sum()
5. Summary statistics: df.describe()
6. Unique value counts for categorical columns
7. Date range if time series
8. Memory usage: df.info()
Output format:
- Markdown table with column summary
- List of data quality issues found
- Recommendations for cleaning
Tools denied: Write, Edit, NotebookEdit (read-only profiling)
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 2",
run_in_background=true,
prompt="""
[Same template for dataset 2]
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 3",
run_in_background=true,
prompt="""
[Same template for dataset 3]
""")
After launching agents:
- Continue to other work (don't wait)
- Check status with
/taskscommand - Collect results with TaskOutput when ready
# Collect profiling results
TaskOutput(task_id="task-abc123", block=true, timeout=30000)
TaskOutput(task_id="task-def456", block=true, timeout=30000)
TaskOutput(task_id="task-ghi789", block=true, timeout=30000)
Benefits:
- 3x faster profiling for 3 datasets
- Each agent focused on single source
- Results consolidated in main chat
3. Identify Data Quality Issues
CRITICAL: Document ALL issues before proceeding:
| Check | What to Look For |
|---|---|
| Missing values | Null counts, patterns of missingness |
| Duplicates | Exact duplicates, key-based duplicates |
| Outliers | Extreme values, impossible values |
| Type issues | Strings in numeric columns, date parsing |
| Cardinality | Unexpected unique values |
| Distribution | Skewness, unexpected patterns |
4. Create Task Breakdown
Break analysis into ordered tasks:
- Each task should produce visible output
- Order by data dependencies
- Include data cleaning tasks FIRST
5. Write Plan Doc
Write to .claude/PLAN.md:
# Analysis Plan: [Analysis Name]
> **For Claude:** REQUIRED SUB-SKILL: Use `Skill(skill="workflows:ds-implement")` to implement this plan with output-first verification.
>
> **Delegation:** Main chat orchestrates, Task agents implement. Use `Skill(skill="workflows:ds-delegate")` for subagent templates.
## Spec Reference
See: .claude/SPEC.md
## Data Profile
### Source 1: [name]
- Location: [path/connection]
- Shape: [rows] x [columns]
- Date range: [start] to [end]
- Key columns: [list]
#### Column Summary
| Column | Type | Non-null | Unique | Notes |
|--------|------|----------|--------|-------|
| col1 | int64 | 100% | 50 | Primary key |
| col2 | object | 95% | 10 | Category |
#### Data Quality Issues
- [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag]
- [ ] Duplicates: 100 duplicate rows on [key] - [strategy]
- [ ] Outliers: col3 has values > 1000 - [strategy]
### Source 2: [name]
[Same structure]
## Task Breakdown
### Task 1: Data Cleaning (required first)
- Handle missing values in col2
- Remove duplicates
- Fix data types
- Output: Clean DataFrame, log of rows removed
### Task 2: [Analysis Step]
- Input: Clean DataFrame
- Process: [description]
- Output: [specific output to verify]
- Dependencies: Task 1
### Task 3: [Next Step]
[Same structure]
## Output Verification Plan
For each task, define what output proves completion:
- Task 1: "X rows cleaned, Y rows dropped"
- Task 2: "Visualization showing [pattern]"
- Task 3: "Model accuracy >= 0.8"
## Reproducibility Requirements
- Random seed: [value if needed]
- Package versions: [key packages]
- Data snapshot: [date/version]
Red Flags - STOP If You're About To:
| Action | Why It's Wrong | Do Instead |
|---|---|---|
| Skip data profiling | Your data issues will break your analysis | Always profile first |
| Ignore missing values | You'll corrupt your results | Document and plan handling |
| Start analysis immediately | You haven't characterized your data | Complete profiling |
| Assume your data is clean | Never assume, you must verify | Run quality checks |
Output
Complete the plan when:
- Read and understand
.claude/SPEC.md - Profile all data sources (shape, types, stats)
- Document data quality issues
- Define cleaning strategy for each issue
- Order tasks by dependency
- Define output verification criteria
- Write
.claude/PLAN.md - Confirm ready for implementation
Phase Complete
REQUIRED SUB-SKILL: After completing plan, IMMEDIATELY invoke:
Skill(skill="workflows:ds-implement")