name	data-science-tools
description	Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. CRITICAL: All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory.

Data Science Tools Skill

Purpose

This skill documents the data science ecosystem available in this project, including:

Which Python libraries are installed and available
How to use them for statistical analysis and regression
WHERE to place analysis scripts (reports/{topic}/scripts/ - NOT root scripts/)
Best practices for reproducible data science

🚨 CRITICAL: Script Organization Rule

ALL regression, modeling, and analysis scripts MUST go in:

reports/{topic}_{timestamp}/scripts/

NEVER in:

scripts/  ❌ (root scripts/ is only for reusable utilities)

See Script Organization Best Practices section below.

Available Libraries

Installed in `.venv` Virtual Environment

The following data science libraries are installed and ready to use:

Library	Version	Purpose
numpy	Latest	Numerical computing, arrays, linear algebra
scipy	1.16.3+	Scientific computing, optimization, statistics
pandas	2.3.3+	Data manipulation, DataFrames, time series
scikit-learn	1.7.2+	Machine learning, regression, clustering

Activating the Virtual Environment

All Python scripts must use the virtual environment:

source .venv/bin/activate && python scripts/your_script.py

Or add shebang to scripts:

#!/usr/bin/env python3
# Then run directly: ./scripts/your_script.py

In Bash tool calls:

source .venv/bin/activate && python scripts/analysis.py

Common Use Cases

1. Regression Modeling (scipy.optimize.curve_fit)

Purpose: Fit non-linear models to data (S-curves, exponential, etc.)

Example: Logistic Regression

import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

# Define model
def logistic(t, L, k, t0):
    """Logistic S-curve: L / (1 + exp(-k*(t - t0)))"""
    return L / (1 + np.exp(-k * (t - t0)))

# Prepare data
years = np.array([1993, 1994, ...])  # Time points
shares = np.array([0.004, 0.005, ...])  # Observed values
t = years - 1993  # Normalize time

# Fit model with bounds
p0 = [80, 0.5, 30]  # Initial guess: L=80%, k=0.5, t0=30
bounds = ([50, 0.1, 20], [100, 2.0, 50])  # Parameter bounds

params, covariance = curve_fit(
    logistic, t, shares,
    p0=p0,
    bounds=bounds,
    maxfev=10000
)

L, k, t0 = params

# Validate
predictions = logistic(t, L, k, t0)
r2 = r2_score(shares, predictions)
rmse = np.sqrt(np.mean((shares - predictions)**2))

print(f"Fitted parameters: L={L:.2f}, k={k:.4f}, t0={t0:.2f}")
print(f"R² = {r2:.6f}, RMSE = {rmse:.4f}")

⚠️ Important: Always use curve_fit with:

Initial guess (p0)
Bounds on parameters (prevents unrealistic values)
maxfev to allow sufficient iterations

2. Model Comparison

Compare multiple models to find best fit:

models = {
    'logistic': (logistic, [80, 0.5, 30], ([50, 0.1, 20], [100, 2.0, 50])),
    'gompertz': (gompertz, [80, 0.2, 30], ([50, 0.05, 20], [100, 1.0, 50])),
}

results = {}
for name, (func, p0, bounds) in models.items():
    params, _ = curve_fit(func, t, shares, p0=p0, bounds=bounds)
    pred = func(t, *params)
    r2 = r2_score(shares, pred)
    results[name] = {'params': params, 'r2': r2}

# Find best
best_model = max(results.items(), key=lambda x: x[1]['r2'])
print(f"Best model: {best_model[0]} (R² = {best_model[1]['r2']:.6f})")

3. Data Manipulation with Pandas

Read CSV, filter, aggregate:

import pandas as pd

# Read data
df = pd.read_csv('data/ev_annual_bil10.csv')

# Filter
recent = df[df['year'] >= 2015]

# Aggregate
yearly_avg = df.groupby('year')['ev_share_pct'].mean()

# Export
df.to_csv('data/results.csv', index=False)

4. Statistical Analysis

from scipy import stats

# Correlation
corr, p_value = stats.pearsonr(x, y)

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)

Script Organization Best Practices

Directory Structure

dst_skills/
├── scripts/               # Reusable utilities ONLY
│   ├── fetch_and_store.py
│   ├── db/
│   │   └── helpers.py
│   └── utils.py
│
├── data/                 # Raw data and databases
│   ├── dst.db
│   └── *.csv
│
└── reports/              # Generated reports
    └── {topic}_{timestamp}/
        ├── report.html
        ├── visualizations.html
        ├── data/         # Report-specific intermediate data
        │   └── *.csv
        └── scripts/      # ⚠️ ALL analysis scripts go HERE
            ├── README.md
            ├── fit_models.py
            ├── validate.py
            └── requirements.txt

IMPORTANT: Do NOT create analysis scripts in root scripts/ directory. All regression, modeling, and analysis scripts must be in the report's scripts/ folder.

When to Place Scripts in `reports/{topic}/scripts/` ✅ ALWAYS for Analysis

Use this for ALL report-specific analysis:

Regression modeling (curve_fit, forecasting, etc.)
Statistical analysis (hypothesis tests, correlations, etc.)
Data transformation specific to this report
Validation and model comparison
Reproducibility - reader can re-run your exact analysis
Documentation - shows exactly what was done
Versioning - freezes code with report at time of publication

✅ ALL of these belong in reports/{topic}/scripts/:

fit_ev_models.py - Regression modeling
validate_models.py - Model validation
verify_regression_models.py - scipy verification
forecast_scenarios.py - Forecasting
statistical_tests.py - Hypothesis testing

Example structure:

reports/elbiler_danmark_20251031/
├── report.html
├── visualizations.html
├── data/                    # Intermediate data for THIS analysis
│   ├── model_fits.csv
│   ├── forecasts.csv
│   └── residuals.csv
└── scripts/                 # ✅ ALL analysis scripts here
    ├── README.md           # Explains how to reproduce
    ├── fit_ev_models.py    # Main regression analysis
    ├── validate_models.py  # Cross-validation
    ├── verify_regression_models.py  # scipy verification
    └── requirements.txt    # Dependencies snapshot

When to Use `scripts/` (Root Level) ⚠️ ONLY for Reusable Utilities

Root scripts/ is ONLY for infrastructure utilities that are shared across ALL reports:

Database utilities (db/helpers.py, db/validate.py)
Data fetching (fetch_and_store.py)
Generic helpers (utils.py)
NOT for analysis - no regression, modeling, or statistics

❌ NEVER put these in root scripts/:

Regression models
Statistical analysis
Data transformations
Forecasting
Model validation

✅ Root scripts/ should ONLY contain:

# scripts/db/helpers.py - OK (reusable DB utility)
def safe_numeric_cast(column_name):
    """Helper for casting DST suppressed values."""
    return f"CASE WHEN {column_name} != '..' THEN CAST({column_name} AS NUMERIC) ELSE NULL END"

# scripts/utils.py - OK (generic utility)
def format_timestamp():
    """Standard timestamp format for filenames."""
    return datetime.now().strftime('%Y%m%d_%H%M%S')

# scripts/fetch_and_store.py - OK (reusable infrastructure)
def fetch_dst_table(table_id, filters):
    """Fetch data from DST API and store in DuckDB."""
    # ... implementation

If you're doing curve_fit, forecasting, or statistics → reports/{topic}/scripts/ ✅

Template: Report Analysis Script

#!/usr/bin/env python3
"""
EV Adoption Model Fitting and Validation
=========================================

Report: Danmarks Elbilsudvikling 2050
Date: 2025-10-31
Author: Claude Code

Purpose:
    Fit multiple regression models to EV adoption data and compare.

Usage:
    cd reports/{report_name}/scripts/
    source ../../../.venv/bin/activate
    python fit_ev_models.py

Outputs:
    - ../data/model_parameters.csv
    - ../data/forecasts.csv
    - stdout: Model comparison table
"""

import sys
import os
import csv
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

def main():
    # 1. Load data using relative path from scripts/ directory
    script_dir = os.path.dirname(os.path.abspath(__file__))
    project_root = os.path.join(script_dir, '../../..')

    # Path to project-level data
    data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')

    print(f"Loading data from {data_path}...")
    years = []
    shares = []
    with open(data_path, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            years.append(int(row['year']))
            shares.append(float(row['ev_share_pct']))

    years = np.array(years)
    shares = np.array(shares)

    # 2. Fit models
    print("\nFitting models...")
    # ... implementation

    # 3. Save results to report's data/ directory
    output_dir = os.path.join(script_dir, '../data')
    os.makedirs(output_dir, exist_ok=True)

    output_path = os.path.join(output_dir, 'model_parameters.csv')
    print(f"\nSaving results to {output_path}...")
    # ... save implementation

if __name__ == '__main__':
    main()

Key points:

Use os.path for cross-platform compatibility
Always use relative paths from script's location
Project data: ../../../data/
Report data: ../data/
Activate venv before running

README.md Template for Report Scripts

# Analysis Scripts for EV Adoption Report

## Report Details
- **Topic:** Danmarks Elbilsudvikling til 2050
- **Generated:** 2025-10-31
- **Data:** BIL10, BIL52, BIL51 (Danmarks Statistik)

## Reproducibility

### Prerequisites
```bash
# From project root
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn

Run Analysis

cd reports/elbiler_danmark_20251031/scripts/
python fit_ev_models.py
python validate_models.py

Scripts

fit_ev_models.py - Fits logistic, Gompertz, exponential models
validate_models.py - Cross-validation and residual analysis
export_forecasts.py - Generate 2026-2050 predictions

Outputs

Results saved to ../data/:

model_parameters.csv - Fitted parameters (L, k, t0)
forecasts.csv - Year-by-year predictions
validation_metrics.csv - R², RMSE, etc.

Model Details

See ../report.html Section 3: Methodology


## Common Pitfalls and Solutions

### 1. ModuleNotFoundError

**Problem:**
```bash
ModuleNotFoundError: No module named 'scipy'

Solution:

# Always activate venv first
source .venv/bin/activate
python scripts/your_script.py

2. curve_fit Fails to Converge

Problem:

OptimizeWarning: Covariance of the parameters could not be estimated

Solutions:

Improve initial guess p0
Tighten bounds (e.g., L: [60, 90] instead of [50, 100])
Increase maxfev to 20000
Normalize/scale your data first
Try different optimization methods

# Better bounds
bounds = ([65, 0.3, 25], [95, 0.8, 40])  # Tighter

# Or use different method
from scipy.optimize import minimize, differential_evolution

3. Grid Search vs Optimization

Bad (inefficient):

best_r2 = 0
for L in [70, 75, 80, 85, 90, 95]:
    for k in np.arange(0.1, 2.0, 0.05):
        # ... fit and compare

Good (use scipy):

params, _ = curve_fit(logistic, t, shares, p0=[80, 0.5, 30])

When grid search is acceptable:

Quick prototyping to find good p0
Testing specific scenarios (e.g., compare L=70% vs L=90%)
Educational purposes

4. Overfitting

Warning signs:

R² > 0.999 on historical data
Model fits noise, not signal
Poor performance on holdout set

Solutions:

# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=False)

# Fit on train, validate on test
params, _ = curve_fit(model, train_t, train_y)
test_pred = model(test_t, *params)
test_r2 = r2_score(test_y, test_pred)

if test_r2 < 0.9:
    print("⚠️ Warning: Poor generalization")

Installation and Verification

Check Installed Packages

source .venv/bin/activate
pip list | grep -E "(numpy|scipy|pandas|scikit)"

Expected output:

numpy          1.x.x
pandas         2.3.3
scikit-learn   1.7.2
scipy          1.16.3

Verify scipy.optimize Works

source .venv/bin/activate
python -c "from scipy.optimize import curve_fit; print('✓ scipy.optimize available')"

Install Missing Packages

source .venv/bin/activate
pip install numpy scipy pandas scikit-learn

Integration with DST Skills Workflow

Typical Workflow

Discovery: /dst-discover → Find tables
Fetch: /dst-fetch → Download data to data/
Analysis: /dst-analyze → SQL queries, basic calculations
Modeling: Create script in reports/{topic}/scripts/ for regression
Visualize: /dst-visualize → Create charts from results
Report: /dst-report → Generate HTML with all findings

Where Each Step Happens

Step	Location	Examples
Data fetching	`data/`	dst.db, *.csv
SQL queries	Agent (ephemeral)	Aggregations, joins
Regression/modeling	`reports/{topic}/scripts/` ✅	curve_fit, forecasting
Results	`reports/{topic}/data/`	model_parameters.csv
Report	`reports/{topic}/`	report.html

Example: Complete Regression Analysis

Step 1: Create analysis script in report folder

File: reports/elbiler_danmark_20251031/scripts/fit_logistic_model.py

#!/usr/bin/env python3
"""
Fit logistic regression to EV adoption data.

Usage:
    cd reports/elbiler_danmark_20251031/scripts/
    source ../../../.venv/bin/activate
    python fit_logistic_model.py
"""

import csv
import os
import numpy as np
from scipy.optimize import curve_fit

def main():
    # Load data from project data/
    script_dir = os.path.dirname(os.path.abspath(__file__))
    project_root = os.path.join(script_dir, '../../..')
    data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')

    # 1. Load data
    years = []
    shares = []
    with open(data_path, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            years.append(int(row['year']))
            shares.append(float(row['ev_share_pct']))

    years = np.array(years)
    shares = np.array(shares)
    t = years - years.min()

    # 2. Define and fit model
    def logistic(t, L, k, t0):
        return L / (1 + np.exp(-k * (t - t0)))

    params, _ = curve_fit(logistic, t, shares,
                         p0=[80, 0.5, 30],
                         bounds=([50, 0.1, 20], [100, 2.0, 50]))
    L, k, t0 = params

    # 3. Forecast
    future_years = np.arange(years.max() + 1, 2051)
    future_t = future_years - years.min()
    forecast = logistic(future_t, L, k, t0)

    # 4. Export to report's data/ folder
    output_dir = os.path.join(script_dir, '../data')
    os.makedirs(output_dir, exist_ok=True)

    output_path = os.path.join(output_dir, 'forecast.csv')
    with open(output_path, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(['year', 'predicted_share'])
        for year, pred in zip(future_years, forecast):
            writer.writerow([year, pred])

    print(f"✓ Forecast exported: {output_path}")
    print(f"  Model: L={L:.1f}%, k={k:.3f}, t0={t0:.1f}")

if __name__ == '__main__':
    main()

Step 2: Run from report's scripts/ directory

cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py

Step 3: Use results in visualization and report

The forecast.csv is now in reports/elbiler_danmark_20251031/data/ and can be used by /dst-visualize and /dst-report.

✅ Benefits of this approach:

Script stays with report (reproducibility)
Relative paths work from any machine
Clear separation: data fetching vs analysis vs reporting
Easy to version control and share

References

Documentation

scipy.optimize.curve_fit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
sklearn metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
pandas: https://pandas.pydata.org/docs/

Regression Theory

Logistic growth: Bass diffusion model, technology adoption
Gompertz curve: Asymmetric S-curve for market saturation
Model selection: AIC, BIC, cross-validation

Best Practices

Script placement: ALWAYS put analysis scripts in reports/{topic}/scripts/
Validation: Use train-test split for model validation
Reporting: Always report R², RMSE, and residual plots
Documentation: Document assumptions and limitations in script docstrings
Reproducibility: Version-control analysis scripts WITH the report they generate
Data paths: Use relative paths with os.path for cross-platform compatibility
Virtual env: Always activate .venv before running scipy/numpy code

Quick Reference: Where Does It Go?

What	Where	Example
Regression scripts	`reports/{topic}/scripts/`	`fit_models.py`
Validation scripts	`reports/{topic}/scripts/`	`verify_regression_models.py`
Forecasting scripts	`reports/{topic}/scripts/`	`forecast_scenarios.py`
Statistical tests	`reports/{topic}/scripts/`	`hypothesis_tests.py`
Intermediate results	`reports/{topic}/data/`	`model_parameters.csv`
Raw data	`data/` (project root)	`dst.db`, `ev_annual_bil10.csv`
Reusable utilities	`scripts/` (project root)	`db/helpers.py`, `fetch_and_store.py`

Simple rule: If it uses scipy/curve_fit/statistics → reports/{topic}/scripts/ ✅

data-science-tools

Install Skill

SKILL.md

Data Science Tools Skill

Purpose

🚨 CRITICAL: Script Organization Rule

Available Libraries

Installed in `.venv` Virtual Environment

Activating the Virtual Environment

Common Use Cases

1. Regression Modeling (scipy.optimize.curve_fit)

2. Model Comparison

3. Data Manipulation with Pandas

4. Statistical Analysis

Script Organization Best Practices

Directory Structure

When to Place Scripts in `reports/{topic}/scripts/` ✅ ALWAYS for Analysis

When to Use `scripts/` (Root Level) ⚠️ ONLY for Reusable Utilities

Template: Report Analysis Script

README.md Template for Report Scripts

Run Analysis

Scripts

Outputs

Model Details

2. curve_fit Fails to Converge

3. Grid Search vs Optimization

4. Overfitting

Installation and Verification

Check Installed Packages

Verify scipy.optimize Works

Install Missing Packages

Integration with DST Skills Workflow

Typical Workflow

Where Each Step Happens

Example: Complete Regression Analysis

References

Documentation

Regression Theory

Best Practices

Quick Reference: Where Does It Go?

Install Skill

SKILL.md

Data Science Tools Skill

Purpose

🚨 CRITICAL: Script Organization Rule

Available Libraries

Installed in .venv Virtual Environment

Activating the Virtual Environment

Common Use Cases

1. Regression Modeling (scipy.optimize.curve_fit)

2. Model Comparison

3. Data Manipulation with Pandas

4. Statistical Analysis

Script Organization Best Practices

Directory Structure

When to Place Scripts in reports/{topic}/scripts/ ✅ ALWAYS for Analysis

When to Use scripts/ (Root Level) ⚠️ ONLY for Reusable Utilities

Template: Report Analysis Script

README.md Template for Report Scripts

Run Analysis

Scripts

Outputs

Model Details

2. curve_fit Fails to Converge

3. Grid Search vs Optimization

4. Overfitting

Installation and Verification

Check Installed Packages

Verify scipy.optimize Works

Install Missing Packages

Integration with DST Skills Workflow

Typical Workflow

Where Each Step Happens

Example: Complete Regression Analysis

References

Documentation

Regression Theory

Best Practices

Quick Reference: Where Does It Go?

Installed in `.venv` Virtual Environment

When to Place Scripts in `reports/{topic}/scripts/` ✅ ALWAYS for Analysis

When to Use `scripts/` (Root Level) ⚠️ ONLY for Reusable Utilities