| name | data-science-tools |
| description | Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. **CRITICAL:** All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory. |
Data Science Tools Skill
Purpose
This skill documents the data science ecosystem available in this project, including:
- Which Python libraries are installed and available
- How to use them for statistical analysis and regression
- WHERE to place analysis scripts (reports/{topic}/scripts/ - NOT root scripts/)
- Best practices for reproducible data science
🚨 CRITICAL: Script Organization Rule
ALL regression, modeling, and analysis scripts MUST go in:
reports/{topic}_{timestamp}/scripts/
NEVER in:
scripts/ ❌ (root scripts/ is only for reusable utilities)
See Script Organization Best Practices section below.
Available Libraries
Installed in .venv Virtual Environment
The following data science libraries are installed and ready to use:
| Library | Version | Purpose |
|---|---|---|
| numpy | Latest | Numerical computing, arrays, linear algebra |
| scipy | 1.16.3+ | Scientific computing, optimization, statistics |
| pandas | 2.3.3+ | Data manipulation, DataFrames, time series |
| scikit-learn | 1.7.2+ | Machine learning, regression, clustering |
Activating the Virtual Environment
All Python scripts must use the virtual environment:
source .venv/bin/activate && python scripts/your_script.py
Or add shebang to scripts:
#!/usr/bin/env python3
# Then run directly: ./scripts/your_script.py
In Bash tool calls:
source .venv/bin/activate && python scripts/analysis.py
Common Use Cases
1. Regression Modeling (scipy.optimize.curve_fit)
Purpose: Fit non-linear models to data (S-curves, exponential, etc.)
Example: Logistic Regression
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
# Define model
def logistic(t, L, k, t0):
"""Logistic S-curve: L / (1 + exp(-k*(t - t0)))"""
return L / (1 + np.exp(-k * (t - t0)))
# Prepare data
years = np.array([1993, 1994, ...]) # Time points
shares = np.array([0.004, 0.005, ...]) # Observed values
t = years - 1993 # Normalize time
# Fit model with bounds
p0 = [80, 0.5, 30] # Initial guess: L=80%, k=0.5, t0=30
bounds = ([50, 0.1, 20], [100, 2.0, 50]) # Parameter bounds
params, covariance = curve_fit(
logistic, t, shares,
p0=p0,
bounds=bounds,
maxfev=10000
)
L, k, t0 = params
# Validate
predictions = logistic(t, L, k, t0)
r2 = r2_score(shares, predictions)
rmse = np.sqrt(np.mean((shares - predictions)**2))
print(f"Fitted parameters: L={L:.2f}, k={k:.4f}, t0={t0:.2f}")
print(f"R² = {r2:.6f}, RMSE = {rmse:.4f}")
⚠️ Important: Always use curve_fit with:
- Initial guess (
p0) - Bounds on parameters (prevents unrealistic values)
maxfevto allow sufficient iterations
2. Model Comparison
Compare multiple models to find best fit:
models = {
'logistic': (logistic, [80, 0.5, 30], ([50, 0.1, 20], [100, 2.0, 50])),
'gompertz': (gompertz, [80, 0.2, 30], ([50, 0.05, 20], [100, 1.0, 50])),
}
results = {}
for name, (func, p0, bounds) in models.items():
params, _ = curve_fit(func, t, shares, p0=p0, bounds=bounds)
pred = func(t, *params)
r2 = r2_score(shares, pred)
results[name] = {'params': params, 'r2': r2}
# Find best
best_model = max(results.items(), key=lambda x: x[1]['r2'])
print(f"Best model: {best_model[0]} (R² = {best_model[1]['r2']:.6f})")
3. Data Manipulation with Pandas
Read CSV, filter, aggregate:
import pandas as pd
# Read data
df = pd.read_csv('data/ev_annual_bil10.csv')
# Filter
recent = df[df['year'] >= 2015]
# Aggregate
yearly_avg = df.groupby('year')['ev_share_pct'].mean()
# Export
df.to_csv('data/results.csv', index=False)
4. Statistical Analysis
from scipy import stats
# Correlation
corr, p_value = stats.pearsonr(x, y)
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
Script Organization Best Practices
Directory Structure
dst_skills/
├── scripts/ # Reusable utilities ONLY
│ ├── fetch_and_store.py
│ ├── db/
│ │ └── helpers.py
│ └── utils.py
│
├── data/ # Raw data and databases
│ ├── dst.db
│ └── *.csv
│
└── reports/ # Generated reports
└── {topic}_{timestamp}/
├── report.html
├── visualizations.html
├── data/ # Report-specific intermediate data
│ └── *.csv
└── scripts/ # ⚠️ ALL analysis scripts go HERE
├── README.md
├── fit_models.py
├── validate.py
└── requirements.txt
IMPORTANT: Do NOT create analysis scripts in root scripts/ directory.
All regression, modeling, and analysis scripts must be in the report's scripts/ folder.
When to Place Scripts in reports/{topic}/scripts/ ✅ ALWAYS for Analysis
Use this for ALL report-specific analysis:
- Regression modeling (curve_fit, forecasting, etc.)
- Statistical analysis (hypothesis tests, correlations, etc.)
- Data transformation specific to this report
- Validation and model comparison
- Reproducibility - reader can re-run your exact analysis
- Documentation - shows exactly what was done
- Versioning - freezes code with report at time of publication
✅ ALL of these belong in reports/{topic}/scripts/:
fit_ev_models.py- Regression modelingvalidate_models.py- Model validationverify_regression_models.py- scipy verificationforecast_scenarios.py- Forecastingstatistical_tests.py- Hypothesis testing
Example structure:
reports/elbiler_danmark_20251031/
├── report.html
├── visualizations.html
├── data/ # Intermediate data for THIS analysis
│ ├── model_fits.csv
│ ├── forecasts.csv
│ └── residuals.csv
└── scripts/ # ✅ ALL analysis scripts here
├── README.md # Explains how to reproduce
├── fit_ev_models.py # Main regression analysis
├── validate_models.py # Cross-validation
├── verify_regression_models.py # scipy verification
└── requirements.txt # Dependencies snapshot
When to Use scripts/ (Root Level) ⚠️ ONLY for Reusable Utilities
Root scripts/ is ONLY for infrastructure utilities that are shared across ALL reports:
- Database utilities (
db/helpers.py,db/validate.py) - Data fetching (
fetch_and_store.py) - Generic helpers (
utils.py) - NOT for analysis - no regression, modeling, or statistics
❌ NEVER put these in root scripts/:
- Regression models
- Statistical analysis
- Data transformations
- Forecasting
- Model validation
✅ Root scripts/ should ONLY contain:
# scripts/db/helpers.py - OK (reusable DB utility)
def safe_numeric_cast(column_name):
"""Helper for casting DST suppressed values."""
return f"CASE WHEN {column_name} != '..' THEN CAST({column_name} AS NUMERIC) ELSE NULL END"
# scripts/utils.py - OK (generic utility)
def format_timestamp():
"""Standard timestamp format for filenames."""
return datetime.now().strftime('%Y%m%d_%H%M%S')
# scripts/fetch_and_store.py - OK (reusable infrastructure)
def fetch_dst_table(table_id, filters):
"""Fetch data from DST API and store in DuckDB."""
# ... implementation
If you're doing curve_fit, forecasting, or statistics → reports/{topic}/scripts/ ✅
Template: Report Analysis Script
#!/usr/bin/env python3
"""
EV Adoption Model Fitting and Validation
=========================================
Report: Danmarks Elbilsudvikling 2050
Date: 2025-10-31
Author: Claude Code
Purpose:
Fit multiple regression models to EV adoption data and compare.
Usage:
cd reports/{report_name}/scripts/
source ../../../.venv/bin/activate
python fit_ev_models.py
Outputs:
- ../data/model_parameters.csv
- ../data/forecasts.csv
- stdout: Model comparison table
"""
import sys
import os
import csv
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
def main():
# 1. Load data using relative path from scripts/ directory
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
# Path to project-level data
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
print(f"Loading data from {data_path}...")
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
# 2. Fit models
print("\nFitting models...")
# ... implementation
# 3. Save results to report's data/ directory
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'model_parameters.csv')
print(f"\nSaving results to {output_path}...")
# ... save implementation
if __name__ == '__main__':
main()
Key points:
- Use
os.pathfor cross-platform compatibility - Always use relative paths from script's location
- Project data:
../../../data/ - Report data:
../data/ - Activate venv before running
README.md Template for Report Scripts
# Analysis Scripts for EV Adoption Report
## Report Details
- **Topic:** Danmarks Elbilsudvikling til 2050
- **Generated:** 2025-10-31
- **Data:** BIL10, BIL52, BIL51 (Danmarks Statistik)
## Reproducibility
### Prerequisites
```bash
# From project root
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
Run Analysis
cd reports/elbiler_danmark_20251031/scripts/
python fit_ev_models.py
python validate_models.py
Scripts
fit_ev_models.py- Fits logistic, Gompertz, exponential modelsvalidate_models.py- Cross-validation and residual analysisexport_forecasts.py- Generate 2026-2050 predictions
Outputs
Results saved to ../data/:
model_parameters.csv- Fitted parameters (L, k, t0)forecasts.csv- Year-by-year predictionsvalidation_metrics.csv- R², RMSE, etc.
Model Details
See ../report.html Section 3: Methodology
## Common Pitfalls and Solutions
### 1. ModuleNotFoundError
**Problem:**
```bash
ModuleNotFoundError: No module named 'scipy'
Solution:
# Always activate venv first
source .venv/bin/activate
python scripts/your_script.py
2. curve_fit Fails to Converge
Problem:
OptimizeWarning: Covariance of the parameters could not be estimated
Solutions:
- Improve initial guess
p0 - Tighten bounds (e.g., L: [60, 90] instead of [50, 100])
- Increase
maxfevto 20000 - Normalize/scale your data first
- Try different optimization methods
# Better bounds
bounds = ([65, 0.3, 25], [95, 0.8, 40]) # Tighter
# Or use different method
from scipy.optimize import minimize, differential_evolution
3. Grid Search vs Optimization
Bad (inefficient):
best_r2 = 0
for L in [70, 75, 80, 85, 90, 95]:
for k in np.arange(0.1, 2.0, 0.05):
# ... fit and compare
Good (use scipy):
params, _ = curve_fit(logistic, t, shares, p0=[80, 0.5, 30])
When grid search is acceptable:
- Quick prototyping to find good
p0 - Testing specific scenarios (e.g., compare L=70% vs L=90%)
- Educational purposes
4. Overfitting
Warning signs:
- R² > 0.999 on historical data
- Model fits noise, not signal
- Poor performance on holdout set
Solutions:
# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=False)
# Fit on train, validate on test
params, _ = curve_fit(model, train_t, train_y)
test_pred = model(test_t, *params)
test_r2 = r2_score(test_y, test_pred)
if test_r2 < 0.9:
print("⚠️ Warning: Poor generalization")
Installation and Verification
Check Installed Packages
source .venv/bin/activate
pip list | grep -E "(numpy|scipy|pandas|scikit)"
Expected output:
numpy 1.x.x
pandas 2.3.3
scikit-learn 1.7.2
scipy 1.16.3
Verify scipy.optimize Works
source .venv/bin/activate
python -c "from scipy.optimize import curve_fit; print('✓ scipy.optimize available')"
Install Missing Packages
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
Integration with DST Skills Workflow
Typical Workflow
- Discovery:
/dst-discover→ Find tables - Fetch:
/dst-fetch→ Download data todata/ - Analysis:
/dst-analyze→ SQL queries, basic calculations - Modeling: Create script in
reports/{topic}/scripts/for regression - Visualize:
/dst-visualize→ Create charts from results - Report:
/dst-report→ Generate HTML with all findings
Where Each Step Happens
| Step | Location | Examples |
|---|---|---|
| Data fetching | data/ |
dst.db, *.csv |
| SQL queries | Agent (ephemeral) | Aggregations, joins |
| Regression/modeling | reports/{topic}/scripts/ ✅ |
curve_fit, forecasting |
| Results | reports/{topic}/data/ |
model_parameters.csv |
| Report | reports/{topic}/ |
report.html |
Example: Complete Regression Analysis
Step 1: Create analysis script in report folder
File: reports/elbiler_danmark_20251031/scripts/fit_logistic_model.py
#!/usr/bin/env python3
"""
Fit logistic regression to EV adoption data.
Usage:
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
"""
import csv
import os
import numpy as np
from scipy.optimize import curve_fit
def main():
# Load data from project data/
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
# 1. Load data
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
t = years - years.min()
# 2. Define and fit model
def logistic(t, L, k, t0):
return L / (1 + np.exp(-k * (t - t0)))
params, _ = curve_fit(logistic, t, shares,
p0=[80, 0.5, 30],
bounds=([50, 0.1, 20], [100, 2.0, 50]))
L, k, t0 = params
# 3. Forecast
future_years = np.arange(years.max() + 1, 2051)
future_t = future_years - years.min()
forecast = logistic(future_t, L, k, t0)
# 4. Export to report's data/ folder
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'forecast.csv')
with open(output_path, 'w') as f:
writer = csv.writer(f)
writer.writerow(['year', 'predicted_share'])
for year, pred in zip(future_years, forecast):
writer.writerow([year, pred])
print(f"✓ Forecast exported: {output_path}")
print(f" Model: L={L:.1f}%, k={k:.3f}, t0={t0:.1f}")
if __name__ == '__main__':
main()
Step 2: Run from report's scripts/ directory
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
Step 3: Use results in visualization and report
The forecast.csv is now in reports/elbiler_danmark_20251031/data/ and can be used by /dst-visualize and /dst-report.
✅ Benefits of this approach:
- Script stays with report (reproducibility)
- Relative paths work from any machine
- Clear separation: data fetching vs analysis vs reporting
- Easy to version control and share
References
Documentation
- scipy.optimize.curve_fit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
- sklearn metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
- pandas: https://pandas.pydata.org/docs/
Regression Theory
- Logistic growth: Bass diffusion model, technology adoption
- Gompertz curve: Asymmetric S-curve for market saturation
- Model selection: AIC, BIC, cross-validation
Best Practices
- Script placement: ALWAYS put analysis scripts in
reports/{topic}/scripts/ - Validation: Use train-test split for model validation
- Reporting: Always report R², RMSE, and residual plots
- Documentation: Document assumptions and limitations in script docstrings
- Reproducibility: Version-control analysis scripts WITH the report they generate
- Data paths: Use relative paths with
os.pathfor cross-platform compatibility - Virtual env: Always activate
.venvbefore running scipy/numpy code
Quick Reference: Where Does It Go?
| What | Where | Example |
|---|---|---|
| Regression scripts | reports/{topic}/scripts/ |
fit_models.py |
| Validation scripts | reports/{topic}/scripts/ |
verify_regression_models.py |
| Forecasting scripts | reports/{topic}/scripts/ |
forecast_scenarios.py |
| Statistical tests | reports/{topic}/scripts/ |
hypothesis_tests.py |
| Intermediate results | reports/{topic}/data/ |
model_parameters.csv |
| Raw data | data/ (project root) |
dst.db, ev_annual_bil10.csv |
| Reusable utilities | scripts/ (project root) |
db/helpers.py, fetch_and_store.py |
Simple rule: If it uses scipy/curve_fit/statistics → reports/{topic}/scripts/ ✅