| name | atft-pipeline |
| description | Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline. |
| proactive | true |
ATFT Pipeline Skill
Mission
- Provision fresh or historical parquet datasets for ATFT-GAT-FAN with GPU-accelerated ETL.
- Maintain deterministic feature graphs (approx. 395 engineered factors, 307 active).
- Guard J-Quants API quota, credential sanity, and cache health to prevent training stalls.
When To Engage
- Any request mentioning dataset builds, ETL, J-Quants, cache, RAPIDS/cuDF, or feature graph refresh.
- Pre-training sanity checks (“ensure latest dataset”, “verify cache integrity”).
- Recovery tasks (“resume interrupted dataset job”, “clean corrupted cache shards”).
Preflight Checklist
- Confirm
nvidia-smireports at least one free A100 80GB GPU; fallback to CPU only if GPU unavailable. - Validate credentials:
.envcontainsJQUANTS_AUTH_EMAIL/PASSWORDandJQUANTS_PLAN_TIER. - Ensure
python -m pip install -e .already executed (dependencies + entry points). - Check latest health snapshot:
tools/project-health-check.sh --section dataset. - Inspect existing dataset for reuse:
ls -lh output/ml_dataset_latest_full.parquet.
Core Playbooks
1. Background Five-Year Refresh (default)
make dataset-check-strict— GPU + secrets verification.make dataset-bg START=<optional> END=<optional>— SSH-safe background run with logging in_logs/dataset.tail -f _logs/dataset/*.log— monitor progress (auto prints PID + PGID).make cache-stats— ensure cache hit-rate & size in expected bounds (<2.5 TB).python scripts/pipelines/run_full_dataset.py --dry-run— confirm metadata integrity without rebuild.
2. Hotfix / Forced Refresh
make dataset-gpu-refresh START=YYYY-MM-DD END=YYYY-MM-DD— bypasses cached parquet + API throttle aware.make datasets-prune— keep latest dataset generation only.make cache-prune CACHE_TTL_DAYS=90— evict stale graph shards to recover disk.
3. Resource-Constrained Fallback
make dataset-check— relaxed diagnostics (CPU acceptable).make dataset-cpu START=YYYY-MM-DD END=YYYY-MM-DD— chunked Pandas path.make dataset-safe-resume— resume from last safe checkpoint if memory pressure triggered fallback.
4. Graph Feature Investigation
python scripts/pipelines/run_full_dataset.py --inspect-graph --start YYYY-MM-DD --end YYYY-MM-DD.python -c "import polars as pl; df = pl.read_parquet('output/ml_dataset_latest_full.parquet'); print(df.select(pl.all().is_null().sum()))"— null audit.make cache-monitor— per-window edge density + overlap stats.
Observability Hooks
_logs/dataset/for job logs,cache/*.jsonmetadata for cache.ml_dataset_latest_full_metadata.jsonfor column coverage & horizon alignment.benchmark_output/dataset_timestamps.jsonto confirm pipeline duration vs baseline (target: <42m GPU path).
Failure Triage
- Credential errors → run
python scripts/pipelines/run_full_dataset.py --auth-test. - CUDA OOM → rerun with
make dataset-safe(40GB RMM pool pre-configured). - API rate limits → throttle via
make dataset-gpu REFRESH_THROTTLE=1. - Corrupted parquet →
make dataset-rebuildthenpython tools/parquet_validator.py output/ml_dataset_latest_full.parquet.
Codex Collaboration
- Escalate complex ETL debugging or architectural refactors via
./tools/codex.sh "Diagnose dataset pipeline bottleneck"(leverages OpenAI Codex deep reasoning). - For long-running autonomous maintenance, schedule
./tools/codex.sh --max --exec "Perform full dataset pipeline audit"off-hours (uses.mcp.jsonfrom the Codex repo for filesystem/git context). - When Codex proposes changes, sync learnings back here and refresh dataset runbooks if any commands or defaults shift.
Handoff Notes
- Always update
dataset_features_detail.jsonif schema changes. - Announce new dataset snapshot in
EXPERIMENT_STATUS.mdwith generation timestamp and settings. - Surface anomalies (missing tickers, new features) via
docs/data_quality/reports.