| name | dst-data |
| description | Fetch actual data from Danmarks Statistik API and store in DuckDB. Use when user wants to download and store specific DST table data for analysis. |
DST Data Skill
Purpose
Fetch statistical data from Danmarks Statistik API and persist it in the local DuckDB database for analysis. This is the final step in the data acquisition workflow, transforming remote API data into queryable local tables.
Critical API Quirks
BULK Format Requirements
DST's BULK format (unlimited streaming) has a critical requirement:
ALL variables must be specified in filters - no auto-elimination like other formats.
Error: "Der skal vælges værdier for variabel: [VAR_NAME]"
Solution:
- Get all variables from tableinfo
- Specify each in filters:
{"VAR1":["*"], "VAR2":["*"], ...} - Use wildcard
"*"to get all values for a variable
Suppressed Values ("..")
DST uses ".." to indicate suppressed/confidential data.
Impact:
- Causes errors when casting to numeric:
CAST('..' AS INTEGER)fails - Common in small population cells, rare events
- Must be filtered before numeric operations
Solutions:
- Filter in SQL:
WHERE column != '..' - Safe casting:
CASE WHEN column != '..' THEN CAST(column AS INTEGER) END - Use helpers:
safe_numeric_cast(column)from scripts/db/helpers.py
Cell Limits
- CSV/JSON formats: 1,000,000 cell maximum
- BULK format: Unlimited (streaming)
- Calculation: cells = var1_count × var2_count × ... × varN_count
Error: "Forespørgslen returnerer for mange observationer" (REQUEST-LIMIT)
Solution: Use BULK format automatically (done by fetch_and_store.py)
Common Error Codes
EXTRACT-NOTALLOWED
Message: "Der skal vælges værdier for variabel: X" Cause: BULK format missing variable in filters Fix: Add variable to filters with at least one value
EXTRACT-NOTFOUND
Messages:
- "Tabellen blev ikke fundet" (table not found)
- "Variablen blev ikke fundet" (variable not found)
- "Værdien blev ikke fundet" (value not found)
Fix: Verify IDs against tableinfo output
REQUEST-LIMIT
Message: "Forespørgslen returnerer for mange observationer" Cause: Request exceeds 1M cells (non-BULK formats) Fix: Use BULK format or add more filters
When to Use
- User requests specific table data to be downloaded
- Ready to download data after reviewing table structure (tableinfo)
- Need to refresh existing data with latest from DST
- Want to store data for offline analysis
- Building a local data repository from DST
How to Use - Recommended (Combined Workflow)
The fetch_and_store.py script combines fetching and storing in one command. This is the recommended approach.
Basic Usage
Fetch and store a complete table:
python scripts/fetch_and_store.py --table-id <TABLE_ID>
With Filters
Fetch only specific data using filters:
python scripts/fetch_and_store.py --table-id <TABLE_ID> --filters '<JSON>'
Example filters:
# Filter by region
--filters '{"OMRÅDE":["000"]}'
# Filter by time period
--filters '{"TID":["2024*"]}'
# Multiple filters
--filters '{"OMRÅDE":["000","101"],"TID":["2024*"]}'
Overwrite Existing Data
Replace existing table:
python scripts/fetch_and_store.py --table-id <TABLE_ID> --overwrite
Skip if Fresh
Only fetch if data is older than threshold:
python scripts/fetch_and_store.py --table-id <TABLE_ID> --skip-if-fresh --max-age-days 30
How to Use - Advanced (Separate Steps)
For advanced users who need more control, fetch and store can be done separately.
Step 1: Fetch Data
Download data from API:
python scripts/api/fetch_data.py --table-id <TABLE_ID> --output data.json
Step 2: Store Data
Save to DuckDB:
python scripts/db/store_data.py --table-id <TABLE_ID> --data-file data.json
Expected Output
Data Storage
- Data stored in table named
dst_{table_id}(lowercase) - Example: Table FOLK1A →
dst_folk1a - Metadata updated in
dst_metadatatable
Success Message
✓ Created table dst_folk1a with 45231 records
If Skipped (Fresh Data)
⊘ Skipped FOLK1A: data_is_fresh
Exit codes:
- 0: Success
- 1: Error
- 2: Skipped (when using --skip-if-fresh)
Important Considerations
Data Format
- Default format: BULK - Streaming CSV with no cell limit
- Semicolon-separated (
;) not comma - UTF-8 encoding with BOM
- Alternative formats: CSV, JSONSTAT, XLSX (1M cell limit)
- BULK requires ALL variables specified - no auto-elimination
Cell Limits & Size
- Non-streaming formats: 1,000,000 cell limit (CSV, JSONSTAT, XLSX)
- BULK format: Unlimited cells (streaming)
- Large tables may take significant time to download
- Check table info first to estimate size: cells = var1_count × var2_count × ... × varN_count
- Use filters to limit data volume when possible
- Error if exceeded: "Forespørgslen returnerer for mange observationer" (REQUEST-LIMIT)
Existing Data
- Default behavior: Fail if table already exists
- Use
--overwriteto replace existing data - Overwrites completely - no partial updates
Network and Time
- Requires stable internet connection
- Large downloads may timeout - retry if needed
- Recommended: 4-5 requests/second maximum (self-imposed limit)
- Timeout recommendations: 30s (small), 60s (medium), 120s+ (large)
Filters Format
- Filters must be valid JSON
- Keys are variable IDs from table info
- Values are arrays of allowed values
- Wildcard patterns:
*(all),prefix*,*suffix,>=value,<=value - Time patterns:
(1)(latest),(-n+5)(last 5),2024*(year match) - BULK format validation: Must specify all variables, check with tableinfo first
Verification Steps
After storing data, verify success:
1. Check Metadata
python scripts/db/query_metadata.py --table-id <TABLE_ID>
2. Spot Check Data
python scripts/db/query_data.py --sql "SELECT * FROM dst_<table_id> LIMIT 5"
3. Verify Record Count
Compare record count in metadata with table info expectations.
Next Steps
After successfully storing data:
- Data is ready for analysis
- Switch to DST Analyst Agent for querying and analysis
- Run SQL queries using dst-query skill
- Create summaries using table_summary script
Example:
python scripts/db/table_summary.py --table-id FOLK1A
Examples
Example 1: Simple fetch and store
python scripts/fetch_and_store.py --table-id FOLK1A
Example 2: Overwrite existing data
python scripts/fetch_and_store.py --table-id FOLK1A --overwrite
Example 3: Fetch with regional filter
python scripts/fetch_and_store.py --table-id FOLK1A --filters '{"OMRÅDE":["000","101"]}'
Example 4: Skip if recently fetched
python scripts/fetch_and_store.py --table-id FOLK1A --skip-if-fresh --max-age-days 7
Example 5: Advanced - separate steps
# Step 1: Fetch
python scripts/api/fetch_data.py --table-id FOLK1A --output folk1a.json
# Step 2: Store
python scripts/db/store_data.py --table-id FOLK1A --data-file folk1a.json
Example 6: Export to CSV
python scripts/api/fetch_data.py --table-id FOLK1A --format csv --output data.csv
Tips
Before Large Downloads
- Check table info to understand data size
- Use filters to limit to needed data
- Test with small filter first
For Regular Updates
- Use --skip-if-fresh to avoid unnecessary downloads
- Set appropriate max-age-days based on update frequency
- Use --overwrite when refresh is needed
Performance
- Filter at API level (not after download)
- Monitor disk space for large datasets
- Consider time of day (API load varies)
Data Quality
- Always verify after storage
- Check record counts match expectations
- Spot check data values for sanity
Common Filter Patterns
Time Filters
# Latest period (recommended for current data)
--filters '{"Tid":["(1)"]}'
# Last 5 periods
--filters '{"Tid":["(-n+5)"]}'
# Specific year
--filters '{"Tid":["2024"]}'
# Year range (wildcard)
--filters '{"Tid":["202*"]}'
# Specific quarters (K=Kvartal)
--filters '{"Tid":["2024K1","2024K2"]}'
# All Q1 quarters
--filters '{"Tid":["*K1"]}'
# Range operator
--filters '{"Tid":[">=2023K1<=2024K4"]}'
Geographic Filters
# Whole country
--filters '{"OMRÅDE":["000"]}'
# Specific regions
--filters '{"OMRÅDE":["101","147"]}'
# All regions (wildcard)
--filters '{"OMRÅDE":["*"]}'
# Region codes: 000=Denmark, 101=Copenhagen, 147=Frederiksberg
Combined Filters
# Multiple dimensions with patterns
--filters '{"OMRÅDE":["000"],"Tid":["(1)"],"KØN":["*"]}'
# Complex filtering
--filters '{"OMRÅDE":["101","147"],"Tid":[">=2020"],"KØN":["1","2"]}'
Pattern Reference
*- All valuesprefix*- Starts with prefix*suffix- Ends with suffix(1)- Latest period (nth-rule)(-n+N)- Last N periods>=value- From value onwards<=value- Up to value>=A<=B- Between A and B
Troubleshooting
"Table already exists"
- Use
--overwriteflag to replace existing data - Or query existing data first to check if refresh needed
"Network timeout"
- Large table - try with filters to reduce size
- Increase timeout in script if needed (120s+ for large datasets)
- Retry the operation
- Check internet connection
"REQUEST-LIMIT: Too many cells"
Error message: "Forespørgslen returnerer for mange observationer"
- Solution: Script should use BULK format automatically for large requests
- Verify format=BULK is being used
- Add more filters to reduce cell count
- Cell limit: 1,000,000 for non-streaming formats
"EXTRACT-NOTALLOWED: Missing variable"
Error message: "Der skal vælges værdier for variabel: [VARIABLE_NAME]"
- BULK format requires ALL variables specified
- Check tableinfo to get all variable IDs
- Ensure every variable has a value selection
- Cannot rely on auto-elimination with BULK
"EXTRACT-NOTFOUND: Invalid code"
Error messages: "Tabellen blev ikke fundet" / "Variablen blev ikke fundet" / "Værdien blev ikke fundet"
- Verify table ID is correct (case-insensitive)
- Check variable IDs match tableinfo exactly
- Verify value codes exist in tableinfo
- Run tableinfo first to validate codes
"Invalid filters"
- Verify JSON syntax is correct (use single quotes around JSON, double quotes inside)
- Check variable IDs match table info exactly
- Ensure values exist in table info
- Test patterns: wildcards need asterisk
*, ranges need>=/<=
"No data returned"
- Filters may be too restrictive
- Verify table has data for requested filters
- Check table info for available values
- Try with fewer/broader filters first
"Cannot sort time for streaming formats"
Error message: "Der kan ikke vælges sortering af tid for streamede formater"
- BULK format does not support time sorting
- Sort after data is stored in DuckDB
- Or use CSV format instead (if under 1M cells)
Disk Space Issues
- Monitor available space before large downloads
- Clean up old/unused tables
- Use filters to limit data volume
- Check estimated size: run tableinfo and calculate cells