| name | dst-query |
| description | Execute SQL queries on Danmarks Statistik data stored in DuckDB. Use when user needs specific data analysis, filtering, aggregation, or joins. Also includes table summary functionality. |
DST Query Skill
Purpose
Execute SQL queries to analyze DST data stored in DuckDB. This is the core skill for data analysis - enabling filtering, aggregation, joins, and extracting insights from stored statistical data.
DST Data Patterns
Handling Suppressed Values
DST uses ".." for suppressed/confidential data. Must handle before numeric operations.
Filter approach:
SELECT * FROM dst_table
WHERE INDHOLD != '..' -- Filter out suppressed
Safe casting approach:
SELECT
TID,
CASE
WHEN INDHOLD != '..' THEN CAST(INDHOLD AS INTEGER)
ELSE NULL
END as value
FROM dst_table
Using helpers:
from scripts.db.helpers import safe_numeric_cast
query = f"SELECT TID, {safe_numeric_cast('INDHOLD')} as value FROM dst_table"
Aggregate Codes
DST uses special codes for totals/aggregates:
"TOT"- Total"I alt"- Total (Danish)"Drivmidler i alt"- All fuel types"IALT"- Total (alternative)
Filter them out when analyzing details:
SELECT * FROM dst_table
WHERE fuel_type NOT IN ('Drivmidler i alt', 'I alt')
AND gender NOT IN ('TOT', 'IALT')
Time Format Handling
DST time codes as strings don't sort chronologically:
"2024K1"- Quarter 1, 2024"2024M01"- January 2024"2024"- Year 2024
Extract for proper sorting:
-- For quarters
SELECT
TID,
CAST(SUBSTRING(TID, 1, 4) AS INTEGER) as year,
CAST(SUBSTRING(TID, 6, 1) AS INTEGER) as quarter
FROM dst_table
WHERE TID LIKE '%K%'
ORDER BY year, quarter
When to Use
- User asks analytical questions about the data
- Need to filter or aggregate data
- Joining multiple DST tables
- Extracting specific insights or trends
- Computing custom statistics
- Exploring table structure and contents (use table summary)
Table Summary
Purpose
Get a quick overview of table structure and statistics before detailed querying.
Usage
python scripts/db/table_summary.py --table-id <TABLE_ID>
When to Use
- Before writing complex queries
- Understanding table structure
- Checking available columns
- Seeing sample data
- Getting quick statistics
Output Includes
- Record count
- Column names and types
- Sample rows (first 5)
- Statistics for numeric columns (min, max, avg, median)
- Distinct value counts
- NULL counts
- Top values for categorical columns
Running SQL Queries
Basic Usage
Execute a SQL query:
python scripts/db/query_data.py --sql "<QUERY>"
Output Formats
Table format (default - console-friendly):
--format table
JSON format (for programmatic use):
--format json
CSV format (for exports):
--format csv
Save to File
Save query results:
--output <file>
Safety Limit
Add automatic LIMIT:
--limit 100
Table Naming Convention
All DST tables in DuckDB follow this pattern:
- Format:
dst_{table_id}(lowercase) - Example: FOLK1A →
dst_folk1a - Example: AUP01 →
dst_aup01
Important: Always use lowercase in queries.
Data Format in DuckDB
Tables stored from DST API use these conventions:
- Separator: Data was fetched as semicolon-separated CSV (
;) - Encoding: UTF-8 with BOM (handled automatically)
- Column names: Based on variable IDs from tableinfo
- Value codes: Exact codes from DST (e.g., "000", "101", "2024K1")
- Data types: DuckDB infers types (usually strings for codes, numeric for values)
Common Query Patterns
1. Explore Data
SELECT * FROM dst_folk1a LIMIT 10
2. Count Records
SELECT COUNT(*) FROM dst_folk1a
3. Check Column Structure
-- See what columns exist
DESCRIBE dst_folk1a;
-- Or use table summary (recommended)
-- python scripts/db/table_summary.py --table-id FOLK1A
4. Aggregation
SELECT region, SUM(population) as total_pop
FROM dst_folk1a
GROUP BY region
ORDER BY total_pop DESC
5. Time Series Analysis
-- Note: Time codes from DST (e.g., "2024K1" for Q1 2024)
SELECT tid, value
FROM dst_folk1a
WHERE område = '000' -- Whole country
ORDER BY tid
6. Filtering with DST Codes
-- Use exact codes from tableinfo
SELECT *
FROM dst_folk1a
WHERE tid LIKE '2024%' -- All 2024 periods
AND område IN ('000', '101') -- Denmark and Copenhagen
AND køn IN ('1', '2') -- Men and women (not 'TOT')
7. Multiple Aggregations
SELECT
område,
COUNT(*) as record_count,
AVG(CAST(indhold AS DOUBLE)) as avg_value,
MAX(CAST(indhold AS DOUBLE)) as max_value
FROM dst_folk1a
GROUP BY område
8. Join Tables
SELECT
a.område,
a.indhold as population,
b.indhold as employment
FROM dst_folk1a a
JOIN dst_aup01 b ON a.område = b.område AND a.tid = b.tid
WHERE a.tid = '2024K1'
9. Percentages
SELECT
område,
CAST(indhold AS DOUBLE) as value,
100.0 * CAST(indhold AS DOUBLE) / SUM(CAST(indhold AS DOUBLE)) OVER () as percentage
FROM dst_folk1a
WHERE tid = '2024K1' AND køn = 'TOT'
10. Latest Period Analysis
-- Find most recent quarter
WITH latest AS (
SELECT MAX(tid) as max_tid FROM dst_folk1a
)
SELECT område, SUM(CAST(indhold AS DOUBLE)) as total
FROM dst_folk1a
WHERE tid = (SELECT max_tid FROM latest)
GROUP BY område
ORDER BY total DESC
Best Practices
Query Development
- Start with table summary to understand structure
- Use LIMIT for exploratory queries
- Build incrementally - test small queries first
- Check record counts before expensive operations
Performance
- Use WHERE clauses to filter early
- Add indexes if querying repeatedly (advanced)
- Aggregate before joining when possible
- Be mindful of large result sets
Safety
- Queries are READ-ONLY (SELECT only)
- Cannot modify data (no INSERT/UPDATE/DELETE)
- Cannot alter schema (no DROP/CREATE/ALTER)
- Script validates queries before execution
Data Quality
- Handle NULL values explicitly
- Use COALESCE for NULL handling
- Verify data types before operations
- Check for duplicates if unexpected
Understanding Data Freshness
Before analyzing, check when data was last updated:
SELECT table_id, last_updated, row_count
FROM dst_metadata
WHERE table_id = 'FOLK1A'
Recommendations:
- Check freshness before major analysis
- Re-fetch if data is stale (use dst-check-freshness skill)
- Note DST update frequency varies by table
- Some tables update quarterly, others monthly or annually
Examples
Example 1: Get table summary
python scripts/db/table_summary.py --table-id FOLK1A
Example 2: Simple exploration
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 5"
Example 3: Aggregation by year
python scripts/db/query_data.py --sql "SELECT year, SUM(population) as total FROM dst_folk1a GROUP BY year ORDER BY year"
Example 4: Regional analysis
python scripts/db/query_data.py --sql "SELECT region, AVG(value) as avg_val FROM dst_folk1a WHERE year >= 2020 GROUP BY region"
Example 5: Export to CSV
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a" --format csv --output results.csv
Example 6: JSON output
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 100" --format json --output data.json
Example 7: With safety limit
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a" --limit 1000
Advanced Queries
Window Functions
SELECT
year,
value,
LAG(value) OVER (ORDER BY year) as prev_year_value,
value - LAG(value) OVER (ORDER BY year) as change
FROM dst_folk1a
WHERE region = '000'
ORDER BY year
Pivoting Data
SELECT
region,
MAX(CASE WHEN year = 2023 THEN value END) as val_2023,
MAX(CASE WHEN year = 2024 THEN value END) as val_2024
FROM dst_folk1a
GROUP BY region
Percentiles
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY value) as median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
FROM dst_folk1a
Complex Filtering
SELECT *
FROM dst_folk1a
WHERE year BETWEEN 2020 AND 2024
AND region IN (SELECT DISTINCT region FROM dst_aup01 WHERE employment > 1000)
AND value IS NOT NULL
ORDER BY value DESC
LIMIT 100
Tips
Before Querying
- Run table summary first to see structure
- Check column names and types (often Danish: område, tid, køn, etc.)
- Review sample data to understand value formats
- Verify table has data and check last_updated in dst_metadata
- Note: Column names are lowercase variable IDs from DST
Working with DST Data Codes
- Time codes:
"2024K1"(quarterly),"2024M01"(monthly),"2024"(annual) - Geographic codes:
"000"(whole country),"101"(Copenhagen), etc. - Aggregate codes:
"TOT","IALT"often represent totals - Use LIKE: For pattern matching time periods:
tid LIKE '2024%' - Cast when needed: Value columns may be strings:
CAST(indhold AS DOUBLE)
Query Writing
- Use table aliases for clarity (a, b, etc.)
- Format SQL for readability
- Comment complex queries
- Test with LIMIT first
- Handle Danish characters properly (æ, ø, å)
Analysis Workflow
- Understand: Get table summary
- Check freshness: Query dst_metadata for last_updated
- Explore: Simple SELECT with LIMIT
- Filter: Add WHERE clauses with exact DST codes
- Aggregate: Use GROUP BY (cast numeric columns first)
- Refine: Add ORDER BY, calculations
- Export: Save final results
Performance Tips
- Filter first, aggregate second
- Use specific columns, not SELECT *
- Add LIMIT for large tables
- Use indexes on commonly filtered columns (advanced)
- Consider creating views for repeated queries (advanced)
- Cache results locally if running same query multiple times
Troubleshooting
"Table not found"
- Verify table exists:
python scripts/db/list_tables.py(dst-list-tables skill) - Check table name is lowercase
- Ensure format: dst_{table_id}
- Example: FOLK1A becomes
dst_folk1a
"Column not found"
- Run table summary to see actual column names
- Column names are lowercase DST variable IDs
- Common columns:
område(region),tid(time),køn(gender),indhold(value) - Check spelling including Danish characters (æ, ø, å)
- Verify column exists in that specific table
Data Type Issues
- Value columns often stored as strings (e.g.,
indhold) - Cast to numeric for calculations:
CAST(indhold AS DOUBLE) - Time codes are strings: use LIKE for patterns
- Don't assume numeric types without checking
Unexpected Results
- Empty results: Check if data was actually fetched for that table
- Wrong aggregations: Verify you're filtering out 'TOT' codes if needed
- Time ordering issues: Time codes as strings may not sort chronologically
- Solution: Extract year/quarter or use CASE statements
- Duplicate rows: Table may have multiple dimensions - check GROUP BY
Large Result Sets
- Add LIMIT clause for exploration
- Use aggregation to reduce rows
- Export to file instead of console:
--output results.csv - Check row count first:
SELECT COUNT(*) FROM table
Slow Queries
- Check WHERE filters are effective
- Filter by indexed columns (primary keys)
- Simplify joins
- Reduce columns selected
- Check data size with COUNT first
- Avoid SELECT * on large tables
Query Syntax Errors
- Verify SQL syntax (DuckDB follows PostgreSQL conventions)
- Check quotes: use single quotes for string literals
- Danish characters: ensure UTF-8 encoding
- Test simple version first
- Review error message carefully
Character Encoding Issues
- DuckDB handles UTF-8 automatically
- If seeing odd characters, verify terminal encoding
- CSV exports preserve Danish characters (æ, ø, å)
Common Workflows
Workflow 1: Explore New Table
# 1. Get summary
python scripts/db/table_summary.py --table-id FOLK1A
# 2. See sample data
python scripts/db/query_data.py --sql "SELECT * FROM dst_folk1a LIMIT 5"
# 3. Check record count
python scripts/db/query_data.py --sql "SELECT COUNT(*) FROM dst_folk1a"
# 4. Explore key dimensions
python scripts/db/query_data.py --sql "SELECT DISTINCT region FROM dst_folk1a"
Workflow 2: Trend Analysis
# 1. Get summary statistics
python scripts/db/table_summary.py --table-id FOLK1A
# 2. Query time series
python scripts/db/query_data.py --sql "SELECT year, SUM(value) as total FROM dst_folk1a GROUP BY year ORDER BY year"
# 3. Calculate growth
python scripts/db/query_data.py --sql "SELECT year, value, value - LAG(value) OVER (ORDER BY year) as growth FROM dst_folk1a WHERE region = '000'"
# 4. Export results
python scripts/db/query_data.py --sql "..." --format csv --output analysis.csv
Workflow 3: Compare Regions
# 1. Get regional breakdown
python scripts/db/query_data.py --sql "SELECT region, COUNT(*) as records, AVG(value) as avg_value FROM dst_folk1a GROUP BY region ORDER BY avg_value DESC"
# 2. Top regions
python scripts/db/query_data.py --sql "SELECT region, SUM(value) as total FROM dst_folk1a WHERE year = 2024 GROUP BY region ORDER BY total DESC LIMIT 10"
# 3. Compare specific regions
python scripts/db/query_data.py --sql "SELECT year, region, value FROM dst_folk1a WHERE region IN ('000', '101', '147') ORDER BY year, region"
SQL Reference
Useful DuckDB Functions
Aggregation:
- COUNT, SUM, AVG, MIN, MAX
- MEDIAN, STDDEV, VARIANCE
- STRING_AGG (concatenate strings)
String Functions:
- UPPER, LOWER, TRIM
- SUBSTRING, CONCAT
- LIKE, ILIKE (case-insensitive)
Date Functions:
- CURRENT_DATE, CURRENT_TIMESTAMP
- DATE_DIFF, DATE_ADD
- EXTRACT (year, month, day)
Window Functions:
- ROW_NUMBER, RANK, DENSE_RANK
- LAG, LEAD
- FIRST_VALUE, LAST_VALUE
Conditional:
- CASE WHEN ... THEN ... END
- COALESCE (handle NULLs)
- NULLIF
Best Practices Summary
- Always start with table summary
- Use LIMIT during development
- Build queries incrementally
- Handle NULLs explicitly
- Use clear aliases and formatting
- Test before running on full dataset
- Export results for further analysis
- Document complex queries