| name | dbt-materializations |
| description | Expert guidance for choosing and implementing dbt materializations (ephemeral, view, table, incremental, snapshots, Python models). Use this skill when deciding on materialization strategy, implementing incremental models, setting up snapshots for SCD Type 2 tracking, or creating Python models for machine learning workloads. |
dbt Materializations
Purpose
Transform AI agents into experts on dbt materializations, providing guidance on choosing the right materialization strategy based on model purpose, size, update frequency, and query patterns, plus implementation details for each type including advanced features like snapshots and Python models.
When to Use This Skill
Activate this skill when users ask about:
- Choosing the right materialization for a model
- Implementing incremental models with merge/append/delete+insert strategies
- Setting up snapshots for SCD Type 2 historical tracking
- Converting table materializations to incremental
- Creating Python models for ML or advanced analytics
- Understanding trade-offs between ephemeral, view, and table
- Optimizing materialization performance
- Implementing slowly changing dimensions
Official dbt Documentation: Materializations
Decision Matrix
| Materialization | Use Case | Build Time | Storage | Query Speed | Best For |
|---|---|---|---|---|---|
| ephemeral | Staging, reusable logic | Fast (CTE) | None | N/A | Bronze layer |
| view | Simple transforms | Fast | Minimal | Slow | Always-fresh data |
| table | Complex logic | Slow | High | Fast | Dimensions |
| incremental | Large datasets | Fast | Medium | Fast | Large facts |
Ephemeral Materialization
When to Use: Staging models, reusable intermediate logic that doesn't need to be queried directly
{{ config(materialized='ephemeral') }}
select
customer_id,
customer_name,
upper(trim(email)) as email_clean
from {{ source('crm', 'customers') }}
How it Works:
- Compiled as CTE in downstream models
- No physical table created
- Zero storage cost
- Cannot be queried directly
Best For:
- Bronze/staging layer models
- Reusable intermediate transformations
- Models referenced by only 1-2 downstream models
Performance Note: If an ephemeral model is referenced by many downstream models or contains complex logic, consider changing to table materialization to avoid recomputing.
View Materialization
When to Use: Simple transformations where you always need fresh data and query performance isn't critical
{{ config(materialized='view') }}
select
customer_id,
count(*) as order_count,
sum(order_amount) as total_spent
from {{ ref('stg_orders') }}
group by customer_id
How it Works:
- Creates database view
- Query runs every time view is accessed
- Minimal storage (just view definition)
- Always shows latest data
Best For:
- Simple aggregations
- Always-fresh reporting needs
- Development/prototyping
When to Avoid:
- Complex transformations
- Frequently queried models
- Performance-critical queries
Table Materialization
When to Use: Dimensions, complex transformations, frequently queried models
{{ config(
materialized='table',
cluster_by=['order_date'] -- Snowflake optimization
) }}
select
customer_id,
customer_name,
count(distinct order_id) as lifetime_orders,
sum(order_amount) as lifetime_value,
min(order_date) as first_order_date
from {{ ref('stg_customers') }} c
join {{ ref('stg_orders') }} o using (customer_id)
group by customer_id, customer_name
How it Works:
- Drops and recreates table on every run (DROP + CREATE TABLE AS)
- Full refresh every time
- Fast query performance
- Higher storage cost
Best For:
- Dimension tables
- Gold layer marts
- Complex silver layer transformations
- Models with frequent queries
Performance Optimization:
{{ config(
materialized='table',
cluster_by=['date_column', 'category'], -- Snowflake clustering
snowflake_warehouse='LARGE_WH' -- Custom warehouse for complex logic
) }}
Incremental Materialization
When to Use: Large fact tables (millions+ rows), time-series data, event logs, append-only data
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
merge_exclude_columns=['dbt_inserted_at'],
cluster_by=['order_date', 'customer_id']
) }}
select
order_id,
customer_id,
order_date,
order_amount,
order_status,
{% if is_incremental() %}
dbt_inserted_at, -- Preserve from first insert
{% else %}
current_timestamp() as dbt_inserted_at,
{% endif %}
current_timestamp() as dbt_updated_at
from {{ ref('stg_orders') }}
{% if is_incremental() %}
-- Only process new/updated records
where order_date > (select max(order_date) from {{ this }})
{% endif %}
How it Works:
- First run: Loads all data (acts like table)
- Subsequent runs: Insert/update only new records
- Uses
unique_keyto identify records - Dramatically faster builds for large tables
Performance Benefits:
- Reduces build time from hours to minutes
- Lower compute costs
- Enables more frequent refreshes
Incremental Strategies
1. Merge Strategy (Most Common)
Use For: Updateable records, handling late-arriving data, SCD Type 1
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
merge_exclude_columns=['dbt_inserted_at'] -- Preserve original timestamp
) }}
select
order_id,
customer_id,
order_status, -- Can change over time
order_amount,
{% if is_incremental() %}
dbt_inserted_at,
{% else %}
current_timestamp() as dbt_inserted_at,
{% endif %}
current_timestamp() as dbt_updated_at
from {{ ref('stg_orders') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
How It Works:
- Matches on
unique_key - Updates existing records
- Inserts new records
- Preserves columns in
merge_exclude_columns
2. Append Strategy (Fastest)
Use For: Immutable event data, logs, clickstreams
{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='append'
) }}
select
event_id,
user_id,
event_type,
event_timestamp,
event_properties
from {{ ref('stg_events') }}
{% if is_incremental() %}
where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}
How It Works:
- Only inserts new records
- No updates or deletes
- Fastest incremental strategy
- Assumes data is immutable
3. Delete+Insert Strategy
Use For: Partitioned data, date-based reprocessing
{{ config(
materialized='incremental',
unique_key='order_date',
incremental_strategy='delete+insert'
) }}
select
order_date,
customer_id,
count(*) as daily_orders,
sum(order_amount) as daily_revenue
from {{ ref('stg_orders') }}
group by order_date, customer_id
{% if is_incremental() %}
where order_date >= dateadd(day, -7, current_date()) -- Reprocess last 7 days
{% endif %}
How It Works:
- Deletes records matching
unique_keyvalues - Inserts all new records
- Good for reprocessing entire partitions
Official dbt Docs: Incremental Models
Incremental Best Practices
1. Always Include is_incremental() Check
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
2. Add Lookback for Late Data
{% if is_incremental() %}
where order_date >= dateadd(day, -3, (select max(order_date) from {{ this }}))
{% endif %}
3. Limit Source Scans
{% if is_incremental() %}
-- Only scan recent source data
where source_updated_at >= dateadd(day, -30, current_date())
and source_updated_at > (select max(updated_at) from {{ this }})
{% endif %}
4. Use Clustering for Performance
{{ config(
cluster_by=['event_date', 'user_id'] -- Commonly filtered/joined columns
) }}
5. Handle Full Refresh
# Force rebuild from scratch
dbt build --full-refresh --select model_name
Snapshots (SCD Type 2)
Purpose: Track historical changes to slowly changing dimensions
When to Use: Customer attributes, product catalogs, employee records, pricing history
-- snapshots/dim_customers_scd.sql
{% snapshot dim_customers_scd %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}
select * from {{ ref('stg_customers') }}
{% endsnapshot %}
Generated Columns:
dbt_valid_from- When record became activedbt_valid_to- When record was superseded (NULL for current)dbt_scd_id- Unique identifier for each versiondbt_updated_at- Last snapshot processing time
Snapshot Strategies
Timestamp Strategy (Recommended)
Use When: Source has reliable updated_at timestamp
{% snapshot customers_snapshot_timestamp %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
select * from {{ source('crm', 'customers') }}
{% endsnapshot %}
Advantages:
- Faster performance
- Only checks timestamp
- More efficient
Check Strategy
Use When: No reliable timestamp, need to check specific columns
{% snapshot customers_snapshot_check %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['customer_name', 'customer_email', 'customer_tier']
)
}}
select * from {{ source('crm', 'customers') }}
{% endsnapshot %}
Advantages:
- Works without timestamp
- Tracks specific column changes
- More control over what triggers new version
Querying Snapshots
Get Current Records Only:
select *
from {{ ref('dim_customers_scd') }}
where dbt_valid_to is null
Point-in-Time Query:
select *
from {{ ref('dim_customers_scd') }}
where '2024-01-15' between dbt_valid_from and coalesce(dbt_valid_to, '9999-12-31')
Change History:
select
customer_id,
customer_name,
dbt_valid_from,
dbt_valid_to
from {{ ref('dim_customers_scd') }}
order by customer_id, dbt_valid_from
Running Snapshots:
dbt snapshot # Runs all snapshots
dbt snapshot --select dim_customers_scd # Specific snapshot
Official dbt Docs: Snapshots
Python Models
Purpose: Machine learning, statistical analysis, complex transformations beyond SQL
When to Use: ML models, clustering, advanced analytics, Python library integration (pandas, scikit-learn, etc.)
# models/silver/customer_clustering.py
def model(dbt, session):
"""Cluster customers using K-Means"""
dbt.config(
materialized="table",
packages=["scikit-learn", "pandas"]
)
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Get data from dbt model
df = dbt.ref("int_customers__metrics").to_pandas()
# Select features
features = ['total_orders', 'lifetime_value', 'avg_order_value']
X = df[features].fillna(0)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform clustering
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster_id'] = kmeans.fit_predict(X_scaled)
df['cluster_label'] = df['cluster_id'].map({
0: 'Low Value',
1: 'Medium Value',
2: 'High Value',
3: 'VIP',
4: 'At Risk'
})
# Return final dataframe
return df[['customer_id', 'cluster_id', 'cluster_label'] + features]
Python Model Best Practices
1. Use SQL for Data Preparation
# Let SQL handle filtering, joins, aggregations
df = dbt.ref("int_customers__prepared").to_pandas()
2. Python for ML/Complex Analytics Only
# Don't use Python for simple transformations
# Use SQL instead
3. Specify Required Packages
dbt.config(
packages=["scikit-learn==1.3.0", "pandas", "numpy"]
)
4. Test Python Models
# Can use standard dbt tests
models:
- name: customer_clustering
columns:
- name: customer_id
tests:
- dbt_constraints.primary_key
- name: cluster_id
tests:
- not_null
- accepted_values:
values: [0, 1, 2, 3, 4]
Official dbt Docs: Python Models
When to Change Materializations
Change Ephemeral/View to Table When:
Model is Referenced Multiple Times
- If 3+ downstream models reference it
- Avoids recomputing same logic
Complex Transformations
- Heavy aggregations or window functions
- Self-joins or complex CTEs
Memory Issues
- Queries failing due to memory constraints
- CTE becoming too large
Change Table to Incremental When:
Large Data Volumes
- Table has millions+ rows
- Full refresh takes > 5 minutes
Time-Series Data
- Append-only event logs
- Daily/hourly data loads
Performance Requirements
- Need faster, more frequent refreshes
- Cost optimization needed
Folder-Level Configuration
Configure materializations at folder level in dbt_project.yml:
models:
your_project:
bronze:
+materialized: ephemeral
+tags: ["bronze", "staging"]
silver:
+materialized: ephemeral
+tags: ["silver"]
gold:
+materialized: table
+tags: ["gold", "marts"]
Override at model level only for special cases (incremental, Python, etc.).
Helping Users Choose Materializations
Strategy for Assisting Users
When users ask about materializations:
- Understand purpose: What is the model for?
- Assess size: How many rows? How often updated?
- Check usage: How many downstream models? Query frequency?
- Recommend materialization: Based on layer and requirements
- Provide configuration: Complete working example
- Explain trade-offs: Build time, storage, query performance
Common User Questions
"Should this be ephemeral or table?"
- Ephemeral: Staging, lightweight intermediate models
- Table: Dimensions, marts, complex intermediate models
- Consider: Reusability, complexity, downstream usage
"When should I use incremental?"
- Large tables (millions+ rows)
- Time-series or append-only data
- Performance/cost optimization needed
- Full rebuild takes too long
"How do I set up SCD Type 2?"
- Use snapshots with timestamp or check strategy
- Configure unique_key and updated_at
- Query with dbt_valid_to IS NULL for current records
Related Official Documentation
Goal: Transform AI agents into experts on dbt materializations who guide users to optimal materialization choices based on data characteristics, usage patterns, and performance requirements.