| name | lineage-analyzer |
| description | Analyzes dbt model dependencies, traces upstream sources and downstream consumers, identifies circular dependencies, and visualizes data lineage. Use when user asks about model dependencies, wants to understand data flow, needs impact analysis, or wants to trace where data comes from or goes to. |
| allowed-tools | Read, Grep, Glob, Bash |
dbt Lineage Analyzer
Expert skill for analyzing and tracing dbt model dependencies and data lineage.
When to Use This Skill
Activate when the user mentions:
- "Show me the lineage for this model"
- "What models depend on X"
- "Where does this data come from"
- "Trace dependencies"
- "Impact analysis"
- "Upstream/downstream models"
- "DAG structure"
- "Circular dependencies"
Core Capabilities
1. Dependency Discovery
Extract dependencies from dbt models by finding:
{{ ref('model_name') }}- model references{{ source('source_name', 'table_name') }}- source references- Dependencies in Jinja loops and conditionals
2. Lineage Tracing Methods
Method 1: File Scanning (Primary)
# Find all model files
find models/ -name "*.sql" -type f
# Extract refs from a model
grep -o "ref('[^']*')" model_file.sql | sed "s/ref('\(.*\)')/\1/"
# Extract sources
grep -o "source('[^']*',\s*'[^']*')" model_file.sql
Method 2: Parse manifest.json (If Available)
# Check for dbt manifest
if [ -f "target/manifest.json" ]; then
# Parse using jq or Python
cat target/manifest.json | jq '.nodes[] | select(.resource_type == "model")'
fi
3. Build Dependency Graph
For each model, extract:
- Model name and path
- Direct parent refs
- Direct parent sources
- Config (materialization, schema)
Create bidirectional mapping:
- parents[model] = list of dependencies
- children[model] = list of consumers
4. Recursive Tracing
Upstream Traversal (Sources)
def trace_upstream(model_name, graph, visited=None):
if visited is None:
visited = set()
if model_name in visited:
return [] # Circular dependency
visited.add(model_name)
upstream = []
for parent in graph['parents'].get(model_name, []):
upstream.append(parent)
upstream.extend(trace_upstream(parent, graph, visited))
return upstream
Downstream Traversal (Consumers)
def trace_downstream(model_name, graph, visited=None):
if visited is None:
visited = set()
if model_name in visited:
return []
visited.add(model_name)
downstream = []
for child in graph['children'].get(model_name, []):
downstream.append(child)
downstream.extend(trace_downstream(child, graph, visited))
return downstream
5. Visualization Formats
Tree Format (Compact)
fct_orders
├── int_order_details
│ ├── stg_orders
│ │ └── source.raw.orders
│ └── stg_order_items
│ └── source.raw.order_items
└── dim_customers
└── stg_customers
└── source.raw.customers
Detailed Format
Model: fct_orders
Type: model
Path: models/marts/core/fct_orders.sql
Materialization: incremental
Schema: analytics
DIRECT DEPENDENCIES:
→ int_order_details (ephemeral)
→ dim_customers (table)
UPSTREAM LINEAGE (3 levels deep):
Level 1:
- int_order_details
- dim_customers
Level 2:
- stg_orders
- stg_order_items
- stg_customers
Level 3:
- source.raw.orders
- source.raw.order_items
- source.raw.customers
DOWNSTREAM CONSUMERS:
← rpt_daily_revenue (view)
← dashboard_sales_metrics (table)
STATISTICS:
- Dependency Depth: 3
- Total Upstream Models: 5
- Total Sources: 3
- Direct Children: 2
- Total Downstream: 2
Mermaid Diagram Format
graph TD
source_orders[(source: orders)]
source_customers[(source: customers)]
stg_orders[stg_orders]
stg_customers[stg_customers]
int_order_details[int_order_details]
fct_orders[fct_orders]
source_orders --> stg_orders
source_customers --> stg_customers
stg_orders --> int_order_details
stg_customers --> int_order_details
int_order_details --> fct_orders
6. Impact Analysis
When tracing lineage, also identify:
Breaking Change Impact
- If this model's schema changes, what breaks?
- Models that depend on specific columns
- Tests that would fail
Performance Impact
- Models on the critical path
- Long-running models blocking others
- Warehouse resource consumption
Freshness Impact
- Source freshness checks
- Incremental load dependencies
- SLA-critical models
7. Circular Dependency Detection
def find_cycles(graph):
"""Detect circular dependencies using DFS"""
def visit(node, path, visited, cycles):
if node in path:
cycle_start = path.index(node)
cycles.append(path[cycle_start:] + [node])
return
if node in visited:
return
visited.add(node)
path.append(node)
for child in graph.get(node, []):
visit(child, path.copy(), visited, cycles)
cycles = []
visited = set()
for node in graph:
if node not in visited:
visit(node, [], visited, cycles)
return cycles
Practical Workflows
Workflow 1: Single Model Lineage
User: "Show lineage for fct_orders"
Steps:
1. Find the model file: models/marts/core/fct_orders.sql
2. Extract refs and sources from the file
3. Build immediate parent list
4. Recursively trace upstream to sources
5. Trace downstream to find consumers
6. Generate visualization
7. Calculate statistics
Workflow 2: Column-Level Lineage
User: "Where does the 'customer_ltv' column come from?"
Steps:
1. Find model containing customer_ltv
2. Parse SELECT statement to see source column
3. If derived, trace back through transformations
4. Identify originating source table and column
5. Map full transformation path
Workflow 3: Impact Analysis
User: "What breaks if I change stg_customers?"
Steps:
1. Find all downstream models referencing stg_customers
2. For each downstream model:
- Check which columns it uses from stg_customers
- Identify tests on those columns
- Mark as impacted
3. Trace to final downstream consumers
4. Generate impact report
Workflow 4: Critical Path Analysis
User: "What's the critical path in my dbt project?"
Steps:
1. Build full dependency graph
2. Identify leaf nodes (final marts/reports)
3. Trace back to sources
4. Calculate depth for each path
5. Identify longest paths
6. Highlight bottleneck models
Implementation Approach
Step 1: Scan Project Structure
# Find all dbt models
find models/ -type f -name "*.sql" > model_list.txt
# Find sources
find models/ -type f -name "*.yml" | xargs grep -l "sources:"
Step 2: Extract Dependencies
For each model file:
import re
def extract_refs(sql_content):
# Pattern: {{ ref('model_name') }}
pattern = r"{{\s*ref\(['\"]([^'\"]+)['\"]\)\s*}}"
return re.findall(pattern, sql_content)
def extract_sources(sql_content):
# Pattern: {{ source('source', 'table') }}
pattern = r"{{\s*source\(['\"]([^'\"]+)['\"]\s*,\s*['\"]([^'\"]+)['\"]\)\s*}}"
return re.findall(pattern, sql_content)
Step 3: Build Graph Structure
graph = {
'nodes': {}, # model_name: {path, type, config}
'edges': [], # (parent, child) tuples
'sources': {} # source_name.table_name: metadata
}
for model_file in model_files:
model_name = extract_model_name(model_file)
refs = extract_refs(read_file(model_file))
sources = extract_sources(read_file(model_file))
graph['nodes'][model_name] = {
'path': model_file,
'type': 'model',
'parents': refs + sources
}
for ref in refs:
graph['edges'].append((ref, model_name))
Step 4: Query and Visualize
def get_lineage(model_name, graph):
upstream = trace_upstream(model_name, graph)
downstream = trace_downstream(model_name, graph)
return {
'model': model_name,
'upstream': upstream,
'downstream': downstream,
'depth': calculate_depth(model_name, graph),
'stats': calculate_stats(model_name, graph)
}
Output Examples
Example 1: Simple Lineage
Model: stg_orders
Path: models/staging/ecommerce/stg_orders.sql
Type: View
UPSTREAM:
└── source.raw.orders
DOWNSTREAM:
├── int_order_details (ephemeral)
└── fct_daily_orders (table)
Stats: 1 level deep, 1 source, 2 consumers
Example 2: Complex Lineage with Circular Dependency
⚠️ CIRCULAR DEPENDENCY DETECTED:
model_a → model_b → model_c → model_a
Path: models/intermediate/model_a.sql
models/intermediate/model_b.sql
models/intermediate/model_c.sql
This creates an infinite loop. Please refactor to break the cycle.
Example 3: Critical Path
CRITICAL PATH ANALYSIS
Longest dependency chain (6 levels):
source.raw.events
→ stg_events (view)
→ int_event_sessions (ephemeral)
→ int_user_sessions (ephemeral)
→ fct_user_activity (table, 45min build)
→ rpt_user_dashboard (view)
Bottleneck: fct_user_activity (45min)
Recommendation: Consider incremental materialization
Troubleshooting
Issue: Can't find model file
- Check project structure and model naming
- Look in subdirectories under models/
- Verify model isn't disabled in dbt_project.yml
Issue: Missing dependencies in output
- Check for dynamic refs:
{{ ref(var('model_name')) }} - Look for refs in Jinja loops
- Check macros that might generate refs
Issue: Manifest not available
- Fall back to file scanning method
- Run
dbt compilefirst to generate manifest - Parse .sql files directly
Quality Checklist
- All refs and sources extracted correctly
- Circular dependencies identified
- Depth calculations accurate
- Downstream consumers found
- Visualization clear and readable
- Statistics include all relevant metrics
- Special cases handled (dynamic refs, macros)