| name | databricks-asset-bundles |
| description | Modern deployment with Databricks Asset Bundles (DAB), supporting multi-environment configurations and CI/CD integration. |
| triggers | databricks asset bundles, dab deployment, bundle configuration, multi environment, infrastructure as code |
| category | deployment |
Databricks Asset Bundles Skill
Overview
Databricks Asset Bundles (DAB) is a modern deployment framework that packages notebooks, DLT pipelines, jobs, and configurations into versioned, environment-aware bundles. It enables Infrastructure as Code for Databricks.
Key Benefits:
- Infrastructure as Code
- Multi-environment support (dev, staging, prod)
- Version control for all artifacts
- Automated deployment
- Environment-specific configurations
- Integrated with CI/CD
When to Use This Skill
Use Databricks Asset Bundles when you need to:
- Deploy pipelines across multiple environments
- Implement Infrastructure as Code
- Automate deployment workflows
- Manage environment-specific configurations
- Version control Databricks artifacts
- Enable collaborative development
- Standardize deployment processes
Core Concepts
1. Bundle Structure
Standard Bundle Layout:
my-bundle/
├── databricks.yml # Main configuration
├── environments/
│ ├── dev.yml # Development overrides
│ ├── staging.yml # Staging overrides
│ └── prod.yml # Production overrides
├── src/
│ ├── notebooks/
│ │ ├── bronze_ingestion.py
│ │ └── silver_transformation.py
│ └── pipelines/
│ └── dlt_pipeline.py
├── resources/
│ ├── jobs.yml
│ ├── pipelines.yml
│ └── clusters.yml
└── tests/
└── test_transformations.py
2. Main Configuration
databricks.yml:
bundle:
name: data-platform-bundle
# Optional git configuration
git:
branch: main
origin_url: https://github.com/org/repo.git
workspace:
host: https://your-workspace.databricks.com
root_path: /Workspace/bundles/${bundle.name}
# Define variables
variables:
catalog_name:
description: "Unity Catalog name"
default: "dev_catalog"
storage_path:
description: "Base storage path"
default: "/mnt/dev/data"
cluster_size:
description: "Cluster size"
default: "small"
# Include other configuration files
include:
- resources/*.yml
# Define resources
resources:
jobs:
daily_pipeline:
name: "[${bundle.environment}] Daily Pipeline"
tasks:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./src/notebooks/bronze_ingestion
source: WORKSPACE
base_parameters:
catalog: ${var.catalog_name}
storage: ${var.storage_path}
new_cluster:
num_workers: 2
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
spark_conf:
spark.databricks.delta.preview.enabled: "true"
- task_key: silver_transformation
depends_on:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./src/notebooks/silver_transformation
source: WORKSPACE
job_cluster_key: shared_cluster
job_clusters:
- job_cluster_key: shared_cluster
new_cluster:
num_workers: "${var.cluster_size == 'small' ? 2 : 8}"
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
schedule:
quartz_cron_expression: "0 0 1 * * ?" # Daily at 1 AM
timezone_id: "America/New_York"
email_notifications:
on_failure:
- data-team@company.com
pipelines:
bronze_to_gold:
name: "[${bundle.environment}] Bronze to Gold Pipeline"
target: ${var.catalog_name}
storage: ${var.storage_path}/dlt
libraries:
- notebook:
path: ./src/pipelines/dlt_pipeline.py
clusters:
- label: default
num_workers: 4
node_type_id: i3.xlarge
configuration:
source_path: ${var.storage_path}/landing
checkpoint_path: ${var.storage_path}/checkpoints
development: false
continuous: false
targets:
dev:
mode: development
workspace:
host: https://dev-workspace.databricks.com
root_path: /Workspace/dev/${bundle.name}
variables:
catalog_name: dev_catalog
storage_path: /mnt/dev/data
cluster_size: small
staging:
mode: production
workspace:
host: https://staging-workspace.databricks.com
root_path: /Workspace/staging/${bundle.name}
variables:
catalog_name: staging_catalog
storage_path: /mnt/staging/data
cluster_size: medium
prod:
mode: production
workspace:
host: https://prod-workspace.databricks.com
root_path: /Workspace/prod/${bundle.name}
variables:
catalog_name: prod_catalog
storage_path: /mnt/prod/data
cluster_size: large
3. Environment-Specific Configuration
environments/prod.yml:
# Production-specific overrides
variables:
catalog_name: prod_catalog
storage_path: /mnt/prod/data
cluster_size: large
resources:
jobs:
daily_pipeline:
# Production-specific settings
max_concurrent_runs: 1
timeout_seconds: 7200
job_clusters:
- job_cluster_key: shared_cluster
new_cluster:
num_workers: 8
node_type_id: i3.2xlarge
autoscale:
min_workers: 4
max_workers: 16
email_notifications:
on_start:
- data-team@company.com
on_success:
- data-team@company.com
on_failure:
- data-team@company.com
- oncall@company.com
pipelines:
bronze_to_gold:
development: false
continuous: true # Continuous processing in prod
clusters:
- label: default
num_workers: 8
node_type_id: i3.2xlarge
autoscale:
min_workers: 4
max_workers: 16
notifications:
- email_recipients:
- data-team@company.com
on_failure: true
on_success: false
4. Deployment Workflow
CLI Commands:
# Install Databricks CLI
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
# Authenticate
databricks auth login --host https://your-workspace.databricks.com
# Validate bundle
databricks bundle validate -t dev
# Deploy to development
databricks bundle deploy -t dev
# Run a job
databricks bundle run -t dev daily_pipeline
# Deploy to production
databricks bundle deploy -t prod
# Destroy bundle (cleanup)
databricks bundle destroy -t dev
Implementation Patterns
Pattern 1: Multi-Environment Pipeline
Complete Bundle with Environment Variations:
# databricks.yml
bundle:
name: customer-analytics
variables:
environment:
description: "Deployment environment"
catalog:
description: "Unity Catalog"
min_workers:
description: "Minimum cluster workers"
default: 2
max_workers:
description: "Maximum cluster workers"
default: 8
resources:
jobs:
customer_pipeline:
name: "[${var.environment}] Customer Analytics Pipeline"
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./notebooks/ingest_customers
new_cluster:
num_workers: ${var.min_workers}
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
- task_key: transform
depends_on:
- task_key: ingest
notebook_task:
notebook_path: ./notebooks/transform_customers
new_cluster:
autoscale:
min_workers: ${var.min_workers}
max_workers: ${var.max_workers}
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
- task_key: aggregate
depends_on:
- task_key: transform
notebook_task:
notebook_path: ./notebooks/aggregate_metrics
new_cluster:
num_workers: ${var.min_workers}
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
targets:
dev:
variables:
environment: dev
catalog: dev_catalog
min_workers: 2
max_workers: 4
prod:
variables:
environment: prod
catalog: prod_catalog
min_workers: 4
max_workers: 16
Pattern 2: Modular Configuration
Split Configuration Across Files:
# databricks.yml
bundle:
name: data-platform
include:
- resources/jobs/*.yml
- resources/pipelines/*.yml
- resources/clusters/*.yml
# resources/jobs/ingestion_jobs.yml
resources:
jobs:
ingest_customers:
name: "[${bundle.environment}] Ingest Customers"
tasks:
- task_key: main
notebook_task:
notebook_path: ./notebooks/ingest_customers
ingest_orders:
name: "[${bundle.environment}] Ingest Orders"
tasks:
- task_key: main
notebook_task:
notebook_path: ./notebooks/ingest_orders
# resources/pipelines/dlt_pipelines.yml
resources:
pipelines:
customer_pipeline:
name: "[${bundle.environment}] Customer DLT Pipeline"
target: ${var.catalog}.customer
libraries:
- notebook:
path: ./pipelines/customer_dlt
order_pipeline:
name: "[${bundle.environment}] Order DLT Pipeline"
target: ${var.catalog}.orders
libraries:
- notebook:
path: ./pipelines/order_dlt
Pattern 3: Python Deployment Script
Automated Deployment:
"""
Automated bundle deployment script.
"""
import subprocess
import sys
from typing import Dict, Any
class BundleDeployer:
"""Deploy Databricks Asset Bundles."""
def __init__(self, bundle_path: str):
self.bundle_path = bundle_path
def validate(self, target: str) -> bool:
"""Validate bundle configuration."""
print(f"Validating bundle for target: {target}")
result = subprocess.run(
["databricks", "bundle", "validate", "-t", target],
cwd=self.bundle_path,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Validation failed: {result.stderr}")
return False
print("Validation successful")
return True
def deploy(self, target: str, force: bool = False) -> bool:
"""Deploy bundle to target environment."""
if not self.validate(target):
return False
print(f"Deploying bundle to {target}")
cmd = ["databricks", "bundle", "deploy", "-t", target]
if force:
cmd.append("--force")
result = subprocess.run(
cmd,
cwd=self.bundle_path,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Deployment failed: {result.stderr}")
return False
print(f"Deployment successful: {result.stdout}")
return True
def run_job(self, target: str, job_key: str) -> bool:
"""Run a specific job from bundle."""
print(f"Running job: {job_key} on {target}")
result = subprocess.run(
["databricks", "bundle", "run", "-t", target, job_key],
cwd=self.bundle_path,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Job run failed: {result.stderr}")
return False
print(f"Job started: {result.stdout}")
return True
def destroy(self, target: str, auto_approve: bool = False) -> bool:
"""Destroy bundle resources."""
print(f"WARNING: Destroying bundle resources in {target}")
cmd = ["databricks", "bundle", "destroy", "-t", target]
if auto_approve:
cmd.append("--auto-approve")
result = subprocess.run(
cmd,
cwd=self.bundle_path,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Destroy failed: {result.stderr}")
return False
print("Bundle resources destroyed")
return True
# Usage
if __name__ == "__main__":
deployer = BundleDeployer("./my-bundle")
# Deploy to development
if deployer.deploy("dev"):
deployer.run_job("dev", "daily_pipeline")
# Deploy to production (requires approval)
if len(sys.argv) > 1 and sys.argv[1] == "--prod":
deployer.deploy("prod")
Pattern 4: GitOps Integration
GitHub Actions Workflow:
# .github/workflows/bundle-deploy.yml
name: Deploy Databricks Bundle
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
type: choice
options:
- dev
- staging
- prod
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Validate Bundle
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
run: |
cd bundle/
databricks bundle validate -t dev
deploy-dev:
needs: validate
if: github.ref == 'refs/heads/develop'
runs-on: ubuntu-latest
environment: development
steps:
- uses: actions/checkout@v3
- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Deploy to Development
env:
DATABRICKS_HOST: ${{ secrets.DEV_DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DEV_DATABRICKS_TOKEN }}
run: |
cd bundle/
databricks bundle deploy -t dev
deploy-prod:
needs: validate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v3
- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Deploy to Production
env:
DATABRICKS_HOST: ${{ secrets.PROD_DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }}
run: |
cd bundle/
databricks bundle deploy -t prod
Best Practices
1. Bundle Organization
- Keep bundle files under version control
- Use environment-specific overrides
- Separate resources into logical files
- Document variable purposes
- Include README for bundle usage
2. Environment Management
# Use consistent naming
targets:
dev:
mode: development # Enables faster iterations
staging:
mode: production # Production-like behavior
prod:
mode: production # Full production settings
3. Variable Usage
# Define reusable variables
variables:
project_name:
description: "Project identifier"
default: "customer-analytics"
# Use variables consistently
resources:
jobs:
${var.project_name}_job:
name: "[${bundle.environment}] ${var.project_name}"
4. Testing Strategy
# Test bundle locally
databricks bundle validate -t dev
# Deploy to dev for testing
databricks bundle deploy -t dev
# Run integration tests
databricks bundle run -t dev test_job
# Deploy to prod after validation
databricks bundle deploy -t prod
Common Pitfalls to Avoid
Don't:
- Hard-code environment-specific values
- Skip validation before deployment
- Modify resources outside of bundles
- Use development mode in production
- Deploy without testing
Do:
- Use variables for environment differences
- Always validate before deploying
- Manage all resources through bundles
- Use production mode for prod
- Test in lower environments first
Complete Examples
See /examples/ directory for:
complete_bundle_project/: Full bundle structuremulti_workspace_deployment/: Cross-workspace deployment
Related Skills
delta-live-tables: Deploy DLT pipelinescicd-workflows: Automate deploymentstesting-patterns: Test before deploydata-products: Deploy data products