name	data-lake-platform
description	Universal data lake and lakehouse patterns covering ingestion (dlt, Airbyte), transformation (SQLMesh, dbt), storage formats (Iceberg, Delta, Hudi, Parquet), query engines (ClickHouse, DuckDB, Doris, StarRocks), streaming (Kafka, Flink), orchestration (Dagster, Airflow, Prefect), and visualization (Metabase, Superset, Grafana). Self-hosted and cloud options.

Data Lake Platform — Quick Reference

Build production data lakes and lakehouses: ingest from any source, transform with SQL, store in open formats, query at scale.

When to Use This Skill

Design data lake/lakehouse architecture (medallion, data mesh, lambda/kappa)
Set up data ingestion pipelines (dlt, Airbyte)
Build SQL transformation layers (SQLMesh, dbt)
Choose and configure table formats (Iceberg, Delta Lake, Hudi)
Deploy analytical query engines (ClickHouse, DuckDB, Doris, StarRocks)
Implement streaming pipelines (Kafka, Flink, Spark Streaming)
Set up orchestration (Dagster, Airflow, Prefect)
Implement data quality and governance (Great Expectations, DataHub)
Optimize storage costs and query performance
Migrate from legacy warehouse to lakehouse

Quick Reference

Layer	Tools	Templates	When to Use
Ingestion	dlt, Airbyte	`templates/ingestion/`	Extract from APIs, databases, files
Transformation	SQLMesh, dbt	`templates/transformation/`	SQL-based data modeling
Storage	Iceberg, Delta, Hudi	`templates/storage/`	Open table formats with ACID
Query Engines	ClickHouse, DuckDB, Doris	`templates/query-engines/`	Fast analytical queries
Streaming	Kafka, Flink, Spark	`templates/streaming/`	Real-time data pipelines
Orchestration	Dagster, Airflow, Prefect	`templates/orchestration/`	Pipeline scheduling
Cloud	Snowflake, BigQuery, Redshift	`templates/cloud/`	Managed warehouses
Visualization	Metabase, Superset, Grafana	`templates/visualization/`	Dashboards, BI, monitoring

Decision Tree: Choosing Your Stack

Data Lake Architecture?
    ├─ Self-hosted priority?
    │   ├─ OLAP queries? → ClickHouse (Yandex origin, Russian-friendly)
    │   ├─ Embedded analytics? → DuckDB (in-process, no server)
    │   ├─ Complex joins + updates? → Apache Doris or StarRocks
    │   └─ Data lakehouse? → Iceberg + Trino/Spark
    │
    ├─ Cloud-native?
    │   ├─ AWS? → Redshift, Athena + Iceberg
    │   ├─ GCP? → BigQuery
    │   └─ Multi-cloud? → Snowflake
    │
    ├─ Ingestion tool?
    │   ├─ Python-first, simple? → dlt (recommended)
    │   ├─ GUI, many connectors? → Airbyte
    │   └─ Enterprise, CDC? → Debezium, Fivetran
    │
    ├─ Transformation tool?
    │   ├─ SQL-first, CI/CD? → SQLMesh (recommended)
    │   ├─ Large community? → dbt
    │   └─ Python transformations? → Pandas + Great Expectations
    │
    ├─ Table format?
    │   ├─ Multi-engine reads? → Apache Iceberg (industry standard)
    │   ├─ Databricks ecosystem? → Delta Lake
    │   └─ CDC, real-time updates? → Apache Hudi
    │
    ├─ Streaming?
    │   ├─ Event streaming? → Apache Kafka
    │   ├─ Stream processing? → Apache Flink
    │   └─ Batch + streaming? → Spark Streaming
    │
    └─ Orchestration?
        ├─ Data-aware, modern? → Dagster (recommended)
        ├─ Mature, many integrations? → Airflow
        └─ Python-native, simple? → Prefect

Architecture Patterns

Pattern 1: Medallion Architecture (Bronze/Silver/Gold)

Use when: Building enterprise data lake with clear data quality tiers.

Sources → Bronze (raw) → Silver (cleaned) → Gold (business-ready)
            ↓              ↓                  ↓
         Iceberg        Iceberg            Iceberg
         append-only    deduped            aggregated

Tools: dlt → SQLMesh → ClickHouse/DuckDB → Metabase

See templates/cross-platform/template-medallion-architecture.md

Pattern 2: Data Mesh (Domain-Oriented)

Use when: Large organization, multiple domains, decentralized ownership.

Domain A ──→ Domain A Lake ──→ Data Products
Domain B ──→ Domain B Lake ──→ Data Products
Domain C ──→ Domain C Lake ──→ Data Products
                   ↓
            Federated Catalog (DataHub/OpenMetadata)

Tools: dlt (per domain) → SQLMesh → Iceberg → DataHub

See resources/architecture-patterns.md

Pattern 3: Lambda/Kappa (Streaming + Batch)

Use when: Real-time + historical analytics required.

Lambda: Kafka → Flink (speed layer) ───→ Serving
                   ↓                        ↑
        Batch (Spark) → Iceberg ───────────┘

Kappa:  Kafka → Flink → Iceberg → Serving (single path)

Tools: Kafka → Flink → Iceberg → ClickHouse

See resources/streaming-patterns.md

Core Capabilities

1. Data Ingestion (dlt, Airbyte)

Extract data from any source: REST APIs, databases, files, SaaS platforms.

dlt (Python-native, recommended):

Simple pip install, Python scripts
100+ verified sources
Incremental loading, schema evolution
Destinations: ClickHouse, DuckDB, Snowflake, BigQuery, Postgres

Airbyte (GUI, connectors):

300+ pre-built connectors
Self-hosted or cloud
CDC replication
Declarative YAML configuration

See templates/ingestion/dlt/ and templates/ingestion/airbyte/

2. SQL Transformation (SQLMesh, dbt)

Build data models with SQL, manage dependencies, test data quality.

SQLMesh (recommended):

Virtual data environments
Automatic change detection
Plan/apply workflow (like Terraform)
Built-in unit testing

dbt (popular alternative):

Large community
Many packages
Cloud offering
Jinja templating

See templates/transformation/sqlmesh/ and templates/transformation/dbt/

3. Open Table Formats (Iceberg, Delta, Hudi)

Store data in open formats with ACID transactions, time travel, schema evolution.

Apache Iceberg (recommended):

Multi-engine support (Spark, Trino, Flink, ClickHouse)
Hidden partitioning
Snapshot isolation
Industry momentum (Snowflake, Databricks, AWS)

Delta Lake:

Databricks native
Z-ordering optimization
Change data feed

Apache Hudi:

Optimized for CDC/updates
Record-level indexing
Near real-time ingestion

See templates/storage/

4. Query Engines (ClickHouse, DuckDB, Doris, StarRocks)

Fast analytical queries on large datasets.

ClickHouse (self-hosted priority):

100+ GB/s scan speed
Columnar storage + compression
Russian origin (Yandex), fully open source
MergeTree engine family

DuckDB (embedded):

In-process, no server
Parquet/CSV native
Python/R integration
Laptop-scale analytics

Apache Doris / StarRocks:

MPP architecture
Complex joins
Real-time updates
MySQL protocol compatible

See templates/query-engines/

5. Streaming (Kafka, Flink, Spark)

Real-time data pipelines and stream processing.

Apache Kafka:

Event streaming platform
Durable log storage
Connect ecosystem

Apache Flink:

True stream processing
Exactly-once semantics
Stateful computations

Spark Streaming:

Micro-batch processing
Unified batch + stream API
ML integration

See templates/streaming/

6. Orchestration (Dagster, Airflow, Prefect)

Schedule and monitor data pipelines.

Dagster (recommended):

Software-defined assets
Data-aware scheduling
Built-in observability
Type system

Airflow:

Mature, battle-tested
Many operators
Large community

Prefect:

Python-native
Hybrid execution
Simple deployment

See templates/orchestration/

7. Visualization (Metabase, Superset, Grafana)

Business intelligence dashboards and operational monitoring.

Metabase (business users):

Intuitive question builder
Self-serve analytics
Signed embedding
Caching and permissions

Apache Superset (analysts):

SQL Lab for exploration
Advanced visualizations
Row-level security
Custom chart plugins

Grafana (infrastructure):

Time-series dashboards
Real-time alerting
Prometheus/InfluxDB native
Data pipeline monitoring

See templates/visualization/ and resources/bi-visualization-patterns.md

Navigation

Resources (Deep Guides)

Resource	Description
architecture-patterns.md	Medallion, data mesh, lakehouse design
ingestion-patterns.md	dlt vs Airbyte, CDC, incremental loading
transformation-patterns.md	SQLMesh vs dbt, testing, CI/CD
storage-formats.md	Iceberg vs Delta vs Hudi comparison
query-engine-patterns.md	ClickHouse, DuckDB, Doris optimization
streaming-patterns.md	Kafka, Flink, Spark Streaming
orchestration-patterns.md	Dagster, Airflow, Prefect comparison
governance-catalog.md	DataHub, OpenMetadata, lineage
cost-optimization.md	Storage, compute, query optimization
operational-playbook.md	Monitoring, incidents, migrations
bi-visualization-patterns.md	Metabase, Superset, Grafana operations

Templates (Copy-Paste Ready)

Category	Templates
Cross-platform	medallion, pipeline, incremental, quality, schema, partitioning, cost, migration
Ingestion/dlt	pipeline, rest-api, database-source, incremental, warehouse-loading
Ingestion/Airbyte	connection, custom-connector
Transformation/SQLMesh	project, model, incremental, dag, testing, production, security
Transformation/dbt	project, incremental, testing
Storage	iceberg-table, iceberg-maintenance, delta-table, hudi-table, parquet-optimization
Query Engines	clickhouse (setup, ingestion, optimization, materialized-views, replication), duckdb, doris, starrocks
Streaming	kafka-ingestion, flink-processing, spark-streaming
Orchestration	dagster-pipeline, airflow-dag, prefect-flow
Cloud	snowflake-setup, bigquery-setup, redshift-setup
Visualization	metabase (connection-checklist, dashboard-request, incident-playbook)

Related Skills

ai-mlops — ML deployment, drift detection, model registry
ai-ml-data-science — Feature engineering, ML modeling, EDA
data-sql-optimization — OLTP database optimization (PostgreSQL, MySQL, Oracle)
ops-devops-platform — Infrastructure, CI/CD, Kubernetes

Quick Start Examples

Example 1: dlt + ClickHouse + Metabase

# Install dlt with ClickHouse
pip install "dlt[clickhouse]"

# Initialize pipeline
dlt init rest_api clickhouse

# Configure and run
python pipeline.py

Example 2: SQLMesh + DuckDB (Local Development)

# Install SQLMesh
pip install sqlmesh

# Initialize project
sqlmesh init duckdb

# Plan and apply
sqlmesh plan
sqlmesh run

Example 3: Dagster + dlt + SQLMesh

# dagster_pipeline.py
from dagster import asset, Definitions
import dlt
import sqlmesh

@asset
def raw_data():
    pipeline = dlt.pipeline("my_pipeline", destination="duckdb")
    return pipeline.run(source_data())

@asset(deps=[raw_data])
def transformed_data():
    ctx = sqlmesh.Context()
    ctx.run()

Sources

See data/sources.json for 100+ curated sources covering:

Official documentation (dlt, SQLMesh, ClickHouse, Iceberg, etc.)
Architecture guides (medallion, data mesh)
Comparison articles (2024-2025 benchmarks)
Community resources (Discord, Stack Overflow)
Russian ecosystem (Yandex DataLens, Arenadata, Altinity)

data-lake-platform

Install Skill

SKILL.md