Claude Code Plugins

Community-maintained marketplace

Feedback
21
1

>

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name data-lake-platform
description Universal data lake and lakehouse patterns covering ingestion (dlt, Airbyte), transformation (SQLMesh, dbt), storage formats (Iceberg, Delta, Hudi, Parquet), query engines (ClickHouse, DuckDB, Doris, StarRocks), streaming (Kafka, Flink), orchestration (Dagster, Airflow, Prefect), and visualization (Metabase, Superset, Grafana). Self-hosted and cloud options.

Data Lake Platform — Quick Reference

Build production data lakes and lakehouses: ingest from any source, transform with SQL, store in open formats, query at scale.


When to Use This Skill

  • Design data lake/lakehouse architecture (medallion, data mesh, lambda/kappa)
  • Set up data ingestion pipelines (dlt, Airbyte)
  • Build SQL transformation layers (SQLMesh, dbt)
  • Choose and configure table formats (Iceberg, Delta Lake, Hudi)
  • Deploy analytical query engines (ClickHouse, DuckDB, Doris, StarRocks)
  • Implement streaming pipelines (Kafka, Flink, Spark Streaming)
  • Set up orchestration (Dagster, Airflow, Prefect)
  • Implement data quality and governance (Great Expectations, DataHub)
  • Optimize storage costs and query performance
  • Migrate from legacy warehouse to lakehouse

Quick Reference

Layer Tools Templates When to Use
Ingestion dlt, Airbyte templates/ingestion/ Extract from APIs, databases, files
Transformation SQLMesh, dbt templates/transformation/ SQL-based data modeling
Storage Iceberg, Delta, Hudi templates/storage/ Open table formats with ACID
Query Engines ClickHouse, DuckDB, Doris templates/query-engines/ Fast analytical queries
Streaming Kafka, Flink, Spark templates/streaming/ Real-time data pipelines
Orchestration Dagster, Airflow, Prefect templates/orchestration/ Pipeline scheduling
Cloud Snowflake, BigQuery, Redshift templates/cloud/ Managed warehouses
Visualization Metabase, Superset, Grafana templates/visualization/ Dashboards, BI, monitoring

Decision Tree: Choosing Your Stack

Data Lake Architecture?
    ├─ Self-hosted priority?
    │   ├─ OLAP queries? → ClickHouse (Yandex origin, Russian-friendly)
    │   ├─ Embedded analytics? → DuckDB (in-process, no server)
    │   ├─ Complex joins + updates? → Apache Doris or StarRocks
    │   └─ Data lakehouse? → Iceberg + Trino/Spark
    │
    ├─ Cloud-native?
    │   ├─ AWS? → Redshift, Athena + Iceberg
    │   ├─ GCP? → BigQuery
    │   └─ Multi-cloud? → Snowflake
    │
    ├─ Ingestion tool?
    │   ├─ Python-first, simple? → dlt (recommended)
    │   ├─ GUI, many connectors? → Airbyte
    │   └─ Enterprise, CDC? → Debezium, Fivetran
    │
    ├─ Transformation tool?
    │   ├─ SQL-first, CI/CD? → SQLMesh (recommended)
    │   ├─ Large community? → dbt
    │   └─ Python transformations? → Pandas + Great Expectations
    │
    ├─ Table format?
    │   ├─ Multi-engine reads? → Apache Iceberg (industry standard)
    │   ├─ Databricks ecosystem? → Delta Lake
    │   └─ CDC, real-time updates? → Apache Hudi
    │
    ├─ Streaming?
    │   ├─ Event streaming? → Apache Kafka
    │   ├─ Stream processing? → Apache Flink
    │   └─ Batch + streaming? → Spark Streaming
    │
    └─ Orchestration?
        ├─ Data-aware, modern? → Dagster (recommended)
        ├─ Mature, many integrations? → Airflow
        └─ Python-native, simple? → Prefect

Architecture Patterns

Pattern 1: Medallion Architecture (Bronze/Silver/Gold)

Use when: Building enterprise data lake with clear data quality tiers.

Sources → Bronze (raw) → Silver (cleaned) → Gold (business-ready)
            ↓              ↓                  ↓
         Iceberg        Iceberg            Iceberg
         append-only    deduped            aggregated

Tools: dlt → SQLMesh → ClickHouse/DuckDB → Metabase

See templates/cross-platform/template-medallion-architecture.md

Pattern 2: Data Mesh (Domain-Oriented)

Use when: Large organization, multiple domains, decentralized ownership.

Domain A ──→ Domain A Lake ──→ Data Products
Domain B ──→ Domain B Lake ──→ Data Products
Domain C ──→ Domain C Lake ──→ Data Products
                   ↓
            Federated Catalog (DataHub/OpenMetadata)

Tools: dlt (per domain) → SQLMesh → Iceberg → DataHub

See resources/architecture-patterns.md

Pattern 3: Lambda/Kappa (Streaming + Batch)

Use when: Real-time + historical analytics required.

Lambda: Kafka → Flink (speed layer) ───→ Serving
                   ↓                        ↑
        Batch (Spark) → Iceberg ───────────┘

Kappa:  Kafka → Flink → Iceberg → Serving (single path)

Tools: Kafka → Flink → Iceberg → ClickHouse

See resources/streaming-patterns.md


Core Capabilities

1. Data Ingestion (dlt, Airbyte)

Extract data from any source: REST APIs, databases, files, SaaS platforms.

dlt (Python-native, recommended):

  • Simple pip install, Python scripts
  • 100+ verified sources
  • Incremental loading, schema evolution
  • Destinations: ClickHouse, DuckDB, Snowflake, BigQuery, Postgres

Airbyte (GUI, connectors):

  • 300+ pre-built connectors
  • Self-hosted or cloud
  • CDC replication
  • Declarative YAML configuration

See templates/ingestion/dlt/ and templates/ingestion/airbyte/

2. SQL Transformation (SQLMesh, dbt)

Build data models with SQL, manage dependencies, test data quality.

SQLMesh (recommended):

  • Virtual data environments
  • Automatic change detection
  • Plan/apply workflow (like Terraform)
  • Built-in unit testing

dbt (popular alternative):

  • Large community
  • Many packages
  • Cloud offering
  • Jinja templating

See templates/transformation/sqlmesh/ and templates/transformation/dbt/

3. Open Table Formats (Iceberg, Delta, Hudi)

Store data in open formats with ACID transactions, time travel, schema evolution.

Apache Iceberg (recommended):

  • Multi-engine support (Spark, Trino, Flink, ClickHouse)
  • Hidden partitioning
  • Snapshot isolation
  • Industry momentum (Snowflake, Databricks, AWS)

Delta Lake:

  • Databricks native
  • Z-ordering optimization
  • Change data feed

Apache Hudi:

  • Optimized for CDC/updates
  • Record-level indexing
  • Near real-time ingestion

See templates/storage/

4. Query Engines (ClickHouse, DuckDB, Doris, StarRocks)

Fast analytical queries on large datasets.

ClickHouse (self-hosted priority):

  • 100+ GB/s scan speed
  • Columnar storage + compression
  • Russian origin (Yandex), fully open source
  • MergeTree engine family

DuckDB (embedded):

  • In-process, no server
  • Parquet/CSV native
  • Python/R integration
  • Laptop-scale analytics

Apache Doris / StarRocks:

  • MPP architecture
  • Complex joins
  • Real-time updates
  • MySQL protocol compatible

See templates/query-engines/

5. Streaming (Kafka, Flink, Spark)

Real-time data pipelines and stream processing.

Apache Kafka:

  • Event streaming platform
  • Durable log storage
  • Connect ecosystem

Apache Flink:

  • True stream processing
  • Exactly-once semantics
  • Stateful computations

Spark Streaming:

  • Micro-batch processing
  • Unified batch + stream API
  • ML integration

See templates/streaming/

6. Orchestration (Dagster, Airflow, Prefect)

Schedule and monitor data pipelines.

Dagster (recommended):

  • Software-defined assets
  • Data-aware scheduling
  • Built-in observability
  • Type system

Airflow:

  • Mature, battle-tested
  • Many operators
  • Large community

Prefect:

  • Python-native
  • Hybrid execution
  • Simple deployment

See templates/orchestration/

7. Visualization (Metabase, Superset, Grafana)

Business intelligence dashboards and operational monitoring.

Metabase (business users):

  • Intuitive question builder
  • Self-serve analytics
  • Signed embedding
  • Caching and permissions

Apache Superset (analysts):

  • SQL Lab for exploration
  • Advanced visualizations
  • Row-level security
  • Custom chart plugins

Grafana (infrastructure):

  • Time-series dashboards
  • Real-time alerting
  • Prometheus/InfluxDB native
  • Data pipeline monitoring

See templates/visualization/ and resources/bi-visualization-patterns.md


Navigation

Resources (Deep Guides)

Resource Description
architecture-patterns.md Medallion, data mesh, lakehouse design
ingestion-patterns.md dlt vs Airbyte, CDC, incremental loading
transformation-patterns.md SQLMesh vs dbt, testing, CI/CD
storage-formats.md Iceberg vs Delta vs Hudi comparison
query-engine-patterns.md ClickHouse, DuckDB, Doris optimization
streaming-patterns.md Kafka, Flink, Spark Streaming
orchestration-patterns.md Dagster, Airflow, Prefect comparison
governance-catalog.md DataHub, OpenMetadata, lineage
cost-optimization.md Storage, compute, query optimization
operational-playbook.md Monitoring, incidents, migrations
bi-visualization-patterns.md Metabase, Superset, Grafana operations

Templates (Copy-Paste Ready)

Category Templates
Cross-platform medallion, pipeline, incremental, quality, schema, partitioning, cost, migration
Ingestion/dlt pipeline, rest-api, database-source, incremental, warehouse-loading
Ingestion/Airbyte connection, custom-connector
Transformation/SQLMesh project, model, incremental, dag, testing, production, security
Transformation/dbt project, incremental, testing
Storage iceberg-table, iceberg-maintenance, delta-table, hudi-table, parquet-optimization
Query Engines clickhouse (setup, ingestion, optimization, materialized-views, replication), duckdb, doris, starrocks
Streaming kafka-ingestion, flink-processing, spark-streaming
Orchestration dagster-pipeline, airflow-dag, prefect-flow
Cloud snowflake-setup, bigquery-setup, redshift-setup
Visualization metabase (connection-checklist, dashboard-request, incident-playbook)

Related Skills


Quick Start Examples

Example 1: dlt + ClickHouse + Metabase

# Install dlt with ClickHouse
pip install "dlt[clickhouse]"

# Initialize pipeline
dlt init rest_api clickhouse

# Configure and run
python pipeline.py

Example 2: SQLMesh + DuckDB (Local Development)

# Install SQLMesh
pip install sqlmesh

# Initialize project
sqlmesh init duckdb

# Plan and apply
sqlmesh plan
sqlmesh run

Example 3: Dagster + dlt + SQLMesh

# dagster_pipeline.py
from dagster import asset, Definitions
import dlt
import sqlmesh

@asset
def raw_data():
    pipeline = dlt.pipeline("my_pipeline", destination="duckdb")
    return pipeline.run(source_data())

@asset(deps=[raw_data])
def transformed_data():
    ctx = sqlmesh.Context()
    ctx.run()

Sources

See data/sources.json for 100+ curated sources covering:

  • Official documentation (dlt, SQLMesh, ClickHouse, Iceberg, etc.)
  • Architecture guides (medallion, data mesh)
  • Comparison articles (2024-2025 benchmarks)
  • Community resources (Discord, Stack Overflow)
  • Russian ecosystem (Yandex DataLens, Arenadata, Altinity)