Claude Code Plugins

Community-maintained marketplace

Feedback

Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name data-architecture
description Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.
allowed-tools Read, Glob, Grep

Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

When to Use This Skill

  • Choosing between data lake, warehouse, and lakehouse
  • Designing a modern data platform
  • Implementing data mesh principles
  • Planning data storage strategy
  • Understanding data architecture trade-offs

Data Architecture Evolution

Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics

Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing

Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML

Architecture Comparison

Data Warehouse

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │     ETL     │ ──► │  Warehouse  │
│ (Structured)│     │ (Transform) │     │ (Star/Snow) │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                        ┌─────────────┐
                                        │     BI      │
                                        │  Analytics  │
                                        └─────────────┘

Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage

Best for:
- Business intelligence
- Financial reporting
- Structured analytics

Data Lake

┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │  Data Lake  │
│    (All)    │     │   (Raw)     │
└─────────────┘     └─────────────┘
                          │
         ┌────────────────┼────────────────┐
         ▼                ▼                ▼
    ┌─────────┐     ┌─────────┐     ┌─────────┐
    │   ML    │     │   ETL   │     │  Spark  │
    │ Training│     │ to DW   │     │ Analysis│
    └─────────┘     └─────────┘     └─────────┘

Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"

Best for:
- Data science/ML
- Unstructured data
- Experimental analysis

Data Lakehouse

┌─────────────┐     ┌─────────────────────────────────┐
│   Sources   │ ──► │         Data Lakehouse          │
│    (All)    │     │  ┌──────────────────────────┐   │
└─────────────┘     │  │    Metadata Layer        │   │
                    │  │ (Delta/Iceberg/Hudi)     │   │
                    │  └──────────────────────────┘   │
                    │  ┌──────────────────────────┐   │
                    │  │    Storage Layer         │   │
                    │  │    (Object Storage)      │   │
                    │  └──────────────────────────┘   │
                    └─────────────────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
         ┌─────────┐         ┌─────────┐         ┌─────────┐
         │   SQL   │         │   ML    │         │  Stream │
         │   BI    │         │ Workload│         │ Process │
         └─────────┘         └─────────┘         └─────────┘

Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats

Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms

Architecture Selection Guide

Factor Warehouse Lake Lakehouse
Data types Structured All All
Query performance Excellent Poor-Medium Good
Data quality High Variable Configurable
Cost High Low Medium
ML workloads Limited Excellent Excellent
Real-time Limited Good Good
Governance Strong Weak Strong
Complexity Low High Medium
Decision Tree:

Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
    └── Need ML + BI on same data?
        ├── Yes → Lakehouse
        └── No
            └── Primarily ML/unstructured?
                ├── Yes → Data Lake
                └── No → Lakehouse

Lakehouse Technologies

Delta Lake (Databricks)

Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log

Apache Iceberg (Netflix)

Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral

File format: Parquet/ORC/Avro + metadata

Apache Hudi (Uber)

Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming

File format: Parquet + Hudi metadata

Technology Comparison

Feature Delta Lake Iceberg Hudi
ACID Yes Yes Yes
Time Travel Yes Yes Yes
Schema Evolution Good Excellent Good
Streaming Excellent Good Excellent
Ecosystem Databricks Wide Wide
Performance Excellent Excellent Good
Community Large Growing Medium

Data Mesh

Principles

Data Mesh = Decentralized data architecture

Four Principles:

1. Domain Ownership
   - Data owned by domain teams
   - Not centralized data team

2. Data as a Product
   - Treat data like a product
   - Quality, discoverability, usability

3. Self-Serve Platform
   - Platform enables domain teams
   - Reduces friction

4. Federated Governance
   - Global standards
   - Local implementation

Data Products

Data Product = Autonomous unit of data

Components:
┌──────────────────────────────────────┐
│           Data Product               │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Data   │  │     Metadata     │ │
│  │ (Tables) │  │ (Schema, docs)   │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Code   │  │      APIs        │ │
│  │ (ETL)    │  │  (Access layer)  │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────────────────────────────┐│
│  │         Quality + SLAs           ││
│  └──────────────────────────────────┘│
└──────────────────────────────────────┘

Data Mesh vs Centralized

Aspect Centralized Data Mesh
Ownership Central data team Domain teams
Scaling Team bottleneck Scales with org
Domain knowledge Lost in translation Preserved
Governance Centralized Federated
Implementation Uniform Heterogeneous
Complexity Lower initially Higher initially

Data Modeling Patterns

Star Schema

        ┌─────────────┐
        │  Dim_Time   │
        └──────┬──────┘
               │
┌───────────┐  │  ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘  │  └───────────┘
               │
        ┌──────┴──────┐
        │ Fact_Sales  │
        └─────────────┘

Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting

Snowflake Schema

Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies

Data Vault

Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse

Storage Layers

Bronze/Silver/Gold (Medallion Architecture)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Bronze  │ ──► │ Silver  │ ──► │  Gold   │
│  (Raw)  │     │(Cleaned)│     │(Curated)│
└─────────┘     └─────────┘     └─────────┘

Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features

Zones in Data Lake

Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation

Best Practices

Data Quality

Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring

Governance

Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging

Performance

Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views

Related Skills

  • etl-elt-patterns - Data transformation
  • stream-processing - Real-time data
  • database-scaling - Database patterns