Data Architecture
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
When to Use This Skill
- Choosing between data lake, warehouse, and lakehouse
- Designing a modern data platform
- Implementing data mesh principles
- Planning data storage strategy
- Understanding data architecture trade-offs
Data Architecture Evolution
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics
Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing
Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
Architecture Comparison
Data Warehouse
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ ETL │ ──► │ Warehouse │
│ (Structured)│ │ (Transform) │ │ (Star/Snow) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ BI │
│ Analytics │
└─────────────┘
Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage
Best for:
- Business intelligence
- Financial reporting
- Structured analytics
Data Lake
┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Data Lake │
│ (All) │ │ (Raw) │
└─────────────┘ └─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ ML │ │ ETL │ │ Spark │
│ Training│ │ to DW │ │ Analysis│
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"
Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
Data Lakehouse
┌─────────────┐ ┌─────────────────────────────────┐
│ Sources │ ──► │ Data Lakehouse │
│ (All) │ │ ┌──────────────────────────┐ │
└─────────────┘ │ │ Metadata Layer │ │
│ │ (Delta/Iceberg/Hudi) │ │
│ └──────────────────────────┘ │
│ ┌──────────────────────────┐ │
│ │ Storage Layer │ │
│ │ (Object Storage) │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SQL │ │ ML │ │ Stream │
│ BI │ │ Workload│ │ Process │
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats
Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
Architecture Selection Guide
| Factor |
Warehouse |
Lake |
Lakehouse |
| Data types |
Structured |
All |
All |
| Query performance |
Excellent |
Poor-Medium |
Good |
| Data quality |
High |
Variable |
Configurable |
| Cost |
High |
Low |
Medium |
| ML workloads |
Limited |
Excellent |
Excellent |
| Real-time |
Limited |
Good |
Good |
| Governance |
Strong |
Weak |
Strong |
| Complexity |
Low |
High |
Medium |
Decision Tree:
Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
└── Need ML + BI on same data?
├── Yes → Lakehouse
└── No
└── Primarily ML/unstructured?
├── Yes → Data Lake
└── No → Lakehouse
Lakehouse Technologies
Delta Lake (Databricks)
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)
File format: Parquet + Delta log
Apache Iceberg (Netflix)
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral
File format: Parquet/ORC/Avro + metadata
Apache Hudi (Uber)
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming
File format: Parquet + Hudi metadata
Technology Comparison
| Feature |
Delta Lake |
Iceberg |
Hudi |
| ACID |
Yes |
Yes |
Yes |
| Time Travel |
Yes |
Yes |
Yes |
| Schema Evolution |
Good |
Excellent |
Good |
| Streaming |
Excellent |
Good |
Excellent |
| Ecosystem |
Databricks |
Wide |
Wide |
| Performance |
Excellent |
Excellent |
Good |
| Community |
Large |
Growing |
Medium |
Data Mesh
Principles
Data Mesh = Decentralized data architecture
Four Principles:
1. Domain Ownership
- Data owned by domain teams
- Not centralized data team
2. Data as a Product
- Treat data like a product
- Quality, discoverability, usability
3. Self-Serve Platform
- Platform enables domain teams
- Reduces friction
4. Federated Governance
- Global standards
- Local implementation
Data Products
Data Product = Autonomous unit of data
Components:
┌──────────────────────────────────────┐
│ Data Product │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Data │ │ Metadata │ │
│ │ (Tables) │ │ (Schema, docs) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Code │ │ APIs │ │
│ │ (ETL) │ │ (Access layer) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────────────────────────────┐│
│ │ Quality + SLAs ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘
Data Mesh vs Centralized
| Aspect |
Centralized |
Data Mesh |
| Ownership |
Central data team |
Domain teams |
| Scaling |
Team bottleneck |
Scales with org |
| Domain knowledge |
Lost in translation |
Preserved |
| Governance |
Centralized |
Federated |
| Implementation |
Uniform |
Heterogeneous |
| Complexity |
Lower initially |
Higher initially |
Data Modeling Patterns
Star Schema
┌─────────────┐
│ Dim_Time │
└──────┬──────┘
│
┌───────────┐ │ ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘ │ └───────────┘
│
┌──────┴──────┐
│ Fact_Sales │
└─────────────┘
Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting
Snowflake Schema
Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory
Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies
Data Vault
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)
Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse
Storage Layers
Bronze/Silver/Gold (Medallion Architecture)
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Bronze │ ──► │ Silver │ ──► │ Gold │
│ (Raw) │ │(Cleaned)│ │(Curated)│
└─────────┘ └─────────┘ └─────────┘
Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features
Zones in Data Lake
Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation
Best Practices
Data Quality
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
Governance
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
Performance
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
Related Skills
etl-elt-patterns - Data transformation
stream-processing - Real-time data
database-scaling - Database patterns