name	data-modeling
description	Schema design, entity relationships, normalization, and database patterns. Use when designing database schemas, modeling domain entities, deciding between normalized and denormalized structures, choosing between relational and NoSQL approaches, or planning schema migrations. Covers ER modeling, normal forms, and data evolution strategies.

Data Modeling

When to Use

Designing new database schemas from domain requirements
Analyzing existing schemas for optimization opportunities
Deciding between normalized and denormalized structures
Choosing appropriate data stores (relational vs NoSQL)
Planning schema evolution and migration strategies
Modeling complex entity relationships

Philosophy

Data models outlive applications. A well-designed schema encodes business rules, enforces integrity, and enables performance optimization. The goal is to create models that are correct first, then optimize for access patterns while maintaining data integrity.

Entity-Relationship Modeling

Identifying Entities

Entities represent distinct business concepts that have identity and lifecycle.

Entity Identification Checklist:

Has unique identity across the system
Has attributes that describe it
Participates in relationships with other entities
Has a meaningful lifecycle (created, modified, archived)
Would be stored and retrieved independently

Common Entity Patterns:

Core domain objects (User, Product, Order)
Reference/lookup data (Country, Status, Category)
Transactional records (Payment, LogEntry, Event)
Associative entities (OrderItem, Enrollment, Permission)

Relationship Types

Type	Notation	Example	Implementation
One-to-One	1:1	User - Profile	FK with unique constraint or same table
One-to-Many	1:N	Customer - Orders	FK on the "many" side
Many-to-Many	M:N	Students - Courses	Junction/bridge table

Relationship Considerations:

Cardinality: minimum and maximum on each side
Optionality: required vs optional participation
Direction: unidirectional vs bidirectional navigation
Cascade behavior: what happens on delete/update

Attribute Analysis

Attribute Types:

Simple: single atomic value (name, price)
Composite: structured value (address = street + city + postal)
Derived: calculated from other attributes (age from birthdate)
Multi-valued: repeating values (phone numbers, tags)

Key Types:

Natural key: business-meaningful identifier (SSN, ISBN)
Surrogate key: system-generated identifier (UUID, auto-increment)
Composite key: multiple columns forming identity
Candidate key: any attribute(s) that could serve as primary key

Best Practice: Prefer surrogate keys for primary keys; use natural keys as unique constraints.

Normalization

Normal Forms Progression

Each normal form builds on the previous. Normalize until requirements dictate otherwise.

First Normal Form (1NF)

Rule: Eliminate repeating groups; each cell contains atomic values.

Violation Example:

Order(id, customer, items: "widget,gadget,thing")

Resolution:

Order(id, customer)
OrderItem(order_id, item_name)

Second Normal Form (2NF)

Rule: Remove partial dependencies on composite keys.

Violation Example:

OrderItem(order_id, product_id, product_name, quantity)
                                 ^-- depends only on product_id

Resolution:

OrderItem(order_id, product_id, quantity)
Product(product_id, product_name)

Third Normal Form (3NF)

Rule: Remove transitive dependencies; non-key columns depend only on the key.

Violation Example:

Employee(id, department_id, department_name)
                            ^-- depends on department_id, not employee id

Resolution:

Employee(id, department_id)
Department(id, name)

Boyce-Codd Normal Form (BCNF)

Rule: Every determinant is a candidate key.

Violation Example:

CourseOffering(student, course, instructor)
-- Constraint: each instructor teaches only one course
-- instructor -> course (but instructor is not a candidate key)

Resolution:

InstructorCourse(instructor, course) -- instructor is key
Enrollment(student, instructor) -- references instructor

When to Stop Normalizing

Stop at 3NF for most OLTP systems. Consider BCNF when:

Update anomalies cause data corruption
Data integrity is paramount
Write frequency is high

Denormalization Strategies

Denormalize intentionally for read performance, not out of laziness.

Calculated Columns

Store derived values to avoid repeated computation.

Order
  - subtotal (calculated once on item changes)
  - tax_amount (calculated once)
  - total (calculated once)

Trade-off: Faster reads, more complex writes, potential consistency issues.

Materialized Relationships

Embed frequently-accessed related data.

Post
  - author_id
  - author_name (copied from User.name)
  - author_avatar_url (copied from User.avatar_url)

Trade-off: Eliminates joins, requires synchronization on source changes.

Aggregation Tables

Pre-compute summaries for reporting.

DailySales
  - date
  - product_id
  - units_sold (sum)
  - revenue (sum)

Trade-off: Fast analytics, storage overhead, stale until refreshed.

Denormalization Decision Matrix

Factor	Normalize	Denormalize
Write frequency	High	Low
Read frequency	Low	High
Data consistency	Critical	Eventual OK
Query complexity	Simple	Complex joins
Data size	Small	Large

NoSQL Data Modeling Patterns

Document Stores (MongoDB, DynamoDB)

Embedding Pattern: Embed related data that is read together and has 1:few relationship.

{
  "order_id": "123",
  "customer": {
    "id": "456",
    "name": "Jane Doe",
    "email": "jane@example.com"
  },
  "items": [
    {"product_id": "A1", "name": "Widget", "quantity": 2}
  ]
}

Referencing Pattern: Reference related data when it changes independently or is shared.

{
  "order_id": "123",
  "customer_id": "456",
  "item_ids": ["A1", "B2"]
}

Hybrid Pattern: Embed summary data, reference for full details.

{
  "order_id": "123",
  "customer_summary": {
    "id": "456",
    "name": "Jane Doe"
  },
  "items": [
    {"product_id": "A1", "name": "Widget", "quantity": 2}
  ]
}

Key-Value Stores

Access Pattern Design: Design keys around query patterns.

USER:{user_id} -> user data
USER:{user_id}:ORDERS -> list of order ids
ORDER:{order_id} -> order data

Composite Keys: Combine entity type with identifiers for namespacing.

Wide-Column Stores (Cassandra, HBase)

Partition Key Design: Choose partition keys for even distribution and access locality.

Primary Key: (user_id, order_date)
             ^-- partition key (distribution)
                       ^-- clustering column (ordering)

Avoid:

High-cardinality partition keys causing hot spots
Large partitions exceeding recommended sizes
Scatter-gather queries across partitions

Graph Databases

Node and Relationship Design:

Nodes: entities with properties
Relationships: named, directed, with properties
Labels: categorize nodes for efficient traversal

(User)-[:PURCHASED {date, amount}]->(Product)
(User)-[:FOLLOWS]->(User)
(Product)-[:BELONGS_TO]->(Category)

Schema Evolution Strategies

Additive Changes (Safe)

Add new nullable columns
Add new tables
Add new indexes
Add new optional fields (NoSQL)

Breaking Changes (Require Migration)

Remove columns/tables
Rename columns/tables
Change data types
Add non-nullable columns without defaults

Migration Patterns

Expand-Contract Pattern:

Add new column alongside old
Backfill new column from old
Update application to use new column
Remove old column

Blue-Green Schema:

Create new version of schema
Dual-write to both versions
Migrate reads to new version
Drop old version

Versioned Documents (NoSQL):

{
  "_schema_version": 2,
  "name": "Jane",
  "email": "jane@example.com"
}

Handle multiple versions in application code during transition.

Best Practices

Model the domain first, then optimize for access patterns
Use surrogate keys for primary keys; natural keys as unique constraints
Normalize to 3NF for OLTP; denormalize deliberately for read-heavy loads
Document all foreign key relationships and cascade behaviors
Version control all schema changes as migration scripts
Test migrations on production-like data volumes
Consider query patterns when designing NoSQL schemas
Plan for schema evolution from day one

Anti-Patterns

Designing schemas around UI forms instead of domain concepts
Using generic columns (field1, field2, field3)
Entity-Attribute-Value (EAV) for structured data
Storing comma-separated values in single columns
Circular foreign key dependencies
Missing indexes on foreign key columns
Hard-deleting data without soft-delete consideration
Ignoring temporal aspects (effective dates, audit trails)

References

templates/schema-design-template.md - Structured template for schema documentation

data-modeling

Install Skill

SKILL.md

Data Modeling

When to Use

Philosophy

Entity-Relationship Modeling

Identifying Entities

Relationship Types

Attribute Analysis

Normalization

Normal Forms Progression

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

When to Stop Normalizing

Denormalization Strategies

Calculated Columns

Materialized Relationships

Aggregation Tables

Denormalization Decision Matrix

NoSQL Data Modeling Patterns

Document Stores (MongoDB, DynamoDB)

Key-Value Stores

Wide-Column Stores (Cassandra, HBase)

Graph Databases

Schema Evolution Strategies

Additive Changes (Safe)

Breaking Changes (Require Migration)

Migration Patterns

Best Practices

Anti-Patterns

References