| name | data-warehouse |
| description | Data warehouse (大数据/数仓) ops skill for designing and operating analytical data platforms. Use for tasks like defining source-of-truth, building ETL/ELT pipelines, dimensional modeling (star schema), data quality checks, partitioning strategies, cost/performance tuning, governance, lineage, and SLA monitoring. |
data-warehouse
Use this skill for 大数据/数仓(DW)建设与运维:从数据接入到指标交付与治理。
Defaults / assumptions to confirm
- Warehouse tech: BigQuery/Snowflake/Redshift/Hive/ClickHouse/etc.
- Orchestration: Airflow/Dagster/Argo/dbt
- Ingestion: CDC vs batch, streaming (Kafka) vs file
- Data consumers: BI dashboards, product analytics, ML features
Core outputs
- Warehouse architecture (sources → staging → warehouse → marts)
- Data model (facts/dimensions) + metric definitions
- Pipeline plan (DAGs, schedules, dependencies, SLAs)
- Data quality plan (checks, thresholds, alerts)
- Cost/performance plan (partitioning, clustering, materialization)
- Governance plan (access control, PII handling, retention)
Workflow
- Understand the business questions
- What decisions will this warehouse support?
- Define critical metrics and their definitions (single source of truth).
- Source & ingestion design
- Identify systems of record and ownership.
- Choose ingestion: CDC for mutable OLTP, append-only logs for events.
- Define late-arriving data strategy and backfills.
- Modeling (practical)
- Prefer star schema for BI: Fact tables + Dimension tables.
- Keep grain explicit (one row represents what?).
- Separate raw/staging from curated models; avoid mixing.
- ETL/ELT pipelines
- Define DAGs with clear inputs/outputs and idempotency.
- Handle retries, partial failures, and reprocessing windows.
- Provide backfill procedures and runbook.
- Data quality & observability
- Validate freshness, volume, schema drift, null ratios, referential consistency (logical).
- Add anomaly detection for key metrics.
- Track pipeline success rate, duration, and SLA misses.
- Performance & cost
- Partition by date/time for large facts; cluster by common filters/joins.
- Materialize expensive queries (summary tables, incremental models).
- Control scan cost with column pruning and predicate pushdown.
- Governance & security
- PII classification, masking, and retention.
- RBAC/ABAC for datasets; audit logging.
- Document lineage and ownership for each table/model.
Templates
Table spec
- Name:
- Grain:
- Partition/cluster keys:
- Key columns:
- Sources:
- Refresh cadence:
- Consumers:
- Quality checks:
- Owner:
Metric spec
- Name:
- Definition:
- Numerator/denominator:
- Filters:
- Time window:
- Known caveats: