Claude Code Plugins

Community-maintained marketplace

Feedback
0
0

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name bigquery-etl-core
description The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

  • query.sql OR query.py - The query definition (SQL or Python)
  • metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
  • schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

  1. Naming Conventions: READ references/naming_conventions.md

    • Table naming patterns
    • Dataset organization
    • Version suffix conventions
  2. Dataset Organization: READ references/dataset_naming_conventions.md

    • Common dataset suffixes (_derived, _stable, _live)
    • When to use each dataset type
    • Dataset naming rules
  3. Schema Resources: READ references/discovery_resources.md

    • Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
    • Priority order for schema lookup during construction
    • Common mozfun UDFs
  4. Privacy Guidelines: READ references/privacy_guidelines.md

    • Data handling requirements
    • PII considerations
    • Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See assets/directory_structure_example.txt for detailed examples.

Key principles:

  • Always flat: sql/{project}/{dataset}/{table_name}/
  • Never use subdirectories within table directories
  • Table names always include version suffix (_v1, _v2, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

  1. Local files first: Check sql/*/schema.yaml and metadata.yaml files

    • Most reliable and up-to-date source
    • Contains field descriptions written by table owners
  2. Glean Dictionary: For _live and _stable tables

  3. ProbeInfo API: For Glean metric metadata

    • Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
    • Provides metric definitions and descriptions programmatically
    • Use for validating metric references in queries
  4. DataHub MCP: Only as last resort

    • MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
    • Use for schema lookup when not available in local files or Glean Dictionary
    • Extract ONLY necessary fields (column names, types, descriptions)
    • Use for downstream impact analysis when modifying tables

See references/discovery_resources.md for:

  • Detailed guidance on each schema source
  • ProbeInfo API endpoints and usage patterns
  • Glean Dictionary URL patterns for different products
  • DataHub MCP best practices for construction
  • Common mozfun UDFs
  • Key documentation links

Naming Conventions

Table Names:

  • Use snake_case with version suffix: clients_daily_event_v1
  • Common suffixes: _daily, _hourly, _aggregates, _summary

Field Names:

  • Use snake_case: submission_date, client_id, n_total_events
  • Prefix counts with n_: n_events, n_sessions
  • Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

See references/naming_conventions.md for:

  • Complete naming patterns and conventions
  • Reserved/common patterns to avoid
  • BigQuery project naming conventions

Dataset Organization

See references/dataset_naming_conventions.md for:

  • Dataset naming patterns by suffix (_derived, _external, etc.)
  • Common dataset prefixes by product/source
  • Table versioning patterns
  • Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

  • No PII in derived tables
  • Use client-level identifiers (client_id) not individual identifiers
  • Respect data retention policies (~2 years for client-level data)
  • Label client-level tables with table_type: client_level in metadata.yaml

See references/privacy_guidelines.md for:

  • Key principles from Mozilla's data platform
  • Geo IP lookup and user agent parsing policies
  • Best practices for data handling
  • Deletion request support
  • Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

  • Most tables use day partitioning on submission_date
  • Clustering improves query performance for filtered/joined fields
  • See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

  • mozfun.map.get_key() - Extract values from key-value maps
  • mozfun.norm.truncate_version() - Normalize version strings
  • mozfun.stats.mode_last() - Statistical mode calculation

UDF source code in sql/mozfun/ directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

  • Metric types: Counter, boolean, string, event, etc.
  • Pings: Collections of metrics (e.g., baseline, events, metrics)
  • Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

  • Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
  • All have auto-generated schemas based on metric definitions

See references/glean_overview.md for:

  • What is Glean and how it differs from Firefox Desktop Telemetry
  • Glean SDK and metric type details
  • Common Glean datasets in BigQuery
  • When to use Glean Dictionary

bigquery-etl CLI Commands

See references/bqetl_cli_commands.md for:

  • Key bqetl CLI commands for query creation, validation, schema updates
  • How to find the right DAG for scheduling
  • Backfill creation commands

Best Practices

General principles:

  • Always include field descriptions in schema.yaml (see metadata-manager skill)
  • Add header comments explaining query purpose (see query-writer skill)
  • Reference bug/ticket numbers for context
  • Document any data exclusions or filtering logic

See assets/query_structure_example.sql for standard query structure.

Version migration:

  • Create new _v2 table when making breaking schema changes
  • Keep _v1 running during migration period
  • Update views to point to new version
  • Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

  • Provides naming conventions for new tables and datasets
  • Supplies common field naming patterns for requirements gathering
  • Offers privacy guidelines for data model planning

Works with query-writer

  • Provides project structure and naming conventions
  • Supplies common patterns and mozfun UDF references
  • Offers schema description lookup guidance for construction

Works with metadata-manager

  • Provides DAG naming patterns and scheduling conventions
  • Supplies partitioning and clustering best practices
  • Offers ownership and labeling patterns

Works with sql-test-generator

  • Provides test structure and fixture naming conventions
  • Supplies common table patterns for test creation
  • Offers query parameter conventions

Works with bigconfig-generator

  • Provides table naming conventions for Bigeye monitoring configuration
  • Supplies dataset organization patterns
  • Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

  • Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
  • Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
  • Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
  • Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
  • Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

Bundled Resources

References

  • references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
  • references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
  • references/dataset_naming_conventions.md - Dataset organization and versioning patterns
  • references/privacy_guidelines.md - Mozilla data privacy policies and best practices
  • references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
  • references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

  • READ references/datahub_best_practices.md - Comprehensive token optimization strategies
  • Follow priority order: local files → documentation → DataHub (only as last resort)
  • Use search-first patterns and extract minimal fields from responses

Assets

  • assets/query_structure_example.sql - Standard query.sql structure with common patterns
  • assets/directory_structure_example.txt - File organization examples