| name | bigquery-etl-core |
| description | The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills. |
BigQuery ETL Core
Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction
Project Overview
The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.
Each table/query typically consists of three files:
query.sqlORquery.py- The query definition (SQL or Python)metadata.yaml- Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)schema.yaml- BigQuery schema definition with field types and descriptions (see metadata-manager skill)
Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.
🚨 REQUIRED READING - Start Here
When starting work in bigquery-etl, READ these foundational references:
Naming Conventions: READ
references/naming_conventions.md- Table naming patterns
- Dataset organization
- Version suffix conventions
Dataset Organization: READ
references/dataset_naming_conventions.md- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
Schema Resources: READ
references/discovery_resources.md- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
Privacy Guidelines: READ
references/privacy_guidelines.md- Data handling requirements
- PII considerations
- Workgroup access patterns
Directory Structure
sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml
See assets/directory_structure_example.txt for detailed examples.
Key principles:
- Always flat:
sql/{project}/{dataset}/{table_name}/ - Never use subdirectories within table directories
- Table names always include version suffix (
_v1,_v2, etc.)
Schema & Description Resources for Construction
Finding Schema Descriptions
Priority order for schema lookup during construction:
Local files first: Check
sql/*/schema.yamlandmetadata.yamlfiles- Most reliable and up-to-date source
- Contains field descriptions written by table owners
Glean Dictionary: For
_liveand_stabletables- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
ProbeInfo API: For Glean metric metadata
- Endpoints:
https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics - Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
- Endpoints:
DataHub MCP: Only as last resort
- MUST READ
references/datahub_best_practices.mdBEFORE any DataHub queries - Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables
- MUST READ
See references/discovery_resources.md for:
- Detailed guidance on each schema source
- ProbeInfo API endpoints and usage patterns
- Glean Dictionary URL patterns for different products
- DataHub MCP best practices for construction
- Common mozfun UDFs
- Key documentation links
Naming Conventions
Table Names:
- Use snake_case with version suffix:
clients_daily_event_v1 - Common suffixes:
_daily,_hourly,_aggregates,_summary
Field Names:
- Use snake_case:
submission_date,client_id,n_total_events - Prefix counts with
n_:n_events,n_sessions - Standard Mozilla fields:
submission_date,client_id,sample_id,normalized_channel,normalized_country_code,app_version
See references/naming_conventions.md for:
- Complete naming patterns and conventions
- Reserved/common patterns to avoid
- BigQuery project naming conventions
Dataset Organization
See references/dataset_naming_conventions.md for:
- Dataset naming patterns by suffix (
_derived,_external, etc.) - Common dataset prefixes by product/source
- Table versioning patterns
- Incremental vs full refresh query patterns
Privacy & Data Handling
Mozilla follows strict data privacy policies:
- No PII in derived tables
- Use client-level identifiers (
client_id) not individual identifiers - Respect data retention policies (~2 years for client-level data)
- Label client-level tables with
table_type: client_levelin metadata.yaml
See references/privacy_guidelines.md for:
- Key principles from Mozilla's data platform
- Geo IP lookup and user agent parsing policies
- Best practices for data handling
- Deletion request support
- Sample ID usage for sampling
BigQuery & Mozilla Conventions
Partitioning & Clustering
- Most tables use day partitioning on
submission_date - Clustering improves query performance for filtered/joined fields
- See metadata-manager skill for detailed partitioning and clustering configuration
Common UDFs (mozfun)
Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/
Common functions:
mozfun.map.get_key()- Extract values from key-value mapsmozfun.norm.truncate_version()- Normalize version stringsmozfun.stats.mode_last()- Statistical mode calculation
UDF source code in sql/mozfun/ directory.
Glean Overview
Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.
Key concepts:
- Metric types: Counter, boolean, string, event, etc.
- Pings: Collections of metrics (e.g.,
baseline,events,metrics) - Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)
Common Glean datasets in BigQuery:
- Pattern:
{app_id}.{ping_name}(e.g.,org_mozilla_fenix.baseline) - All have auto-generated schemas based on metric definitions
See references/glean_overview.md for:
- What is Glean and how it differs from Firefox Desktop Telemetry
- Glean SDK and metric type details
- Common Glean datasets in BigQuery
- When to use Glean Dictionary
bigquery-etl CLI Commands
See references/bqetl_cli_commands.md for:
- Key bqetl CLI commands for query creation, validation, schema updates
- How to find the right DAG for scheduling
- Backfill creation commands
Best Practices
General principles:
- Always include field descriptions in schema.yaml (see metadata-manager skill)
- Add header comments explaining query purpose (see query-writer skill)
- Reference bug/ticket numbers for context
- Document any data exclusions or filtering logic
See assets/query_structure_example.sql for standard query structure.
Version migration:
- Create new
_v2table when making breaking schema changes - Keep
_v1running during migration period - Update views to point to new version
- Coordinate with downstream consumers before deprecating old version
For detailed best practices, see:
- Query writing: query-writer skill
- Metadata configuration: metadata-manager skill
- Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
- Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/
Integration with Other Skills
bigquery-etl-core serves as the foundation skill that other skills build upon:
Works with model-requirements
- Provides naming conventions for new tables and datasets
- Supplies common field naming patterns for requirements gathering
- Offers privacy guidelines for data model planning
Works with query-writer
- Provides project structure and naming conventions
- Supplies common patterns and mozfun UDF references
- Offers schema description lookup guidance for construction
Works with metadata-manager
- Provides DAG naming patterns and scheduling conventions
- Supplies partitioning and clustering best practices
- Offers ownership and labeling patterns
Works with sql-test-generator
- Provides test structure and fixture naming conventions
- Supplies common table patterns for test creation
- Offers query parameter conventions
Works with bigconfig-generator
- Provides table naming conventions for Bigeye monitoring configuration
- Supplies dataset organization patterns
- Offers field naming standards for data quality checks
This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.
Reference Examples
Real query examples in the repository:
- Simple query:
sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql - Aggregation with GROUP BY:
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql - Complex query with CTEs:
sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql - Python ETL (INFORMATION_SCHEMA):
sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py - Python ETL (External API):
sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py
For more examples, explore the sql/moz-fx-data-shared-prod/ directory.
Bundled Resources
References
references/discovery_resources.md- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation linksreferences/naming_conventions.md- Complete naming patterns for tables, fields, and projectsreferences/dataset_naming_conventions.md- Dataset organization and versioning patternsreferences/privacy_guidelines.md- Mozilla data privacy policies and best practicesreferences/glean_overview.md- Glean SDK concepts and BigQuery dataset structuresreferences/bqetl_cli_commands.md- Key CLI commands and DAG discovery
DataHub Usage (CRITICAL for Token Efficiency)
BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:
- READ
references/datahub_best_practices.md- Comprehensive token optimization strategies - Follow priority order: local files → documentation → DataHub (only as last resort)
- Use search-first patterns and extract minimal fields from responses
Assets
assets/query_structure_example.sql- Standard query.sql structure with common patternsassets/directory_structure_example.txt- File organization examples