| name | gcp-bq-table-management |
| description | Use when creating BigQuery tables, implementing partitioning or clustering, managing table schemas, or optimizing table structure. Covers time-based partitioning, range partitioning, clustering strategies, DDL commands, and table configuration. |
BigQuery Table Management
Use this skill when creating, modifying, or optimizing BigQuery table structures with partitioning and clustering.
Creating Tables
Basic Table Creation
Using bq mk:
bq mk -t \
--schema 'customer_id:STRING,amount:FLOAT,date:DATE' \
--description "Customer orders table" \
project:dataset.orders
Using SQL DDL:
CREATE TABLE `project.dataset.orders` (
customer_id STRING,
amount FLOAT64,
date DATE,
created_at TIMESTAMP
);
Partitioning Strategies
Time-Based Partitioning
Create time-partitioned table:
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,amount:FLOAT' \
--time_partitioning_field timestamp \
--time_partitioning_type DAY \
project:dataset.orders
SQL DDL version:
CREATE TABLE `project.dataset.orders` (
timestamp TIMESTAMP,
customer_id STRING,
amount FLOAT64
)
PARTITION BY DATE(timestamp);
Partitioning options:
- DAY - Daily partitions (most common)
- HOUR - Hourly partitions (for high-volume data)
- MONTH - Monthly partitions (for historical data)
- YEAR - Yearly partitions (for very old data)
Ingestion-Time Partitioning
Create table with automatic _PARTITIONTIME:
bq mk -t \
--schema 'customer_id:STRING,amount:FLOAT' \
--time_partitioning_type DAY \
project:dataset.orders
Query with ingestion-time partition:
SELECT * FROM `project.dataset.orders`
WHERE _PARTITIONTIME >= '2024-01-01'
Range Partitioning
Create range-partitioned table:
bq mk -t \
--schema 'customer_id:INTEGER,region:STRING,sales:FLOAT' \
--range_partitioning=customer_id,0,100,10 \
project:dataset.sales
Parameters: field,start,end,interval
- Creates partitions: [0,10), [10,20), [20,30), ..., [90,100)
SQL DDL version:
CREATE TABLE `project.dataset.sales` (
customer_id INT64,
region STRING,
sales FLOAT64
)
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10));
Clustering
Basic Clustering
Create clustered table:
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,product_id:STRING,amount:FLOAT' \
--clustering_fields customer_id,product_id \
project:dataset.orders
SQL DDL version:
CREATE TABLE `project.dataset.orders` (
timestamp TIMESTAMP,
customer_id STRING,
product_id STRING,
amount FLOAT64
)
CLUSTER BY customer_id, product_id;
Clustering rules:
- Up to 4 clustering columns
- Order matters (first column has most impact)
- Works best with WHERE, GROUP BY, JOIN filters
Partitioning + Clustering (Recommended)
Combined approach:
bq mk -t \
--schema 'timestamp:TIMESTAMP,customer_id:STRING,transaction_amount:FLOAT' \
--time_partitioning_field timestamp \
--clustering_fields customer_id \
--description "Partitioned by day, clustered by customer" \
project:dataset.transactions
SQL DDL version:
CREATE TABLE `project.dataset.transactions` (
timestamp TIMESTAMP,
customer_id STRING,
transaction_amount FLOAT64
)
PARTITION BY DATE(timestamp)
CLUSTER BY customer_id;
Query benefits:
-- Partition pruning + clustering optimization
SELECT * FROM `project.dataset.transactions`
WHERE DATE(timestamp) = '2024-01-15' -- Partition filter
AND customer_id = 'CUST123' -- Cluster filter
Table Configuration Options
Expiration
Set table expiration:
bq mk -t \
--expiration 2592000 \
--schema 'field:TYPE' \
project:dataset.temp_table
Expiration in seconds: 2592000 = 30 days
Update existing table:
bq update --expiration 604800 project:dataset.table
Remove expiration:
bq update --expiration 0 project:dataset.table
Labels
Add labels:
bq mk -t \
--schema 'field:TYPE' \
--label environment:production \
--label team:analytics \
project:dataset.table
Update labels:
bq update --set_label environment:staging project:dataset.table
Description
Set description:
bq update \
--description "Customer transaction history with daily partitioning" \
project:dataset.transactions
Schema Management
Adding Columns
Cannot add required columns to existing data:
# Add optional column
bq query --use_legacy_sql=false \
'ALTER TABLE `project.dataset.table`
ADD COLUMN new_field STRING'
Changing Column Modes
REQUIRED → NULLABLE (allowed):
ALTER TABLE `project.dataset.table`
ALTER COLUMN field_name DROP NOT NULL;
NULLABLE → REQUIRED (NOT allowed if data exists)
Relaxing Column Types
Allowed type changes:
- INT64 → FLOAT64 ✅
- INT64 → NUMERIC ✅
- INT64 → BIGNUMERIC ✅
- INT64 → STRING ✅
Example:
ALTER TABLE `project.dataset.table`
ALTER COLUMN amount SET DATA TYPE FLOAT64;
External Tables
Create External Table (GCS)
CSV in GCS:
bq mk \
--external_table_definition=gs://bucket/*.csv@CSV \
--schema='customer_id:STRING,amount:FLOAT' \
project:dataset.external_orders
Parquet in GCS (schema auto-detected):
bq mk \
--external_table_definition=gs://bucket/*.parquet@PARQUET \
project:dataset.external_data
Supported formats: CSV, JSON, AVRO, PARQUET, ORC
External Table Limitations
- No DML operations (INSERT, UPDATE, DELETE)
- No guaranteed performance SLAs
- Data must be in GCS, Drive, or Bigtable
- Cannot be partitioned (but can use hive partitioning)
Snapshots and Clones
Table Snapshots
Create snapshot:
CREATE SNAPSHOT TABLE `project.dataset.orders_snapshot`
CLONE `project.dataset.orders`;
Restore from snapshot:
CREATE OR REPLACE TABLE `project.dataset.orders`
CLONE `project.dataset.orders_snapshot`;
Snapshot retention: 7 days by default
Table Clones
Create table clone:
CREATE TABLE `project.dataset.orders_clone`
CLONE `project.dataset.orders`;
Difference from snapshot:
- Clone = new independent table
- Snapshot = point-in-time reference
Time Travel
Query historical data:
-- Query table as it was 1 hour ago
SELECT * FROM `project.dataset.orders`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
-- Query table at specific time
SELECT * FROM `project.dataset.orders`
FOR SYSTEM_TIME AS OF '2024-01-15 10:00:00 UTC';
Time travel window: 7 days (168 hours) by default
Best Practices
Partition Selection
Use time-based partitioning when:
- Data has timestamp/date column
- Queries filter by time ranges
- Data arrives chronologically
- Want automatic partition management
Use range partitioning when:
- Partitioning on integer column (ID, age, etc.)
- Predictable value distribution
- Fixed range boundaries
Use ingestion-time partitioning when:
- No natural timestamp column
- Loading data from streaming sources
- Want simple partition management
Clustering Selection
Cluster on columns that are:
- Frequently used in WHERE clauses
- Used in JOIN conditions
- Used in GROUP BY
- High cardinality (many distinct values)
Order matters:
- Most filtered column first
- Then second most filtered
- Up to 4 columns total
Partitioning + Clustering Strategy
Optimal pattern:
CREATE TABLE `project.dataset.optimized` (
event_timestamp TIMESTAMP, -- Partition on this
customer_id STRING, -- Cluster on this (1st)
product_category STRING, -- Cluster on this (2nd)
amount FLOAT64
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY customer_id, product_category;
Query pattern:
-- Both partition and cluster benefit
SELECT SUM(amount)
FROM `project.dataset.optimized`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
AND customer_id = 'CUST123'
GROUP BY product_category;
Checking Table Metadata
Get table information:
bq show --format=prettyjson project:dataset.table
Check partition info:
SELECT
partition_id,
total_rows,
total_logical_bytes,
last_modified_time
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'orders'
ORDER BY partition_id DESC
LIMIT 10;
Check clustering info:
SELECT
table_name,
clustering_ordinal_position,
column_name
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE clustering_ordinal_position IS NOT NULL
ORDER BY table_name, clustering_ordinal_position;
Common Pitfalls
❌ Too many partitions
Problem: Creating 100,000+ partitions Limit: 10,000 partitions per table Solution: Use larger partition granularity (MONTH vs DAY)
❌ Wrong partition column
Problem: Partitioning on column not used in queries Solution: Partition on columns used in WHERE clauses
❌ Not filtering on partition
Problem: Query scans all partitions Solution: Always include partition filter in WHERE
❌ Clustering too many columns
Problem: Clustering on 5+ columns Limit: 4 columns maximum Solution: Choose most selective columns
❌ Wrong cluster order
Problem: Least selective column first Solution: Put most selective column first
Table Maintenance
Update partition expiration
Set partition expiration:
bq update \
--time_partitioning_expiration 2592000 \
project:dataset.partitioned_table
Query shows this: Partitions older than 30 days auto-delete
Optimize table storage
Run optimization query:
-- BigQuery automatically optimizes storage
-- No manual VACUUM or OPTIMIZE needed
BigQuery automatically:
- Compacts data
- Sorts by clustering columns
- Removes deleted rows
- Optimizes storage format
Access Control & Security
Row-Level Security
Row-level access policies filter data based on user/group membership. They coexist with column-level security.
Creating row-level policies:
CREATE ROW ACCESS POLICY policy_name
ON dataset.table
GRANT TO ("user:[email protected]")
FILTER USING (region = "US");
Multiple policies:
-- Policy for US users
CREATE ROW ACCESS POLICY us_users_policy
ON dataset.orders
GRANT TO ("group:[email protected]")
FILTER USING (region = "US");
-- Policy for managers (see all regions)
CREATE ROW ACCESS POLICY managers_policy
ON dataset.orders
GRANT TO ("group:[email protected]")
FILTER USING (TRUE);
Viewing policies:
SELECT * FROM dataset.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES
WHERE table_name = 'orders';
Dropping policies:
DROP ROW ACCESS POLICY policy_name ON dataset.table;
Column-Level Security
Use policy tags from Data Catalog to restrict access to sensitive columns:
Creating table with policy tags:
CREATE TABLE dataset.customers (
customer_id STRING,
name STRING,
email STRING,
ssn STRING OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG")
),
credit_score INT64 OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/SENSITIVE_TAG")
)
);
Adding policy tags to existing columns:
ALTER TABLE dataset.customers
ALTER COLUMN ssn SET OPTIONS(
policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG")
);
How it works:
- Create taxonomy and policy tags in Data Catalog
- Apply policy tags to table columns
- Grant IAM roles on policy tags (datacatalog.categoryFineGrainedReader)
- Users without permission cannot query those columns
Authorized Views
Views that allow users to query data without direct table access:
Use cases:
- Sharing specific columns/rows without full table access
- Implementing business logic in access control
- Best performance for row/column filtering
Setup process:
-- 1. Create view in dataset A
CREATE VIEW datasetA.public_orders AS
SELECT order_id, customer_id, amount, order_date
FROM datasetA.orders
WHERE status = 'completed';
-- 2. Grant dataset B's view access to dataset A's table
-- This is done via dataset permissions in Cloud Console or:
bq update --source datasetA.orders \
--view datasetB.public_view
Example authorized view:
-- View in shared_views dataset
CREATE VIEW shared_views.customer_summary AS
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM private_data.orders
GROUP BY customer_id;
-- Grant access to view (not underlying table)
-- Users can query shared_views.customer_summary
-- but cannot access private_data.orders
Benefits:
- Row/column filtering without policy overhead
- Business logic in SQL (e.g., only show completed orders)
- Best query performance
- Centralized access control
Security Best Practices
1. Layered security:
- Use row-level policies for user-based filtering
- Use column-level security for sensitive data (PII, PHI)
- Use authorized views for complex access patterns
2. Performance:
- Authorized views: Best performance (compiled into query)
- Row-level policies: Slight overhead (filter applied)
- Column-level: No performance impact
3. Combining approaches:
-- Table with column-level security AND row-level policy
CREATE TABLE dataset.sensitive_data (
user_id STRING,
region STRING,
ssn STRING OPTIONS(policy_tags=("...")),
data JSON
)
PARTITION BY DATE(created_at);
-- Row-level policy
CREATE ROW ACCESS POLICY regional_access
ON dataset.sensitive_data
GRANT TO ("group:[email protected]")
FILTER USING (region = "US");
4. Auditing: Monitor access with Cloud Audit Logs:
SELECT
timestamp,
principal_email,
resource_name,
method_name
FROM `PROJECT.DATASET.cloudaudit_googleapis_com_data_access_*`
WHERE resource.type = "bigquery_dataset"
ORDER BY timestamp DESC;
Quick Reference
Partition types:
- Time-based: HOUR, DAY, MONTH, YEAR
- Ingestion-time: Automatic _PARTITIONTIME
- Range: Integer column ranges
Clustering:
- Max 4 columns
- Order matters
- Works with or without partitioning
Security:
- Row-level: Filter by user/group
- Column-level: Policy tags for sensitive data
- Authorized views: Business logic filtering
Limits:
- 10,000 partitions per table
- 4 clustering columns
- 7-day time travel window
- 10,000 columns per table
- 100 row-level policies per table