name	gcp-bq-data-loading
description	Use when loading data into BigQuery from CSV, JSON, Avro, Parquet files, Cloud Storage, or local files. Covers bq load command, source formats, schema detection, incremental loading, and handling parsing errors.

BigQuery Data Loading

Use this skill when importing data into BigQuery from various file formats and sources.

Basic Load Command

bq load \
  --location=LOCATION \
  --source_format=FORMAT \
  PROJECT:DATASET.TABLE \
  SOURCE_PATH \
  SCHEMA

Loading from CSV

Basic CSV Load

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT,date:DATE

CSV with Schema File

# Create schema.json
echo '[
  {"name": "customer_id", "type": "STRING"},
  {"name": "amount", "type": "FLOAT"},
  {"name": "date", "type": "DATE"}
]' > schema.json

# Load with schema file
bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  ./schema.json

CSV Options

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  --field_delimiter=',' \
  --quote='"' \
  --allow_quoted_newlines \
  --allow_jagged_rows \
  --max_bad_records=100 \
  --null_marker='NULL' \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Key flags:

--skip_leading_rows=N - Skip header rows
--field_delimiter=',' - Column separator (default: comma)
--allow_quoted_newlines - Allow newlines in quoted fields
--allow_jagged_rows - Allow rows with missing fields
--max_bad_records=N - Tolerate N parsing errors
--null_marker='NULL' - String representing NULL values

CSV with Auto-detect

bq load \
  --source_format=CSV \
  --autodetect \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv

Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.

Loading from JSON

Newline-Delimited JSON

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  dataset.table \
  gs://bucket/data.json \
  customer_id:STRING,amount:FLOAT,date:DATE

JSON format required:

{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}
{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}

NOT standard JSON array:

// ❌ This won't work
[
  {"customer_id": "C001", "amount": 99.99},
  {"customer_id": "C002", "amount": 149.99}
]

JSON with Auto-detect

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --autodetect \
  dataset.table \
  gs://bucket/data.json

JSON with Nested Fields

Schema with nested STRUCT:

[
  {"name": "customer_id", "type": "STRING"},
  {"name": "address", "type": "RECORD", "fields": [
    {"name": "street", "type": "STRING"},
    {"name": "city", "type": "STRING"},
    {"name": "zip", "type": "STRING"}
  ]},
  {"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [
    {"name": "order_id", "type": "STRING"},
    {"name": "amount", "type": "FLOAT"}
  ]}
]

Loading from Avro

Basic Avro Load

bq load \
  --source_format=AVRO \
  dataset.table \
  gs://bucket/data.avro

Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!

Avro with Wildcards

bq load \
  --source_format=AVRO \
  dataset.table \
  "gs://bucket/path/to/*.avro"

Note: Use quotes around wildcard paths.

Avro Advantages

Self-describing: Schema embedded in file
Efficient: Compact binary format
Type-safe: Strong typing preserved
No schema drift: Schema always matches data

Loading from Parquet

Basic Parquet Load

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.parquet

Like Avro: Schema auto-detected from Parquet metadata.

Parquet with Compression

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.snappy.parquet

Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD

Loading Strategies

Append Data (Default)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/new_data.csv \
  schema.json

Behavior: Adds rows to existing table.

Replace Table

bq load \
  --source_format=CSV \
  --replace \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Deletes all existing data, loads new data.

Overwrite Partition

bq load \
  --source_format=CSV \
  --replace \
  --time_partitioning_field=date \
  dataset.table\$20240115 \
  gs://bucket/data_20240115.csv \
  schema.json

Syntax: TABLE$YYYYMMDD targets specific partition.

Behavior: Replaces only that partition, leaves others intact.

Append or Skip

# Skip load if table already has data
bq load \
  --source_format=CSV \
  --if_not_exists \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Loading from Cloud Storage

Single File

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Multiple Files (List)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \
  schema.json

Wildcard Pattern

bq load \
  --source_format=CSV \
  dataset.table \
  "gs://bucket/path/data-*.csv" \
  schema.json

Patterns:

gs://bucket/*.csv - All CSV files in bucket root
gs://bucket/2024/*/*.csv - All CSV files in subdirectories
gs://bucket/data-[0-9]*.csv - Files matching pattern

Loading from Local Files

Local CSV

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  /path/to/local/data.csv \
  schema.json

Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.

Schema Handling

Inline Schema

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN

Schema File

[
  {"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
  {"name": "amount", "type": "FLOAT64"},
  {"name": "order_date", "type": "DATE"},
  {"name": "metadata", "type": "JSON"}
]

Modes:

REQUIRED - Field must have value
NULLABLE - Field can be NULL (default)
REPEATED - Field is an array

Auto-detect Schema

bq load \
  --source_format=CSV \
  --autodetect \
  dataset.table \
  gs://bucket/data.csv

Pros:

Quick for exploration
No schema definition needed

Cons:

Not reliable for production
Schema may change between loads
May misinterpret types

Schema Evolution

Add new columns:

bq load \
  --source_format=CSV \
  --schema_update_option=ALLOW_FIELD_ADDITION \
  dataset.table \
  gs://bucket/data_with_new_column.csv \
  schema.json

Relax required columns:

bq load \
  --schema_update_option=ALLOW_FIELD_RELAXATION \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Compression

Compressed Files

BigQuery automatically detects compression:

# Gzip compressed CSV
bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv.gz \
  schema.json

Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD

Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.

Error Handling

Allow Bad Records

bq load \
  --source_format=CSV \
  --max_bad_records=1000 \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Skip up to 1000 rows with errors, load the rest.

Ignore Unknown Values

bq load \
  --source_format=JSON \
  --ignore_unknown_values \
  dataset.table \
  gs://bucket/data.json \
  schema.json

Behavior: Ignore JSON fields not in schema.

Validate Only (Dry Run)

bq load \
  --dry_run \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Validate schema and format without loading data.

Loading to Partitioned Tables

Time-Partitioned Table

bq load \
  --source_format=CSV \
  --time_partitioning_field=order_date \
  --time_partitioning_type=DAY \
  dataset.partitioned_orders \
  gs://bucket/orders_2024.csv \
  order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT

Load Specific Partition

# Load into 2024-01-15 partition
bq load \
  --source_format=CSV \
  dataset.orders\$20240115 \
  gs://bucket/orders_20240115.csv \
  schema.json

Range-Partitioned Table

bq load \
  --source_format=CSV \
  --range_partitioning=customer_id,0,1000,100 \
  dataset.range_partitioned \
  gs://bucket/data.csv \
  customer_id:INTEGER,amount:FLOAT

Performance Optimization

Parallel Loading

BigQuery automatically parallelizes loads from:

Multiple files with wildcards
Uncompressed files (internal parallelization)

Optimal: Split large files into 1GB chunks

File Format Recommendations

Best performance:

Avro - Fastest, self-describing, splittable
Parquet - Fast, columnar, good compression
CSV uncompressed - Good for parallel processing
JSON - Flexible but slower

Avoid:

Very large single files (>10GB)
CSV with complex escaping
Highly compressed files (limits parallelism)

Monitoring Loads

Check Load Jobs

bq ls --jobs --max_results=10

Job Details

bq show -j JOB_ID

Failed Loads

SELECT
  job_id,
  user_email,
  error_result.message as error_message,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  job_type = 'LOAD'
  AND state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;

Common Patterns

Daily Batch Load

#!/bin/bash
DATE=$(date +%Y%m%d)
bq load \
  --source_format=CSV \
  --replace \
  dataset.daily_data\$$DATE \
  gs://bucket/data_$DATE.csv \
  schema.json

Incremental Load with Deduplication

# Load to staging
bq load \
  --source_format=CSV \
  dataset.staging_orders \
  gs://bucket/new_orders.csv \
  schema.json

# Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE `project.dataset.orders` T
USING `project.dataset.staging_orders` S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'

Troubleshooting

"Too many errors"

Problem: Exceeds max_bad_records Solution: Increase --max_bad_records or fix data quality

"Schema mismatch"

Problem: CSV columns don't match schema Solution: Verify column order and count

"Invalid CSV format"

Problem: Unescaped quotes or newlines Solution: Use --allow_quoted_newlines

"Permission denied"

Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role

Quick Reference

Format priorities (fastest first):

Avro (splittable, self-describing)
Parquet (columnar, efficient)
CSV uncompressed (parallel)
JSON (flexible)

Schema strategies:

Production: Explicit schema file
Development: Auto-detect
Evolution: Allow field addition/relaxation

Loading strategies:

New data: Append (default)
Replace all: --replace
Replace partition: TABLE$YYYYMMDD
Incremental: Load staging → MERGE

Install Skill

SKILL.md