name	pyiceberg-storage
description	Research-driven PyIceberg development. Injects research steps for Iceberg table operations, ACID transactions, catalog integration, and schema evolution. Use when working with Iceberg table storage, time-travel queries, or data lake operations.

PyIceberg Storage Development (Research-Driven)

Philosophy

This skill does NOT prescribe specific Iceberg table patterns. Instead, it guides you to:

Research the current PyIceberg version and capabilities
Discover existing Iceberg table usage in the codebase
Validate your implementations against PyIceberg SDK documentation
Verify integration with Polaris catalog and storage layer

Pre-Implementation Research Protocol

Step 1: Verify Runtime Environment

ALWAYS run this first:

python -c "import pyiceberg; print(f'PyIceberg {pyiceberg.__version__}')"

Critical Questions to Answer:

What version is installed? (0.10.x recommended as of 2025)
Are required dependencies installed? (pyarrow, fsspec, cloud SDKs)
Does it match the documented requirements?

Step 2: Research SDK State (if unfamiliar)

When to research: If you encounter unfamiliar PyIceberg features or need to validate patterns

Research queries (use WebSearch):

"PyIceberg [feature] documentation 2025" (e.g., "PyIceberg ACID transactions 2025")
"PyIceberg catalog REST API configuration 2025"
"PyIceberg [cloud] integration 2025" (e.g., "PyIceberg AWS S3 integration 2025")

Official documentation: https://py.iceberg.apache.org

Key documentation sections:

API Reference: https://py.iceberg.apache.org/reference/
Configuration: https://py.iceberg.apache.org/configuration/
Table Operations: https://py.iceberg.apache.org/reference/pyiceberg/table/

Step 3: Discover Existing Patterns

BEFORE creating new Iceberg operations, search for existing implementations:

# Find existing PyIceberg usage
rg "from pyiceberg|import pyiceberg" --type py

# Find catalog configurations
rg "load_catalog|Catalog" --type py

# Find table operations
rg "\.load_table|\.create_table" --type py

# Find transaction usage
rg "\.transaction|\.append|\.overwrite" --type py

Key questions:

What catalog is configured? (REST, Hive, Glue, etc.)
What storage is used? (S3, Azure, GCS, local filesystem)
What table operations are already implemented?
How are schemas defined?

Step 4: Validate Against Architecture

Check architecture docs for integration requirements:

Read /docs/ for Iceberg storage requirements
Understand Polaris catalog integration
Verify compute targets and storage mappings
Check partition strategy requirements

Implementation Guidance (Not Prescriptive)

Catalog Configuration

Core concept: Catalogs manage Iceberg table metadata and coordinate with storage

Research questions:

What catalog type should be used? (REST for Polaris, Hive, Glue)
How should credentials be managed? (vended credentials, access delegation)
What warehouse configuration is needed?
How should catalog be initialized in code?

SDK features to research:

load_catalog(): Initialize catalog from configuration
REST catalog: Connection to Polaris or other REST APIs
Catalog types: rest, hive, glue, dynamodb, sql
Configuration: YAML files, environment variables, Python dicts

Table Operations

Core concept: Tables represent datasets with schema, partitioning, and ACID guarantees

Research questions:

Should I create a new table or load existing?
What schema should the table have?
What partitioning strategy? (time-based, dimension-based)
What sort order for performance?

SDK features to research:

catalog.create_table(): Create new Iceberg table
catalog.load_table(): Load existing table
table.scan(): Query table data
table.schema(): Get table schema
table.spec(): Get partition spec
table.properties(): Get table properties

ACID Transactions

Core concept: Iceberg uses optimistic concurrency and snapshots for ACID guarantees

Research questions:

What operation am I performing? (append, overwrite, delete)
Do I need transactional guarantees?
How should concurrent writes be handled?
What snapshot isolation level is needed?

SDK features to research:

table.append(): Append data (creates APPEND snapshot)
table.overwrite(): Overwrite data (creates OVERWRITE snapshot)
table.delete(): Delete data (creates DELETE snapshot)
Transaction API: Stage changes before committing
Optimistic concurrency: Snapshot-based isolation

Schema Evolution

Core concept: Iceberg supports safe schema changes without rewriting data

Research questions:

What schema changes are needed? (add column, rename, type promotion)
Are changes backward compatible?
How should schema evolution be tracked?
What validation is needed?

SDK features to research:

table.update_schema(): Modify table schema
Schema operations: add_column(), rename_column(), update_column()
Type promotion: Safe type changes (int → long, float → double)
Schema versioning: Iceberg tracks schema evolution

Partitioning

Core concept: Partitioning improves query performance by organizing data

Research questions:

What partition strategy? (daily, hourly, by dimension)
What columns should be partition keys?
Should partitioning be hidden (partition transforms)?
How does partitioning affect queries?

SDK features to research:

Partition specs: Define partitioning strategy
Partition transforms: day(), hour(), month(), year(), bucket(), truncate()
Hidden partitioning: Partition transforms applied automatically
Dynamic partitioning: Partitions created as data arrives

Time Travel

Core concept: Iceberg snapshots enable querying historical data

Research questions:

What snapshot should be queried? (timestamp, snapshot ID)
How far back does history need to go?
Should old snapshots be expired?
What metadata is available in snapshots?

SDK features to research:

table.scan().use_ref(): Query specific snapshot/tag/branch
table.history(): List table snapshots
table.refs(): List named references (branches, tags)
Snapshot expiration: Clean up old snapshots

Data I/O

Core concept: PyIceberg reads/writes data using PyArrow and fsspec

Research questions:

What data format? (Parquet, Avro, ORC)
What file system? (S3, Azure, GCS, local)
How should data be batched?
What compression should be used?

SDK features to research:

table.scan().to_arrow(): Read data as PyArrow Table
table.scan().to_pandas(): Read data as Pandas DataFrame
table.scan().to_duckdb(): Read data into DuckDB
FileIO implementations: S3, Azure, GCS, fsspec
Data file formats: Parquet (recommended), Avro, ORC

Validation Workflow

Before Implementation

✅ Verified PyIceberg version and dependencies
✅ Searched for existing catalog and table configurations
✅ Read architecture docs for Polaris integration
✅ Identified storage layer (S3, Azure, GCS)
✅ Researched unfamiliar PyIceberg features

During Implementation

✅ Using load_catalog() with proper configuration
✅ Type hints on ALL functions and parameters
✅ Proper error handling for catalog/table operations
✅ Transaction API for ACID guarantees
✅ Schema evolution following Iceberg best practices
✅ Partitioning strategy aligned with query patterns

After Implementation

✅ Verify catalog connection works
✅ Test table creation/loading
✅ Test data write operations (append, overwrite)
✅ Test data read operations (scan, time travel)
✅ Verify schema evolution works
✅ Check integration with dbt/Dagster (if applicable)

Context Injection (For Future Claude Instances)

When this skill is invoked, you should:

Verify runtime state (don't assume):

python -c "import pyiceberg; print(pyiceberg.__version__)"

Discover existing patterns (don't invent):

rg "load_catalog" --type py
rg "\.load_table|\.create_table" --type py

Research when uncertain (don't guess):
- Use WebSearch for "PyIceberg [feature] documentation 2025"
- Check official docs: https://py.iceberg.apache.org
Validate against architecture (don't assume requirements):
- Read relevant architecture docs in /docs/
- Understand Polaris catalog integration
- Check storage layer configuration
Check Polaris integration (if using REST catalog):
- Verify Polaris endpoint configuration
- Check credential management (vended credentials)
- Understand access delegation model

Quick Reference: Common Research Queries

Use these WebSearch queries when encountering specific needs:

Catalog setup: "PyIceberg REST catalog configuration 2025"
Polaris integration: "PyIceberg Polaris catalog integration 2025"
Table operations: "PyIceberg create table append data 2025"
ACID transactions: "PyIceberg transaction API ACID guarantees 2025"
Schema evolution: "PyIceberg schema evolution add column 2025"
Partitioning: "PyIceberg partition transforms hidden partitioning 2025"
Time travel: "PyIceberg snapshot time travel query 2025"
AWS S3: "PyIceberg S3 FileIO configuration 2025"
Azure: "PyIceberg Azure Blob Storage configuration 2025"
GCS: "PyIceberg Google Cloud Storage configuration 2025"

Integration Points to Research

Polaris Catalog Integration

Key question: How does PyIceberg connect to Polaris REST catalog?

Research areas:

REST catalog configuration (type: rest, uri, warehouse)
Credential management (credential, token-refresh-enabled)
Access delegation (header.X-Iceberg-Access-Delegation: vended-credentials)
Scope and permissions (scope: PRINCIPAL_ROLE:ALL)
FileIO implementation (py-io-impl: pyiceberg.io.fsspec.FsspecFileIO)

dbt → PyIceberg Integration

Key question: How do dbt models write to Iceberg tables?

Research areas:

dbt Python models using PyIceberg
External tables (dbt creates views, PyIceberg writes data)
Post-hooks for Iceberg table creation
Schema mapping (dbt schema → Iceberg schema)

Dagster → PyIceberg Integration

Key question: How do Dagster assets materialize Iceberg tables?

Research areas:

Custom IOManager for Iceberg tables
PyIceberg in asset functions
Partition mapping (Dagster partitions → Iceberg partitions)
Metadata propagation (Dagster metadata → Iceberg properties)

Storage Layer Configuration

Key question: How is cloud storage configured for Iceberg?

Research areas:

S3 configuration (bucket, region, credentials)
Azure Blob Storage configuration
Google Cloud Storage configuration
Local filesystem for development
Vended credentials vs static credentials

PyIceberg Development Workflow

Local Development

# Install PyIceberg with extras
pip install "pyiceberg[s3,pyarrow,pandas,duckdb]"

# Verify installation
python -c "import pyiceberg; print(pyiceberg.__version__)"

# Test catalog connection
python -c "from pyiceberg.catalog import load_catalog; catalog = load_catalog('default')"

Example: Load Catalog and Query Table

from pyiceberg.catalog import load_catalog

# Load catalog from configuration
catalog = load_catalog(
    "polaris_catalog",
    **{
        "type": "rest",
        "uri": "https://account.snowflakecomputing.com/polaris/api/catalog",
        "credential": "client_id:client_secret",
        "warehouse": "my_warehouse",
    }
)

# Load table
table = catalog.load_table("my_namespace.my_table")

# Query data
df = table.scan().to_pandas()

Example: Create Table and Write Data

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType
import pyarrow as pa

# Define schema
schema = Schema(
    NestedField(1, "id", IntegerType(), required=True),
    NestedField(2, "name", StringType(), required=False),
)

# Create table
table = catalog.create_table(
    "my_namespace.my_new_table",
    schema=schema,
)

# Append data
data = pa.table({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})
table.append(data)

References

PyIceberg Documentation: Official documentation
PyIceberg on PyPI: Package information
Configuration Guide: Catalog and storage config
API Reference: Complete API documentation
GitHub Repository: PyIceberg source
Apache Iceberg Spec: Iceberg table format specification

Remember: This skill provides research guidance, NOT prescriptive patterns. Always:

Verify the PyIceberg version and dependencies
Discover existing catalog and table configurations
Research SDK capabilities when needed (use WebSearch liberally)
Validate against actual Polaris integration requirements
Test catalog connection and table operations before considering complete
Understand ACID guarantees and snapshot isolation for correctness

pyiceberg-storage

Install Skill

SKILL.md

PyIceberg Storage Development (Research-Driven)

Philosophy

Pre-Implementation Research Protocol

Step 1: Verify Runtime Environment

Step 2: Research SDK State (if unfamiliar)

Step 3: Discover Existing Patterns

Step 4: Validate Against Architecture

Implementation Guidance (Not Prescriptive)

Catalog Configuration

Table Operations

ACID Transactions

Schema Evolution

Partitioning

Time Travel

Data I/O

Validation Workflow

Before Implementation

During Implementation

After Implementation

Context Injection (For Future Claude Instances)

Quick Reference: Common Research Queries

Integration Points to Research

Polaris Catalog Integration

dbt → PyIceberg Integration

Dagster → PyIceberg Integration

Storage Layer Configuration

PyIceberg Development Workflow

Local Development

Example: Load Catalog and Query Table

Example: Create Table and Write Data

References