| name | polars |
| description | Lightning-fast DataFrame library written in Rust for high-performance data manipulation and analysis. Use when user wants blazing fast data transformations, working with large datasets, lazy evaluation pipelines, or needs better performance than pandas. Ideal for ETL, data wrangling, aggregations, joins, and reading/writing CSV, Parquet, JSON files. |
Polars
Overview
Polars is a blazingly fast DataFrame library written in Rust with Python bindings. Built for performance and memory efficiency, Polars leverages parallel execution and lazy evaluation to process data faster than pandas, especially on large datasets.
When to Use This Skill
Activate when the user:
- Wants to work with DataFrames and needs high performance
- Mentions Polars explicitly or asks for "fast" data processing
- Needs to process large datasets (millions of rows)
- Wants lazy evaluation for query optimization
- Asks for data transformations, filtering, grouping, or aggregations
- Needs to read/write CSV, Parquet, JSON, or other data formats
- Wants to combine with DuckDB for SQL + DataFrame workflows
Installation
Check if Polars is installed:
python3 -c "import polars as pl; print(pl.__version__)"
If not installed:
pip3 install polars
For full features including Parquet support:
pip3 install 'polars[pyarrow]'
For DuckDB integration:
pip3 install polars duckdb 'polars[pyarrow]'
Core Capabilities
1. Reading Data
Polars can read data from various formats:
import polars as pl
# Read CSV
df = pl.read_csv('data.csv')
# Read Parquet (fast, columnar format)
df = pl.read_parquet('data.parquet')
# Read JSON
df = pl.read_json('data.json')
# Read multiple files
df = pl.read_csv('data/*.csv')
# Read with lazy evaluation (doesn't load until needed)
lazy_df = pl.scan_csv('large_data.csv')
lazy_df = pl.scan_parquet('data/*.parquet')
2. Data Selection and Filtering
import polars as pl
df = pl.read_csv('data.csv')
# Select columns
result = df.select(['name', 'age', 'city'])
# Select with expressions
result = df.select([
pl.col('name'),
pl.col('age'),
pl.col('salary').alias('annual_salary')
])
# Filter rows
result = df.filter(pl.col('age') > 25)
# Multiple conditions
result = df.filter(
(pl.col('age') > 25) &
(pl.col('city') == 'NYC')
)
# Filter with string methods
result = df.filter(pl.col('name').str.contains('John'))
3. Transformations and New Columns
import polars as pl
df = pl.read_csv('sales.csv')
# Add new columns
result = df.with_columns([
(pl.col('quantity') * pl.col('price')).alias('total'),
pl.col('product').str.to_uppercase().alias('product_upper'),
pl.when(pl.col('quantity') > 10)
.then(pl.lit('bulk'))
.otherwise(pl.lit('retail'))
.alias('sale_type')
])
# Modify existing columns
result = df.with_columns([
pl.col('price').round(2),
pl.col('date').str.strptime(pl.Date, '%Y-%m-%d')
])
# Rename columns
result = df.rename({'old_name': 'new_name'})
4. Aggregations and Grouping
import polars as pl
df = pl.read_csv('sales.csv')
# Group by and aggregate
result = df.group_by('category').agg([
pl.col('sales').sum().alias('total_sales'),
pl.col('sales').mean().alias('avg_sales'),
pl.col('sales').count().alias('num_sales'),
pl.col('product').n_unique().alias('unique_products')
])
# Multiple group by columns
result = df.group_by(['category', 'region']).agg([
pl.col('revenue').sum(),
pl.col('customer_id').n_unique().alias('unique_customers')
])
# Aggregation without grouping
stats = df.select([
pl.col('sales').sum(),
pl.col('sales').mean(),
pl.col('sales').median(),
pl.col('sales').std(),
pl.col('sales').min(),
pl.col('sales').max()
])
5. Sorting and Ranking
import polars as pl
df = pl.read_csv('data.csv')
# Sort by single column
result = df.sort('age')
# Sort descending
result = df.sort('salary', descending=True)
# Sort by multiple columns
result = df.sort(['department', 'salary'], descending=[False, True])
# Add rank column
result = df.with_columns([
pl.col('salary').rank(method='dense').over('department').alias('dept_rank')
])
6. Joins
import polars as pl
customers = pl.read_csv('customers.csv')
orders = pl.read_csv('orders.csv')
# Inner join
result = customers.join(orders, on='customer_id', how='inner')
# Left join
result = customers.join(orders, on='customer_id', how='left')
# Join on different column names
result = customers.join(
orders,
left_on='id',
right_on='customer_id',
how='inner'
)
# Join on multiple columns
result = df1.join(df2, on=['col1', 'col2'], how='inner')
7. Window Functions
import polars as pl
df = pl.read_csv('sales.csv')
# Calculate running total
result = df.with_columns([
pl.col('sales').cum_sum().over('region').alias('running_total')
])
# Calculate rolling average
result = df.with_columns([
pl.col('sales').rolling_mean(window_size=7).alias('7_day_avg')
])
# Rank within groups
result = df.with_columns([
pl.col('sales').rank().over('category').alias('category_rank')
])
# Lag and lead
result = df.with_columns([
pl.col('sales').shift(1).over('product').alias('prev_sales'),
pl.col('sales').shift(-1).over('product').alias('next_sales')
])
Lazy Evaluation for Performance
Polars' lazy API optimizes queries before execution:
import polars as pl
# Start with lazy scan (doesn't load data yet)
lazy_df = (
pl.scan_csv('large_data.csv')
.filter(pl.col('date') >= '2024-01-01')
.select(['customer_id', 'product', 'sales', 'date'])
.group_by('customer_id')
.agg([
pl.col('sales').sum().alias('total_sales'),
pl.col('product').n_unique().alias('unique_products')
])
.filter(pl.col('total_sales') > 1000)
.sort('total_sales', descending=True)
)
# Execute the optimized query
result = lazy_df.collect()
# Or get execution plan
print(lazy_df.explain())
Common Patterns
Pattern 1: ETL Pipeline
import polars as pl
from datetime import datetime
# Extract and Transform
result = (
pl.scan_csv('raw_data.csv')
# Clean data
.filter(
(pl.col('amount') > 0) &
(pl.col('quantity') > 0)
)
# Transform columns
.with_columns([
pl.col('date').str.strptime(pl.Date, '%Y-%m-%d'),
pl.col('product').str.strip().str.to_uppercase(),
(pl.col('quantity') * pl.col('amount')).alias('total'),
pl.when(pl.col('quantity') > 10)
.then(pl.lit('bulk'))
.otherwise(pl.lit('retail'))
.alias('order_type')
])
# Aggregate
.group_by(['date', 'product', 'order_type'])
.agg([
pl.col('total').sum().alias('daily_total'),
pl.col('quantity').sum().alias('daily_quantity'),
pl.count().alias('num_orders')
])
.collect()
)
# Load (save results)
result.write_parquet('processed_data.parquet')
Pattern 2: Data Exploration
import polars as pl
df = pl.read_csv('data.csv')
# Quick overview
print(df.head())
print(df.describe())
print(df.schema)
# Column statistics
print(df.select([
pl.col('age').min(),
pl.col('age').max(),
pl.col('age').mean(),
pl.col('age').median(),
pl.col('age').std()
]))
# Count nulls
print(df.null_count())
# Value counts
print(df['category'].value_counts())
# Unique values
print(df['status'].n_unique())
Pattern 3: Combining with DuckDB
Use Polars for data loading and DuckDB for SQL analytics:
import polars as pl, duckdb
# Load data with Polars
df = pl.read_parquet('data/*.parquet')
# Use DuckDB for complex SQL
result = duckdb.sql("""
SELECT
category,
DATE_TRUNC('month', date) as month,
SUM(revenue) as monthly_revenue,
COUNT(DISTINCT customer_id) as unique_customers
FROM df
WHERE date >= '2024-01-01'
GROUP BY category, month
ORDER BY month DESC, monthly_revenue DESC
""").pl() # Convert back to Polars DataFrame
# Continue with Polars
final = result.with_columns([
(pl.col('monthly_revenue') / pl.col('unique_customers')).alias('revenue_per_customer')
])
Pattern 4: Writing Data
import polars as pl
df = pl.read_csv('data.csv')
# Write to CSV
df.write_csv('output.csv')
# Write to Parquet (recommended for large data)
df.write_parquet('output.parquet')
# Write to JSON
df.write_json('output.json')
# Write partitioned Parquet files
df.write_parquet('output/', partition_by='date')
Expression Chaining
Polars uses a powerful expression syntax:
import polars as pl
result = df.select([
# String operations
pl.col('name').str.to_lowercase().str.strip().alias('clean_name'),
# Arithmetic
(pl.col('price') * 1.1).round(2).alias('price_with_tax'),
# Conditional logic
pl.when(pl.col('age') < 18)
.then(pl.lit('minor'))
.when(pl.col('age') < 65)
.then(pl.lit('adult'))
.otherwise(pl.lit('senior'))
.alias('age_group'),
# Date operations
pl.col('date').dt.year().alias('year'),
pl.col('date').dt.month().alias('month'),
# List operations
pl.col('tags').list.len().alias('num_tags'),
])
Performance Tips
- Use lazy evaluation for large datasets - lets Polars optimize the query
- Use Parquet format - columnar, compressed, much faster than CSV
- Filter early - push filters before other operations
- Avoid row iteration - use vectorized operations instead
- Use expressions - more efficient than pandas-style operations
# Good: Lazy + filter early
result = (
pl.scan_parquet('large.parquet')
.filter(pl.col('date') >= '2024-01-01') # Filter first
.select(['col1', 'col2', 'col3']) # Then select
.collect()
)
# Less efficient: Eager loading
df = pl.read_parquet('large.parquet')
result = df.filter(pl.col('date') >= '2024-01-01').select(['col1', 'col2', 'col3'])
Polars vs Pandas
Key differences:
- Immutability: Polars DataFrames are immutable (operations return new DataFrames)
- Performance: Polars is typically 5-10x faster than pandas
- Lazy evaluation: Polars can optimize queries before execution
- Expressions: Polars uses expression API instead of method chaining
- Parallel: Polars automatically parallelizes operations
# Pandas style
df['new_col'] = df['col1'] * df['col2']
# Polars style
df = df.with_columns([
(pl.col('col1') * pl.col('col2')).alias('new_col')
])
Integration with DuckDB
For the best of both worlds, combine Polars and DuckDB:
import polars as pl, duckdb
# Polars: Fast data loading and transformation
df = (
pl.scan_parquet('data/*.parquet')
.filter(pl.col('active') == True)
.collect()
)
# DuckDB: SQL analytics
result = duckdb.sql("""
SELECT
category,
SUM(amount) as total,
AVG(amount) as average
FROM df
GROUP BY category
""").pl()
See the duckdb skill for more SQL capabilities and the references/api_reference.md file for detailed Polars API documentation.
Error Handling
import polars as pl
try:
df = pl.read_csv('data.csv')
except FileNotFoundError:
print("File not found")
except pl.exceptions.ComputeError as e:
print(f"Polars compute error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Resources
- references/api_reference.md: Detailed Polars API documentation and examples
- Official docs: https://docs.pola.rs/
- API reference: https://docs.pola.rs/api/python/stable/reference/