| name | earthquake_data_tsunami-query |
| description | Query and analyze earthquake_data_tsunami.xlsx data using conversational AI. Automatically generated skill for Excel file with 783 rows and 13 columns across 1 sheet(s). |
earthquake_data_tsunami Query Skill
Auto-generated skill for querying earthquake_data_tsunami.xlsx
Dataset Overview
- Original File: earthquake_data_tsunami.xlsx
- File Size: 0.05 MB
- Sheets: 1
- Total Rows: 783
- Total Columns: 13
- Formulas: 0
- Data Format: Parquet (optimized for fast querying)
Available Sheets
earthquake_data_tsunami
- Rows: 783
- Columns: 13
- Key columns: magnitude, cdi, mmi, sig, nst
Query Capabilities
This skill enables natural language querying of the Excel data. You can:
Filtering and Selection
- Filter rows based on conditions
- Select specific columns
- Combine multiple conditions
Aggregations
- Group by categories
- Calculate sums, averages, counts
- Find min/max values
Analysis
- Compare across groups
- Identify trends and patterns
- Generate insights from the data
Example Queries
"Show me total sales by age group"
"What's the average revenue for customers over 25?"
"Filter rows where status is 'active' and created after 2024-01-01"
"Group by category and calculate sum of revenue"
"Find the top 10 products by sales volume"
"Compare performance across different regions"
Formula Information
This dataset contains 0 formulas. They have been analyzed and documented.
Key Formulas
No formulas found
All formulas are documented in formula_map.json with their cell locations and dependencies.
Technical Details
- Storage Format: Parquet (columnar, compressed)
- Query Engine: Polars with streaming support
- Memory Efficiency: Lazy loading, data loaded on-demand
- Performance: ~30x faster than direct Excel queries
- Data Location:
earthquake_data_tsunami_parquet/
Instructions for Claude
When a user requests to query this data:
Step 1: Load Required Resources
import polars as pl
from pathlib import Path
import json
# Load data dictionary to understand schema
with open('data_dictionary.json', 'r') as f:
schema = json.load(f)
# Load formula map if needed
with open('formula_map.json', 'r') as f:
formulas = json.load(f)
Step 2: Use Query Helper
from query_helper import QueryHelper
# Initialize helper
helper = QueryHelper('earthquake_data_tsunami_parquet')
# Load a sheet
df = helper.load_sheet('Sheet1', lazy=True)
# Execute query with streaming
result = df.filter(
pl.col("age") > 25
).group_by("category").agg([
pl.count().alias("count"),
pl.mean("revenue").alias("avg_revenue")
]).collect(streaming=True)
# Display results (paginated)
print(result.head(100))
Step 3: Handle Large Results
- Always use
.head(100)or.limit(100)for initial results - Offer to show more if user requests
- Use streaming mode for queries:
.collect(streaming=True) - Paginate large outputs
Step 4: Reference Documentation
Check
data_dictionary.jsonfor:- Column names and data types
- Sample values
- Formula indicators
Check
formula_map.jsonfor:- Excel formula definitions
- Cell locations
- Dependencies
Check
sample_data.jsonfor:- Representative data examples
- Data patterns and formats
Column Reference
| Sheet | Column | Type | Has Formulas |
|---|---|---|---|
| earthquake_data_tsunami | magnitude | numeric | No |
| earthquake_data_tsunami | cdi | numeric | No |
| earthquake_data_tsunami | mmi | numeric | No |
| earthquake_data_tsunami | sig | numeric | No |
| earthquake_data_tsunami | nst | numeric | No |
| earthquake_data_tsunami | dmin | numeric | No |
| earthquake_data_tsunami | gap | numeric | No |
| earthquake_data_tsunami | depth | numeric | No |
| earthquake_data_tsunami | latitude | numeric | No |
| earthquake_data_tsunami | longitude | numeric | No |
| earthquake_data_tsunami | Year | numeric | No |
| earthquake_data_tsunami | Month | numeric | No |
| earthquake_data_tsunami | tsunami | numeric | No |
For complete column information, see data_dictionary.json.
Data Dictionary Location
All detailed schema information is in data_dictionary.json:
- Column names, types, and sample values
- Formula locations and definitions
- Sheet relationships
- Data statistics
Best Practices
- Always use lazy loading with
pl.scan_parquet()for large datasets - Stream results with
.collect(streaming=True)to avoid memory issues - Limit initial results to 100 rows, offer pagination
- Check data dictionary before constructing queries
- Handle nulls gracefully in user-facing outputs
- Validate column names from schema before querying
- Use appropriate aggregations based on data types
Example Complete Workflow
import polars as pl
import json
from query_helper import QueryHelper
# 1. Initialize
helper = QueryHelper('earthquake_data_tsunami_parquet')
# 2. Load schema
with open('data_dictionary.json', 'r') as f:
schema = json.load(f)
# 3. Check available columns
sheet_info = schema.get('Sheet1', {})
columns = [col['name'] for col in sheet_info.get('column_details', [])]
print(f"Available columns: {', '.join(columns[:10])}")
# 4. Execute query
df = helper.load_sheet('Sheet1')
result = df.filter(
pl.col('age') > 25
).select(['name', 'age', 'revenue']).head(100).collect(streaming=True)
# 5. Display formatted results
print(result)
print(f"\nShowing {len(result)} of {df.count()} total rows")
Troubleshooting
Q: Column not found error?
A: Check data_dictionary.json for exact column names (case-sensitive)
Q: Memory issues with large queries?
A: Use .head() to limit results and .collect(streaming=True)
Q: Formula not working?
A: Check formula_map.json - formulas are pre-computed in Parquet data
Q: Sheet name not found?
A: List available sheets from schema or use helper.list_sheets()
Note: This is an auto-generated skill. The quality of query results depends on the data quality in the source Excel file.