| name | Test Data Generation & Validation |
| description | Generate real Cassandra 5.0 test data using Docker containers, export SSTables with proper directory structure, validate parsing against sstabledump, and manage test datasets. Use when working with test data generation, dataset creation, SSTable export, validation, fixture management, or sstabledump comparison. |
Test Data Generation & Validation
This skill provides guidance on generating real Cassandra 5.0 test data and validating parsing correctness.
When to Use This Skill
- Generating test data with specific schemas
- Creating test fixtures for property tests
- Exporting SSTables from Cassandra
- Validating parsed data against sstabledump
- Managing test datasets
- Creating reproducible test scenarios
Overview
CQLite uses real Cassandra 5.0 instances to generate test data, ensuring:
- Format correctness (real Cassandra writes)
- Edge case coverage (nulls, empty values, large values)
- Compression validation (actual compressed SSTables)
- Schema variety (all CQL types)
Test Data Workflow
See dataset-generation.md for complete workflow details.
Quick Start
cd test-data
# 1. Start clean Cassandra 5 with schemas
./scripts/start-clean.sh
# 2. Generate data (N rows per table)
ROWS=1000 ./scripts/generate.sh
# 3. Export SSTables
./scripts/export.sh
# 4. Shutdown and clean volumes
./scripts/shutdown-clean.sh
Generation Scripts
start-clean.sh
Starts Cassandra 5.0 container and applies schemas.
What it does:
- Starts
cassandra-5-0container via docker-compose - Waits for Cassandra to be healthy
- Applies schemas from
schemas/core.list - Verifies keyspaces and tables created
Environment variables:
SCHEMA_SET=core- Use curated schema list (default)SCHEMA_SET=all- Use all *.cql files
Example:
# Use default core schemas
./scripts/start-clean.sh
# Use all schemas
SCHEMA_SET=all ./scripts/start-clean.sh
generate.sh
Generates test data using Python data generator.
What it does:
- Connects to running Cassandra container
- Generates type-correct data for each table
- Inserts rows using prepared statements
- Flushes memtables to SSTables
- Produces metadata.yml with row counts
Environment variables:
ROWS=N- Rows per table (default: varies by SCALE)TABLES=table1,table2- Generate for specific tables onlySCALE=SMALL|MEDIUM|LARGE- Preset sizes
Example:
# Generate 1000 rows per table
ROWS=1000 ./scripts/generate.sh
# Generate only for specific tables
TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh
# Use LARGE scale preset
SCALE=LARGE ./scripts/generate.sh
export.sh
Exports SSTables from Cassandra data directory.
What it does:
- Stops Cassandra to ensure consistent snapshot
- Copies SSTables from container to
datasets/sstables/ - Preserves directory structure (keyspace/table/files)
- Copies metadata.yml
- Creates metadata about dataset
Output structure:
test-data/datasets/
├── metadata.yml # Generated by generate.sh
├── sstables/
│ ├── test_basic/
│ │ └── simple_table/
│ │ ├── *-Data.db
│ │ ├── *-Index.db
│ │ ├── *-Statistics.db
│ │ ├── *-Summary.db
│ │ └── *-TOC.txt
│ ├── test_collections/
│ └── test_timeseries/
shutdown-clean.sh
Stops Cassandra and removes Docker volumes.
What it does:
- Stops all containers
- Removes Docker volumes (clean slate)
- Prepares for next generation cycle
Use when:
- Done with current dataset
- Want to regenerate from scratch
- Cleaning up after tests
Test Schemas
Schemas in test-data/schemas/:
basic-types.cql
Simple table with all primitive types:
- Partition key: uuid
- No clustering
- Columns: int, text, timestamp, boolean, etc.
collections.cql
Collection types:
- list
- set
- map<text, int>
- Nested frozen collections
time-series.cql
Time-series pattern:
- Partition key: sensor_id
- Clustering: timestamp (DESC)
- Columns: temperature, humidity, pressure
wide-rows.cql
Wide partition testing:
- Single partition key
- Many clustering rows (1000+)
- Tests pagination and offset handling
Custom Schemas
Add your own:
# Create schema
echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql
# Add to core.list
echo "my-schema.cql" >> schemas/core.list
# Generate
./scripts/start-clean.sh
./scripts/generate.sh
Validation Workflow
See validation-workflow.md for complete validation process.
Validate Against sstabledump
# 1. Generate sstabledump reference
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
> reference.json
# 2. Parse with cqlite
cargo run --bin cqlite -- \
--data-dir test-data/datasets/sstables/keyspace/table \
--schema test-data/schemas/schema.cql \
--out json > cqlite.json
# 3. Compare (ignoring formatting)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json
Automated Validation
Run validation script:
# Validate all test tables
cargo test --test sstable_validation
# Validate specific table
cargo test --test sstable_validation -- simple_table
Property Testing
Generate random data for property tests:
use proptest::prelude::*;
proptest! {
#[test]
fn test_row_parsing_roundtrip(
partition_key in any::<i32>(),
text_value in "\\PC*", // Any valid unicode
int_value in any::<i32>(),
) {
// Generate test data in Cassandra
insert_test_row(partition_key, &text_value, int_value)?;
flush_memtable()?;
// Parse with cqlite
let parsed = parse_sstable()?;
// Validate roundtrip
assert_eq!(parsed.get_int("partition_key"), partition_key);
assert_eq!(parsed.get_text("text_col"), text_value);
assert_eq!(parsed.get_int("int_col"), int_value);
}
}
Dataset Packaging
Package datasets for CI or distribution:
# Package current dataset
./scripts/package_datasets.sh
# Output: test-data/cqlite-test-data-v5.0-<date>.tar.gz
Contents:
- All SSTables
- metadata.yml
- Schema files
- README with generation parameters
CI Integration
Smoke Test
Quick validation in CI:
# Use packaged dataset
tar xzf cqlite-test-data-v5.0.tar.gz
# Run core tests
./scripts/ci-one-shot-smoke.sh
# Validates:
# - Basic parsing
# - All CQL types
# - Compression
# - Collections
See test-data/scripts/CI_SMOKE_TEST_USAGE.md for details.
Common Scenarios
Scenario 1: Test New CQL Type
# 1. Add column to schema
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
>> schemas/basic-types.cql
# 2. Regenerate data
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh
# 3. Validate parsing
cargo test --test sstable_validation
Scenario 2: Test Large Values
# Generate with specific row size
ROWS=100 SCALE=LARGE ./scripts/generate.sh
# Validates:
# - Large text values (1MB+)
# - Large blob values
# - Large collections (1000+ elements)
Scenario 3: Test Edge Cases
# Modify generate_comprehensive_test_data.py
def generate_edge_cases(session):
# Null values
session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
# Empty collections
session.execute("INSERT INTO table (pk, tags) VALUES (?, [])",
[uuid.uuid4()])
# Empty strings
session.execute("INSERT INTO table (pk, name) VALUES (?, '')",
[uuid.uuid4()])
PRD Alignment
Supports Milestone M1 (Core Reading Library):
- 95% test coverage goal
- All CQL types validated
- Real Cassandra data ensures format correctness
Supports All Milestones:
- Regression testing with frozen datasets
- Property-based testing for edge cases
- CI integration for PR validation
Troubleshooting
Cassandra Won't Start
# Check logs
docker logs cassandra-5-0
# Common issue: Port 9042 in use
lsof -i :9042
# Kill process or change port in docker-compose-cassandra5.yml
Generation Fails
# Check generator logs
cat test-data/logs/data_generation.log
# Verify schema applied
docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"
Export Produces No Files
# Verify data exists in container
docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/
# Check if flush happened
docker logs cassandra-5-0 | grep flush
Dataset Repository
Packaged datasets available at:
https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0
Download for:
- CI without Docker
- Reproducible benchmarks
- Offline development
Next Steps
When creating new tests:
- Design schema in
schemas/ - Generate data with
generate.sh - Export SSTables with
export.sh - Write parser test
- Validate with sstabledump
- Add to CI smoke test suite
See documentation:
- dataset-generation.md - Full workflow
- validation-workflow.md - Validation process