name	Test Data Generation & Validation
description	Generate real Cassandra 5.0 test data using Docker containers, export SSTables with proper directory structure, validate parsing against sstabledump, and manage test datasets. Use when working with test data generation, dataset creation, SSTable export, validation, fixture management, or sstabledump comparison.

Test Data Generation & Validation

This skill provides guidance on generating real Cassandra 5.0 test data and validating parsing correctness.

When to Use This Skill

Generating test data with specific schemas
Creating test fixtures for property tests
Exporting SSTables from Cassandra
Validating parsed data against sstabledump
Managing test datasets
Creating reproducible test scenarios

Overview

CQLite uses real Cassandra 5.0 instances to generate test data, ensuring:

Format correctness (real Cassandra writes)
Edge case coverage (nulls, empty values, large values)
Compression validation (actual compressed SSTables)
Schema variety (all CQL types)

Test Data Workflow

See dataset-generation.md for complete workflow details.

Quick Start

cd test-data

# 1. Start clean Cassandra 5 with schemas
./scripts/start-clean.sh

# 2. Generate data (N rows per table)
ROWS=1000 ./scripts/generate.sh

# 3. Export SSTables
./scripts/export.sh

# 4. Shutdown and clean volumes
./scripts/shutdown-clean.sh

Generation Scripts

start-clean.sh

Starts Cassandra 5.0 container and applies schemas.

What it does:

Starts cassandra-5-0 container via docker-compose
Waits for Cassandra to be healthy
Applies schemas from schemas/core.list
Verifies keyspaces and tables created

Environment variables:

SCHEMA_SET=core - Use curated schema list (default)
SCHEMA_SET=all - Use all *.cql files

Example:

# Use default core schemas
./scripts/start-clean.sh

# Use all schemas
SCHEMA_SET=all ./scripts/start-clean.sh

generate.sh

Generates test data using Python data generator.

What it does:

Connects to running Cassandra container
Generates type-correct data for each table
Inserts rows using prepared statements
Flushes memtables to SSTables
Produces metadata.yml with row counts

Environment variables:

ROWS=N - Rows per table (default: varies by SCALE)
TABLES=table1,table2 - Generate for specific tables only
SCALE=SMALL|MEDIUM|LARGE - Preset sizes

Example:

# Generate 1000 rows per table
ROWS=1000 ./scripts/generate.sh

# Generate only for specific tables
TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh

# Use LARGE scale preset
SCALE=LARGE ./scripts/generate.sh

export.sh

Exports SSTables from Cassandra data directory.

What it does:

Stops Cassandra to ensure consistent snapshot
Copies SSTables from container to datasets/sstables/
Preserves directory structure (keyspace/table/files)
Copies metadata.yml
Creates metadata about dataset

Output structure:

test-data/datasets/
├── metadata.yml          # Generated by generate.sh
├── sstables/
│   ├── test_basic/
│   │   └── simple_table/
│   │       ├── *-Data.db
│   │       ├── *-Index.db
│   │       ├── *-Statistics.db
│   │       ├── *-Summary.db
│   │       └── *-TOC.txt
│   ├── test_collections/
│   └── test_timeseries/

shutdown-clean.sh

Stops Cassandra and removes Docker volumes.

What it does:

Stops all containers
Removes Docker volumes (clean slate)
Prepares for next generation cycle

Use when:

Done with current dataset
Want to regenerate from scratch
Cleaning up after tests

Test Schemas

Schemas in test-data/schemas/:

basic-types.cql

Simple table with all primitive types:

Partition key: uuid
No clustering
Columns: int, text, timestamp, boolean, etc.

collections.cql

Collection types:

list
set
map<text, int>
Nested frozen collections

time-series.cql

Time-series pattern:

Partition key: sensor_id
Clustering: timestamp (DESC)
Columns: temperature, humidity, pressure

wide-rows.cql

Wide partition testing:

Single partition key
Many clustering rows (1000+)
Tests pagination and offset handling

Custom Schemas

Add your own:

# Create schema
echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql

# Add to core.list
echo "my-schema.cql" >> schemas/core.list

# Generate
./scripts/start-clean.sh
./scripts/generate.sh

Validation Workflow

See validation-workflow.md for complete validation process.

Validate Against sstabledump

# 1. Generate sstabledump reference
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
    > reference.json

# 2. Parse with cqlite
cargo run --bin cqlite -- \
    --data-dir test-data/datasets/sstables/keyspace/table \
    --schema test-data/schemas/schema.cql \
    --out json > cqlite.json

# 3. Compare (ignoring formatting)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json

Automated Validation

Run validation script:

# Validate all test tables
cargo test --test sstable_validation

# Validate specific table
cargo test --test sstable_validation -- simple_table

Property Testing

Generate random data for property tests:

use proptest::prelude::*;

proptest! {
    #[test]
    fn test_row_parsing_roundtrip(
        partition_key in any::<i32>(),
        text_value in "\\PC*",  // Any valid unicode
        int_value in any::<i32>(),
    ) {
        // Generate test data in Cassandra
        insert_test_row(partition_key, &text_value, int_value)?;
        flush_memtable()?;
        
        // Parse with cqlite
        let parsed = parse_sstable()?;
        
        // Validate roundtrip
        assert_eq!(parsed.get_int("partition_key"), partition_key);
        assert_eq!(parsed.get_text("text_col"), text_value);
        assert_eq!(parsed.get_int("int_col"), int_value);
    }
}

Dataset Packaging

Package datasets for CI or distribution:

# Package current dataset
./scripts/package_datasets.sh

# Output: test-data/cqlite-test-data-v5.0-<date>.tar.gz

Contents:

All SSTables
metadata.yml
Schema files
README with generation parameters

CI Integration

Smoke Test

Quick validation in CI:

# Use packaged dataset
tar xzf cqlite-test-data-v5.0.tar.gz

# Run core tests
./scripts/ci-one-shot-smoke.sh

# Validates:
# - Basic parsing
# - All CQL types
# - Compression
# - Collections

See test-data/scripts/CI_SMOKE_TEST_USAGE.md for details.

Common Scenarios

Scenario 1: Test New CQL Type

# 1. Add column to schema
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
    >> schemas/basic-types.cql

# 2. Regenerate data
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh

# 3. Validate parsing
cargo test --test sstable_validation

Scenario 2: Test Large Values

# Generate with specific row size
ROWS=100 SCALE=LARGE ./scripts/generate.sh

# Validates:
# - Large text values (1MB+)
# - Large blob values
# - Large collections (1000+ elements)

Scenario 3: Test Edge Cases

# Modify generate_comprehensive_test_data.py
def generate_edge_cases(session):
    # Null values
    session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
    
    # Empty collections
    session.execute("INSERT INTO table (pk, tags) VALUES (?, [])", 
                   [uuid.uuid4()])
    
    # Empty strings
    session.execute("INSERT INTO table (pk, name) VALUES (?, '')", 
                   [uuid.uuid4()])

PRD Alignment

Supports Milestone M1 (Core Reading Library):

95% test coverage goal
All CQL types validated
Real Cassandra data ensures format correctness

Supports All Milestones:

Regression testing with frozen datasets
Property-based testing for edge cases
CI integration for PR validation

Troubleshooting

Cassandra Won't Start

# Check logs
docker logs cassandra-5-0

# Common issue: Port 9042 in use
lsof -i :9042
# Kill process or change port in docker-compose-cassandra5.yml

Generation Fails

# Check generator logs
cat test-data/logs/data_generation.log

# Verify schema applied
docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"

Export Produces No Files

# Verify data exists in container
docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/

# Check if flush happened
docker logs cassandra-5-0 | grep flush

Dataset Repository

Packaged datasets available at:

https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0

Download for:

CI without Docker
Reproducible benchmarks
Offline development

Next Steps

When creating new tests:

Design schema in schemas/
Generate data with generate.sh
Export SSTables with export.sh
Write parser test
Validate with sstabledump
Add to CI smoke test suite

See documentation:

dataset-generation.md - Full workflow
validation-workflow.md - Validation process

Test Data Generation & Validation

Install Skill

SKILL.md

Test Data Generation & Validation

When to Use This Skill

Overview

Test Data Workflow

Quick Start

Generation Scripts

start-clean.sh

generate.sh

export.sh

shutdown-clean.sh

Test Schemas

basic-types.cql

collections.cql

time-series.cql

wide-rows.cql

Custom Schemas

Validation Workflow

Validate Against sstabledump

Automated Validation

Property Testing

Dataset Packaging

CI Integration

Smoke Test

Common Scenarios

Scenario 1: Test New CQL Type

Scenario 2: Test Large Values

Scenario 3: Test Edge Cases

PRD Alignment

Troubleshooting

Cassandra Won't Start

Generation Fails

Export Produces No Files

Dataset Repository

Next Steps