| name | moai-formats-data |
| description | Data format specialist covering TOON encoding, JSON/YAML optimization, serialization patterns, and data validation for modern applications |
| version | 1.0.0 |
| category | library |
| allowed-tools | Read, Write, Edit, Grep, Glob |
| tags | formats, data, toon, serialization, validation, optimization |
| updated | Sat Dec 06 2025 00:00:00 GMT+0000 (Coordinated Universal Time) |
| status | active |
| author | MoAI-ADK Team |
Data Format Specialist
Quick Reference (30 seconds)
Advanced Data Format Management - Comprehensive data handling covering TOON encoding, JSON/YAML optimization, serialization patterns, and data validation for performance-critical applications.
Core Capabilities:
- TOON Encoding: 40-60% token reduction vs JSON for LLM communication
- JSON/YAML Optimization: Efficient serialization and parsing patterns
- Data Validation: Schema validation, type checking, error handling
- Format Conversion: Seamless transformation between data formats
- Performance: Optimized data structures and caching strategies
- Schema Management: Dynamic schema generation and evolution
When to Use:
- Optimizing data transmission to LLMs within token budgets
- High-performance serialization/deserialization
- Schema validation and data integrity
- Format conversion and data transformation
- Large dataset processing and optimization
Quick Start:
# TOON encoding (40-60% token reduction)
from moai_formats_data import TOONEncoder
encoder = TOONEncoder()
compressed = encoder.encode({"user": "John", "age": 30})
original = encoder.decode(compressed)
# Fast JSON processing
from moai_formats_data import JSONOptimizer
optimizer = JSONOptimizer()
fast_json = optimizer.serialize_fast(large_dataset)
# Data validation
from moai_formats_data import DataValidator
validator = DataValidator()
schema = validator.create_schema({"name": {"type": "string", "required": True}})
result = validator.validate({"name": "John"}, schema)
Implementation Guide (5 minutes)
Core Concepts
TOON (Token-Optimized Object Notation):
- Custom binary-compatible format optimized for LLM token usage
- Type markers:
#(numbers),!(booleans),@(timestamps),~(null) - 40-60% size reduction vs JSON for typical data structures
- Lossless round-trip encoding/decoding
Performance Optimization:
- Ultra-fast JSON processing with orjson (2-5x faster than standard json)
- Streaming processing for large datasets using ijson
- Intelligent caching with LRU eviction and memory management
- Schema compression and validation optimization
Data Validation:
- Type-safe validation with custom rules and patterns
- Schema evolution and migration support
- Cross-field validation and dependency checking
- Performance-optimized batch validation
Basic Implementation
from moai_formats_data import TOONEncoder, JSONOptimizer, DataValidator
from datetime import datetime
# 1. TOON Encoding for LLM optimization
encoder = TOONEncoder()
data = {
"user": {"id": 123, "name": "John", "active": True, "created": datetime.now()},
"permissions": ["read", "write", "admin"]
}
# Encode and compare sizes
toon_data = encoder.encode(data)
original_data = encoder.decode(toon_data)
# 2. Fast JSON Processing
optimizer = JSONOptimizer()
# Ultra-fast serialization
json_bytes = optimizer.serialize_fast(data)
parsed_data = optimizer.deserialize_fast(json_bytes)
# Schema compression for repeated validation
schema = {"type": "object", "properties": {"name": {"type": "string"}}}
compressed_schema = optimizer.compress_schema(schema)
# 3. Data Validation
validator = DataValidator()
# Create validation schema
user_schema = validator.create_schema({
"username": {"type": "string", "required": True, "min_length": 3},
"email": {"type": "email", "required": True},
"age": {"type": "integer", "required": False, "min_value": 13}
})
# Validate data
user_data = {"username": "john_doe", "email": "john@example.com", "age": 30}
result = validator.validate(user_data, user_schema)
if result['valid']:
print("Data is valid!")
sanitized = result['sanitized_data']
else:
print("Validation errors:", result['errors'])
Common Use Cases
API Response Optimization:
# Optimize API responses for LLM consumption
def optimize_api_response(data: Dict) -> str:
encoder = TOONEncoder()
return encoder.encode(data)
# Parse optimized responses
def parse_optimized_response(toon_data: str) -> Dict:
encoder = TOONEncoder()
return encoder.decode(toon_data)
Configuration Management:
# Fast YAML configuration loading
from moai_formats_data import YAMLOptimizer
yaml_optimizer = YAMLOptimizer()
config = yaml_optimizer.load_fast("config.yaml")
# Merge multiple configurations
merged = yaml_optimizer.merge_configs(base_config, env_config, user_config)
Large Dataset Processing:
# Stream processing for large JSON files
from moai_formats_data import StreamProcessor
processor = StreamProcessor(chunk_size=8192)
# Process file line by line without loading into memory
def process_item(item):
print(f"Processing: {item['id']}")
processor.process_json_stream("large_dataset.json", process_item)
Advanced Features (10+ minutes)
Advanced TOON Features
Custom Type Handlers:
# Extend TOON encoder with custom types
class CustomTOONEncoder(TOONEncoder):
def _encode_value(self, value):
# Handle UUID objects
if hasattr(value, 'hex') and len(value.hex) == 32: # UUID
return f'${value.hex}'
# Handle Decimal objects
elif hasattr(value, 'as_tuple'): # Decimal
return f'&{str(value)}'
return super()._encode_value(value)
def _parse_value(self, s):
# Parse custom UUIDs
if s.startswith('$') and len(s) == 33:
import uuid
return uuid.UUID(s[1:])
# Parse custom Decimals
elif s.startswith('&'):
from decimal import Decimal
return Decimal(s[1:])
return super()._parse_value(s)
Streaming TOON Processing:
# Process TOON data in streaming mode
def stream_toon_data(data_generator):
encoder = TOONEncoder()
for data in data_generator:
yield encoder.encode(data)
# Batch TOON processing
def batch_encode_toon(data_list: List[Dict], batch_size: int = 1000):
encoder = TOONEncoder()
results = []
for i in range(0, len(data_list), batch_size):
batch = data_list[i:i + batch_size]
encoded_batch = [encoder.encode(item) for item in batch]
results.extend(encoded_batch)
return results
Advanced Validation Patterns
Cross-Field Validation:
# Validate relationships between fields
class CrossFieldValidator:
def __init__(self):
self.base_validator = DataValidator()
def validate_user_data(self, data: Dict) -> Dict:
# Base validation
schema = self.base_validator.create_schema({
"password": {"type": "string", "required": True, "min_length": 8},
"confirm_password": {"type": "string", "required": True},
"email": {"type": "email", "required": True}
})
result = self.base_validator.validate(data, schema)
# Cross-field validation
if data.get("password") != data.get("confirm_password"):
result['errors']['password_mismatch'] = "Passwords do not match"
result['valid'] = False
return result
Schema Evolution:
# Handle schema changes over time
from moai_formats_data import SchemaEvolution
evolution = SchemaEvolution()
# Define schema versions
v1_schema = {"name": {"type": "string"}, "age": {"type": "integer"}}
v2_schema = {"full_name": {"type": "string"}, "age": {"type": "integer"}, "email": {"type": "email"}}
# Register schemas
evolution.register_schema("v1", v1_schema)
evolution.register_schema("v2", v2_schema)
# Add migration function
def migrate_v1_to_v2(data: Dict) -> Dict:
return {
"full_name": data["name"],
"age": data["age"],
"email": None # New required field
}
evolution.add_migration("v1", "v2", migrate_v1_to_v2)
# Migrate data
old_data = {"name": "John Doe", "age": 30}
new_data = evolution.migrate_data(old_data, "v1", "v2")
Performance Optimization
Intelligent Caching:
from moai_formats_data import SmartCache
# Create cache with memory constraints
cache = SmartCache(max_memory_mb=50, max_items=10000)
@cache.cache.cache_result(ttl=1800) # 30 minutes
def expensive_data_processing(data: Dict) -> Dict:
# Simulate expensive computation
time.sleep(0.1)
return {"processed": True, "data": data}
# Cache statistics
print(cache.get_stats())
# Cache warming
def warm_common_data():
common_queries = [
{"type": "user", "id": 1},
{"type": "user", "id": 2},
{"type": "config", "key": "app"}
]
for query in common_queries:
expensive_data_processing(query)
warm_common_data()
Batch Processing Optimization:
# Optimized batch validation
def validate_batch_optimized(data_list: List[Dict], schema: Dict) -> List[Dict]:
validator = DataValidator()
# Pre-compile patterns for performance
validator._compile_schema_patterns(schema)
# Process in batches for memory efficiency
batch_size = 1000
results = []
for i in range(0, len(data_list), batch_size):
batch = data_list[i:i + batch_size]
batch_results = [validator.validate(data, schema) for data in batch]
results.extend(batch_results)
return results
Integration Patterns
LLM Integration:
# Prepare data for LLM consumption
def prepare_for_llm(data: Dict, max_tokens: int = 2000) -> str:
encoder = TOONEncoder()
toon_data = encoder.encode(data)
# Check token count
estimated_tokens = len(toon_data.split())
if estimated_tokens > max_tokens:
# Implement data reduction strategy
reduced_data = reduce_data_complexity(data, max_tokens)
toon_data = encoder.encode(reduced_data)
return toon_data
def reduce_data_complexity(data: Dict, max_tokens: int) -> Dict:
"""Reduce data complexity to fit token budget."""
# Implement selective field removal
priority_fields = ["id", "name", "email", "status"]
reduced = {k: v for k, v in data.items() if k in priority_fields}
# Further reduction if needed
encoder = TOONEncoder()
while len(encoder.encode(reduced).split()) > max_tokens:
# Remove least important fields
if len(reduced) <= 1:
break
reduced.popitem()
return reduced
Database Integration:
# Optimize database queries with format conversion
def optimize_db_response(db_data: List[Dict]) -> Dict:
# Convert database results to optimized format
optimizer = JSONOptimizer()
# Compress and cache schema
common_schema = {"type": "object", "properties": {"id": {"type": "integer"}}}
compressed_schema = optimizer.compress_schema(common_schema)
# Process in batches
processor = StreamProcessor()
processed_data = []
for item in db_data:
# Apply validation and transformation
processed_item = transform_db_item(item)
processed_data.append(processed_item)
return {
"data": processed_data,
"count": len(processed_data),
"schema": compressed_schema
}
Works Well With
- moai-domain-backend - Backend data serialization and API responses
- moai-domain-database - Database data format optimization
- moai-foundation-core - MCP data serialization and transmission patterns
- moai-docs-generation - Documentation data formatting
- moai-foundation-core - Core data architecture principles
Module References
Core Implementation Modules:
- `modules/toon-encoding.md` - TOON encoding implementation and examples
- `modules/json-optimization.md` - High-performance JSON/YAML processing
- `modules/data-validation.md` - Advanced validation and schema management
- `modules/caching-performance.md` - Caching strategies and performance optimization
Usage Examples
CLI Usage
# Encode data to TOON format
moai-formats encode-toon --input data.json --output data.toon
# Validate data against schema
moai-formats validate --schema schema.json --data data.json
# Convert between formats
moai-formats convert --input data.json --output data.yaml --format yaml
# Optimize JSON structure
moai-formats optimize-json --input large-data.json --output optimized.json
Python API
from moai_formats_data import TOONEncoder, DataValidator, JSONOptimizer
# TOON encoding
encoder = TOONEncoder()
toon_data = encoder.encode({"user": "John", "age": 30})
original_data = encoder.decode(toon_data)
# Data validation
validator = DataValidator()
schema = validator.create_schema({
"name": {"type": "string", "required": True, "min_length": 2},
"email": {"type": "email", "required": True}
})
result = validator.validate({"name": "John", "email": "john@example.com"}, schema)
# JSON optimization
optimizer = JSONOptimizer()
fast_json = optimizer.serialize_fast(large_dataset)
parsed_data = optimizer.deserialize_fast(fast_json)
Technology Stack
Core Libraries:
- orjson: Ultra-fast JSON parsing and serialization
- PyYAML: YAML processing with C-based loaders
- ijson: Streaming JSON parser for large files
- python-dateutil: Advanced datetime parsing
- regex: Advanced regular expression support
Performance Tools:
- lru_cache: Built-in memoization
- pickle: Object serialization
- hashlib: Hash generation for caching
- functools: Function decorators and utilities
Validation Libraries:
- jsonschema: JSON Schema validation
- cerberus: Lightweight data validation
- marshmallow: Object serialization/deserialization
- pydantic: Data validation using Python type hints
Status: Production Ready Last Updated: 2025-11-30 Maintained by: MoAI-ADK Data Team