| name | sqlite-db-truncate |
| description | Guidance for recovering data from corrupted or truncated SQLite database files through binary analysis and manual parsing. This skill applies when working with damaged SQLite databases that cannot be opened with standard tools, particularly when corruption is due to binary truncation, incomplete writes, or filesystem errors. |
SQLite Truncated Database Recovery
This skill provides systematic approaches for recovering data from SQLite database files that have been corrupted through binary truncation. It emphasizes understanding the SQLite file format before attempting recovery and avoiding common pitfalls that lead to multiple failed iterations.
When to Use This Skill
This skill applies when:
- A SQLite database file cannot be opened with standard
sqlite3commands - The database error indicates corruption or malformed data
- File size is smaller than expected (suggesting truncation)
- Standard recovery tools like
.recovercommand fail - Manual binary parsing of SQLite page structure is required
Initial Assessment Strategy
Before writing any recovery code, perform a thorough analysis of the corrupted file:
Step 1: Examine File Characteristics
# Check file size and basic properties
ls -lh database.db
file database.db
# Create hex dump for analysis
hexdump -C database.db | head -100
Key observations to make:
- File size: SQLite pages are typically 4096 bytes. Check if size aligns with page boundaries
- Magic bytes: Valid SQLite files start with "SQLite format 3\000" (16 bytes)
- First byte after header: Identifies page type (0x0d = table leaf page with actual data)
Step 2: Identify Corruption Pattern
Common truncation scenarios:
- Header-only file: Only the 100-byte header remains
- Missing header: File starts with a data page (first byte is 0x0d, 0x05, 0x0a, or 0x02)
- Partial page: File ends mid-page, truncating some cells
If the file lacks the standard "SQLite format 3" magic header but starts with 0x0d, this indicates the file contains only a table leaf page without the database header.
Step 3: Try Standard Tools First
Always attempt standard recovery before manual parsing:
# Check if sqlite3 can read the file
sqlite3 database.db ".schema" 2>&1
sqlite3 database.db "SELECT * FROM sqlite_master" 2>&1
# Try built-in recovery
sqlite3 database.db ".recover" > recovered.sql 2>&1
# Try integrity check
sqlite3 database.db "PRAGMA integrity_check;"
If these fail with "database disk image is malformed" or similar errors, proceed to manual binary parsing.
SQLite Page Structure Overview
Understanding the page structure is essential before writing recovery code.
Table Leaf Page Layout (Page Type 0x0d)
Offset Size Description
------ ---- -----------
0 1 Page type (0x0d for table leaf)
1 2 First freeblock offset (big-endian)
3 2 Number of cells on page (big-endian)
5 2 Cell content area start offset (big-endian)
7 1 Fragmented free bytes count
8+ varies Cell pointer array (2 bytes per cell, big-endian)
... [Gap/free space]
End Cell data (grows backward from page end)
Cell Structure
Each cell contains a database row:
[Payload size: varint]
[Row ID: varint]
[Header size: varint]
[Serial type 1: varint]
[Serial type 2: varint]
...
[Column 1 value]
[Column 2 value]
...
Varint Encoding
SQLite uses variable-length integers (varints):
- Bytes 1-8: Use 7 bits for data, high bit (0x80) indicates continuation
- Byte 9: Uses all 8 bits (no continuation)
Serial Types
Serial types indicate how to interpret column data:
| Type | Size | Meaning |
|---|---|---|
| 0 | 0 | NULL |
| 1 | 1 | 8-bit signed integer |
| 2 | 2 | 16-bit big-endian signed integer |
| 3 | 3 | 24-bit big-endian signed integer |
| 4 | 4 | 32-bit big-endian signed integer |
| 7 | 8 | IEEE 754 64-bit float (big-endian) |
| 8 | 0 | Integer constant 0 |
| 9 | 0 | Integer constant 1 |
| N >= 12, even | (N-12)/2 | BLOB |
| N >= 13, odd | (N-13)/2 | Text string (UTF-8) |
Example: Serial type 0x21 (33) = text string of length (33-13)/2 = 10 bytes.
Recovery Approach
Build a Single, Modular Script
Avoid creating multiple separate recovery scripts. Instead, build one script iteratively with clear debug output:
import struct
import json
DEBUG = True
def read_varint(data, offset):
"""Read SQLite variable-length integer."""
value = 0
for i in range(9):
if offset + i >= len(data):
return None, offset
byte = data[offset + i]
if i == 8:
value = (value << 8) | byte
return value, offset + i + 1
value = (value << 7) | (byte & 0x7f)
if (byte & 0x80) == 0:
return value, offset + i + 1
return value, offset
def decode_value(data, offset, serial_type):
"""Decode value based on serial type."""
if serial_type == 0:
return None, offset
elif serial_type == 1:
return struct.unpack('>b', data[offset:offset+1])[0], offset + 1
elif serial_type == 2:
return struct.unpack('>h', data[offset:offset+2])[0], offset + 2
elif serial_type == 4:
return struct.unpack('>i', data[offset:offset+4])[0], offset + 4
elif serial_type == 7:
return struct.unpack('>d', data[offset:offset+8])[0], offset + 8
elif serial_type == 8:
return 0, offset
elif serial_type == 9:
return 1, offset
elif serial_type >= 12:
if serial_type % 2 == 0:
length = (serial_type - 12) // 2
return data[offset:offset+length], offset + length
else:
length = (serial_type - 13) // 2
return data[offset:offset+length].decode('utf-8', errors='replace'), offset + length
return None, offset
Parse Incrementally with Debug Output
Parse one cell completely and verify before processing all cells:
def parse_cell(data, cell_offset, debug=DEBUG):
"""Parse a single cell with detailed debug output."""
if debug:
print(f"\nParsing cell at offset {cell_offset} (0x{cell_offset:04x})")
# Read payload size
payload_size, offset = read_varint(data, cell_offset)
if debug:
print(f" Payload size: {payload_size}")
# Read row ID
row_id, offset = read_varint(data, offset)
if debug:
print(f" Row ID: {row_id}")
# Read header size
header_size, header_start = read_varint(data, offset)
if debug:
print(f" Header size: {header_size}")
# Parse serial types
serial_types = []
current = header_start
header_end = offset + header_size
while current < header_end:
st, current = read_varint(data, current)
serial_types.append(st)
if debug:
print(f" Serial types: {serial_types}")
# Parse values
values = []
for st in serial_types:
val, current = decode_value(data, current, st)
values.append(val)
if debug:
print(f" Values: {values}")
return {'row_id': row_id, 'values': values}
Common Pitfalls and Prevention
Pitfall 1: Not Understanding the Corruption Pattern
Mistake: Assuming the file has a standard SQLite header when it may only contain a data page.
Prevention: Always examine the first few bytes with hexdump. If the file starts with 0x0d instead of "SQLite format 3", the header is missing. Adjust parsing offsets accordingly (no 100-byte header offset needed).
Pitfall 2: Multiple Script Iterations
Mistake: Creating many separate recovery scripts (recover1.py, recover2.py, etc.) based on trial and error.
Prevention:
- Read the hex dump thoroughly first and annotate the structure manually
- Build one script with debug flags
- Reference the SQLite file format specification before coding
Pitfall 3: Reading Strings Beyond Their Boundaries
Mistake: Reading string data without checking the serial type length, resulting in incorrect strings (e.g., "testword052" instead of "testword05").
Prevention: Always calculate string length from serial type: length = (serial_type - 13) // 2. Read exactly that many bytes.
Pitfall 4: Syntax Errors in Generated Code
Mistake: Missing spaces in operators like if48 instead of if 48, or 12and instead of 12 and.
Prevention: Validate syntax before running:
python3 -m py_compile recovery_script.py
Pitfall 5: Wrong Byte Order
Mistake: Reading multi-byte integers with little-endian instead of big-endian.
Prevention: SQLite uses big-endian for all multi-byte integers. Always use struct.unpack('>...', data) with the > prefix.
Pitfall 6: Not Handling Truncation Gracefully
Mistake: Script crashes when encountering truncated data at end of file.
Prevention: Check bounds before every read operation:
def safe_read(data, offset, length):
if offset + length > len(data):
return None
return data[offset:offset+length]
Verification Strategy
Step 1: Validate Cell Count
Compare the number of cells reported in the page header (offset 3-4) with actual cells found.
Step 2: Validate Data Patterns
If expected patterns are known (e.g., words matching "testwordXY"), verify extracted strings match the pattern.
Step 3: Check Value Ranges
Verify extracted numeric values are within expected ranges. Watch for:
- Unexpected negative numbers (sign bit interpretation)
- Very large numbers (byte order issues)
- NaN or infinity for floats
Step 4: Compare with Expected Output Format
Before finalizing output, ensure JSON structure matches requirements:
# Validate output structure
for record in recovered_data:
assert 'word' in record and 'value' in record
assert isinstance(record['word'], str)
assert isinstance(record['value'], (int, float))
Output Generation
Format recovered data according to the required output specification:
def generate_output(recovered_rows, output_path):
"""Format and save recovered data."""
results = []
for row in recovered_rows:
if len(row['values']) >= 2:
results.append({
'word': row['values'][0],
'value': row['values'][1]
})
with open(output_path, 'w') as f:
json.dump(results, f, indent=2)
print(f"Recovered {len(results)} records to {output_path}")
return results
Summary Checklist
Before writing recovery code:
- Examined file with hexdump to understand corruption extent
- Identified whether header is present or missing
- Tried standard SQLite tools first
- Reviewed SQLite file format specification
During implementation:
- Using a single script with debug output (not multiple scripts)
- Validated Python syntax before running
- Using big-endian byte order for all multi-byte integers
- Calculating string lengths from serial types
- Handling truncation with bounds checking
After recovery:
- Verified cell count matches expectation
- Validated string patterns if known
- Checked numeric value ranges
- Confirmed output format matches requirements