name	sqlite-db-truncate
description	Guidance for recovering data from corrupted or truncated SQLite database files through binary analysis and manual parsing. This skill applies when working with damaged SQLite databases that cannot be opened with standard tools, particularly when corruption is due to binary truncation, incomplete writes, or filesystem errors.

SQLite Truncated Database Recovery

This skill provides systematic approaches for recovering data from SQLite database files that have been corrupted through binary truncation. It emphasizes understanding the SQLite file format before attempting recovery and avoiding common pitfalls that lead to multiple failed iterations.

When to Use This Skill

This skill applies when:

A SQLite database file cannot be opened with standard sqlite3 commands
The database error indicates corruption or malformed data
File size is smaller than expected (suggesting truncation)
Standard recovery tools like .recover command fail
Manual binary parsing of SQLite page structure is required

Initial Assessment Strategy

Before writing any recovery code, perform a thorough analysis of the corrupted file:

Step 1: Examine File Characteristics

# Check file size and basic properties
ls -lh database.db
file database.db

# Create hex dump for analysis
hexdump -C database.db | head -100

Key observations to make:

File size: SQLite pages are typically 4096 bytes. Check if size aligns with page boundaries
Magic bytes: Valid SQLite files start with "SQLite format 3\000" (16 bytes)
First byte after header: Identifies page type (0x0d = table leaf page with actual data)

Step 2: Identify Corruption Pattern

Common truncation scenarios:

Header-only file: Only the 100-byte header remains
Missing header: File starts with a data page (first byte is 0x0d, 0x05, 0x0a, or 0x02)
Partial page: File ends mid-page, truncating some cells

If the file lacks the standard "SQLite format 3" magic header but starts with 0x0d, this indicates the file contains only a table leaf page without the database header.

Step 3: Try Standard Tools First

Always attempt standard recovery before manual parsing:

# Check if sqlite3 can read the file
sqlite3 database.db ".schema" 2>&1
sqlite3 database.db "SELECT * FROM sqlite_master" 2>&1

# Try built-in recovery
sqlite3 database.db ".recover" > recovered.sql 2>&1

# Try integrity check
sqlite3 database.db "PRAGMA integrity_check;"

If these fail with "database disk image is malformed" or similar errors, proceed to manual binary parsing.

SQLite Page Structure Overview

Understanding the page structure is essential before writing recovery code.

Table Leaf Page Layout (Page Type 0x0d)

Offset  Size   Description
------  ----   -----------
0       1      Page type (0x0d for table leaf)
1       2      First freeblock offset (big-endian)
3       2      Number of cells on page (big-endian)
5       2      Cell content area start offset (big-endian)
7       1      Fragmented free bytes count
8+      varies Cell pointer array (2 bytes per cell, big-endian)
...            [Gap/free space]
End            Cell data (grows backward from page end)

Cell Structure

Each cell contains a database row:

[Payload size: varint]
[Row ID: varint]
[Header size: varint]
[Serial type 1: varint]
[Serial type 2: varint]
...
[Column 1 value]
[Column 2 value]
...

Varint Encoding

SQLite uses variable-length integers (varints):

Bytes 1-8: Use 7 bits for data, high bit (0x80) indicates continuation
Byte 9: Uses all 8 bits (no continuation)

Serial Types

Serial types indicate how to interpret column data:

Type	Size	Meaning
0	0	NULL
1	1	8-bit signed integer
2	2	16-bit big-endian signed integer
3	3	24-bit big-endian signed integer
4	4	32-bit big-endian signed integer
7	8	IEEE 754 64-bit float (big-endian)
8	0	Integer constant 0
9	0	Integer constant 1
N >= 12, even	(N-12)/2	BLOB
N >= 13, odd	(N-13)/2	Text string (UTF-8)

Example: Serial type 0x21 (33) = text string of length (33-13)/2 = 10 bytes.

Recovery Approach

Build a Single, Modular Script

Avoid creating multiple separate recovery scripts. Instead, build one script iteratively with clear debug output:

import struct
import json

DEBUG = True

def read_varint(data, offset):
    """Read SQLite variable-length integer."""
    value = 0
    for i in range(9):
        if offset + i >= len(data):
            return None, offset
        byte = data[offset + i]
        if i == 8:
            value = (value << 8) | byte
            return value, offset + i + 1
        value = (value << 7) | (byte & 0x7f)
        if (byte & 0x80) == 0:
            return value, offset + i + 1
    return value, offset

def decode_value(data, offset, serial_type):
    """Decode value based on serial type."""
    if serial_type == 0:
        return None, offset
    elif serial_type == 1:
        return struct.unpack('>b', data[offset:offset+1])[0], offset + 1
    elif serial_type == 2:
        return struct.unpack('>h', data[offset:offset+2])[0], offset + 2
    elif serial_type == 4:
        return struct.unpack('>i', data[offset:offset+4])[0], offset + 4
    elif serial_type == 7:
        return struct.unpack('>d', data[offset:offset+8])[0], offset + 8
    elif serial_type == 8:
        return 0, offset
    elif serial_type == 9:
        return 1, offset
    elif serial_type >= 12:
        if serial_type % 2 == 0:
            length = (serial_type - 12) // 2
            return data[offset:offset+length], offset + length
        else:
            length = (serial_type - 13) // 2
            return data[offset:offset+length].decode('utf-8', errors='replace'), offset + length
    return None, offset

Parse Incrementally with Debug Output

Parse one cell completely and verify before processing all cells:

def parse_cell(data, cell_offset, debug=DEBUG):
    """Parse a single cell with detailed debug output."""
    if debug:
        print(f"\nParsing cell at offset {cell_offset} (0x{cell_offset:04x})")

    # Read payload size
    payload_size, offset = read_varint(data, cell_offset)
    if debug:
        print(f"  Payload size: {payload_size}")

    # Read row ID
    row_id, offset = read_varint(data, offset)
    if debug:
        print(f"  Row ID: {row_id}")

    # Read header size
    header_size, header_start = read_varint(data, offset)
    if debug:
        print(f"  Header size: {header_size}")

    # Parse serial types
    serial_types = []
    current = header_start
    header_end = offset + header_size
    while current < header_end:
        st, current = read_varint(data, current)
        serial_types.append(st)

    if debug:
        print(f"  Serial types: {serial_types}")

    # Parse values
    values = []
    for st in serial_types:
        val, current = decode_value(data, current, st)
        values.append(val)

    if debug:
        print(f"  Values: {values}")

    return {'row_id': row_id, 'values': values}

Common Pitfalls and Prevention

Pitfall 1: Not Understanding the Corruption Pattern

Mistake: Assuming the file has a standard SQLite header when it may only contain a data page.

Prevention: Always examine the first few bytes with hexdump. If the file starts with 0x0d instead of "SQLite format 3", the header is missing. Adjust parsing offsets accordingly (no 100-byte header offset needed).

Pitfall 2: Multiple Script Iterations

Mistake: Creating many separate recovery scripts (recover1.py, recover2.py, etc.) based on trial and error.

Prevention:

Read the hex dump thoroughly first and annotate the structure manually
Build one script with debug flags
Reference the SQLite file format specification before coding

Pitfall 3: Reading Strings Beyond Their Boundaries

Mistake: Reading string data without checking the serial type length, resulting in incorrect strings (e.g., "testword052" instead of "testword05").

Prevention: Always calculate string length from serial type: length = (serial_type - 13) // 2. Read exactly that many bytes.

Pitfall 4: Syntax Errors in Generated Code

Mistake: Missing spaces in operators like if48 instead of if 48, or 12and instead of 12 and.

Prevention: Validate syntax before running:

python3 -m py_compile recovery_script.py

Pitfall 5: Wrong Byte Order

Mistake: Reading multi-byte integers with little-endian instead of big-endian.

Prevention: SQLite uses big-endian for all multi-byte integers. Always use struct.unpack('>...', data) with the > prefix.

Pitfall 6: Not Handling Truncation Gracefully

Mistake: Script crashes when encountering truncated data at end of file.

Prevention: Check bounds before every read operation:

def safe_read(data, offset, length):
    if offset + length > len(data):
        return None
    return data[offset:offset+length]

Verification Strategy

Step 1: Validate Cell Count

Compare the number of cells reported in the page header (offset 3-4) with actual cells found.

Step 2: Validate Data Patterns

If expected patterns are known (e.g., words matching "testwordXY"), verify extracted strings match the pattern.

Step 3: Check Value Ranges

Verify extracted numeric values are within expected ranges. Watch for:

Unexpected negative numbers (sign bit interpretation)
Very large numbers (byte order issues)
NaN or infinity for floats

Step 4: Compare with Expected Output Format

Before finalizing output, ensure JSON structure matches requirements:

# Validate output structure
for record in recovered_data:
    assert 'word' in record and 'value' in record
    assert isinstance(record['word'], str)
    assert isinstance(record['value'], (int, float))

Output Generation

Format recovered data according to the required output specification:

def generate_output(recovered_rows, output_path):
    """Format and save recovered data."""
    results = []
    for row in recovered_rows:
        if len(row['values']) >= 2:
            results.append({
                'word': row['values'][0],
                'value': row['values'][1]
            })

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

    print(f"Recovered {len(results)} records to {output_path}")
    return results

Summary Checklist

Before writing recovery code:

Examined file with hexdump to understand corruption extent
Identified whether header is present or missing
Tried standard SQLite tools first
Reviewed SQLite file format specification

During implementation:

Using a single script with debug output (not multiple scripts)
Validated Python syntax before running
Using big-endian byte order for all multi-byte integers
Calculating string lengths from serial types
Handling truncation with bounds checking

After recovery:

Verified cell count matches expectation
Validated string patterns if known
Checked numeric value ranges
Confirmed output format matches requirements

sqlite-db-truncate

Install Skill

SKILL.md