name	corpus-investigation
description	Systematically investigate large corpus sections (100GB+) using stratified sampling, pattern recognition, and computational verification. Produces comprehensive section analyses with metadata schemas, chunking strategies, and RAG integration recommendations. Use when analyzing large datasets, investigating archive structures, studying corpus organization, conducting investigation and study tasks, or documenting dataset characteristics for RAG pipeline design.
allowed-tools	Read, Grep, Glob, Bash, Write

Corpus Investigation Skill

Purpose: Enable systematic, reproducible, token-efficient investigation of large corpus sections to inform RAG architecture design.

Methodology: Based on the proven investigation framework used to analyze the 121GB Marxists Internet Archive, achieving 95% token reduction through computational verification and stratified sampling.

When to Use This Skill

Activate this skill when the user requests:

"investigate corpus section"
"analyze archive structure"
"study dataset organization"
"document corpus characteristics"
"investigation and study" (Maoist reference to systematic research)
Analysis of large file collections (>1GB)
Metadata schema extraction from document sets
Chunking strategy recommendations for RAG
Dataset preparation for knowledge base ingestion

Core Investigation Framework

Follow this 5-phase methodology for all corpus investigations:

Phase 1: Reconnaissance (10% of effort)

Goal: Understand section scope without deep reading

Tasks:

Read the index page for the section (if exists)

Run directory structure analysis:

cd /path/to/section

# Get directory tree with sizes (3 levels deep)
find . -type d -maxdepth 3 | head -50
du -h --max-depth=2 | sort -h | tail -20

# Count files by type
find . -type f -name "*.html" | wc -l
find . -type f -name "*.htm" | wc -l
find . -type f -name "*.pdf" | wc -l

# Get size distribution by subdirectory
du -sh */ 2>/dev/null | sort -h

Identify subsections and document hierarchy
Calculate total size and file counts

Output: Section overview with statistics in markdown format

Phase 2: Stratified Sampling (40% of effort)

Goal: Sample representative files across key dimensions

Stratification Dimensions:

Size-based: Large (>1GB), medium (100MB-1GB), small (<100MB) subsections
Temporal: Early, mid, late periods (if time-based organization)
Type-based: HTML vs PDF, different file naming patterns
Depth-based: Index pages, category pages, content pages

Sampling Strategy:

# Sample large subsections (>1GB) - prioritize
# Read 10-15 files from largest sections

# Sample medium subsections (100MB-1GB)
# Read 5-10 files from mid-size sections

# Sample small subsections (<100MB)
# Read 3-5 files total from small sections

# Sample different time periods (if applicable)
# Find files with year patterns
find /path/to/section -name "*19[0-9][0-9]*" -o -name "*20[0-9][0-9]*" | head -20

# Sample different file depths
find /path/to/section -name "index.htm*" | head -5  # Index pages
find /path/to/section -type f -name "*.htm*" | shuf | head -10  # Random content

Target Sample Size: 15-25 files total across all dimensions

For each sampled file:

Extract structure (headings, meta tags, first paragraph)
Do NOT read full content unless necessary
Document patterns observed

Token Optimization: Extract structure only, not full content

# When reading HTML, extract structure not content:
# - DOCTYPE and charset
# - All meta tags (name and content)
# - All heading tags (h1-h6)
# - CSS classes used
# - First paragraph only
# - Link patterns (internal, external, anchors)
# - Total word count estimate

Phase 3: Pattern Verification (30% of effort)

Goal: Verify that patterns observed in samples hold across entire section

Use computational tools (grep/find), NOT exhaustive reading

Verification Commands:

# 1. Meta tag consistency
grep -r '<meta name="author"' /path/to/section | wc -l
grep -r '<meta name="description"' /path/to/section | wc -l
grep -roh '<meta name="[^"]*"' /path/to/section | sed 's/<meta name="\([^"]*\)".*/\1/' | sort | uniq -c | sort -rn

# 2. CSS class usage patterns
grep -roh 'class="[^"]*"' /path/to/section | sort | uniq -c | sort -rn | head -30

# 3. Link patterns
grep -roh 'href="[^"]*"' /path/to/section | head -100
grep -roh 'href="#[^"]*"' /path/to/section | sort | uniq -c | sort -rn | head -20

# 4. File naming conventions
find /path/to/section -type f -name "*.htm*" | sed 's/.*\///' | sort | uniq -c | sort -rn | head -30

# 5. Year/date patterns in filenames
find /path/to/section -name "*.htm*" -o -name "*.pdf" | grep -oE '(19|20)[0-9]{2}' | sort | uniq -c

# 6. DOCTYPE declarations
grep -roh '<!DOCTYPE[^>]*>' /path/to/section | sort | uniq -c

# 7. Character encoding
grep -roh 'charset=[^"]*' /path/to/section | sort | uniq -c | sort -rn

Document Confidence Levels:

100% = Universal pattern (all files)
90-99% = Standard pattern (rare exceptions)
75-89% = Common pattern (notable exceptions)
50-74% = Frequent pattern (not standard)
<50% = Occasional pattern

Phase 4: Edge Case Analysis (10% of effort)

Goal: Identify exceptions and unusual patterns

Sample These Outliers:

# 1. Largest files (top 5)
find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -hr | head -5

# 2. Smallest files (bottom 5)
find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -h | head -5

# 3. Files with unusual names (no standard patterns)
find /path/to/section -type f -name "*.htm*" | grep -v 'index\|chapter\|ch[0-9]\|[0-9]\{4\}'

# 4. Deepest nested files
find /path/to/section -type f -name "*.htm*" | awk '{print gsub(/\//,"/"), $0}' | sort -rn | head -10

# 5. Files without meta tags (if meta tags expected)
for file in $(find /path/to/section -name "*.htm*" | head -100); do
  grep -q '<meta name' "$file" || echo "$file"
done

Read 3-5 edge case files to understand why they differ

Phase 5: Synthesis (10% of effort)

Goal: Compile findings into actionable specification

Produce a Section Analysis Document following this structure:

# {Section Name} Analysis

**Section Path**: /absolute/path/to/section/
**Total Size**: {X}GB
**File Count**: {N} HTML, {M} PDFs
**Investigation Date**: YYYY-MM-DD

---

## 1. Executive Summary

[2-3 paragraph overview: purpose, size, key findings]

## 2. Directory Structure

[Hierarchical organization with sizes]

## 3. File Type Analysis

### HTML Files
- Count: {N}
- Naming conventions: [patterns]
- Size range: {min} - {max}

### PDF Files
- Count: {M}
- OCR quality: [assessed from samples]
- Purpose: [scanned books/periodicals/etc]

## 4. HTML Structure Patterns

### DOCTYPE and Encoding
[Common declarations]

### Meta Tag Schema
| Meta Tag | Occurrence % | Example |
|----------|--------------|---------|
| author   | 95%          | "Marx"  |
| ...      | ...          | ...     |

### Semantic CSS Classes
[Classes with semantic meaning]

## 5. Metadata Extraction Schema

[Define 5-layer metadata schema]

**Layer 1: File System**
- section, subsection, author (from path)
- year, chapter (from filename)

**Layer 2: HTML Meta Tags**
- title, author, description, keywords

**Layer 3: Breadcrumb Navigation**
- breadcrumb trail, category path

**Layer 4: Semantic CSS Classes**
- curator context, provenance, annotations

**Layer 5: Content-Derived**
- word_count, has_footnotes, has_images, reading_time

**Example Metadata Record**:
```json
{
  "section": "...",
  "author": "...",
  "title": "...",
  "year": 1867,
  "word_count": 8500,
  ...
}

6. Linking Architecture

Internal link patterns
Cross-section references
Anchor/footnote system
Link topology (hierarchical/cross-reference/citation)

7. Unique Characteristics

[What makes this section different?]

8. Chunking Recommendations

Recommended Chunk Boundary

[Paragraph / Section / Article / Entry]

Rationale

[Why this boundary is optimal for RAG]

Chunk Size Estimates

[Expected token counts]

Hierarchical Context Preservation

[How to preserve work → chapter → section hierarchy]

9. RAG Integration Strategy

Processing pipeline steps
Metadata enrichment approach
Indexing strategy (vector DB collections, filters)
Expected query patterns
Cross-section integration

10. Edge Cases and Exceptions

Unusual files
Missing metadata
Broken links
Encoding issues

11. Processing Risks and Mitigations

Risk	Impact	Probability	Mitigation
...	...	...	...

12. Sample Files Analyzed

[List all files read during investigation]

/path/to/file1.htm - Index page, navigation structure
/path/to/file2.htm - Typical content, has footnotes ...

13. Pattern Verification Commands

[Document exact bash commands used]

# Example verification commands
grep -r '<meta name="author"' /path | wc -l

14. Recommendations

Priority: HIGH / MEDIUM / LOW for RAG inclusion
Processing complexity: SIMPLE / MODERATE / COMPLEX
Special requirements: OCR / translation / cleanup needed
Integration dependencies: [other sections required first]

15. Open Questions

[Question requiring user decision]
[Question needing further investigation]


---

## Token Efficiency Tactics

**CRITICAL**: Use these tactics to achieve 95% token reduction

### 1. Computational Tools Over Reading

**DON'T**: Read 100 files to find patterns (500k tokens)
**DO**: Use grep to extract patterns + read 5 samples (25k tokens)

### 2. Strategic Sampling Over Exhaustive Coverage

**DON'T**: Read every author's archive (1M+ tokens)
**DO**: Read 3 authors (large/medium/small) + verify with find (50k tokens)

### 3. Extract Structure, Not Content

**DON'T**: Read entire documents
**DO**: Extract headings, meta tags, first paragraph only

**Implementation**:
- Read only `<head>` section for meta tags
- Extract only `<h1>-<h6>` tags for structure
- Read only first `<p>` for content sample
- Count links/images, don't read them

### 4. Aggregate Statistics Over Individual Analysis

**DON'T**: Describe each file individually
**DO**: Compute aggregate statistics and describe patterns

**Example**:

Instead of: "file1.htm has 3 meta tags" "file2.htm has 3 meta tags" "file3.htm has 2 meta tags"

Write: "95% of files have 3 meta tags (author, description, classification)" "5% are missing description tag"


### 5. Reference Examples, Don't Reproduce

**DON'T**: Include full HTML of 10 example files (50k tokens)
**DO**: Include 1-2 representative examples + reference paths (5k tokens)

### 6. Use Shell Commands for Verification

**Always prefer**:
- `grep -r` for pattern extraction
- `find` for file discovery
- `wc -l` for counting
- `sort | uniq -c` for frequency analysis
- `du -sh` for size calculations

**Over**:
- Reading files individually
- Manual counting
- Exhaustive sampling

---

## Metadata Extraction Protocol

**Extract metadata in 5 layers** for comprehensive documentation:

### Layer 1: File System Metadata

Extract from file paths using regex:

```python
# Example patterns to extract:
# - Section from /path/{section}/...
# - Author from /archive/{author}/...
# - Year from .../{year}/... or filename
# - Work slug from path structure
# - Chapter from ch##.htm filenames

Layer 2: HTML Meta Tags

Parse <meta> tags in <head>:

# Extract all meta tag names
grep -roh '<meta name="[^"]*"' /path | sed 's/<meta name="\([^"]*\)".*/\1/' | sort | uniq -c

# Extract specific meta tag
grep -roh '<meta name="author" content="[^"]*"' /path | head -20

Layer 3: Breadcrumb Navigation

Extract breadcrumb trails (usually <p class="breadcrumb"> or similar):

# Find breadcrumb patterns
grep -r 'class="breadcrumb"' /path | head -10
grep -r 'class="path"' /path | head -10
grep -r '<nav' /path | head -10

Layer 4: Semantic CSS Classes

Identify CSS classes with semantic meaning:

# Extract all CSS classes
grep -roh 'class="[^"]*"' /path | sed 's/class="\([^"]*\)".*/\1/' | tr ' ' '\n' | sort | uniq -c | sort -rn | head -30

# Common semantic classes to look for:
# - "context" (curator annotations)
# - "information" (provenance)
# - "quoteb" (block quotes)
# - "fst" (first paragraph)
# - "title" (work titles)

Layer 5: Content-Derived Metadata

Calculate from document content:

Word count (approximate: wc -w)
Has footnotes (check for anchor links: href="#")
Has images (check for tags)
Reading time (word_count / 250 words per minute)

Sample Selection Algorithm

Use this strategy to select representative samples:

For a section with N total files:

1. ALWAYS read top-level index.htm (if exists)

2. Identify subsections by size:
   - Large (>1GB): Sample 10-15 files
   - Medium (100MB-1GB): Sample 5-10 files
   - Small (<100MB): Sample 3-5 files total

3. Within each subsection, stratify by:
   - File type (HTML vs PDF)
   - Depth (index vs category vs content pages)
   - Time period (if applicable)

4. Use random sampling within strata:
   find /path -name "*.htm*" | shuf | head -10

5. Include edge cases:
   - Largest file
   - Smallest file
   - Unusual naming pattern
   - Deepest nested file

Target: 15-25 files total for sections <10GB
Target: 25-40 files total for sections >10GB

Verification Without Exhaustive Reading

Principle: Trust but verify

After identifying a pattern from samples, verify it holds using computational tools.

Confidence Level Calculation

# Pattern: "All files have author meta tag"

# Count total files
total=$(find /path -name "*.htm*" | wc -l)

# Count files with pattern
with_pattern=$(grep -rl '<meta name="author"' /path | wc -l)

# Calculate percentage
echo "Coverage: $with_pattern / $total = $(( 100 * with_pattern / total ))%"

Statistical Sampling for Very Large Sections

For sections >10GB:

# Get random sample of 1000 files
find /path -name "*.htm*" | shuf -n 1000 > sample_files.txt

# Check pattern in sample
while read file; do
  grep -q '<meta name="author"' "$file" && echo "1" || echo "0"
done < sample_files.txt | awk '{sum+=$1; count++} END {print "Coverage: " sum/count*100 "%"}'

Quality Assurance Checklist

Before completing investigation, verify:

Completeness

Directory structure fully documented with sizes
File counts accurate (HTML, PDF, other)
Meta tag schema extracted (all fields identified)
CSS class inventory complete
Link patterns documented
Unique characteristics identified
Chunking strategy defined with rationale
RAG integration approach proposed
Edge cases documented
Sample files listed with notes
Verification commands included
Confidence levels specified for patterns
Open questions documented

Reproducibility

Sampling strategy documented
Bash commands included (exact commands)
File paths specified (actual paths, not placeholders)
Investigation date included
Corpus version/snapshot documented

Token Efficiency

Used grep/find instead of reading where applicable
Sampled strategically (not exhaustively)
Aggregated statistics (not individual descriptions)
Referenced examples (not reproduced unnecessarily)
Extracted structure only (not full content)

Actionability

Metadata schema is code-ready
Chunking strategy is specific (not vague)
Processing pipeline defined step-by-step
Risks identified with mitigations
Priority level assigned (HIGH/MEDIUM/LOW)

Common Bash Commands Reference

File Discovery

# Find all HTML files
find /path -type f -name "*.htm*"

# Find all PDFs
find /path -type f -name "*.pdf"

# Find files modified in last 30 days
find /path -type f -mtime -30

# Find files larger than 1MB
find /path -type f -size +1M

# Find files by depth
find /path -maxdepth 2 -type f

Pattern Extraction

# Extract meta tag names
grep -roh '<meta name="[^"]*"' /path | sed 's/<meta name="\([^"]*\)".*/\1/' | sort | uniq -c | sort -rn

# Extract CSS classes
grep -roh 'class="[^"]*"' /path | sort | uniq -c | sort -rn

# Extract link patterns
grep -roh 'href="[^"]*"' /path | head -100

# Extract year patterns
find /path -name "*.htm*" | grep -oE '(19|20)[0-9]{2}' | sort | uniq -c

# Extract DOCTYPE
grep -roh '<!DOCTYPE[^>]*>' /path | sort | uniq -c

Statistics

# Count files by extension
find /path -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn

# Size distribution by directory
du -sh /path/*/ | sort -h

# Average file size
find /path -type f -name "*.htm*" | xargs ls -l | awk '{sum+=$5; count++} END {print sum/count/1024 " KB average"}'

# Word count distribution
find /path -name "*.htm*" | head -100 | xargs wc -w | sort -n

Sampling

# Random sample of 10 files
find /path -name "*.htm*" | shuf | head -10

# Stratified sample (first, middle, last)
files=$(find /path -name "*.htm*" | sort)
total=$(echo "$files" | wc -l)
echo "$files" | sed -n "1p; $(($total/2))p; ${total}p"

# Sample by size (smallest, median, largest)
find /path -name "*.htm*" | xargs ls -lh | sort -k5 -h | awk 'NR==1 || NR==total/2 || NR==total'

Investigation Tips

Do's

Start with reconnaissance - Understand scope before sampling
Use stratified sampling - Cover all dimensions (size, time, type, depth)
Verify patterns computationally - grep/find are your friends
Document confidence levels - Specify % of files matching patterns
Extract structure, not content - Save tokens
Reference file paths - Use actual paths, enable reproduction
Include bash commands - Document what you ran
Identify edge cases - Sample outliers explicitly
Be specific in recommendations - "Chunk by paragraph" not "chunk appropriately"
Ask clarifying questions - If scope is unclear, ask user

Don'ts

Don't read exhaustively - Sample strategically instead
Don't include full HTML - Extract structure only
Don't describe individual files - Aggregate statistics instead
Don't make assumptions - Verify patterns with grep/find
Don't skip edge cases - Outliers reveal important variations
Don't use vague language - Be specific and actionable
Don't forget verification - Every pattern needs confidence level
Don't ignore errors - Document broken files, encoding issues
Don't reinvent chunking - Use proven strategies from methodology
Don't exceed token budget - If approaching limit, increase aggregation

Example Investigation Workflow

Here's a concrete example workflow:

USER: "Investigate the /data/archive/philosophy section"

AGENT RESPONSE:

I'll systematically investigate the philosophy section using the 5-phase methodology.

## Phase 1: Reconnaissance

[Run directory structure commands]
[Document size, file counts, hierarchy]

Section overview:
- Size: 2.3GB
- Files: 1,847 HTML, 234 PDFs
- Structure: Organized by author (45 authors)
- Time span: 1840-1950 (from filenames)

## Phase 2: Stratified Sampling

[Select 20 representative files across:]
- Large authors: Hegel (400MB), Kant (300MB), Marx (250MB)
- Medium authors: Nietzsche (150MB), Spinoza (120MB)
- Small authors: Feuerbach (40MB), Stirner (30MB)

[For each sample, extract structure:]
- Meta tags: author, title, classification
- Headings: h1 (title), h2 (chapters), h3 (sections)
- CSS classes: "title", "quoteb", "context"
- First paragraph: [sample text]
- Links: Internal chapter links, footnote anchors

## Phase 3: Pattern Verification

[Run verification commands:]

```bash
# Meta tag coverage
grep -r '<meta name="author"' /data/archive/philosophy | wc -l
# Result: 1,823 / 1,847 files = 99% coverage

# CSS class distribution
grep -roh 'class="[^"]*"' /data/archive/philosophy | sort | uniq -c | sort -rn
# Result: "quoteb" (8,234), "title" (1,847), "context" (892)

Confidence levels:

Author meta tag: 99% (universal)
Title meta tag: 95% (standard)
Classification meta tag: 78% (common)

Phase 4: Edge Cases

[Sample outliers:]

Largest file: hegel-phenomenology.htm (2.3MB, 450k words)
Smallest file: stirner-intro.htm (3KB, 200 words)
Missing meta tags: 24 files without author tag (early scans)
Encoding issues: 12 files with mojibake (German umlauts)

Phase 5: Synthesis

[Write complete section analysis document following template]

Chunking Recommendation: Section-level chunks (h3 boundaries)

Rationale: Philosophical arguments span multiple paragraphs
Average chunk: 800-1200 words, 1000-1500 tokens
Preserve hierarchy: author → work → chapter → section

RAG Integration Strategy:

Collection: "philosophy"
Metadata filters: author, work_title, year, classification
Expected queries: "What did Kant say about...", "Compare Hegel and Marx on..."

Priority: HIGH (core theoretical content) Processing Complexity: MODERATE (encoding issues for German texts)

Open Questions:

Should German texts be translated or kept in original?
Include secondary literature or only primary sources?


---

## Reference Materials

**Full Methodology Specification**:
See `/home/user/projects/marxist-rag/docs/corpus-analysis/00-investigation-methodology-spec.md`

**Example Section Analysis**:
See `/home/user/projects/marxist-rag/docs/corpus-analysis/01-archive-section-analysis.md`

**Corpus Overview**:
See `/home/user/projects/marxist-rag/docs/corpus-analysis/00-corpus-overview.md`

---

## Success Criteria

A successful corpus investigation produces:

1. **Complete section analysis document** following the template
2. **Metadata extraction schema** ready for implementation
3. **Chunking strategy** with clear rationale
4. **RAG integration approach** with concrete steps
5. **Confidence levels** for all identified patterns
6. **Verification commands** enabling reproduction
7. **Edge case documentation** with handling recommendations
8. **Open questions** for user decision

**Token Budget**: 15,000-30,000 tokens per section investigation
**Time Estimate**: 30-60 minutes for AI agent
**Output Size**: 5,000-10,000 token specification document

---

## Investigation Principles Summary

1. **Sample strategically, verify computationally**
2. **Extract structure, not content**
3. **Document patterns, not instances**
4. **Specify confidence levels**
5. **Produce actionable specifications**

**Remember**: The goal is not to read every file, but to understand the corpus structure well enough to design an optimal RAG processing pipeline.

Use computational tools (grep, find, wc, sort, uniq) to verify patterns across thousands of files without reading them individually. This achieves 95% token reduction while maintaining investigative rigor.

---

## Notes

- This methodology was proven on the 121GB Marxists Internet Archive investigation
- Investigations following this framework are reproducible by other AI agents
- Token efficiency tactics are critical for large-scale corpus analysis
- Stratified sampling ensures representative coverage without exhaustive reading
- Computational verification provides confidence without manual checking
- Structured output enables direct implementation of RAG pipelines

**For questions or methodology improvements, see the reference documentation in `/home/user/projects/marxist-rag/docs/corpus-analysis/`**

Install Skill

SKILL.md