name	content-ingestion
description	Ingest web content into Kurt. Map sitemaps to discover URLs, then fetch content selectively.

Content Ingestion

Overview

This skill enables efficient web content ingestion with a map-then-fetch workflow. Discover URLs from sitemaps first, review what was found, then selectively download only the content you need. Supports single document fetching and fast parallel batch downloads.

Content is extracted as markdown with metadata (title, author, date, categories) and stored in the sources/ directory.

Quick Start

# 1. Discover URLs from sitemap (fast, no downloads)
kurct content map https://www.anthropic.com

# 1a. OR discover with publish dates (slower, extracts dates from blogrolls)
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates

# 2. Review what was found
kurt content list --status NOT_FETCHED

# 3. Fetch content (parallel batch)
kurct content fetch --url-prefix https://www.anthropic.com/

Map-Then-Fetch Workflow

Why two steps?

Sitemaps often contain hundreds or thousands of URLs
Map step is fast (no downloads) - lets you review before committing
Fetch step is slow (downloads + extraction) - run selectively
Saves time, bandwidth, and storage

Three-step process:

Map: Discover URLs and create NOT_FETCHED records
Review: Examine discovered URLs using document management commands
Fetch: Download content selectively (single or batch)

Integration with Iterative Source Gathering

When invoked from /create-project or /resume-project, use map-then-fetch to provide preview:

Map first - Show user what URLs were discovered

kurct content map https://docs.example.com
echo "Discovered X URLs. Preview:"
kurt content list --url-prefix https://docs.example.com --status NOT_FETCHED | head -10

Get approval - Ask if user wants to fetch all or selective

Found 150 URLs from docs.example.com

Preview (first 10):
1. https://docs.example.com/quickstart
2. https://docs.example.com/api/authentication
...

Fetch all 150 pages? Or selective? (all/selective/cancel)

Fetch approved content

# If all:
kurct content fetch --url-prefix https://docs.example.com

# If selective (user specifies path pattern):
kurct content fetch --url-prefix https://docs.example.com/api/

This provides Checkpoint 1 (preview) for the iterative source gathering pattern.

Core Operations

Map Sitemap URLs

Discover URLs from sitemaps without downloading content.

# Discover all URLs from sitemap
kurct content map https://www.anthropic.com

# Discover with publish dates from blogrolls (recommended for blogs/docs)
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates

# Limit discovery (useful for testing)
kurct content map https://example.com --limit 10

# Map and fetch immediately
kurct content map https://example.com --fetch

# Discover dates with custom blogroll limit
kurct content map https://example.com --discover-dates --max-blogrolls 5

# JSON output for scripts
kurct content map https://example.com --output json

What happens:

Automatically finds sitemap URLs (checks /sitemap.xml, robots.txt, etc.)
Creates database records with NOT_FETCHED status
Skips duplicate URLs gracefully
Returns list of discovered documents

Example output:

✓ Found 317 URLs from sitemap
  Created: 317 new documents

  ✓ https://www.anthropic.com
     ID: 6203468a | Status: NOT_FETCHED
  ✓ https://www.anthropic.com/news/claude-3-7-sonnet
     ID: bc2bcf48 | Status: NOT_FETCHED

Discovering Publish Dates from Blogrolls

When to use --discover-dates:

Content freshness is critical (tutorials, documentation, blog posts)
You need to identify outdated content for updates
Building a chronological content inventory
Working with blogs, changelogs, or release notes

How it works:

Maps sitemap normally (all URLs)
Uses LLM to identify blog indexes, changelogs, release notes pages
Scrapes those pages to extract individual post URLs + dates
Creates/updates document records with published_date populated
Marks discovered posts with is_chronological=True and discovery_method="blogroll"

Example:

# Discover docs.getdbt.com with date extraction
kurct content map https://docs.getdbt.com/sitemap.xml --discover-dates

# Output shows date discovery:
# ✓ Found 500 URLs from sitemap
# --- Discovering blogroll/changelog pages ---
# Found 8 potential blogroll/changelog pages
#
# Scraping https://docs.getdbt.com/blog...
#   Found 45 posts with dates
# Scraping https://docs.getdbt.com/docs/dbt-versions/...
#   Found 23 posts with dates
#
# ✓ Total documents discovered from blogrolls: 68
#   New: 12 (not in sitemap)
#   Existing: 56 (enriched with dates)

Performance:

Regular mapping: ~2-5 seconds (just sitemap parsing)
With --discover-dates: ~5-15 minutes (includes LLM analysis + page scraping)
Controlled with --max-blogrolls (default: 10 pages)

Benefits:

Dates captured upfront (before fetch)
Discovers additional posts not in sitemap
Enriches existing records with publish dates
Enables date-based filtering and relevance tracking

Troubleshooting: Sitemap Discovery Failures

If automatic discovery fails with "No sitemap found," use this fallback workflow:

Tier 1: Try Common Sitemap Paths Directly

Most sites use standard paths. Try these directly (not base URL):

# Standard path (most common)
kurct content map https://docs.getdbt.com/sitemap.xml

# Alternative paths
kurct content map https://docs.getdbt.com/sitemap_index.xml
kurct content map https://docs.getdbt.com/sitemap/sitemap.xml
kurct content map https://docs.getdbt.com/sitemaps/sitemap.xml

Success if: Kurt finds and processes the sitemap Next if fails: Try Tier 2

Tier 2: Check robots.txt

Use WebFetch to find sitemap URL in robots.txt:

WebFetch URL: https://docs.getdbt.com/robots.txt
Prompt: "Extract all Sitemap URLs from this robots.txt file"

Then try the discovered sitemap URLs:

kurct content map <sitemap-url-from-robots>

Success if: Sitemap URL found in robots.txt works Next if fails: Try Tier 3

Tier 3: Search for Sitemap

Use WebSearch to discover sitemap location:

WebSearch: "site:docs.getdbt.com sitemap"
WebSearch: "docs.getdbt.com sitemap.xml"

Or check common documentation pages for sitemap links.

Then try discovered URLs with kurct content map

Success if: Sitemap found via search Next if fails: Try Tier 4

Tier 4: Manual URL Collection

If no sitemap exists or is inaccessible, manually collect URLs:

Option A: Use WebSearch to find pages

WebSearch: "site:docs.getdbt.com tutorial"
WebSearch: "site:docs.getdbt.com guide"

Option B: Crawl from homepage

Use WebFetch on homepage
Extract navigation links
Add each URL manually

Option C: User provides URL list

Ask user for key URLs to ingest
Import from CSV or list

Then add URLs manually:

kurct content add https://docs.getdbt.com/page1
kurct content add https://docs.getdbt.com/page2
kurct content add https://docs.getdbt.com/page3

Real Example: docs.getdbt.com

# ❌ Automatic discovery fails
kurct content map https://docs.getdbt.com
# Error: No sitemap found

# ✅ Direct sitemap URL works!
kurct content map https://docs.getdbt.com/sitemap.xml --limit 100
# Success: Found 100 URLs from sitemap

Quick Diagnostic Commands

Test if sitemap exists:

# Use WebFetch to check
WebFetch URL: https://example.com/sitemap.xml
Prompt: "Does this sitemap exist? If yes, describe its structure."

Why automatic discovery fails:

Anti-bot protection (site blocks trafilatura but not WebFetch)
Sitemap not in robots.txt
Non-standard sitemap location
Dynamic/JavaScript-rendered sitemaps

When to use each tier:

Tier 1: Always try first (5 seconds to test)
Tier 2: Standard sites with robots.txt (1 minute)
Tier 3: Unusual configurations (2-3 minutes)
Tier 4: No sitemap or heavily protected sites (ongoing)

Fetch Single Document

Download content for a specific document.

# Fetch by document ID
kurct content fetch 6203468a-e3dc-48f2-8e1f-6e1da34dab05

# Fetch by URL (creates document if needed)
kurct content fetch https://www.anthropic.com/company

What happens:

Downloads HTML content
Extracts markdown with trafilatura
Saves to sources/{domain}/{path}/page.md
Updates database: FETCHED status, content metadata
Returns document details

Batch Fetch Documents

Download multiple documents in parallel (5-10x faster than sequential).

# Fetch all from domain
kurct content fetch --url-prefix https://www.anthropic.com/

# Fetch all blog posts
kurct content fetch --url-contains /blog/

# Fetch everything NOT_FETCHED
kurct content fetch --all

# Increase parallelism (default: 5)
kurct content fetch --url-prefix https://example.com/ --max-concurrent 10

# Retry failed documents
kurct content fetch --status ERROR --url-prefix https://example.com/

What happens:

Fetches documents concurrently (default: 5 parallel)
Uses async httpx for fast downloads
Extracts metadata: title, author, date, categories, language
Stores content fingerprint for deduplication
Updates all document records in batch

Performance:

Sequential: ~2-3 seconds per document
Parallel (5 concurrent): ~0.4-0.6 seconds per document
Example: 82 documents in ~10 seconds vs ~3 minutes

File structure after fetch:

sources/
└── www.anthropic.com/
    ├── news/
    │   └── claude-3-7-sonnet.md
    └── company.md

Alternative: Manual URL Addition

When sitemap discovery fails or you want to add specific URLs.

Add Single URLs

# Add URL without fetching
kurct content add https://example.com/page1
kurct content add https://example.com/page2

# Then fetch when ready
kurct content fetch https://example.com/page1

Direct Fetch (Add + Fetch)

Create document record and fetch content in one step.

# Creates document if doesn't exist, then fetches
kurct content fetch https://example.com/specific-page

WebFetch Fallback (When Kurt Fetch Fails)

If kurct content fetch fails due to anti-bot protection or other issues, use WebFetch as a fallback with automatic import.

Workflow

Attempt Kurt fetch first (creates ERROR record if it fails)
Use WebFetch to retrieve content with metadata
Save with YAML frontmatter
Auto-import hook handles the rest

WebFetch with Metadata Extraction

When using WebFetch, extract FULL metadata to preserve in the file:

Use WebFetch with this prompt:
"Extract ALL metadata from this page including:
- Full page title
- Description / meta description
- Author(s)
- Published date or last modified date
- Any structured data or meta tags
- The complete page content as markdown

Format the response to clearly show:
1. All metadata fields found
2. The markdown content"

Save Content with Frontmatter

Save the fetched content with YAML frontmatter:

---
title: "Full Page Title from WebFetch"
url: https://example.com/page
description: "Meta description or summary"
author: "Author Name or Organization"
published_date: "2025-10-22"
last_modified: "2025-10-22"
fetched_via: WebFetch
fetched_at: "2025-10-23"
---

# Page Content

[Markdown content from WebFetch...]

Auto-Import Process

Once the file is saved to /sources/, the PostToolUse hook automatically:

Detects the new .md file
Finds the ERROR record in Kurt DB
Updates status to FETCHED
Parses frontmatter and populates metadata fields
Links the file to the database record
Runs metadata extraction (kurt index)
Shows confirmation message

Result: File is fully indexed in Kurt with proper title, author, dates, and description!

Example: Complete WebFetch Fallback

# 1. Kurt fetch fails
kurct content fetch https://docs.example.com/guide
# → Creates ERROR record

# 2. Use WebFetch (Claude does this automatically)
# Extracts metadata + content

# 3. Save with frontmatter to sources/
# File: sources/docs.example.com/guide.md

# 4. Auto-import hook triggers
# → Updates ERROR record to FETCHED
# → Populates title, author, date from frontmatter
# → Links file to database

# 5. Verify
kurt content get-metadata <doc-id>
# Shows proper title, metadata, FETCHED status

Benefits of WebFetch with Frontmatter

✅ Preserves all page metadata (title, author, dates)
✅ Automatic import via hook
✅ No manual database updates needed
✅ Content is queryable and searchable
✅ Metadata extraction works same as native fetch
✅ Transparent fallback - just works!

Advanced Usage

For custom extraction behavior beyond the CLI, use trafilatura Python library directly.

Custom Crawling

Control crawl depth, URL patterns, language filters, and more.

# See scripts/advanced_crawl_and_import.py
from trafilatura.spider import focused_crawler

todo = focused_crawler(
    homepage,
    max_seen_urls=100,
    max_known_urls=50
)

Trafilatura Crawls Documentation

Custom Extraction Settings

Fine-tune extraction: precision vs recall, include comments, handle tables.

# See scripts/advanced_fetch_custom_extraction.py
from trafilatura import extract

content = extract(
    html,
    include_comments=False,
    include_tables=True,
    favor_precision=True
)

Trafilatura Core Functions

Custom Extraction Config

Configure timeouts, minimum text size, date extraction.

# See scripts/custom_extraction_config.py
from trafilatura.settings import use_config

config = use_config()
config.set('DEFAULT', 'MIN_EXTRACTED_SIZE', '500')

Trafilatura Settings

Quick Reference

Task	Command	Performance
Map sitemap	`kurct content map <url>`	Fast (no downloads)
Map with dates	`kurct content map <url> --discover-dates`	~5-15 min (LLM scraping)
Fetch single	`kurct content fetch <id\|url>`	~2-3s per doc
Batch fetch	`kurct content fetch --url-prefix <url>`	~0.4-0.6s per doc
Add URL	`kurct content add <url>`	Instant
Review discovered	`kurt content list --status NOT_FETCHED`	Instant
Retry failures	`kurct content fetch --status ERROR`	Varies

Python API

# URL Discovery
from kurt.ingest_map import (
    map_sitemap,              # Discover URLs from sitemap
    map_blogrolls,            # Discover from blogroll/changelog pages
    identify_blogroll_candidates,  # Find potential blogroll pages
    extract_chronological_content,  # Extract posts with dates
)

# Content Fetching
from kurt.ingest_fetch import (
    add_document,             # Add single URL
    fetch_document,           # Fetch single document
    fetch_documents_batch,    # Batch fetch (async parallel)
)

# Map sitemap
docs = map_sitemap("https://example.com", limit=100)

# Add document
doc = add_document("https://example.com/page")

# Fetch single
result = fetch_document(document_id="abc-123")

# Batch fetch
results = fetch_documents_batch(
    document_ids=["abc-123", "def-456"],
    max_concurrent=10
)

See:

ingest_map.py - URL discovery
ingest_fetch.py - Content fetching

Troubleshooting

Issue	Solution
"No sitemap found"	See detailed guide above: Try direct sitemap URLs (`.../sitemap.xml`), check robots.txt, or use WebSearch. Full 4-tier fallback documented in "Troubleshooting: Sitemap Discovery Failures" section.
Slow batch fetch	Increase `--max-concurrent` (default: 5, try 10)
Extraction quality low	See advanced extraction scripts for custom settings
Duplicate content	Kurt automatically deduplicates using content hashes
Rate limiting	Reduce `--max-concurrent` or add delays
Content fetch fails	Use WebFetch fallback with frontmatter (see "WebFetch Fallback" section)

Next Steps

For document management, see document-management-skill
For custom extraction, see scripts/ directory
For trafilatura details, see Trafilatura Documentation

content-ingestion

Install Skill

SKILL.md

Content Ingestion

Overview

Quick Start

Map-Then-Fetch Workflow

Integration with Iterative Source Gathering

Core Operations

Map Sitemap URLs

Discovering Publish Dates from Blogrolls

Troubleshooting: Sitemap Discovery Failures

Tier 1: Try Common Sitemap Paths Directly

Tier 2: Check robots.txt

Tier 3: Search for Sitemap

Tier 4: Manual URL Collection

Real Example: docs.getdbt.com

Quick Diagnostic Commands

Fetch Single Document

Batch Fetch Documents

Alternative: Manual URL Addition

Add Single URLs

Direct Fetch (Add + Fetch)

WebFetch Fallback (When Kurt Fetch Fails)

Workflow

WebFetch with Metadata Extraction

Save Content with Frontmatter

Auto-Import Process

Example: Complete WebFetch Fallback

Benefits of WebFetch with Frontmatter

Advanced Usage

Custom Crawling

Custom Extraction Settings

Custom Extraction Config

Quick Reference

Python API

Troubleshooting

Next Steps