name	semtools
description	This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.
license	MIT

Semtools: Semantic Search

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.

Purpose

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands semantic meaning through embeddings.

Key capabilities:

Semantic Search: Find code/text by meaning, not just keywords
Workspace Management: Index large codebases for fast repeated searches
Document Parsing: Convert PDFs, DOCX, PPTX to searchable text (requires API key)

Semtools excels at discovery - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.

When to Use This Skill

Use the semtools skill when you need meaning-based search:

Semantic Code Discovery:

Finding code that implements a concept ("error handling", "data validation")
Discovering similar functionality across different modules
Locating examples of a pattern when you don't know exact names
Understanding what code does without reading everything

Documentation & Knowledge:

Searching documentation by concept, not keywords
Finding related discussions in comments or docs
Discovering similar issues or solutions
Analyzing technical documents (PDFs, reports)

Use Cases:

"Find all authentication-related code" (without knowing function names)
"Show me error handling patterns" (regardless of specific error types)
"Find code similar to this implementation" (semantic similarity)
"Search research papers for 'distributed consensus'" (document search)

Choose semtools over file-search (ripgrep/ast-grep) when:

You know the concept but not the keywords
Exact string matching misses relevant results
You want semantically similar code, not exact matches
Searching across languages or mixed content

Still use file-search when:

You know exact keywords, function names, or patterns
You need structural code matching (ast-grep)
Speed is critical (ripgrep is faster for exact matches)
You're searching for specific symbols or references

Available Commands

Semtools provides three CLI commands you can use via execute_command:

search - Semantic search across code and text files
workspace - Manage workspaces for caching embeddings
parse - Convert documents (PDF, DOCX, PPTX) to searchable text

All commands work out-of-the-box in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.

Core Operations

1. Semantic Search (`search`)

Find files and code sections by semantic meaning:

# Basic semantic search
search "authentication logic" src/

# Search with more context (5 lines before/after)
search "error handling" --n-lines 5 src/

# Get more results (default: 3)
search "database queries" --top-k 10 src/

# Control similarity threshold (0.0-1.0, lower = more lenient)
search "API endpoints" --max-distance 0.4 src/

Parameters:

--n-lines N: Show N lines of context around matches (default: 3)
--top-k K: Return top K most similar matches (default: 3)
--max-distance D: Maximum embedding distance (0.0-1.0, default: 0.3)
-i: Case-insensitive matching

Output format:

Match 1 (similarity: 0.12)
File: src/auth/handlers.py
Lines: 42-47
----
def authenticate_user(username: str, password: str) -> Optional[User]:
    """Authenticate user credentials against database."""
    user = get_user_by_username(username)
    if user and verify_password(password, user.password_hash):
        return user
    return None
----

Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...

2. Workspace Management (`workspace`)

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:

# Create/activate workspace
workspace use my-project

# Set workspace via environment variable
export SEMTOOLS_WORKSPACE=my-project

# Index files in workspace (workspace auto-detected from env var)
search "query" src/

# Check workspace status
workspace status

# Clean up old workspaces
workspace prune

Benefits:

Fast repeated searches: Embeddings cached, no re-computation
Large codebases: IVF_PQ indexing for scalability
Session persistence: Maintain context across multiple searches

When to use workspaces:

Searching the same codebase multiple times
Very large projects (1000+ files)
Interactive exploration sessions
CI/CD pipelines with repeated searches

3. Document Parsing (`parse`) ⚠️ Requires API Key

Convert documents to searchable markdown (requires LlamaParse API key):

# Parse PDFs to markdown
parse research_papers/*.pdf

# Parse Word documents
parse reports/*.docx

# Parse presentations
parse slides/*.pptx

# Parse and pipe to search
parse docs/*.pdf | xargs search "neural networks"

Supported formats:

PDF (.pdf)
Word (.docx)
PowerPoint (.pptx)

Configuration:

# Via environment variable
export LLAMA_CLOUD_API_KEY="llx-..."

# Via config file
cat > ~/.parse_config.json << EOF
{
  "api_key": "llx-...",
  "max_concurrent_requests": 10,
  "timeout_seconds": 3600
}
EOF

Important: Document parsing is optional. Semantic search works without it.

Workflow Patterns

Pattern 1: Concept Discovery

When you know what you're looking for conceptually but not by name:

# Step 1: Broad semantic search
search "rate limiting implementation" src/

# Step 2: Review results, refine query
search "throttle requests per user" src/ --top-k 10

# Step 3: Use ripgrep for exact follow-up
rg "RateLimiter" --type py src/

Pattern 2: Similar Code Finder

When you want to find code similar to a reference implementation:

# Step 1: Extract key concepts from reference code
# [Read example_auth.py and identify key concepts]

# Step 2: Search for similar implementations
search "user authentication with JWT tokens" src/

# Step 3: Compare implementations
# [Review semantic matches to find similar approaches]

Pattern 3: Documentation Search

When researching concepts in documentation or comments:

# Search code comments semantically
search "thread safety guarantees" src/ --n-lines 10

# Search markdown documentation
search "deployment best practices" docs/

# Combined search
search "performance optimization" --top-k 20

Pattern 4: Cross-Language Search

When searching for concepts across different languages:

# Semantic search works across languages
search "connection pooling" src/

# May find:
# - Java: "ConnectionPool manager"
# - Python: "database connection reuse"
# - Go: "pool of persistent connections"
# All semantically related despite different terminology

Pattern 5: Document Analysis (with API key)

When analyzing PDFs or documents:

# Step 1: Parse documents to markdown
parse research/*.pdf > papers.md

# Step 2: Search converted content
search "transformer architecture" papers.md

# Step 3: Combine with code search
search "attention mechanism implementation" src/

Integration with file-search

Semtools and file-search (ripgrep/ast-grep) are complementary tools. Use them together for comprehensive search:

Search Strategy Matrix

You Know	Use First	Then Use	Why
Exact keywords	ripgrep	search	Fast exact match, then find similar
Concept only	search	ripgrep	Find relevant code, then search specifics
Function name	ripgrep	search	Find definition, then find similar usage
Code pattern	ast-grep	search	Find structure, then find similar logic
Approximate idea	search	ripgrep + ast-grep	Discover, then drill down

Layered Search Approach

# Layer 1: Semantic discovery (what's related?)
search "user session management" --top-k 10

# Layer 2: Exact text search (what's the implementation?)
rg "SessionManager|session_store" --type py

# Layer 3: Structural search (how is it used?)
sg --pattern 'session.$METHOD($$$)' --lang python

# Layer 4: Reference tracking (where is it called?)
# [Use serena skill for symbol-level tracking]

Best Practices

1. Start Broad, Then Narrow

Use semantic search for discovery, then narrow with exact search:

# GOOD: Broad semantic discovery first
search "authentication" src/ --top-k 10
# [Review results to learn terminology]
rg "authenticate|verify_credentials" --type py src/

# AVOID: Starting too narrow and missing variations
rg "authenticate" --type py  # Misses "verify_credentials", "check_auth", etc.

2. Adjust Similarity Threshold

Tune --max-distance based on results:

# Too many irrelevant results? Decrease distance (more strict)
search "query" --max-distance 0.2

# Missing relevant results? Increase distance (more lenient)
search "query" --max-distance 0.5

# Default (0.3) works well for most cases
search "query"

3. Use Workspaces for Repeated Searches

For interactive exploration, always use workspaces:

# GOOD: Create workspace once, search many times
export SEMTOOLS_WORKSPACE=my-analysis
search "concept1" src/
search "concept2" src/
search "concept3" src/

# INEFFICIENT: Re-compute embeddings every time
search "concept1" src/
search "concept2" src/

4. Combine with Context Tools

Get more context around semantic matches:

# Find semantically similar code
search "retry logic" src/ --n-lines 2

# Get more context with ripgrep
rg -C 10 "retry" src/specific_file.py

# Or read the full file
cat src/specific_file.py

5. Phrase Queries Conceptually

Write queries as concepts, not exact keywords:

# GOOD: Conceptual queries
search "handling network timeouts"
search "user input validation"
search "concurrent data access"

# LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)
search "timeout"  # Use: rg "timeout"
search "validate"  # Use: rg "validate"

Understanding Semantic Distance

Semtools uses embedding vectors to measure semantic similarity:

Distance 0.0: Identical meaning
Distance 0.1-0.2: Very similar (synonyms, paraphrases)
Distance 0.2-0.3: Related concepts (default threshold)
Distance 0.3-0.4: Loosely related
Distance 0.5+: Weakly related or unrelated

Practical guidelines:

# Strict matching (only close matches)
--max-distance 0.2

# Balanced matching (default, recommended)
--max-distance 0.3

# Lenient matching (exploratory search)
--max-distance 0.4

# Very lenient (may include false positives)
--max-distance 0.5

Local vs. Cloud Embeddings

Semantic Search (Local):

Uses local embeddings (model2vec, potion-multilingual-128M)
No API calls or cloud dependencies
Fast, private, no cost
Works offline

Document Parsing (Cloud):

Uses LlamaParse API (cloud-based)
Requires API key and internet connection
Processes PDFs, DOCX, PPTX
Usage-based pricing (check LlamaIndex pricing)

Privacy consideration: Semantic search is 100% local. Only document parsing sends data to LlamaParse API.

Performance Considerations

Speed Characteristics

Without workspace:

First search: ~2-5 seconds (embedding computation)
Subsequent searches: ~2-5 seconds each (re-compute embeddings)

With workspace (cached embeddings):

First search: ~2-5 seconds (builds index)
Subsequent searches: ~0.1-0.5 seconds (cached)
Large codebases: IVF_PQ indexing for scalability

Comparison:

ripgrep: 0.01-0.1 seconds (fastest, exact match)
ast-grep: 0.1-0.5 seconds (fast, structural)
semtools (cached): 0.1-0.5 seconds (fast, semantic)
semtools (uncached): 2-5 seconds (slower, semantic)

Optimization Tips

# 1. Use workspaces for repeated searches
export SEMTOOLS_WORKSPACE=my-project

# 2. Limit search scope to relevant directories
search "query" src/ --not tests/

# 3. Use --top-k to control result count
search "query" --top-k 5

# 4. Pipe to head for quick preview
search "query" | head -50

Unix Pipeline Integration

Semtools is designed for Unix-style composition:

# Find and parse PDFs, then search
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"

# Search and filter with grep
search "authentication" src/ | grep -i "jwt"

# Count matches
search "error handling" src/ | grep "Match" | wc -l

# Combine with other tools
search "API" src/ | xargs -I {} rg -l "REST" {}

Limitations

When NOT to Use Semtools

Exact keyword search: Use ripgrep for known keywords

# WRONG TOOL: Semantic search for exact function name
search "authenticate_user"

# RIGHT TOOL: Use ripgrep for exact matches
rg "authenticate_user" --type py

Structural code patterns: Use ast-grep for syntax matching

# WRONG TOOL: Semantic search for code structure
search "class with constructor"

# RIGHT TOOL: Use ast-grep for structure
sg --pattern 'class $NAME { constructor($$$) { $$$ } }'

Symbol references: Use serena for LSP-based tracking

# WRONG TOOL: Semantic search for all usages
search "MyClass usage"

# RIGHT TOOL: Use serena for precise references
serena find_referencing_symbols --name 'MyClass'

Small codebases: Overhead not worth it for <100 files
- ripgrep is faster and simpler for small projects

Known Edge Cases

Ambiguous queries: Vague concepts return broad results
Technical jargon: Domain-specific terms may have lower accuracy
Short code snippets: Limited context reduces embedding quality
Mixed languages: Embeddings tuned for English (multilingual model used)
Generated code: Repetitive patterns may cluster together

Troubleshooting

No Semantic Matches Found

If semantic search returns zero results:

Verify files exist: Use ripgrep to confirm content
```
rg "concept" src/
```
Increase similarity threshold: Be more lenient
```
search "query" --max-distance 0.5
```

Rephrase query: Try different terminology

search "user authentication"
search "verify user credentials"
search "login validation"

Check file types: Ensure searching correct extensions
```
search "query" src/*.py  # Target specific types
```

Too Many Irrelevant Results

If semantic search returns too much noise:

Decrease similarity threshold: Be more strict
```
search "query" --max-distance 0.2
```
Limit result count: Review top matches only
```
search "query" --top-k 3
```
Narrow directory scope: Search specific paths
```
search "query" src/specific_module/
```

Refine query: Add more specific concepts

# Vague
search "data"

# Specific
search "data validation with regex patterns"

Document Parsing Fails

If parse fails:

Verify API key is set:
```
echo $LLAMA_CLOUD_API_KEY
```
Check file format: Ensure supported format (PDF, DOCX, PPTX)
```
file document.pdf  # Verify file type
```
Check file size: Large files may timeout
```
du -h document.pdf  # Check size
```
Review parse config: Adjust timeouts if needed
```
cat ~/.parse_config.json
```

Workspace Issues

If workspace commands fail:

# Check workspace status
workspace status

# Prune corrupted workspaces
workspace prune

# Recreate workspace
rm -rf ~/.semtools/workspaces/my-workspace
export SEMTOOLS_WORKSPACE=my-workspace

Install Skill

Shared

SKILL.md