| name | analyzing-text-patterns |
| description | Extract and analyze recurring patterns from log messages, span names, and event names using punctuation-based template discovery. Use when you need to understand log diversity, identify common message structures, detect unusual formats, or prepare for log parser development. Works by removing variable content and preserving structural markers. |
Analyzing Text Patterns
Extract recurring patterns from textual data by removing variable content (usernames, IDs, timestamps) and preserving structural markers (punctuation, spacing). This reveals common message templates, helps identify log diversity, and detects unusual or anomalous formats.
Use when you need to:
- Understand what types of logs/messages exist in your system
- Find recurring message templates for parser development
- Detect unusual or rare log formats (anomaly detection)
- Analyze log diversity across services
- Identify which services generate which log formats
- Prepare for structured log extraction
Key Concepts
Pattern Extraction vs Log Parsing
Pattern Extraction (this skill):
- Discovers structural templates
- Groups messages by shape/structure
- Ignores actual content (IDs, names, values)
- Fast, no ML required
- Example:
"User alice logged in"and"User bob99 logged in"→ same pattern
Log Parsing (different skill):
- Extracts field values from logs
- Requires known pattern/regex
- Captures actual content
- Example: Parse username from
"User {username} logged in"
How Pattern Extraction Works
The Technique:
- Remove all word characters (letters, digits, underscore)
- Replace whitespace with underscores
- Result: Only punctuation and structure remain
Example:
Original: "2025-11-15 [INFO] User alice logged in at 10:30"
Step 1: "-- [] User alice logged in at :"
Step 2: "--_[]___logged_in_at_:"
This "punctuation pattern" represents the log template structure.
When It Works Best
✅ Ideal for:
- Mixed log sources (multiple services, different formats)
- Discovering log templates in unfamiliar systems
- Understanding log diversity
- Finding rare/unusual log formats
- Analyzing unstructured logs
❌ Not ideal for:
- Homogeneous single-service logs (little diversity)
- When numbers are meaningful (error codes, IDs)
- Semantic grouping (grouping by meaning, not structure)
- Already-structured JSON logs (structure is explicit)
Pattern 1: Basic Structure Extraction
Concept: Extract structural template by removing all variable content
When to use:
- Analyzing mixed log sources
- Understanding log diversity
- Finding common templates
Query:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
make_col pattern_id:encode_base64(string(hash(punct)))
statsby count:count(), sample:any(body), group_by(pattern_id, punct)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
How it works:
replace_regex(string(body), /\w/, "")- Remove word characters (letters, digits, _)replace_regex(..., /\s+/, "_")- Normalize whitespace to single underscorehash(punct)- Create unique ID for each patternencode_base64(...)- Make pattern ID readablestatsby count:count()- Count occurrences per patternsample:any(body)- Keep one example message- Calculate
%of total logs
Example result:
pattern_id | count | % | sample
-----------------------------------|-------|------|------------------------------------------
LTcwOTA5MDA2MTQ0NDM3NTA2ODA= | 3405 | 13.0 | 2025-11-15T21:59:38.985Z info MetricsExporter...
MzYzODU1NzM5OTY2NjQyMDU0NQ== | 2877 | 11.0 | [2025-11-15 21:59:55,877] INFO Deleted log...
Interpretation:
- 13% of logs match Pattern 1 (JSON metrics exporter format)
- 11% match Pattern 2 (Kafka deletion logs)
- Shows clear log diversity
Pros:
- Fast and simple
- Works on any text field
- No configuration needed
Cons:
- Loses numeric patterns (error codes)
- Sensitive to formatting changes
- No semantic understanding
Pattern 2: Preserve Numeric Patterns
Concept: Keep numbers to distinguish error codes, status codes, counts
When to use:
- Error code analysis (404 vs 500)
- Status code patterns
- When counts/IDs are structurally significant
Query:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /[a-zA-Z_]/, ""), /\s+/, "_")
make_col pattern_id:encode_base64(string(hash(punct)))
statsby count:count(), sample:any(body), group_by(pattern_id, punct)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
Key difference: /[a-zA-Z_]/ removes only letters, preserves digits
Example differentiation:
Original Pattern 1:
"Error 404: Not found" → "_:__"
"Error 500: Server error" → "_:__"
(Same pattern - can't distinguish!)
With Preserved Digits (Pattern 2):
"Error 404: Not found" → "_404:__"
"Error 500: Server error" → "_500:__"
(Different patterns - can distinguish!)
When this helps:
- HTTP status code analysis
- Error code frequency
- Version number patterns
- Timestamp patterns (year, hour, etc.)
Tradeoff:
- More unique patterns (less grouping)
- Timestamps create per-second patterns
- May be too granular
Pattern 3: Hybrid Approach
Concept: Combine structural and numeric patterns for maximum precision
When to use:
- Need both structure and number patterns
- Building comprehensive log taxonomy
- Maximum differentiation needed
Query:
limit 100000
make_col struct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
make_col numeric:replace_regex(replace_regex(string(body), /[a-zA-Z_]/, ""), /\s+/, "_")
make_col pattern_id:encode_base64(string(hash(struct + "|" + numeric)))
statsby count:count(), sample:any(body), group_by(pattern_id, struct, numeric)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
Combines:
struct: Punctuation pattern (from Pattern 1)numeric: Numeric pattern (from Pattern 2)- Both patterns together create unique signature
Example:
Message: "2025-11-15 [ERROR] Request failed with status 500"
struct: "--_[]___with_"
numeric: "2025-11-15_500"
pattern: Combination of both
Pros:
- Most precise differentiation
- Captures both structure and numbers
- Best for complex analysis
Cons:
- More complex query
- May create too many unique patterns
- Harder to interpret results
Pattern 4: Apply to Different Fields
For Span Names:
limit 100000
make_col punct:replace_regex(replace_regex(span_name, /\w/, ""), /\s+/, "_")
make_col pattern_id:encode_base64(string(hash(punct)))
statsby count:count(), sample:any(span_name), group_by(pattern_id, punct)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
Example results:
28.2%: "" (no punctuation) - HTTP methods: "POST", "GET"
26.4%: "_" (single space) - "router frontend egress"
12.5%: "./" - gRPC: "oteldemo.CartService/GetCart"
Use cases:
- Identify naming conventions across services
- Find services using different protocols
- Understand span taxonomy
For Event Names:
limit 100000
make_col punct:replace_regex(replace_regex(event_name, /\w/, ""), /\s+/, "_")
make_col pattern_id:encode_base64(string(hash(punct)))
statsby count:count(), sample:any(event_name), group_by(pattern_id, punct)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
Use cases:
- Categorize span event types
- Find common event patterns
- Identify event naming consistency
Common Use Cases
Use Case 1: Log Diversity Analysis
Goal: Understand what types of logs exist in your system
Query: Basic pattern extraction (Pattern 1)
Workflow:
- Run pattern extraction on log dataset
- Review top 10-20 patterns
- Identify dominant log types
- Look for unexpected patterns
Example:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
statsby count:count(), sample:any(body), group_by(punct)
sort desc(count)
limit 20
Interpretation:
- If top pattern is >80%: Very homogeneous logging
- If top 10 patterns <50%: High diversity, many log types
- Rare patterns (<1%): Unusual events or errors
Use Case 2: Anomaly Detection (Rare Patterns)
Goal: Find unusual or one-off log patterns
Query:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
statsby count:count(), sample:any(body), group_by(punct)
sort asc(count) # Sort by rarest first
filter count < 10 # Only rare patterns
limit 20
Use cases:
- Identify malformed logs
- Find new/unexpected error messages
- Detect logging bugs (incorrect format)
- Security: Find unusual access patterns
Example findings:
count: 1 | sample: "FATAL: Database connection pool exhausted"
count: 2 | sample: "CRITICAL: Disk space at 99%"
count: 3 | sample: "<?xml malformed payload"
Use Case 3: Service Identification by Pattern
Goal: Map log patterns to services
Query:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
make_col svc:string(resource_attributes."k8s.deployment.name")
statsby count:count(), sample:any(body), group_by(punct, svc)
sort desc(count)
limit 30
Use cases:
- Understand service-specific logging styles
- Identify cross-service patterns
- Find which services need log standardization
Example insights:
- Pattern
[--_::,]_INFO_...→ Only Kafka service - Pattern
--::.Z_info_...→ Only OpenTelemetry collector - Pattern
::_=_[]_→ Phoenix/Elixir services
Use Case 4: Log Parser Development
Goal: Build regex parsers for each log template
Workflow:
- Run pattern extraction
- Review top patterns (cover 80%+ of logs)
- For each pattern:
- Review sample message
- Write regex to extract fields
- Test on sample
- Prioritize by frequency (%)
Query:
limit 100000
make_col punct:replace_regex(replace_regex(string(body), /\w/, ""), /\s+/, "_")
statsby count:count(), sample:any(body), group_by(punct)
sort desc(count)
make_col "%":round(100*count/window(sum(count), group_by()),1)
make_col cumulative_pct:window(sum("%"), group_by(), order_by(desc(count)))
filter cumulative_pct <= 80 # Patterns covering first 80% of logs
Example:
Pattern 1 (40%): "--_::,_INFO_..."
Sample: "[2025-11-15 23:16:06,906] INFO Kafka message"
Regex: /\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d+\] (\w+) (.+)/
Pattern 2 (25%): "--T::.Z_info_..."
Sample: "2025-11-15T23:16:12.525Z\tinfo\tMetricsExporter..."
Regex: /(\d{4}-\d{2}-\d{2}T[\d:.]+Z)\t(\w+)\t(.+)/
OPAL Syntax Key Points
Regex Functions
replace_regex:
replace_regex(string, pattern, replacement)
Common patterns:
/\w/- Word characters (letters, digits, underscore)/[a-zA-Z_]/- Letters and underscore only (preserves digits)/[a-z0-9_]/- Lowercase and digits (preserves uppercase)/\s+/- One or more whitespace characters
Hash and Encoding
Create pattern ID:
make_col pattern_id:encode_base64(string(hash(punct)))
hash(punct)- Creates numeric hash of patternstring(...)- Converts hash to stringencode_base64(...)- Makes ID human-readable
Why use hash?
- Deterministic (same pattern = same ID)
- Short and readable with base64
- Efficient for grouping
Window Functions for Percentages
Calculate % of total:
make_col "%":round(100*count/window(sum(count), group_by()),1)
window(sum(count), group_by())- Sum of all counts (total logs)count/window(...)- Fraction of total100*...- Convert to percentageround(..., 1)- One decimal place
Cumulative percentages:
make_col cumulative_pct:window(sum("%"), group_by(), order_by(desc(count)))
- Sums percentages in descending count order
- Shows "top N patterns cover X% of logs"
Limit Best Practices
Always use limit:
limit 100000 # At start of query
Why:
- Prevents analyzing millions of logs (expensive)
- 100K is usually enough for pattern discovery
- Faster query execution
Adjust based on diversity:
- Homogeneous logs: 10K may be enough
- High diversity: 100K or more
- Start small, increase if needed
Troubleshooting
Issue: Too many unique patterns
Cause: High log diversity or timestamps creating unique patterns
Solutions:
- Increase grouping tolerance: Use basic pattern (removes all numbers)
- Focus on top patterns: Add
limit 20to results - Filter by minimum count:
filter count > 10 - Remove timestamps first:
make_col cleaned:replace_regex(string(body), /\d{4}-\d{2}-\d{2}/, "DATE") make_col punct:replace_regex(replace_regex(cleaned, /\w/, ""), /\s+/, "_")
Issue: Not enough differentiation
Cause: Logs have similar structure but different meanings
Solutions:
- Preserve digits: Use Pattern 2 (preserve numbers)
- Use hybrid approach: Pattern 3 (structure + numbers)
- Add semantic fields: Group by service/level too:
statsby count:count(), sample:any(body), group_by(punct, severity)
Issue: Patterns not meaningful
Cause: Field doesn't have enough structure (too simple or too complex)
Example failures:
- Simple:
event_name = "message"(44% of events) - Complex: Random JSON blobs
Solutions:
- Try different field:
- Instead of
body, tryevent_name - Instead of full message, extract first 50 chars
- Instead of
- Pre-filter:
filter not contains(body, "{") # Skip JSON logs - Consider alternative approach: Structured extraction for JSON
Issue: Performance slow with large dataset
Cause: Processing too many logs
Solutions:
- Reduce limit: Start with
limit 10000 - Add time filter: Reduce time range (1h instead of 24h)
- Sample randomly:
filter random() < 0.1 # Sample 10% of logs limit 100000
Issue: Punctuation pattern empty or same
Cause: Logs have no punctuation or all same punctuation
Example:
"User logged in" → "_logged_in" (no punctuation)
"Server started" → "_" (no punctuation)
Solutions:
Not a good dataset for this technique - Try parsing instead
Use preserve-case variation:
make_col punct:replace_regex(replace_regex(string(body), /[a-z0-9_]/, ""), /\s+/, "_")Keeps uppercase letters as structure markers
Combine with length patterns:
make_col words:array_length(split(string(body), " ")) statsby count:count(), sample:any(body), group_by(words)Group by word count instead
Key Takeaways
Pattern extraction discovers structure, not content
- Groups messages by template/shape
- Ignores variable data (usernames, IDs, timestamps)
- Fast and deterministic
Three pattern variations available
- Basic: Remove all word characters (best for general use)
- Preserve digits: Keep numbers (error codes, status codes)
- Hybrid: Combine both (maximum precision)
Works on any text field
- Log messages (
body) - Span names (
span_name) - Event names (
event_name) - Choose field based on what you're analyzing
- Log messages (
Best for heterogeneous data
- Mixed services with different formats
- Understanding log diversity
- Finding unusual patterns
- Less useful for homogeneous single-service logs
Always use limit
limit 100000prevents overwhelming results- Sample size sufficient for pattern discovery
- Adjust based on dataset size and diversity
Combine with other dimensions
- Group by service, severity, namespace
- Map patterns to sources
- Identify service-specific formats
Not the same as log parsing
- Pattern extraction: Find templates
- Log parsing: Extract field values
- Use pattern extraction first, then build parsers
Review sample messages
- Always include
sample:any(body)in statsby - Sample shows what pattern represents
- Critical for interpretation
- Always include
Percentages show distribution
- Top pattern >80%: Homogeneous
- Top 10 <50%: High diversity
- Use to prioritize parser development
Rare patterns indicate anomalies
filter count < 10finds unusual logs- Useful for error detection
- Security monitoring
When to Use This Skill
Use analyzing-text-patterns skill when:
- Exploring unfamiliar log datasets
- Building log parsers (need to know patterns first)
- Analyzing log diversity across services
- Detecting unusual or malformed logs
- Understanding message structure
- Identifying service-specific logging styles
- Finding rare error patterns
- Preparing for structured log extraction
Cross-references:
- investigating-textual-data (for log parsing and field extraction)
- filtering-event-datasets (for basic log filtering)
- detecting-anomalies (for statistical anomaly detection)