| name | Trace Debugger |
| description | Debug performance issues and understand code flow using AILANG telemetry traces. Use when user asks to debug slow compilation, analyze benchmarks, find bottlenecks, investigate hangs, or understand system behavior. |
Trace Debugger
Debug and analyze AILANG operations using OpenTelemetry distributed tracing. This skill helps identify performance bottlenecks, understand code flow, and debug issues using trace data from GCP Cloud Trace.
Quick Start
Most common usage:
# User says: "Why is compilation slow?"
# This skill will:
# 1. Check telemetry is configured
# 2. Run the slow operation with tracing
# 3. Query recent traces with ailang trace list
# 4. Analyze timing breakdown per phase
# 5. Identify the bottleneck
# Check telemetry status
ailang trace status
# List recent traces
ailang trace list --hours 2 --limit 20
# View specific trace hierarchy
ailang trace view <trace-id>
When to Use This Skill
Invoke this skill when:
- User asks to debug slow compilation or type checking
- User wants to analyze benchmark performance
- User mentions "bottleneck", "slow", "hang", or "performance"
- User wants to understand execution flow across components
- User asks "why is X taking so long?"
- User needs to compare timing between runs
Telemetry Prerequisites
Before debugging with traces:
# Option 1: Google Cloud Trace (recommended)
export GOOGLE_CLOUD_PROJECT=your-project-id
# Option 2: Local Jaeger
docker run -d -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
# Option 3: AILANG Observatory Dashboard (local UI)
ailang server # Starts server on localhost:1957
# View traces at http://localhost:1957 → Observatory tab
# Verify configuration
ailang trace status
Observatory Dashboard Setup (v0.6.3+)
The Observatory provides a local dashboard for viewing traces from Claude Code, Gemini CLI, and AILANG.
Start the Server
ailang server
# Or: make services-start
Configure Claude Code
Add to ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "http/json",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:1957",
"OTEL_RESOURCE_ATTRIBUTES": "ailang.source=user"
}
}
What Claude Code sends: Events via OTLP logs (token counts, costs, model, session info)
Configure Gemini CLI
Add to ~/.gemini/settings.json:
{
"telemetry": {
"enabled": true
}
}
And add to shell profile (~/.zshenv, ~/.bashrc):
export GEMINI_TELEMETRY_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:1957
export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
export OTEL_RESOURCE_ATTRIBUTES="ailang.source=user"
What Gemini CLI sends: Full traces (complete span hierarchy)
Environment Variables Reference
| Variable | Purpose | Example |
|---|---|---|
CLAUDE_CODE_ENABLE_TELEMETRY |
Enable Claude Code telemetry | 1 |
OTEL_LOGS_EXPORTER |
Log export protocol | otlp |
OTEL_METRICS_EXPORTER |
Metrics export protocol | otlp |
OTEL_EXPORTER_OTLP_PROTOCOL |
Transport protocol | http/json |
OTEL_EXPORTER_OTLP_ENDPOINT |
Observatory URL | http://localhost:1957 |
OTEL_RESOURCE_ATTRIBUTES |
Span metadata | ailang.source=user |
GEMINI_TELEMETRY_ENABLED |
Enable Gemini CLI telemetry | true |
OTLP Endpoints
The Observatory receives data on:
/v1/traces- Trace spans (Gemini CLI, AILANG)/v1/logs- Log records (Claude Code events)/v1/metrics- Metrics data
Both protobuf and JSON formats are supported.
Verify Telemetry is Working
- Ensure
ailang serveris running - Run a Claude Code or Gemini CLI command
- Open http://localhost:1957 → Observatory tab
- New traces should appear automatically
Note: If server is not running, OTLP exports fail silently (no impact on CLI tools).
Available Scripts
scripts/check_traces.sh [hours] [filter]
Quick check for recent traces with optional filtering.
Usage:
# Check last hour of traces
.claude/skills/trace-debugger/scripts/check_traces.sh
# Check last 4 hours, filter by eval
.claude/skills/trace-debugger/scripts/check_traces.sh 4 "eval.suite"
# Check compilation traces
.claude/skills/trace-debugger/scripts/check_traces.sh 1 "compile"
scripts/analyze_compilation.sh <file.ail>
Run a file with tracing and analyze compilation phases.
Usage:
# Analyze compilation timing
.claude/skills/trace-debugger/scripts/analyze_compilation.sh examples/runnable/factorial.ail
Workflow
1. Verify Telemetry Configuration
ailang trace status
Expected output shows either GCP or OTLP mode enabled. If disabled, set environment variables.
2. Reproduce the Issue with Tracing
Run the operation that's slow/problematic:
# For compilation issues
GOOGLE_CLOUD_PROJECT=your-project ailang run --caps IO --entry main file.ail
# For eval issues
GOOGLE_CLOUD_PROJECT=your-project ailang eval-suite --models gpt5-mini --benchmarks simple_hello
# For message system issues
GOOGLE_CLOUD_PROJECT=your-project ailang messages list
3. Query Recent Traces
# List recent traces
ailang trace list --hours 1 --limit 10
# Filter by operation type
ailang trace list --filter "compile"
ailang trace list --filter "eval.suite"
ailang trace list --filter "messages"
4. Analyze Trace Hierarchy
# Get full trace details
ailang trace view <trace-id>
Look for:
- Deep nesting: Indicates recursive operations
- Long durations: Shows bottlenecks
- Missing child spans: May indicate early exit or error
- Parallel spans: Shows concurrent operations
5. Interpret Results
Compiler Pipeline Spans:
| Span | What to Look For |
|---|---|
compile.parse |
Long = complex syntax, large file |
compile.elaborate |
Long = many surface→core transforms |
compile.typecheck |
Long = complex type inference, possible hang |
compile.validate |
Long = many nodes to validate |
compile.lower |
Long = complex operator lowering |
Eval Harness Spans:
| Span | What to Look For |
|---|---|
eval.suite |
Total benchmark run time |
eval.benchmark |
Individual benchmark, check benchmark.success |
*.generate |
AI API call time (openai, anthropic, gemini) |
Messaging Spans:
| Span | What to Look For |
|---|---|
messages.send |
Message creation time |
messages.list |
Query time, check list.result_count |
messages.search |
Semantic search time |
Instrumented Components
Current trace coverage in AILANG:
✅ Fully Instrumented
- Compiler Pipeline (
compile.*) - All 6 phases traced - Eval Harness (
eval.suite,eval.benchmark) - Suite and per-benchmark - Messaging (
messages.*) - Send, list, read, search - AI Providers (
anthropic.generate,openai.generate,gemini.generate,ollama.generate) - Server (HTTP middleware) - Request/response tracing
- Coordinator (
coordinator.execute_task) - Task lifecycle
🔜 Prioritized Future Instrumentation
Based on analysis of 280+ implemented design docs and actual bug patterns:
| Priority | Component | Spans | Debug Value |
|---|---|---|---|
| P1 | Type System | types.unify, types.substitute |
4+ hours saved per cyclic type/metadata bug |
| P2 | Module Resolution | modules.resolve, modules.load |
1-2 hours saved per import error |
| P3 | Codegen | codegen.type_lookup, codegen.record |
Catch fallbacks before Go compile |
| P4 | Pattern Matching | match.compile, match.coverage |
Rare but complex debugging |
See resources/trace_patterns.md for detailed span definitions and implementation patterns.
Resources
Trace Patterns Reference
See `resources/trace_patterns.md` for:
- Common debugging patterns
- Trace attribute reference
- Performance baseline expectations
Span Reference
See docs/docs/guides/telemetry.md for:
- Complete span list with attributes
- Environment variable configuration
- Architecture diagrams
Progressive Disclosure
This skill loads information progressively:
- Always loaded: This SKILL.md file (workflow overview)
- Execute as needed: Scripts in
scripts/directory - Load on demand:
resources/trace_patterns.md(detailed patterns)
Notes
- Traces require telemetry environment variables set
- GCP traces may take 30-60 seconds to appear in console
- Local Jaeger provides instant visibility
- Zero overhead when telemetry is disabled
- Use
--jsonflag for programmatic trace analysis
Proactive Trace Improvement
When debugging with traces, actively look for opportunities to add more instrumentation!
If you encounter:
- A debugging session where traces didn't help identify the issue
- A component that would benefit from finer-grained spans
- Missing attributes that would have been useful
Suggest adding traces by:
- Noting the component and what information would help
- Proposing span name and attributes (see resources/trace_patterns.md)
- Creating a design doc for significant additions
Example suggestion format:
Debugging [X] was difficult because traces didn't show [Y].
Suggested addition:
- Span: `component.operation`
- Attributes: `input`, `output`, `duration_ms`
- Location: `internal/package/file.go`
- Debug value: Would show [specific insight]
This helps continuously improve AILANG's observability based on real debugging needs.
Tracing Scope Limitation
Traces only cover AILANG tooling, NOT generated Go code!
| What IS Traced | What is NOT Traced |
|---|---|
ailang compile phases |
Generated Go binary execution |
ailang run (AILANG interpreter) |
Go code after go build |
ailang eval-suite benchmarks |
The actual AI-generated code running |
ailang messages operations |
User application runtime |
To debug generated Go code:
- Use Go's standard profiling (
go tool pprof) - Add your own tracing in generated code templates
- Use
DEBUG_CODEGEN=1to see what code is generated - Add
log.Printftointernal/codegen/templates/if needed
Future possibility: Generate OTEL spans INTO Go code for runtime tracing (not implemented)