| name | breenix-interrupt-syscall-development |
| description | Enforce pristine interrupt/syscall paths with no logging, diagnostics, or heavy operations. Use when developing interrupt handlers, syscall entry/exit, context switches, or timer code. |
Breenix Interrupt and Syscall Path Development
CRITICAL PATH REQUIREMENTS - MANDATORY RULES
These requirements are NON-NEGOTIABLE. Violations will cause catastrophic performance degradation and subtle timing bugs that are extremely difficult to debug.
ABSOLUTELY FORBIDDEN in interrupt/syscall paths:
NO logging or serial output
- No
serial_println!orprintln!of any kind - No debug prints, trace messages, or diagnostics
- This includes "temporary" debug code that you plan to remove later
- EXCEPTION: Only in initialization code that runs once at boot, before interrupts are enabled
- No
NO page table walks or memory diagnostics
- No calls to
translate_address()or similar - No memory allocation diagnostics
- No stack usage analysis
- No heap introspection
- No calls to
NO function calls that allocate or take locks
- No heap allocations (Box, Vec, String, etc.)
- No lock acquisitions (except purpose-built lock-free critical sections)
- No panic!() - handlers must be infallible or use explicit error codes
Interrupt handlers must complete in < 1000 cycles
- On modern x86_64 at 2GHz, this is ~500 nanoseconds
- Timer fires every 10ms = 20,000,000 cycles
- You have 0.005% of that budget
- Serial output takes 10,000+ cycles per character
Context switch path must be deterministic and fast
- No conditional diagnostics based on process state
- No complex validation beyond assertions
- Register save/restore must be branchless where possible
Why These Rules Exist
Violating these rules causes a specific failure pattern that is extremely difficult to diagnose:
Case Study: The trace_iretq_to_ring3 Bug (December 2024)
Symptom: Userspace process appeared to be created successfully but never executed a single instruction. Process remained stuck at RIP=0x40000000 (entry point) through 2,389 scheduler iterations.
Root Cause: Heavy diagnostics in trace_iretq_to_ring3():
// THIS WAS THE BUG - DO NOT DO THIS
unsafe fn trace_iretq_to_ring3(frame: &TrapFrame) {
serial_println!("=== IRETQ to Ring 3 ===");
serial_println!(" RIP: {:#x}", frame.rip);
serial_println!(" RSP: {:#x}", frame.rsp);
// ... more logging ...
let phys_addr = current_page_table.translate_address(VirtAddr::new(frame.rip));
// ... even more diagnostics ...
}
What Actually Happened:
- Kernel prepared perfect interrupt frame for userspace
iretqinstruction successfully transitioned to Ring 3- CPU immediately took timer interrupt within 100-500 cycles
- Userspace RIP never changed from 0x40000000 - not even one instruction executed
- Timer preempted userspace before it could do anything
- Scheduler selected same process again
- Loop repeated 2,389 times with zero forward progress
Why This Happened:
Serial output in trace_iretq_to_ring3() consumed ~10,000-50,000 cycles. Timer fires every 20,000,000 cycles. The logging pushed the kernel dangerously close to the next timer interrupt. By the time iretq completed:
- Timer interrupt was already pending or fired within microseconds
- Userspace had 100-500 cycles of execution time (enough for ~50-250 instructions theoretically)
- But context switch overhead consumed most of that window
- Process never got to execute even its first
movinstruction
Evidence:
- 2,389 scheduler loop iterations with ZERO userspace progress
- RIP frozen at 0x40000000 across all iterations
- All registers frozen in initial state
- Timer interrupt dominated CPU time
The Fix:
Remove ALL diagnostics from the interrupt return path:
// CORRECT - pristine path
unsafe fn return_to_ring3(frame: &TrapFrame) {
// No logging. No diagnostics. Just return.
core::arch::asm!(
"iretq",
in("rsp") frame as *const _ as u64,
options(noreturn)
);
}
After this fix: Userspace executed immediately and correctly.
Key Insight
You cannot debug interrupt paths by adding logging to interrupt paths.
The act of observation changes the system behavior so dramatically that what you're debugging no longer exists. It's a kernel-level Heisenbug - the diagnostic itself destroys the evidence.
APPROVED PATTERNS - What IS Allowed
These operations are acceptable in interrupt/syscall paths:
1. Atomic Counter Increments (for statistics)
TIMER_TICKS.fetch_add(1, Ordering::Relaxed);
SYSCALL_COUNT.fetch_add(1, Ordering::SeqCst);
2. Simple Flag Checks
if should_reschedule.load(Ordering::Acquire) {
schedule();
}
3. Direct Register Manipulation
unsafe {
core::arch::asm!(
"mov {}, cr3",
out(reg) cr3_value
);
}
4. Calling Other Minimal Inline Functions
#[inline(always)]
fn save_context(regs: &mut Registers) {
// Direct memory writes, no allocations
regs.rax = read_rax();
regs.rbx = read_rbx();
// ...
}
5. Assertions (but use sparingly)
debug_assert!(is_kernel_address(stack_ptr));
debug_assert_eq!(cs & 3, 0); // Must be ring 0
Assertions compile to nothing in release builds, so they're safe for sanity checks.
DEBUGGING ALTERNATIVES - How to Debug Without Breaking the Path
Since you cannot add logging to interrupt paths, use these techniques instead:
1. QEMU Interrupt Tracing (BEST OPTION)
# External interrupt tracing - zero overhead
cargo run --bin qemu-uefi -- -d int,cpu_reset -D /tmp/qemu.log
# Then grep the log
grep "interrupt" /tmp/qemu.log
This shows every interrupt entry/exit with register state. No kernel instrumentation required.
2. GDB Breakpoints (SECOND BEST)
# Terminal 1: Start QEMU with GDB server
cargo run --bin qemu-uefi -- -s -S
# Terminal 2: Attach GDB
gdb target/x86_64-breenix/release/breenix
(gdb) target remote :1234
(gdb) break timer_interrupt_handler
(gdb) continue
GDB lets you inspect state without modifying the timing characteristics.
3. Conditional Compilation (USE SPARINGLY)
#[cfg(debug_assertions)]
{
// This compiles out in release builds
INTERRUPT_COUNT.fetch_add(1, Ordering::Relaxed);
}
Only for counters and flags - never for serial output.
4. Post-Interrupt Diagnostics
fn timer_interrupt_handler() {
// PRISTINE PATH - no logging
TIMER_TICKS.fetch_add(1, Ordering::Relaxed);
acknowledge_interrupt();
schedule_if_needed();
}
// Later, in a non-critical path:
pub fn print_timer_stats() {
let ticks = TIMER_TICKS.load(Ordering::Relaxed);
serial_println!("Timer ticks: {}", ticks);
}
Accumulate data in the hot path, log it in a cold path.
5. Hardware Performance Counters (ADVANCED)
// Use CPU performance monitoring to count cycles
// This has minimal overhead (~10 cycles per read)
let start = read_tsc();
critical_operation();
let end = read_tsc();
CYCLE_COUNTS[operation_id].fetch_add(end - start, Ordering::Relaxed);
CODE REVIEW CHECKLIST
Before merging ANY code that touches interrupt or syscall paths, ask these questions:
Mandatory Rejection Criteria
- Does this add any serial output? → REJECT IMMEDIATELY
- Does this add any memory allocation? → REJECT IMMEDIATELY
- Does this add any heap diagnostics? → REJECT IMMEDIATELY
- Does this add page table walks in hot path? → REJECT IMMEDIATELY
Careful Review Required
- Does this take any locks? → If yes, document why it's necessary and prove it's lock-free
- Does this call external functions? → Audit those functions for the above violations
- What's the worst-case cycle count? → Must be < 1000 cycles for interrupt handlers
- Is this code conditional on debug mode? → Verify it compiles out in release
Acceptable Operations
- Atomic counter increments? → OK if relaxed ordering
- Simple flag checks? → OK if no branching complexity
- Direct register reads/writes? → OK, encouraged
- Inline function calls? → OK if callee is also pristine
Testing Requirements
- Did you test in release mode? → Debug builds have different timing
- Did you verify userspace makes forward progress? → RIP should change
- Did you run for multiple seconds? → Timing bugs may be intermittent
- Did you check QEMU interrupt logs? → Verify interrupt frequency is sane
IMPLEMENTATION WORKFLOW
When implementing interrupt or syscall handlers:
1. Design the Pristine Path First
// Write this FIRST - the minimal handler
pub extern "C" fn syscall_entry() {
// Save registers
// Dispatch to handler
// Restore registers
// Return
}
2. Add Atomic Counters for Observability
static SYSCALL_COUNTS: [AtomicU64; 256] = [/* ... */];
pub extern "C" fn syscall_entry(syscall_num: u64) {
SYSCALL_COUNTS[syscall_num as usize].fetch_add(1, Ordering::Relaxed);
// ... rest of handler
}
3. Test in Release Mode
cargo run --release --bin qemu-uefi
4. Verify with External Tracing
cargo run --release --bin qemu-uefi -- -d int -D /tmp/qemu.log
5. Only Then Add Debug-Only Diagnostics (if absolutely necessary)
#[cfg(all(debug_assertions, feature = "interrupt-trace"))]
{
INTERRUPT_TRACE_BUFFER[index].store(rip, Ordering::Relaxed);
}
ANTI-PATTERNS TO AVOID
Anti-Pattern 1: "Temporary" Debug Logging
// NO! This is how bugs hide for months
fn timer_handler() {
serial_println!("Timer fired"); // "I'll remove this later"
// ...
}
Why it's wrong: You'll forget to remove it, or it will get copied to other handlers. When the bug appears in production, the logging will be the bug.
Anti-Pattern 2: Conditional Diagnostics on Process State
// NO! This creates timing-dependent behavior
fn context_switch(next: &Process) {
if next.is_userspace() {
serial_println!("Switching to PID {}", next.pid); // WRONG
}
// ...
}
Why it's wrong: Userspace processes get slower context switches than kernel processes. Creates Heisenbugs where timing depends on process type.
Anti-Pattern 3: "Just This Once" Memory Allocation
// NO! Interrupt handlers must be infallible
fn interrupt_handler() {
let msg = format!("Interrupt at {:#x}", rip); // WRONG - allocates
log_message(&msg);
}
Why it's wrong: What if the allocator is out of memory? What if the allocator lock is held? Handler must never fail.
Anti-Pattern 4: Complex Validation
// NO! Validation should be in non-critical paths
fn syscall_handler(num: u64, args: &[u64]) {
validate_user_memory(args[0]); // May page fault, walk tables, allocate
validate_file_descriptor(args[1]); // May take locks
// ...
}
Why it's wrong: Validation can be expensive. Do it before entering the critical path, or use hardware-based validation (page faults).
SUCCESS METRICS
You've implemented a pristine interrupt/syscall path if:
Userspace makes forward progress immediately
- RIP changes on every scheduler quantum
- Processes complete in expected time
- No infinite loops at entry point
Interrupt frequency is stable
- Timer fires at configured rate (e.g., 100 Hz = every 10ms)
- No interrupt storms
- QEMU logs show regular intervals
Zero overhead in release builds
- No serial output
- No allocations
- Cycle counts < 1000 per handler
Debuggable with external tools
- QEMU tracing shows correct behavior
- GDB breakpoints work
- Performance counters provide visibility
WHEN TO USE THIS SKILL
Invoke this skill when:
- Implementing new interrupt handlers (timer, keyboard, syscall, page fault)
- Modifying syscall entry/exit paths
- Implementing context switch logic
- Adding scheduler hooks
- Debugging performance issues where userspace seems "stuck"
- Reviewing PRs that touch interrupt.asm, syscall/entry.asm, or timer.rs
- Investigating "works in debug, fails in release" bugs (timing-related)
RELATED SKILLS
breenix-kernel-debug-loop: For iterative debugging with external toolsbreenix-systematic-debugging: For documenting root cause analysisbreenix-code-quality-check: For enforcing zero-warning buildsbreenix-interrupt-trace: For QEMU-based interrupt tracingbreenix-gdb-attach: For breakpoint-based debugging
REFERENCES
- Intel Software Developer Manual Vol 3, Chapter 6: Interrupt and Exception Handling
- AMD64 Architecture Manual Vol 2, Chapter 8: Exceptions and Interrupts
- Linux kernel: arch/x86/entry/entry_64.S - pristine syscall entry path
- FreeBSD: sys/amd64/amd64/exception.S - interrupt handling
VERSION HISTORY
- v1.0 (2024-12-02): Initial skill creation after fixing trace_iretq_to_ring3 bug