name	breenix-interrupt-syscall-development
description	Enforce pristine interrupt/syscall paths with no logging, diagnostics, or heavy operations. Use when developing interrupt handlers, syscall entry/exit, context switches, or timer code.

Breenix Interrupt and Syscall Path Development

CRITICAL PATH REQUIREMENTS - MANDATORY RULES

These requirements are NON-NEGOTIABLE. Violations will cause catastrophic performance degradation and subtle timing bugs that are extremely difficult to debug.

ABSOLUTELY FORBIDDEN in interrupt/syscall paths:

NO logging or serial output
- No serial_println! or println! of any kind
- No debug prints, trace messages, or diagnostics
- This includes "temporary" debug code that you plan to remove later
- EXCEPTION: Only in initialization code that runs once at boot, before interrupts are enabled
NO page table walks or memory diagnostics
- No calls to translate_address() or similar
- No memory allocation diagnostics
- No stack usage analysis
- No heap introspection
NO function calls that allocate or take locks
- No heap allocations (Box, Vec, String, etc.)
- No lock acquisitions (except purpose-built lock-free critical sections)
- No panic!() - handlers must be infallible or use explicit error codes
Interrupt handlers must complete in < 1000 cycles
- On modern x86_64 at 2GHz, this is ~500 nanoseconds
- Timer fires every 10ms = 20,000,000 cycles
- You have 0.005% of that budget
- Serial output takes 10,000+ cycles per character
Context switch path must be deterministic and fast
- No conditional diagnostics based on process state
- No complex validation beyond assertions
- Register save/restore must be branchless where possible

Why These Rules Exist

Violating these rules causes a specific failure pattern that is extremely difficult to diagnose:

Case Study: The trace_iretq_to_ring3 Bug (December 2024)

Symptom: Userspace process appeared to be created successfully but never executed a single instruction. Process remained stuck at RIP=0x40000000 (entry point) through 2,389 scheduler iterations.

Root Cause: Heavy diagnostics in trace_iretq_to_ring3():

// THIS WAS THE BUG - DO NOT DO THIS
unsafe fn trace_iretq_to_ring3(frame: &TrapFrame) {
    serial_println!("=== IRETQ to Ring 3 ===");
    serial_println!("  RIP: {:#x}", frame.rip);
    serial_println!("  RSP: {:#x}", frame.rsp);
    // ... more logging ...
    let phys_addr = current_page_table.translate_address(VirtAddr::new(frame.rip));
    // ... even more diagnostics ...
}

What Actually Happened:

Kernel prepared perfect interrupt frame for userspace
iretq instruction successfully transitioned to Ring 3
CPU immediately took timer interrupt within 100-500 cycles
Userspace RIP never changed from 0x40000000 - not even one instruction executed
Timer preempted userspace before it could do anything
Scheduler selected same process again
Loop repeated 2,389 times with zero forward progress

Why This Happened:

Serial output in trace_iretq_to_ring3() consumed ~10,000-50,000 cycles. Timer fires every 20,000,000 cycles. The logging pushed the kernel dangerously close to the next timer interrupt. By the time iretq completed:

Timer interrupt was already pending or fired within microseconds
Userspace had 100-500 cycles of execution time (enough for ~50-250 instructions theoretically)
But context switch overhead consumed most of that window
Process never got to execute even its first mov instruction

Evidence:

2,389 scheduler loop iterations with ZERO userspace progress
RIP frozen at 0x40000000 across all iterations
All registers frozen in initial state
Timer interrupt dominated CPU time

The Fix:

Remove ALL diagnostics from the interrupt return path:

// CORRECT - pristine path
unsafe fn return_to_ring3(frame: &TrapFrame) {
    // No logging. No diagnostics. Just return.
    core::arch::asm!(
        "iretq",
        in("rsp") frame as *const _ as u64,
        options(noreturn)
    );
}

After this fix: Userspace executed immediately and correctly.

Key Insight

You cannot debug interrupt paths by adding logging to interrupt paths.

The act of observation changes the system behavior so dramatically that what you're debugging no longer exists. It's a kernel-level Heisenbug - the diagnostic itself destroys the evidence.

APPROVED PATTERNS - What IS Allowed

These operations are acceptable in interrupt/syscall paths:

1. Atomic Counter Increments (for statistics)

TIMER_TICKS.fetch_add(1, Ordering::Relaxed);
SYSCALL_COUNT.fetch_add(1, Ordering::SeqCst);

2. Simple Flag Checks

if should_reschedule.load(Ordering::Acquire) {
    schedule();
}

3. Direct Register Manipulation

unsafe {
    core::arch::asm!(
        "mov {}, cr3",
        out(reg) cr3_value
    );
}

4. Calling Other Minimal Inline Functions

#[inline(always)]
fn save_context(regs: &mut Registers) {
    // Direct memory writes, no allocations
    regs.rax = read_rax();
    regs.rbx = read_rbx();
    // ...
}

5. Assertions (but use sparingly)

debug_assert!(is_kernel_address(stack_ptr));
debug_assert_eq!(cs & 3, 0); // Must be ring 0

Assertions compile to nothing in release builds, so they're safe for sanity checks.

DEBUGGING ALTERNATIVES - How to Debug Without Breaking the Path

Since you cannot add logging to interrupt paths, use these techniques instead:

1. QEMU Interrupt Tracing (BEST OPTION)

# External interrupt tracing - zero overhead
cargo run --bin qemu-uefi -- -d int,cpu_reset -D /tmp/qemu.log

# Then grep the log
grep "interrupt" /tmp/qemu.log

This shows every interrupt entry/exit with register state. No kernel instrumentation required.

2. GDB Breakpoints (SECOND BEST)

# Terminal 1: Start QEMU with GDB server
cargo run --bin qemu-uefi -- -s -S

# Terminal 2: Attach GDB
gdb target/x86_64-breenix/release/breenix
(gdb) target remote :1234
(gdb) break timer_interrupt_handler
(gdb) continue

GDB lets you inspect state without modifying the timing characteristics.

3. Conditional Compilation (USE SPARINGLY)

#[cfg(debug_assertions)]
{
    // This compiles out in release builds
    INTERRUPT_COUNT.fetch_add(1, Ordering::Relaxed);
}

Only for counters and flags - never for serial output.

4. Post-Interrupt Diagnostics

fn timer_interrupt_handler() {
    // PRISTINE PATH - no logging
    TIMER_TICKS.fetch_add(1, Ordering::Relaxed);
    acknowledge_interrupt();
    schedule_if_needed();
}

// Later, in a non-critical path:
pub fn print_timer_stats() {
    let ticks = TIMER_TICKS.load(Ordering::Relaxed);
    serial_println!("Timer ticks: {}", ticks);
}

Accumulate data in the hot path, log it in a cold path.

5. Hardware Performance Counters (ADVANCED)

// Use CPU performance monitoring to count cycles
// This has minimal overhead (~10 cycles per read)
let start = read_tsc();
critical_operation();
let end = read_tsc();
CYCLE_COUNTS[operation_id].fetch_add(end - start, Ordering::Relaxed);

CODE REVIEW CHECKLIST

Before merging ANY code that touches interrupt or syscall paths, ask these questions:

Mandatory Rejection Criteria

Does this add any serial output? → REJECT IMMEDIATELY
Does this add any memory allocation? → REJECT IMMEDIATELY
Does this add any heap diagnostics? → REJECT IMMEDIATELY
Does this add page table walks in hot path? → REJECT IMMEDIATELY

Careful Review Required

Does this take any locks? → If yes, document why it's necessary and prove it's lock-free
Does this call external functions? → Audit those functions for the above violations
What's the worst-case cycle count? → Must be < 1000 cycles for interrupt handlers
Is this code conditional on debug mode? → Verify it compiles out in release

Acceptable Operations

Atomic counter increments? → OK if relaxed ordering
Simple flag checks? → OK if no branching complexity
Direct register reads/writes? → OK, encouraged
Inline function calls? → OK if callee is also pristine

Testing Requirements

Did you test in release mode? → Debug builds have different timing
Did you verify userspace makes forward progress? → RIP should change
Did you run for multiple seconds? → Timing bugs may be intermittent
Did you check QEMU interrupt logs? → Verify interrupt frequency is sane

IMPLEMENTATION WORKFLOW

When implementing interrupt or syscall handlers:

1. Design the Pristine Path First

// Write this FIRST - the minimal handler
pub extern "C" fn syscall_entry() {
    // Save registers
    // Dispatch to handler
    // Restore registers
    // Return
}

2. Add Atomic Counters for Observability

static SYSCALL_COUNTS: [AtomicU64; 256] = [/* ... */];

pub extern "C" fn syscall_entry(syscall_num: u64) {
    SYSCALL_COUNTS[syscall_num as usize].fetch_add(1, Ordering::Relaxed);
    // ... rest of handler
}

3. Test in Release Mode

cargo run --release --bin qemu-uefi

4. Verify with External Tracing

cargo run --release --bin qemu-uefi -- -d int -D /tmp/qemu.log

5. Only Then Add Debug-Only Diagnostics (if absolutely necessary)

#[cfg(all(debug_assertions, feature = "interrupt-trace"))]
{
    INTERRUPT_TRACE_BUFFER[index].store(rip, Ordering::Relaxed);
}

ANTI-PATTERNS TO AVOID

Anti-Pattern 1: "Temporary" Debug Logging

// NO! This is how bugs hide for months
fn timer_handler() {
    serial_println!("Timer fired"); // "I'll remove this later"
    // ...
}

Why it's wrong: You'll forget to remove it, or it will get copied to other handlers. When the bug appears in production, the logging will be the bug.

Anti-Pattern 2: Conditional Diagnostics on Process State

// NO! This creates timing-dependent behavior
fn context_switch(next: &Process) {
    if next.is_userspace() {
        serial_println!("Switching to PID {}", next.pid); // WRONG
    }
    // ...
}

Why it's wrong: Userspace processes get slower context switches than kernel processes. Creates Heisenbugs where timing depends on process type.

Anti-Pattern 3: "Just This Once" Memory Allocation

// NO! Interrupt handlers must be infallible
fn interrupt_handler() {
    let msg = format!("Interrupt at {:#x}", rip); // WRONG - allocates
    log_message(&msg);
}

Why it's wrong: What if the allocator is out of memory? What if the allocator lock is held? Handler must never fail.

Anti-Pattern 4: Complex Validation

// NO! Validation should be in non-critical paths
fn syscall_handler(num: u64, args: &[u64]) {
    validate_user_memory(args[0]); // May page fault, walk tables, allocate
    validate_file_descriptor(args[1]); // May take locks
    // ...
}

Why it's wrong: Validation can be expensive. Do it before entering the critical path, or use hardware-based validation (page faults).

SUCCESS METRICS

You've implemented a pristine interrupt/syscall path if:

Userspace makes forward progress immediately
- RIP changes on every scheduler quantum
- Processes complete in expected time
- No infinite loops at entry point
Interrupt frequency is stable
- Timer fires at configured rate (e.g., 100 Hz = every 10ms)
- No interrupt storms
- QEMU logs show regular intervals
Zero overhead in release builds
- No serial output
- No allocations
- Cycle counts < 1000 per handler
Debuggable with external tools
- QEMU tracing shows correct behavior
- GDB breakpoints work
- Performance counters provide visibility

WHEN TO USE THIS SKILL

Invoke this skill when:

Implementing new interrupt handlers (timer, keyboard, syscall, page fault)
Modifying syscall entry/exit paths
Implementing context switch logic
Adding scheduler hooks
Debugging performance issues where userspace seems "stuck"
Reviewing PRs that touch interrupt.asm, syscall/entry.asm, or timer.rs
Investigating "works in debug, fails in release" bugs (timing-related)

RELATED SKILLS

breenix-kernel-debug-loop: For iterative debugging with external tools
breenix-systematic-debugging: For documenting root cause analysis
breenix-code-quality-check: For enforcing zero-warning builds
breenix-interrupt-trace: For QEMU-based interrupt tracing
breenix-gdb-attach: For breakpoint-based debugging

REFERENCES

Intel Software Developer Manual Vol 3, Chapter 6: Interrupt and Exception Handling
AMD64 Architecture Manual Vol 2, Chapter 8: Exceptions and Interrupts
Linux kernel: arch/x86/entry/entry_64.S - pristine syscall entry path
FreeBSD: sys/amd64/amd64/exception.S - interrupt handling

VERSION HISTORY

v1.0 (2024-12-02): Initial skill creation after fixing trace_iretq_to_ring3 bug

Install Skill

SKILL.md