| name | livekit-omnimodal |
| description | LiveKit omni-modal continuous coaching with stick-breaking color selection, |
| version | 1.0.0 |
LiveKit Omni-Modal Coaching
Overview
Real-time multi-modal coaching via LiveKit with:
- Continuous listening: Always-on voice input from participants
- Continuous coaching: Persistent guidance via "The Queen" voice persona
- Stick-breaking modality selection: Poisson-Dirichlet weights determine which modality gets attention
- Dynamic sufficiency gating: ε-machine prevents action without verified skills
- Symbolic expression output: All observations become s-expressions for categorical processing
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ OMNI-MODAL LIVEKIT COACHING SYSTEM │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ LiveKit Room │
│ (WebRTC SFU) │
└──────────┬──────────┘
│
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────────┐ ┌─────────────────┐
│ Audio Stream │ │ Video Stream │ │ Data Track │
│ (continuous) │ │ (screenshare) │ │ (CRDT sync) │
└────────┬────────┘ └──────────┬──────────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STICK-BREAKING MODALITY SELECTOR │
│ │
│ ├────────────────┤←────────────┤←────────────────────────────────────────┤│
│ w₁ = 0.45 w₂ = 0.30 w₃ = 0.25 │
│ (audio) (video) (data) │
│ SELECTED fallback fallback │
│ │
│ Max fraction color: #E12A4E (audio segment wins) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DYNAMIC SUFFICIENCY GATE │
│ │
│ Task: "process audio for coaching feedback" │
│ Causal State: (domain=audio, operation=transcribe, tools=(whisper,)) │
│ │
│ Required Skills: │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ say-narration│ │ signal-msg │ │ whitehole │ │
│ │ (-1) │ │ (0) │ │ (+1) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ε-Machine: PROCEED (coverage=1.0, missing=0) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE QUEEN'S VOICE OUTPUT │
│ │
│ Voice: Serena (Premium) - English UK - "Bertha Swirles" persona │
│ Trit: Computed from stick-breaking max-fraction color │
│ │
│ Output: S-expression for categorical processing: │
│ │
│ (coaching-event │
│ :timestamp 1735689600 │
│ :modality :audio │
│ :weight 0.45 │
│ :color "#E12A4E" │
│ :trit +1 │
│ :observation "participant mentioned confusion about types" │
│ :guidance "Consider explaining the relationship...") │
└─────────────────────────────────────────────────────────────────────────────┘
GF(3) Conservation
say-narration (-1) ⊗ livekit-omnimodal (0) ⊗ whitehole-audio (+1) = 0 ✓
| Role | Skill | Function |
|---|---|---|
| MINUS (-1) | say-narration | Queen voice output (constraints) |
| ERGODIC (0) | livekit-omnimodal | THIS SKILL - coordinates modalities |
| PLUS (+1) | whitehole-audio | Audio loopback routing (generation) |
The Queen's Voice
"The Queen" is Serena (Premium) - a British English voice representing Bertha Swirles (quantum physicist). She provides continuous coaching guidance with gravitas and clarity.
# The Queen speaks
say -v "Serena (Premium)" "I observe you're struggling with the type system. Consider that types are propositions and programs are proofs."
Stick-Breaking Modality Selection
Each incoming modality stream is assigned a segment weight via Poisson-Dirichlet:
using Gay.WorldStickBreaking
# Each modality gets a stick segment
modalities = [:audio, :video, :screenshare, :data, :chat]
pd = world_stick_breaking(alpha=1.0, n_segments=length(modalities), seed=session_seed)
# Select dominant modality
selected = world_max_fraction_color(pd)
# => (color="#E12A4E", weight=0.45, index=1, trit=1)
# Audio wins with 45% of attention weight
dominant_modality = modalities[selected.index] # :audio
Dynamic Sufficiency Integration
Before any coaching action, verify skill coverage via ε-machine:
from sufficiency import EpsilonMachine, Action, CoverageResult
# Create action representing coaching intent
action = Action(
operation="coach",
domain="audio",
language="natural",
tool="whisper"
)
# Check sufficiency
epsilon_machine = EpsilonMachine()
state = epsilon_machine.infer_state(action)
coverage = epsilon_machine.check_coverage(action, loaded_skills)
if coverage.is_sufficient:
# Proceed with coaching
emit_sexp(coaching_event)
else:
# Load missing skills first
for skill in coverage.missing:
load_skill(skill)
S-Expression Output Format
All observations and coaching events are emitted as s-expressions for categorical processing:
;; Coaching event structure
(coaching-event
:id "CE-2026-01-01-001"
:timestamp 1735689600
:session-id "room-xyz"
;; Modality selection (from stick-breaking)
:modality :audio
:weight 0.45
:color "#E12A4E"
:trit +1
;; Dynamic sufficiency result
:causal-state (audio transcribe (whisper))
:coverage 1.0
:sufficient t
;; Observation from modality
:observation "participant expressed confusion about monads"
:observation-embedding #<vector 1024>
;; Queen's guidance
:guidance "A monad is simply a monoid in the category of endofunctors."
:voice "Serena (Premium)"
:confidence 0.92)
;; GF(3) conservation record
(gf3-triplet
:minus (say-narration -1)
:ergodic (livekit-omnimodal 0)
:plus (whitehole-audio +1)
:sum 0
:conserved t)
Required Skills (Dependency Analysis)
Currently Have ✓
| Skill | Trit | Status |
|---|---|---|
| say-narration | -1 | ✓ Installed |
| whitehole-audio | +1 | ✓ Installed |
| dynamic-sufficiency | -1 | ✓ Installed |
| gay-mcp | +1 | ✓ Installed |
| signal-messaging | 0 | ✓ Installed |
Skills to Acquire ✗
| Skill | Trit | Purpose | Priority |
|---|---|---|---|
| whisper-transcribe | 0 | Real-time audio→text | HIGH |
| livekit-spectral | +1 | WebRTC + spectral gap walks | HIGH |
| vision-llm | 0 | Screenshare understanding | MEDIUM |
| crdt-livekit | -1 | Data track synchronization | MEDIUM |
| prosody-analyzer | +1 | Voice emotion/tone analysis | LOW |
Skill Gap S-Expression
(skill-gap-analysis
:task "livekit-omnimodal-coaching"
:have (say-narration whitehole-audio dynamic-sufficiency gay-mcp)
:need (whisper-transcribe livekit-spectral vision-llm crdt-livekit)
:coverage 0.55
:sufficient nil
:action :load-skills
:priority-order (whisper-transcribe livekit-spectral vision-llm crdt-livekit prosody-analyzer))
Usage
Start Coaching Session
import asyncio
from livekit import api, rtc
from livekit_omnimodal import OmnimodalCoach
async def main():
# Connect to LiveKit room
room = rtc.Room()
await room.connect(LIVEKIT_URL, token)
# Initialize coach with Queen voice
coach = OmnimodalCoach(
room=room,
voice="Serena (Premium)",
stick_alpha=1.0, # Poisson-Dirichlet concentration
sufficiency_threshold=0.8
)
# Start continuous listening + coaching
await coach.start()
# Coach emits s-expressions for each observation
async for sexp in coach.events():
print(sexp)
# (coaching-event :modality :audio :guidance "...")
asyncio.run(main())
Emit S-Expression
def emit_coaching_sexp(event: CoachingEvent) -> str:
"""Convert coaching event to s-expression."""
return f"""(coaching-event
:id "{event.id}"
:timestamp {event.timestamp}
:modality :{event.modality}
:weight {event.weight:.3f}
:color "{event.color}"
:trit {event.trit:+d}
:observation "{event.observation}"
:guidance "{event.guidance}"
:voice "{event.voice}")"""
Local Sand / Italian Woman Mystery
The "Italian woman always talking about local sand" is Emma (Enhanced) or Federica (Enhanced) from the say-narration skill. These Italian voices speaking English are used for agent announcements.
Resolution: The Queen (Serena Premium) replaces Italian voices for coaching output. To stop Italian voices:
# Kill any running say processes
pkill say
# Check which skills trigger announcements
grep -r "say -v" ~/.claude/skills/*/SKILL.md
The "local sand" might be mishearing "locale" + "sans" (French) from multi-locale voice announcements.
Files
| File | Purpose |
|---|---|
livekit_omnimodal.py |
Main coaching implementation |
stick_modality.py |
Modality selection via stick-breaking |
sexp_emitter.py |
S-expression output |
queen_voice.py |
Voice persona configuration |
Related Skills
say-narration- Voice output personaswhitehole-audio- Audio routingdynamic-sufficiency- ε-machine gatinggay-mcp- Color generationiroh-p2p- P2P data syncsignal-messaging- Fallback messaging