| name | tts-livekit-plugin |
| description | Build self-hosted TTS APIs using HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2) and create LiveKit voice agent plugins with streaming support. Use when creating production-ready text-to-speech systems that need: (1) Self-hosted TTS with full control, (2) LiveKit voice agent integration, (3) Streaming audio for low-latency conversations, (4) Custom voice characteristics, (5) Cost-effective alternatives to cloud TTS providers like ElevenLabs or Google TTS. |
Self-Hosted TTS API and LiveKit Plugin
Build production-ready self-hosted Text-to-Speech APIs using HuggingFace models and integrate them with LiveKit voice agents through custom plugins.
Overview
This skill enables you to:
- Build TTS APIs using state-of-the-art HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2)
- Create LiveKit plugins that connect voice agents to your self-hosted TTS
- Implement streaming for low-latency real-time synthesis
- Deploy to production with Docker, Kubernetes, or cloud platforms
When to use this skill:
- Creating cost-effective voice agents without cloud TTS fees
- Requiring custom voice characteristics or multilingual support
- Building privacy-focused systems with on-premise TTS
- Developing voice agents that need streaming audio synthesis
Implementation Process
Step 1: Choose Your TTS Model
Select the best model for your use case from the HuggingFace ecosystem.
Load model comparison: TTS Models Reference
Quick Selection Guide:
| Use Case | Recommended Model | Why |
|---|---|---|
| Production voice agents | Parler-TTS Mini | Fast, CPU-friendly, text-based voice control |
| High-quality synthesis | Parler-TTS Large / F5-TTS | Superior natural quality |
| Multilingual support | XTTS-v2 | 17+ languages, voice cloning |
| Cost optimization | Parler-TTS Mini on CPU | Runs efficiently without GPU |
Example decision:
- User needs: "Fast, conversational voice for customer support agent"
- Model:
parler-tts/parler-tts-mini-v1 - Device: CPU (for cost) or GPU (for speed)
- Voice description:
"A friendly, professional voice with moderate pace"
Step 2: Build the TTS API Server
Create a FastAPI server that hosts the TTS model with both batch and streaming endpoints.
Use the provided implementation:
The skill includes a complete TTS API server at tts-api/main.py that supports:
- Batch synthesis: POST
/synthesizefor simple text-to-speech - Streaming synthesis: WebSocket
/ws/synthesizefor real-time incremental synthesis - Multiple models: Parler-TTS, F5-TTS, XTTS-v2 (configurable)
- Optimizations: Model caching, async synthesis, sentence-level streaming
Quick start:
# Navigate to TTS API directory
cd tts-api
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env to set:
# TTS_MODEL_TYPE=parler
# TTS_MODEL_NAME=parler-tts/parler-tts-mini-v1
# TTS_DEVICE=cpu
# Run the server
python main.py
The server will:
- Load the model on startup (may take 1-2 minutes)
- Listen on port 8001
- Provide health check at
GET /health - Accept synthesis requests at
POST /synthesizeandWS /ws/synthesize
Test the API:
# Test batch synthesis
curl -X POST http://localhost:8001/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello! This is a test.", "format": "wav"}' \
--output test.wav
# Check health
curl http://localhost:8001/health
For detailed implementation patterns and best practices:
Key implementation details:
The provided tts-api/main.py includes:
- Model loading: Singleton pattern with startup event
- Sentence-level streaming: Splits text at sentence boundaries for natural prosody
- Keepalive messages: Prevents WebSocket timeouts (5s interval)
- End-of-stream signaling: Explicit completion messages
- Error handling: Graceful failures with detailed error messages
- Audio formats: PCM int16 (streaming), WAV, MP3 (batch)
Step 3: Create the LiveKit TTS Plugin
Build a LiveKit plugin that connects your voice agents to the self-hosted TTS API.
Use the provided plugin implementation:
The skill includes a complete LiveKit plugin at livekit-plugin-custom-tts/ with:
- TTS class: Main plugin interface
- ChunkedStream: Streaming synthesis session
- WebSocket communication: Bi-directional streaming
- Examples: Voice agent integration and standalone usage
Install the plugin:
# Navigate to plugin directory
cd livekit-plugin-custom-tts
# Install in development mode
pip install -e .
# Or install from source
pip install .
Use in a voice agent:
from livekit import agents
from livekit.agents import AgentSession
from livekit.plugins import openai, deepgram, silero
from livekit.plugins import custom_tts
async def entrypoint(ctx: agents.JobContext):
# Initialize session with custom TTS
session = AgentSession(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2-general"),
llm=openai.LLM(model="gpt-4o-mini"),
# Use custom self-hosted TTS
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="A friendly, conversational voice.",
sample_rate=24000,
),
),
)
await ctx.connect()
await session.start(agent=YourAgent(), room=ctx.room)
For detailed plugin development patterns:
Key plugin features:
The provided implementation includes:
- Streaming synthesis: Iterates over audio chunks as they're generated
- Keepalive: Maintains long-running WebSocket connections
- Error recovery: Graceful handling of connection failures
- Resource cleanup: Proper task cancellation and WebSocket closure
- LiveKit integration: Follows LiveKit TTS plugin interface exactly
Step 4: Test the Integration
Verify that the TTS API and plugin work together correctly.
Testing levels:
1. API-level testing:
# Start the TTS API
cd tts-api
python main.py
# In another terminal, test synthesis
curl -X POST http://localhost:8001/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Testing TTS API", "format": "wav"}' \
--output test.wav
# Play the audio
ffplay test.wav # or open test.wav
2. Plugin-level testing:
Use the provided example script:
cd livekit-plugin-custom-tts/examples
python basic_usage.py
This will:
- Connect to the TTS API
- Synthesize test text
- Save audio to
output.wav
3. Voice agent testing:
Use the provided voice agent example:
# Set environment variables
export LIVEKIT_URL="wss://your-livekit.cloud"
export LIVEKIT_API_KEY="your-api-key"
export LIVEKIT_API_SECRET="your-api-secret"
export OPENAI_API_KEY="your-openai-key"
export DEEPGRAM_API_KEY="your-deepgram-key"
# Run the voice agent
cd livekit-plugin-custom-tts/examples
python voice_agent.py start
Verification checklist:
- TTS API health endpoint returns OK
- Batch synthesis produces audio files
- WebSocket streaming works without disconnections
- Plugin synthesizes text successfully
- Voice agent speaks using custom TTS
- Audio quality is acceptable
- Latency is reasonable (<1 second for short sentences)
Step 5: Deploy to Production
Deploy the TTS API and voice agent to production infrastructure.
Deployment options:
Option 1: Docker Compose (Quick Start)
Use the provided configuration:
# Create deployment directory
mkdir deployment
cd deployment
# Copy TTS API
cp -r ../tts-api .
# Set environment variables
export LIVEKIT_URL="wss://your-livekit.cloud"
export LIVEKIT_API_KEY="your-api-key"
# ... other vars
# Run docker-compose
docker-compose up -d
Option 2: Kubernetes (Production Scale)
Use the provided Kubernetes manifests in the deployment guide.
For comprehensive deployment instructions:
Production deployment includes:
- Docker containerization for both API and agent
- GPU allocation for faster synthesis
- Health checks and monitoring
- Horizontal scaling with load balancing
- Secrets management
- Persistent model caching
- Logging and metrics collection
Production checklist:
- GPU properly allocated (if using GPU)
- Model cache persisted (avoid re-downloading)
- Health checks implemented
- Monitoring and alerting set up
- Autoscaling configured (if using Kubernetes)
- Secrets properly managed
- Network policies applied (if using Kubernetes)
- Load testing completed
- Backup strategy in place
Common Patterns
Pattern 1: Multiple Voice Characteristics
Create different agents with different voices:
# Customer support agent
support_agent = AgentSession(
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="A professional, clear voice with measured pace.",
),
),
)
# Sales agent
sales_agent = AgentSession(
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="An energetic, friendly voice with upbeat delivery.",
),
),
)
Pattern 2: Multilingual Voice Agent
Use XTTS-v2 for multilingual support:
# Configure TTS API for XTTS
TTS_MODEL_TYPE=xtts
TTS_MODEL_NAME=tts_models/multilingual/multi-dataset/xtts_v2
# In your agent
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="Multilingual voice",
sample_rate=24000,
),
)
Pattern 3: Failover to Cloud TTS
Implement fallback to cloud TTS on self-hosted failure:
async def create_tts():
"""Create TTS with fallback."""
try:
# Try self-hosted first
return custom_tts.TTS(api_url="http://localhost:8001")
except Exception as e:
logger.warning(f"Self-hosted TTS unavailable: {e}")
# Fallback to cloud provider
return openai.TTS(voice="alloy")
session = AgentSession(
tts=await create_tts(),
# ... other config
)
Pattern 4: Response Caching
Cache common phrases to reduce synthesis latency:
from functools import lru_cache
@lru_cache(maxsize=100)
async def synthesize_cached(text: str) -> bytes:
"""Synthesize with caching for common phrases."""
# First time: synthesize and cache
# Subsequent times: return cached audio
return await synthesize_async(text)
# Common greetings, confirmations, etc. are cached
audio = await synthesize_cached("Hello! How can I help you today?")
Troubleshooting
Issue: TTS API is slow
Symptoms: High latency (>2 seconds per sentence)
Solutions:
Use GPU acceleration:
- Set
TTS_DEVICE=cudain environment - Ensure GPU is available (
nvidia-smi) - Allocate GPU in Docker/Kubernetes
- Set
Use faster model:
- Switch from
parler-tts-large-v1toparler-tts-mini-v1 - Consider CPU-optimized models
- Switch from
Optimize inference:
# Add to TTS API startup import torch model = torch.compile(model, mode="reduce-overhead")Implement sentence-level streaming:
- Already included in provided implementation
- Reduces perceived latency
Issue: WebSocket connection drops
Symptoms: Connection lost during long syntheses
Solutions:
Verify keepalive is enabled:
- Check plugin implementation has
_keepalive_loop() - Ensure 5-second interval
- Check plugin implementation has
Increase timeouts:
# In plugin async with asyncio.timeout(60): # 60 second timeout message = await self._ws.recv()Check network policies:
- Verify firewall allows WebSocket connections
- Test direct connection without proxies
Issue: Audio quality is poor
Symptoms: Robotic voice, artifacts, glitches
Solutions:
Use larger model:
- Switch to
parler-tts-large-v1orF5-TTS
- Switch to
Adjust voice description:
# More detailed description = better quality voice_description="A warm, natural voice speaking clearly with good articulation and moderate pace"Check audio format:
- Ensure sample rate matches (24kHz)
- Verify PCM int16 encoding
Issue: High memory usage
Symptoms: Server runs out of memory, crashes
Solutions:
Use smaller model:
parler-tts-mini-v1uses ~880M parameters vs. ~2.4B for large
Limit concurrent requests:
MAX_CONCURRENT = 5 semaphore = asyncio.Semaphore(MAX_CONCURRENT) @app.post("/synthesize") async def synthesize(request: TTSRequest): async with semaphore: return await synthesize_async(request)Clear audio buffers:
- Ensure buffers are released after sending
- Already handled in provided implementation
Issue: Model download fails
Symptoms: Error loading model from HuggingFace
Solutions:
Check internet connectivity:
curl -I https://huggingface.coUse authentication token (if needed):
export HF_TOKEN="your-huggingface-token"Pre-download models:
from huggingface_hub import snapshot_download snapshot_download("parler-tts/parler-tts-mini-v1")
Best Practices
API Development
✅ DO:
- Load model once at startup, not per request
- Implement health checks for monitoring
- Use sentence-level streaming for natural prosody
- Send keepalive messages every 5 seconds
- Explicitly signal end-of-stream
- Handle errors gracefully with detailed messages
- Use async synthesis to avoid blocking
❌ DON'T:
- Load model for each request (very slow)
- Block event loop with synchronous synthesis
- Split text mid-word (breaks natural speech)
- Forget keepalive (causes timeouts)
- Close connections without signaling
Plugin Development
✅ DO:
- Follow LiveKit TTS plugin interface exactly
- Implement proper cleanup in
aclose() - Use async iterators for streaming
- Handle WebSocket disconnections gracefully
- Log errors for debugging
- Test with various text lengths
❌ DON'T:
- Modify LiveKit interfaces
- Leave WebSocket connections open
- Use synchronous code in async context
- Ignore errors or exceptions
- Skip testing edge cases
Deployment
✅ DO:
- Persist model cache to avoid re-downloading
- Use GPU for production (faster synthesis)
- Implement monitoring and alerting
- Use horizontal scaling for high traffic
- Secure API with authentication
- Test failover scenarios
- Document configuration
❌ DON'T:
- Re-download models on each restart
- Run production on CPU only (too slow)
- Deploy without monitoring
- Use single instance for production
- Expose API publicly without auth
- Skip disaster recovery planning
Reference Files
Core Documentation
Load these resources as needed for detailed information:
TTS Models Comparison: Comprehensive comparison of HuggingFace TTS models (Parler-TTS, F5-TTS, XTTS-v2) with performance benchmarks, code examples, and selection guide.
API Implementation Guide: Best practices for building TTS APIs including streaming patterns, model management, audio format handling, security, and optimization strategies.
Plugin Development Guide: Detailed guide for implementing LiveKit TTS plugins including WebSocket communication, async patterns, error handling, and testing.
Deployment Guide: Production deployment with Docker Compose, Kubernetes, scaling strategies, monitoring, and security best practices.
Code Resources
TTS API Server:
tts-api/main.py: Complete FastAPI server with batch and streaming endpointstts-api/requirements.txt: Python dependenciestts-api/Dockerfile: Container configurationtts-api/.env.example: Environment variables template
LiveKit Plugin:
livekit-plugin-custom-tts/: Complete plugin packagelivekit-plugin-custom-tts/livekit/plugins/custom_tts/tts.py: Plugin implementationlivekit-plugin-custom-tts/examples/voice_agent.py: Voice agent integration examplelivekit-plugin-custom-tts/examples/basic_usage.py: Standalone usage example
Quick Reference
TTS API Endpoints
GET / - Health check and info
GET /health - Health status
POST /synthesize - Batch synthesis
WS /ws/synthesize - Streaming synthesis
Environment Variables
# TTS API
TTS_MODEL_TYPE=parler # parler, f5, xtts
TTS_MODEL_NAME=parler-tts/parler-tts-mini-v1
TTS_DEVICE=cuda # cuda or cpu
# LiveKit Agent
LIVEKIT_URL=wss://your-livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret
TTS_API_URL=http://localhost:8001
Model Selection
| Model | Size | Speed | Quality | Multilingual |
|---|---|---|---|---|
| parler-tts-mini-v1 | 880M | ★★★★★ | ★★★★☆ | Limited |
| parler-tts-large-v1 | 2.4B | ★★★☆☆ | ★★★★★ | Limited |
| F5-TTS | Varies | ★★★☆☆ | ★★★★★ | Good |
| XTTS-v2 | 1.2B | ★★★★☆ | ★★★★★ | Excellent |
Voice Description Examples
"A friendly, conversational voice with moderate pace."
"A professional, clear voice speaking slowly and deliberately."
"An energetic, upbeat voice with fast delivery and enthusiasm."
"A calm, soothing voice with gentle, measured speech."
Additional Resources
- HuggingFace TTS: https://huggingface.co/tasks/text-to-speech
- LiveKit Agents Docs: https://docs.livekit.io/agents/
- Parler-TTS Repository: https://github.com/huggingface/parler-tts
- F5-TTS Repository: https://github.com/SWivid/F5-TTS
- XTTS-v2 Model: https://huggingface.co/coqui/XTTS-v2
- Modal Blog on TTS: https://modal.com/blog/open-source-tts