Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: livekit-stt-selfhosted description: Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.
LiveKit Self-Hosted STT Plugin
Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.
Overview
This skill provides templates and guidance for:
- Building a self-hosted STT API server using FastAPI + Whisper/HF models
- Creating a LiveKit plugin that connects to your self-hosted API
- Deploying and scaling in production
Quick Start
Option 1: Build Both (API + Plugin)
When user wants complete setup:
- Create API Server:
python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py
- Create Plugin:
python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .
- Use in LiveKit Agent:
from livekit.plugins import custom_stt
stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")
Option 2: API Server Only
When user only needs the API server:
- Use
scripts/setup_api_server.pywith desired model - See
references/api_server_guide.mdfor implementation details - Template in
assets/api-server/
Option 3: Plugin Only
When user has existing API and needs LiveKit plugin:
- Use
scripts/setup_plugin.pywith plugin name - See
references/plugin_implementation.mdfor details - Template in
assets/plugin-template/
Model Selection
Help user choose the right model:
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Best accuracy | openai/whisper-large-v3 | SOTA quality, requires GPU |
| Production balance | openai/whisper-medium | Good quality, reasonable speed |
| Real-time/fast | openai/whisper-small | Fast, acceptable quality |
| CPU-only | openai/whisper-tiny | Can run without GPU |
| English-only | facebook/wav2vec2-large-960h | Optimized for English |
For detailed comparison and optimization tips, see references/models_comparison.md.
Implementation Workflow
Building the API Server
-
Use the template: Start with
assets/api-server/main.py -
Key components:
- FastAPI app with WebSocket endpoint
- Model loading at startup (kept in memory)
- Audio buffer management
- WebSocket protocol for streaming
-
Customization points:
- Model selection (change
MODEL_IDin .env) - Audio processing parameters
- Batch size and optimization
- Error handling
- Model selection (change
For complete implementation guide, see references/api_server_guide.md.
Building the LiveKit Plugin
-
Use the template: Start with
assets/plugin-template/ -
Required implementations:
_recognize_impl()- Non-streaming recognitionstream()- Return SpeechStream instanceSpeechStreamclass - Handle streaming
-
Key considerations:
- Audio format conversion (16kHz, mono, 16-bit PCM)
- WebSocket connection management
- Event emission (interim/final transcripts)
- Error handling and cleanup
For complete implementation guide, see references/plugin_implementation.md.
Deployment
Development
# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# Test WebSocket
ws://localhost:8000/ws/transcribe
Production
Docker (Recommended):
docker-compose up
Kubernetes: Use manifests in deployment guide
Cloud Platforms: AWS ECS, GCP Cloud Run, Azure Container Instances
For complete deployment guide including scaling, monitoring, and security, see references/deployment.md.
WebSocket Protocol
Client β Server
- Audio: Binary (16-bit PCM, 16kHz)
- Config:
{"type": "config", "language": "en"} - End:
{"type": "end"}
Server β Client
- Interim:
{"type": "interim", "text": "..."} - Final:
{"type": "final", "text": "...", "language": "en"} - Error:
{"type": "error", "message": "..."}
Common Tasks
Change Model
Edit .env:
MODEL_ID=openai/whisper-small # Faster model
Add Language Support
In plugin usage:
stt=custom_stt.STT(language="es") # Spanish
stt=custom_stt.STT(detect_language=True) # Auto-detect
Enable GPU
In API server:
DEVICE=cuda:0 # Use GPU
Scale Horizontally
Deploy multiple API server instances behind load balancer. See references/deployment.md for Nginx configuration.
Troubleshooting
Out of Memory
- Use smaller model (
whisper-smallorwhisper-tiny) - Reduce
batch_sizein pipeline - Enable
low_cpu_mem_usage=True
Slow Transcription
- Ensure GPU is enabled (
DEVICE=cuda:0) - Use FP16 precision (automatic on GPU)
- Increase
batch_size - Use smaller model
Connection Issues
- Verify WebSocket support in load balancer
- Check firewall rules
- Increase timeout settings
Scripts
scripts/setup_api_server.py- Generate API server from templatescripts/setup_plugin.py- Generate LiveKit plugin from template
References
Load these as needed for detailed information:
references/api_server_guide.md- Complete API implementation guidereferences/plugin_implementation.md- LiveKit plugin developmentreferences/models_comparison.md- Model selection and optimizationreferences/deployment.md- Production deployment best practices
Assets
Ready-to-use templates:
assets/api-server/- Complete FastAPI server with Whisperassets/plugin-template/- LiveKit STT plugin structure
Best Practices
- Keep models in memory - Load once at startup, not per request
- Use appropriate model size - Balance quality vs. speed for your use case
- Process audio in chunks - 1-second chunks work well for streaming
- Implement proper cleanup - Close WebSocket connections gracefully
- Monitor metrics - Track latency, throughput, GPU utilization
- Use Docker - Ensures consistent deployments
- Enable authentication - Secure production APIs
- Scale horizontally - Use load balancer for high availability
More by Okeysir198
View allComprehensive guide for building functional tools for LiveKit voice agents using the @function_tool decorator. Use when creating tools for LiveKit agents to enable capabilities like API calls, database queries, multi-agent coordination, or any external integrations. Covers tool design, RunContext handling, interruption patterns, parameter documentation, testing, and production best practices.
Build and deploy self-hosted Text-to-Speech API using MeloTTS from Hugging Face and create a LiveKit plugin for voice agents. Use this skill when building TTS systems, LiveKit voice agents, or self-hosted speech synthesis solutions.
Guide for creating effective prompts and instructions for LiveKit voice agents. Use when building conversational AI agents with the LiveKit Agents framework, including (1) Creating new voice agent prompts from scratch, (2) Improving existing agent instructions, (3) Optimizing prompts for text-to-speech output, (4) Integrating tool/function calling capabilities, (5) Building multi-agent systems with handoffs, (6) Ensuring voice-friendly formatting and brevity for natural conversations, (7) Iteratively improving prompts based on testing and feedback, (8) Building industry-specific agents (debt collection, healthcare, banking, customer service, front desk).
Build self-hosted TTS APIs using HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2) and create LiveKit voice agent plugins with streaming support. Use when creating production-ready text-to-speech systems that need: (1) Self-hosted TTS with full control, (2) LiveKit voice agent integration, (3) Streaming audio for low-latency conversations, (4) Custom voice characteristics, (5) Cost-effective alternatives to cloud TTS providers like ElevenLabs or Google TTS.
