livekit-stt-selfhosted

@Okeysir198/livekit-stt-selfhosted

3 forks

Updated 4/12/2026

Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.

Installation

$npx agent-skills-cli install @Okeysir198/livekit-stt-selfhosted

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

RepositoryOkeysir198/P20251122-claude-skills

Path.claude/skills/livekit-stt-selfhosted/SKILL.md

Branchmain

Scoped Name@Okeysir198/livekit-stt-selfhosted

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: livekit-stt-selfhosted description: Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.

LiveKit Self-Hosted STT Plugin

Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.

Overview

This skill provides templates and guidance for:

Building a self-hosted STT API server using FastAPI + Whisper/HF models
Creating a LiveKit plugin that connects to your self-hosted API
Deploying and scaling in production

Quick Start

Option 1: Build Both (API + Plugin)

When user wants complete setup:

Create API Server:

python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py

Create Plugin:

python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .

Use in LiveKit Agent:

from livekit.plugins import custom_stt

stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")

Option 2: API Server Only

When user only needs the API server:

Use scripts/setup_api_server.py with desired model
See references/api_server_guide.md for implementation details
Template in assets/api-server/

Option 3: Plugin Only

When user has existing API and needs LiveKit plugin:

Use scripts/setup_plugin.py with plugin name
See references/plugin_implementation.md for details
Template in assets/plugin-template/

Model Selection

Help user choose the right model:

Use Case	Recommended Model	Rationale
Best accuracy	`openai/whisper-large-v3`	SOTA quality, requires GPU
Production balance	`openai/whisper-medium`	Good quality, reasonable speed
Real-time/fast	`openai/whisper-small`	Fast, acceptable quality
CPU-only	`openai/whisper-tiny`	Can run without GPU
English-only	`facebook/wav2vec2-large-960h`	Optimized for English

For detailed comparison and optimization tips, see references/models_comparison.md.

Implementation Workflow

Building the API Server

Use the template: Start with assets/api-server/main.py
Key components:
- FastAPI app with WebSocket endpoint
- Model loading at startup (kept in memory)
- Audio buffer management
- WebSocket protocol for streaming
Customization points:
- Model selection (change MODEL_ID in .env)
- Audio processing parameters
- Batch size and optimization
- Error handling

For complete implementation guide, see references/api_server_guide.md.

Building the LiveKit Plugin

Use the template: Start with assets/plugin-template/
Required implementations:
- _recognize_impl() - Non-streaming recognition
- stream() - Return SpeechStream instance
- SpeechStream class - Handle streaming
Key considerations:
- Audio format conversion (16kHz, mono, 16-bit PCM)
- WebSocket connection management
- Event emission (interim/final transcripts)
- Error handling and cleanup

For complete implementation guide, see references/plugin_implementation.md.

Deployment

Development

# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Test WebSocket
ws://localhost:8000/ws/transcribe

Production

Docker (Recommended):

docker-compose up

Kubernetes: Use manifests in deployment guide

Cloud Platforms: AWS ECS, GCP Cloud Run, Azure Container Instances

For complete deployment guide including scaling, monitoring, and security, see references/deployment.md.

WebSocket Protocol

Client → Server

Audio: Binary (16-bit PCM, 16kHz)
Config: {"type": "config", "language": "en"}
End: {"type": "end"}

Server → Client

Interim: {"type": "interim", "text": "..."}
Final: {"type": "final", "text": "...", "language": "en"}
Error: {"type": "error", "message": "..."}

Common Tasks

Change Model

Edit .env:

MODEL_ID=openai/whisper-small  # Faster model

Add Language Support

In plugin usage:

stt=custom_stt.STT(language="es")  # Spanish
stt=custom_stt.STT(detect_language=True)  # Auto-detect

Enable GPU

In API server:

DEVICE=cuda:0  # Use GPU

Scale Horizontally

Deploy multiple API server instances behind load balancer. See references/deployment.md for Nginx configuration.

Troubleshooting

Out of Memory

Use smaller model (whisper-small or whisper-tiny)
Reduce batch_size in pipeline
Enable low_cpu_mem_usage=True

Slow Transcription

Ensure GPU is enabled (DEVICE=cuda:0)
Use FP16 precision (automatic on GPU)
Increase batch_size
Use smaller model

Connection Issues

Verify WebSocket support in load balancer
Check firewall rules
Increase timeout settings

Scripts

scripts/setup_api_server.py - Generate API server from template
scripts/setup_plugin.py - Generate LiveKit plugin from template

References

Load these as needed for detailed information:

references/api_server_guide.md - Complete API implementation guide
references/plugin_implementation.md - LiveKit plugin development
references/models_comparison.md - Model selection and optimization
references/deployment.md - Production deployment best practices

Assets

Ready-to-use templates:

assets/api-server/ - Complete FastAPI server with Whisper
assets/plugin-template/ - LiveKit STT plugin structure

Best Practices

Keep models in memory - Load once at startup, not per request
Use appropriate model size - Balance quality vs. speed for your use case
Process audio in chunks - 1-second chunks work well for streaming
Implement proper cleanup - Close WebSocket connections gracefully
Monitor metrics - Track latency, throughput, GPU utilization
Use Docker - Ensures consistent deployments
Enable authentication - Secure production APIs
Scale horizontally - Use load balancer for high availability

More by Okeysir198

View all

livekit-agent-tools

Comprehensive guide for building functional tools for LiveKit voice agents using the @function_tool decorator. Use when creating tools for LiveKit agents to enable capabilities like API calls, database queries, multi-agent coordination, or any external integrations. Covers tool design, RunContext handling, interruption patterns, parameter documentation, testing, and production best practices.

tts-livekit-plugin

Build and deploy self-hosted Text-to-Speech API using MeloTTS from Hugging Face and create a LiveKit plugin for voice agents. Use this skill when building TTS systems, LiveKit voice agents, or self-hosted speech synthesis solutions.

livekit-prompt-builder

Guide for creating effective prompts and instructions for LiveKit voice agents. Use when building conversational AI agents with the LiveKit Agents framework, including (1) Creating new voice agent prompts from scratch, (2) Improving existing agent instructions, (3) Optimizing prompts for text-to-speech output, (4) Integrating tool/function calling capabilities, (5) Building multi-agent systems with handoffs, (6) Ensuring voice-friendly formatting and brevity for natural conversations, (7) Iteratively improving prompts based on testing and feedback, (8) Building industry-specific agents (debt collection, healthcare, banking, customer service, front desk).

tts-livekit-plugin

Build self-hosted TTS APIs using HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2) and create LiveKit voice agent plugins with streaming support. Use when creating production-ready text-to-speech systems that need: (1) Self-hosted TTS with full control, (2) LiveKit voice agent integration, (3) Streaming audio for low-latency conversations, (4) Custom voice characteristics, (5) Cost-effective alternatives to cloud TTS providers like ElevenLabs or Google TTS.