Agent SkillsAgent Skills
Okeysir198

livekit-stt-selfhosted

@Okeysir198/livekit-stt-selfhosted
Okeysir198
1
3 forks
Updated 4/12/2026
View on GitHub

Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.

Installation

$npx agent-skills-cli install @Okeysir198/livekit-stt-selfhosted
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Path.claude/skills/livekit-stt-selfhosted/SKILL.md
Branchmain
Scoped Name@Okeysir198/livekit-stt-selfhosted

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: livekit-stt-selfhosted description: Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.

LiveKit Self-Hosted STT Plugin

Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.

Overview

This skill provides templates and guidance for:

  1. Building a self-hosted STT API server using FastAPI + Whisper/HF models
  2. Creating a LiveKit plugin that connects to your self-hosted API
  3. Deploying and scaling in production

Quick Start

Option 1: Build Both (API + Plugin)

When user wants complete setup:

  1. Create API Server:
python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py
  1. Create Plugin:
python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .
  1. Use in LiveKit Agent:
from livekit.plugins import custom_stt

stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")

Option 2: API Server Only

When user only needs the API server:

  • Use scripts/setup_api_server.py with desired model
  • See references/api_server_guide.md for implementation details
  • Template in assets/api-server/

Option 3: Plugin Only

When user has existing API and needs LiveKit plugin:

  • Use scripts/setup_plugin.py with plugin name
  • See references/plugin_implementation.md for details
  • Template in assets/plugin-template/

Model Selection

Help user choose the right model:

Use CaseRecommended ModelRationale
Best accuracyopenai/whisper-large-v3SOTA quality, requires GPU
Production balanceopenai/whisper-mediumGood quality, reasonable speed
Real-time/fastopenai/whisper-smallFast, acceptable quality
CPU-onlyopenai/whisper-tinyCan run without GPU
English-onlyfacebook/wav2vec2-large-960hOptimized for English

For detailed comparison and optimization tips, see references/models_comparison.md.

Implementation Workflow

Building the API Server

  1. Use the template: Start with assets/api-server/main.py

  2. Key components:

    • FastAPI app with WebSocket endpoint
    • Model loading at startup (kept in memory)
    • Audio buffer management
    • WebSocket protocol for streaming
  3. Customization points:

    • Model selection (change MODEL_ID in .env)
    • Audio processing parameters
    • Batch size and optimization
    • Error handling

For complete implementation guide, see references/api_server_guide.md.

Building the LiveKit Plugin

  1. Use the template: Start with assets/plugin-template/

  2. Required implementations:

    • _recognize_impl() - Non-streaming recognition
    • stream() - Return SpeechStream instance
    • SpeechStream class - Handle streaming
  3. Key considerations:

    • Audio format conversion (16kHz, mono, 16-bit PCM)
    • WebSocket connection management
    • Event emission (interim/final transcripts)
    • Error handling and cleanup

For complete implementation guide, see references/plugin_implementation.md.

Deployment

Development

# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Test WebSocket
ws://localhost:8000/ws/transcribe

Production

Docker (Recommended):

docker-compose up

Kubernetes: Use manifests in deployment guide

Cloud Platforms: AWS ECS, GCP Cloud Run, Azure Container Instances

For complete deployment guide including scaling, monitoring, and security, see references/deployment.md.

WebSocket Protocol

Client → Server

  • Audio: Binary (16-bit PCM, 16kHz)
  • Config: {"type": "config", "language": "en"}
  • End: {"type": "end"}

Server → Client

  • Interim: {"type": "interim", "text": "..."}
  • Final: {"type": "final", "text": "...", "language": "en"}
  • Error: {"type": "error", "message": "..."}

Common Tasks

Change Model

Edit .env:

MODEL_ID=openai/whisper-small  # Faster model

Add Language Support

In plugin usage:

stt=custom_stt.STT(language="es")  # Spanish
stt=custom_stt.STT(detect_language=True)  # Auto-detect

Enable GPU

In API server:

DEVICE=cuda:0  # Use GPU

Scale Horizontally

Deploy multiple API server instances behind load balancer. See references/deployment.md for Nginx configuration.

Troubleshooting

Out of Memory

  • Use smaller model (whisper-small or whisper-tiny)
  • Reduce batch_size in pipeline
  • Enable low_cpu_mem_usage=True

Slow Transcription

  • Ensure GPU is enabled (DEVICE=cuda:0)
  • Use FP16 precision (automatic on GPU)
  • Increase batch_size
  • Use smaller model

Connection Issues

  • Verify WebSocket support in load balancer
  • Check firewall rules
  • Increase timeout settings

Scripts

  • scripts/setup_api_server.py - Generate API server from template
  • scripts/setup_plugin.py - Generate LiveKit plugin from template

References

Load these as needed for detailed information:

  • references/api_server_guide.md - Complete API implementation guide
  • references/plugin_implementation.md - LiveKit plugin development
  • references/models_comparison.md - Model selection and optimization
  • references/deployment.md - Production deployment best practices

Assets

Ready-to-use templates:

  • assets/api-server/ - Complete FastAPI server with Whisper
  • assets/plugin-template/ - LiveKit STT plugin structure

Best Practices

  1. Keep models in memory - Load once at startup, not per request
  2. Use appropriate model size - Balance quality vs. speed for your use case
  3. Process audio in chunks - 1-second chunks work well for streaming
  4. Implement proper cleanup - Close WebSocket connections gracefully
  5. Monitor metrics - Track latency, throughput, GPU utilization
  6. Use Docker - Ensures consistent deployments
  7. Enable authentication - Secure production APIs
  8. Scale horizontally - Use load balancer for high availability

More by Okeysir198

View all
livekit-agent-tools
1

Comprehensive guide for building functional tools for LiveKit voice agents using the @function_tool decorator. Use when creating tools for LiveKit agents to enable capabilities like API calls, database queries, multi-agent coordination, or any external integrations. Covers tool design, RunContext handling, interruption patterns, parameter documentation, testing, and production best practices.

tts-livekit-plugin
1

Build and deploy self-hosted Text-to-Speech API using MeloTTS from Hugging Face and create a LiveKit plugin for voice agents. Use this skill when building TTS systems, LiveKit voice agents, or self-hosted speech synthesis solutions.

livekit-prompt-builder
1

Guide for creating effective prompts and instructions for LiveKit voice agents. Use when building conversational AI agents with the LiveKit Agents framework, including (1) Creating new voice agent prompts from scratch, (2) Improving existing agent instructions, (3) Optimizing prompts for text-to-speech output, (4) Integrating tool/function calling capabilities, (5) Building multi-agent systems with handoffs, (6) Ensuring voice-friendly formatting and brevity for natural conversations, (7) Iteratively improving prompts based on testing and feedback, (8) Building industry-specific agents (debt collection, healthcare, banking, customer service, front desk).

tts-livekit-plugin
1

Build self-hosted TTS APIs using HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2) and create LiveKit voice agent plugins with streaming support. Use when creating production-ready text-to-speech systems that need: (1) Self-hosted TTS with full control, (2) LiveKit voice agent integration, (3) Streaming audio for low-latency conversations, (4) Custom voice characteristics, (5) Cost-effective alternatives to cloud TTS providers like ElevenLabs or Google TTS.