Skip to content

Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing #16

@eshork

Description

@eshork

Motivation

NeuralDrive is an excellent portable LLM appliance with a clean architecture: immutable rootfs, Caddy as the central reverse proxy on :8443, OpenAI-compatible API, LRU model management, and a unified System API + TUI.

Users are increasingly asking for voice capabilities (speech-to-text and text-to-speech) to build local voice agents, hands-free interfaces, or full conversational systems. Adding audio support in a way that feels native to the existing design would significantly increase the platform's usefulness.

Goal

Expose standard OpenAI-compatible audio endpoints through the existing :8443 gateway:

  • POST /v1/audio/transcriptions — ASR (Speech-to-Text)
  • POST /v1/audio/speech — TTS (Text-to-Speech)

All while maintaining:

  • Bearer token authentication
  • The same OpenAI-compatible style as text models
  • Integration with the System API and Textual TUI
  • LRU-based model loading/unloading
  • Storage on the persistence partition

Proposed Architecture

Caddy (:8443)
├── /v1/chat/completions          → Ollama (existing)
├── /v1/audio/transcriptions      → faster-whisper service (new, internal :8001)
├── /v1/audio/speech              → Kokoro-FastAPI (new, internal :8002)
└── /system/* + /v1/models        → FastAPI System API (extend)

Internal Services (new systemd units):

  • neuraldrive-whisper.service — faster-whisper + thin FastAPI wrapper
  • neuraldrive-kokoro.service — Official ghcr.io/remsky/kokoro-fastapi-* image

Recommended Starting Models

Type Model Reason OpenAI Compatible?
ASR faster-whisper (large-v3 / Turbo) Best balance of speed & accuracy Yes (via wrapper)
TTS Kokoro-82M Highest quality open-source TTS, tiny, runs on CPU Yes (native)
TTS Piper (optional) Extremely lightweight & fast Easy to wrap

Later phases can add Fish Speech S2, Qwen3 audio models, etc.

Implementation Details

1. Caddyfile Changes (minimal)

:8443 {
    # ... existing routes ...

    handle /v1/audio/transcriptions {
        reverse_proxy localhost:8001
    }

    handle /v1/audio/speech {
        reverse_proxy localhost:8002
    }
}

2. New Systemd Services

  • Run as unprivileged users with proper hardening (matching current services)
  • Store models in /persistence/models/audio/

3. System API Extensions (:3001)

Add these endpoints to the existing FastAPI backend:

  • GET /v1/audio/models
  • POST /v1/audio/models/{name}/load
  • POST /v1/audio/models/{name}/unload
  • GET /system/audio/status (shows loaded models + resource usage)

Reuse the existing LRU eviction logic where possible.

4. TUI Updates

  • Show loaded audio models alongside text models
  • Add quick load/unload commands
  • Display real-time audio inference stats

5. Model Storage

Use the existing persistence partition (/persistence/models/audio/) so models survive reboots and are portable.

Phased Implementation Roadmap

Phase Scope Priority
1 Add Kokoro-FastAPI + faster-whisper behind Caddy High
2 Integrate into System API + TUI High
3 Add Piper as lightweight alternative Medium
4 Support additional models (Fish Speech, etc.) Medium
5 Full voice agent pipeline (ASR → LLM → TTS) Nice-to-have

Benefits

  • Keeps the "one appliance, one API" philosophy
  • No breaking changes to existing users
  • Leverages the mature OpenAI ecosystem (tools, SDKs, clients)
  • Enables powerful new use cases (local voice agents, accessibility, etc.)
  • Maintains security model (TLS + Bearer auth)

Questions for Discussion

  1. Should we support streaming TTS responses (stream: true) from day one?
  2. Do we want a higher-level /v1/audio/voice-chat endpoint that chains ASR → LLM → TTS?
  3. Should audio models participate in the same VRAM management pool as text models, or have separate limits?
  4. Any preference between using a thin FastAPI wrapper vs. running the official Kokoro image directly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIAPI endpoints and compatibilityaudioAudio/voice related featuresenhancementNew feature or requestroadmapPlanned feature on the project roadmap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions