Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing

## Motivation

NeuralDrive is an excellent portable LLM appliance with a clean architecture: immutable rootfs, Caddy as the central reverse proxy on `:8443`, OpenAI-compatible API, LRU model management, and a unified System API + TUI.

Users are increasingly asking for **voice capabilities** (speech-to-text and text-to-speech) to build local voice agents, hands-free interfaces, or full conversational systems. Adding audio support in a way that feels native to the existing design would significantly increase the platform's usefulness.

## Goal

Expose standard OpenAI-compatible audio endpoints through the existing `:8443` gateway:
- `POST /v1/audio/transcriptions` — ASR (Speech-to-Text)
- `POST /v1/audio/speech` — TTS (Text-to-Speech)

All while maintaining:
- Bearer token authentication
- The same OpenAI-compatible style as text models
- Integration with the System API and Textual TUI
- LRU-based model loading/unloading
- Storage on the persistence partition

## Proposed Architecture

```
Caddy (:8443)
├── /v1/chat/completions          → Ollama (existing)
├── /v1/audio/transcriptions      → faster-whisper service (new, internal :8001)
├── /v1/audio/speech              → Kokoro-FastAPI (new, internal :8002)
└── /system/* + /v1/models        → FastAPI System API (extend)
```

**Internal Services (new systemd units):**
- `neuraldrive-whisper.service` — faster-whisper + thin FastAPI wrapper
- `neuraldrive-kokoro.service` — Official `ghcr.io/remsky/kokoro-fastapi-*` image

## Recommended Starting Models

| Type | Model                  | Reason                              | OpenAI Compatible? |
|------|------------------------|-------------------------------------|--------------------|
| ASR  | faster-whisper (large-v3 / Turbo) | Best balance of speed & accuracy   | Yes (via wrapper) |
| TTS  | Kokoro-82M             | Highest quality open-source TTS, tiny, runs on CPU | Yes (native) |
| TTS  | Piper (optional)       | Extremely lightweight & fast        | Easy to wrap |

Later phases can add Fish Speech S2, Qwen3 audio models, etc.

## Implementation Details

### 1. Caddyfile Changes (minimal)

```caddyfile
:8443 {
    # ... existing routes ...

    handle /v1/audio/transcriptions {
        reverse_proxy localhost:8001
    }

    handle /v1/audio/speech {
        reverse_proxy localhost:8002
    }
}
```

### 2. New Systemd Services

- Run as unprivileged users with proper hardening (matching current services)
- Store models in `/persistence/models/audio/`

### 3. System API Extensions (`:3001`)

Add these endpoints to the existing FastAPI backend:
- `GET /v1/audio/models`
- `POST /v1/audio/models/{name}/load`
- `POST /v1/audio/models/{name}/unload`
- `GET /system/audio/status` (shows loaded models + resource usage)

Reuse the existing LRU eviction logic where possible.

### 4. TUI Updates

- Show loaded audio models alongside text models
- Add quick load/unload commands
- Display real-time audio inference stats

### 5. Model Storage

Use the existing persistence partition (`/persistence/models/audio/`) so models survive reboots and are portable.

## Phased Implementation Roadmap

| Phase | Scope                                      | Priority |
|-------|--------------------------------------------|----------|
| 1     | Add Kokoro-FastAPI + faster-whisper behind Caddy | High |
| 2     | Integrate into System API + TUI            | High |
| 3     | Add Piper as lightweight alternative       | Medium |
| 4     | Support additional models (Fish Speech, etc.) | Medium |
| 5     | Full voice agent pipeline (ASR → LLM → TTS) | Nice-to-have |

## Benefits

- Keeps the "one appliance, one API" philosophy
- No breaking changes to existing users
- Leverages the mature OpenAI ecosystem (tools, SDKs, clients)
- Enables powerful new use cases (local voice agents, accessibility, etc.)
- Maintains security model (TLS + Bearer auth)

## Questions for Discussion

1. Should we support streaming TTS responses (`stream: true`) from day one?
2. Do we want a higher-level `/v1/audio/voice-chat` endpoint that chains ASR → LLM → TTS?
3. Should audio models participate in the same VRAM management pool as text models, or have separate limits?
4. Any preference between using a thin FastAPI wrapper vs. running the official Kokoro image directly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing #16

Motivation

Goal

Proposed Architecture

Recommended Starting Models

Implementation Details

1. Caddyfile Changes (minimal)

2. New Systemd Services

3. System API Extensions (`:3001`)

4. TUI Updates

5. Model Storage

Phased Implementation Roadmap

Benefits

Questions for Discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Type	Model	Reason	OpenAI Compatible?
ASR	faster-whisper (large-v3 / Turbo)	Best balance of speed & accuracy	Yes (via wrapper)
TTS	Kokoro-82M	Highest quality open-source TTS, tiny, runs on CPU	Yes (native)
TTS	Piper (optional)	Extremely lightweight & fast	Easy to wrap

Phase	Scope	Priority
1	Add Kokoro-FastAPI + faster-whisper behind Caddy	High
2	Integrate into System API + TUI	High
3	Add Piper as lightweight alternative	Medium
4	Support additional models (Fish Speech, etc.)	Medium
5	Full voice agent pipeline (ASR → LLM → TTS)	Nice-to-have

Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing #16

Description

Motivation

Goal

Proposed Architecture

Recommended Starting Models

Implementation Details

1. Caddyfile Changes (minimal)

2. New Systemd Services

3. System API Extensions (:3001)

4. TUI Updates

5. Model Storage

Phased Implementation Roadmap

Benefits

Questions for Discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. System API Extensions (`:3001`)