fix(deriver): break worker-pool deadlock on hung LLM calls by offendingcommit · Pull Request #8 · offendingcommit/honcho

offendingcommit · 2026-05-07T21:31:19Z

Two compounding bugs caused the deriver worker pool to wedge after a single CF-Gateway-streamed Gemini response failed to terminate:

process_work_unit holds async with self.semaphore across the inner LLM call (process_representation_batch / process_item). With no asyncio-level timeout, a hung HTTP read held the slot forever. Eight workers x one hung call each = pool fully locked.
polling_loop gated cleanup_stale_work_units behind if self.semaphore.locked(): continue, so once the pool was full the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES became dead-lettered. Pod restarts didn't help: new pods reclaimed the same poisoned work_unit_keys and re-wedged within minutes.

Fixes:

Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both process_representation_batch and process_item in asyncio.wait_for. TimeoutError propagates to _handle_processing_error, the async with unwinds, the semaphore slot releases.
Move cleanup_stale_work_units above the semaphore-locked check so AQS rows always get reaped on every poll tick, even with a full pool. Cleanup is cheap; running it unconditionally costs one index scan per poll.

Symptoms before fix: active_queue_sessions rows aging past STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing into thousands across all task types, deriver pod alive (PID 1 ok) but log output silent for hours.

Two compounding bugs caused the deriver worker pool to wedge after a single CF-Gateway-streamed Gemini response failed to terminate: 1. process_work_unit holds `async with self.semaphore` across the inner LLM call (process_representation_batch / process_item). With no asyncio-level timeout, a hung HTTP read held the slot forever. Eight workers x one hung call each = pool fully locked. 2. polling_loop gated cleanup_stale_work_units behind `if self.semaphore.locked(): continue`, so once the pool was full the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES became dead-lettered. Pod restarts didn't help: new pods reclaimed the same poisoned work_unit_keys and re-wedged within minutes. Fixes: - Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both process_representation_batch and process_item in asyncio.wait_for. TimeoutError propagates to _handle_processing_error, the `async with` unwinds, the semaphore slot releases. - Move cleanup_stale_work_units above the semaphore-locked check so AQS rows always get reaped on every poll tick, even with a full pool. Cleanup is cheap; running it unconditionally costs one index scan per poll. Symptoms before fix: active_queue_sessions rows aging past STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing into thousands across all task types, deriver pod alive (PID 1 ok) but log output silent for hours.

offendingcommit merged commit 5095a24 into main May 7, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deriver): break worker-pool deadlock on hung LLM calls#8

fix(deriver): break worker-pool deadlock on hung LLM calls#8
offendingcommit merged 1 commit into
mainfrom
fix/deriver-pool-deadlock

offendingcommit commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

offendingcommit commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant