Skip to content

fix(deriver): break worker-pool deadlock on hung LLM calls#8

Merged
offendingcommit merged 1 commit into
mainfrom
fix/deriver-pool-deadlock
May 7, 2026
Merged

fix(deriver): break worker-pool deadlock on hung LLM calls#8
offendingcommit merged 1 commit into
mainfrom
fix/deriver-pool-deadlock

Conversation

@offendingcommit
Copy link
Copy Markdown
Owner

Two compounding bugs caused the deriver worker pool to wedge after a single CF-Gateway-streamed Gemini response failed to terminate:

  1. process_work_unit holds async with self.semaphore across the inner LLM call (process_representation_batch / process_item). With no asyncio-level timeout, a hung HTTP read held the slot forever. Eight workers x one hung call each = pool fully locked.

  2. polling_loop gated cleanup_stale_work_units behind if self.semaphore.locked(): continue, so once the pool was full the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES became dead-lettered. Pod restarts didn't help: new pods reclaimed the same poisoned work_unit_keys and re-wedged within minutes.

Fixes:

  • Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both process_representation_batch and process_item in asyncio.wait_for. TimeoutError propagates to _handle_processing_error, the async with unwinds, the semaphore slot releases.
  • Move cleanup_stale_work_units above the semaphore-locked check so AQS rows always get reaped on every poll tick, even with a full pool. Cleanup is cheap; running it unconditionally costs one index scan per poll.

Symptoms before fix: active_queue_sessions rows aging past STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing into thousands across all task types, deriver pod alive (PID 1 ok) but log output silent for hours.

Two compounding bugs caused the deriver worker pool to wedge after a
single CF-Gateway-streamed Gemini response failed to terminate:

1. process_work_unit holds `async with self.semaphore` across the
   inner LLM call (process_representation_batch / process_item). With
   no asyncio-level timeout, a hung HTTP read held the slot forever.
   Eight workers x one hung call each = pool fully locked.

2. polling_loop gated cleanup_stale_work_units behind
   `if self.semaphore.locked(): continue`, so once the pool was full
   the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES
   became dead-lettered. Pod restarts didn't help: new pods reclaimed
   the same poisoned work_unit_keys and re-wedged within minutes.

Fixes:

- Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both
  process_representation_batch and process_item in asyncio.wait_for.
  TimeoutError propagates to _handle_processing_error, the
  `async with` unwinds, the semaphore slot releases.
- Move cleanup_stale_work_units above the semaphore-locked check so
  AQS rows always get reaped on every poll tick, even with a full
  pool. Cleanup is cheap; running it unconditionally costs one
  index scan per poll.

Symptoms before fix: active_queue_sessions rows aging past
STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing
into thousands across all task types, deriver pod alive (PID 1 ok)
but log output silent for hours.
@offendingcommit offendingcommit merged commit 5095a24 into main May 7, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant