fix(deriver): break worker-pool deadlock on hung LLM calls#8
Merged
Conversation
Two compounding bugs caused the deriver worker pool to wedge after a single CF-Gateway-streamed Gemini response failed to terminate: 1. process_work_unit holds `async with self.semaphore` across the inner LLM call (process_representation_batch / process_item). With no asyncio-level timeout, a hung HTTP read held the slot forever. Eight workers x one hung call each = pool fully locked. 2. polling_loop gated cleanup_stale_work_units behind `if self.semaphore.locked(): continue`, so once the pool was full the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES became dead-lettered. Pod restarts didn't help: new pods reclaimed the same poisoned work_unit_keys and re-wedged within minutes. Fixes: - Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both process_representation_batch and process_item in asyncio.wait_for. TimeoutError propagates to _handle_processing_error, the `async with` unwinds, the semaphore slot releases. - Move cleanup_stale_work_units above the semaphore-locked check so AQS rows always get reaped on every poll tick, even with a full pool. Cleanup is cheap; running it unconditionally costs one index scan per poll. Symptoms before fix: active_queue_sessions rows aging past STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing into thousands across all task types, deriver pod alive (PID 1 ok) but log output silent for hours.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two compounding bugs caused the deriver worker pool to wedge after a single CF-Gateway-streamed Gemini response failed to terminate:
process_work_unit holds
async with self.semaphoreacross the inner LLM call (process_representation_batch / process_item). With no asyncio-level timeout, a hung HTTP read held the slot forever. Eight workers x one hung call each = pool fully locked.polling_loop gated cleanup_stale_work_units behind
if self.semaphore.locked(): continue, so once the pool was full the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES became dead-lettered. Pod restarts didn't help: new pods reclaimed the same poisoned work_unit_keys and re-wedged within minutes.Fixes:
async withunwinds, the semaphore slot releases.Symptoms before fix: active_queue_sessions rows aging past STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing into thousands across all task types, deriver pod alive (PID 1 ok) but log output silent for hours.