Skip to content

Fix: Resolve watcher service deadlock during self-restart on folder deletion#1269

Open
DeveloperAmrit wants to merge 1 commit into
AOSSIE-Org:mainfrom
DeveloperAmrit:fix-issue-1031
Open

Fix: Resolve watcher service deadlock during self-restart on folder deletion#1269
DeveloperAmrit wants to merge 1 commit into
AOSSIE-Org:mainfrom
DeveloperAmrit:fix-issue-1031

Conversation

@DeveloperAmrit
Copy link
Copy Markdown
Member

@DeveloperAmrit DeveloperAmrit commented May 20, 2026

Addressed Issues:

Fixes #1031

Screenshots/Recordings:

image

Additional Notes:

Description
This PR fixes a bug where the sync-microservice watcher service would freeze and eventually fail when attempting to restart itself after detecting a deleted folder.

Previously, when a folder was deleted, the background thread (watcher_thread) triggered a restart sequence that called watcher_util_stop_folder_watcher(). Inside that stop function, the thread attempted to execute watcher_thread.join(). Since it was joining itself, this caused a deadlock that hung for the 5-second timeout, preventing a clean restart and throwing a runtime error.

Changes made

  • watcher.py
    • Prevent Self-Joining: Added a thread-safety check in watcher_util_stop_folder_watcher() so that .join() is only called if threading.current_thread() != watcher_thread.
    • Spawn Restart Worker: Updated watcher_util_restart_folder_watcher() to check if it's being executed by the watcher_thread. If so, it spawns a temporary background daemon thread to handle the stop/start sequence, allowing the original watcher thread to return and terminate gracefully.

How to test

  1. Start the sync-microservice.
  2. Add a folder to the watchlist (this automatically starts the watcher).
  3. Manually delete that folder from the filesystem while the service is running.
  4. Observe the logs: The watcher should now log "Processing 1 deleted folders", followed immediately by successfully stopping the watcher thread and restarting it, without the 5-second hang or the thread-stopping warning.

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Check one of the checkboxes below:

  • [x ] This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: TODO

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • Documentation

    • Updated OpenAPI schema organization in backend documentation
  • Bug Fixes

    • Fixed folder watcher thread management issues during stop operations
    • Enhanced folder watcher restart functionality to correctly handle internal thread operations
    • Improved overall stability of folder watching processes

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Walkthrough

This PR addresses a deadlock in the watcher service thread synchronization and updates OpenAPI schema documentation. The watcher changes prevent the worker thread from joining itself by detecting caller context and delegating internal restarts to daemon threads. Schema definitions are reordered for consistency.

Changes

Watcher Thread Self-Restart Deadlock Prevention

Layer / File(s) Summary
Conditional thread join in stop function
sync-microservice/app/utils/watcher.py
watcher_util_stop_folder_watcher() detects whether it is called from within the watcher thread or externally. When called from inside the watcher thread, it skips the thread.join() call to avoid self-join deadlock and logs an internal-call message. External calls proceed normally with thread joining.
Async restart delegation for in-thread calls
sync-microservice/app/utils/watcher.py
watcher_util_restart_folder_watcher() checks the caller context; if called from inside the watcher thread, it spawns a daemon restart_worker thread to perform stop/start asynchronously and returns immediately. External calls continue to synchronously stop and start the watcher.

OpenAPI Schema Documentation Updates

Layer / File(s) Summary
Error response schema reordering and structure cleanup
docs/backend/backend_python/openapi.json
Error response schemas in components.schemas are reordered: app__schemas__face_clusters__ErrorResponse now appears before app__schemas__folders__ErrorResponse. The transition area of DeleteFoldersResponse is adjusted to ensure clean schema definition boundaries.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python, Documentation

Suggested reviewers

  • rahulharpal1603

🐰 A watcher thread that joins itself with glee,
Now detects with care and stays deadlock-free,
With daemon threads dancing when restarts arise,
And schemas reordered with documentation's eyes! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The openapi.json schema reordering appears unrelated to the deadlock fix objective; it only reorganizes error schema definitions without structural changes, suggesting potential scope creep. Remove the openapi.json schema reordering changes as they are unrelated to the watcher deadlock fix and should be addressed in a separate PR focused on documentation/schema organization.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main fix: resolving a deadlock issue in the watcher service during self-restart on folder deletion, which matches the core problem addressed in the PR.
Linked Issues check ✅ Passed The PR successfully addresses the deadlock issue #1031 by implementing thread-safety checks in watcher_util_stop_folder_watcher() and spawning a daemon thread in watcher_util_restart_folder_watcher() when called from the watcher thread, preventing self-join deadlock.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend bug Something isn't working labels May 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
sync-microservice/app/utils/watcher.py (1)

334-351: ⚡ Quick win

Clarify restart return semantics for the async path.

The function returns True with two different meanings: immediately after scheduling (in-thread path) versus after actual completion (external path). The docstring and main caller in routes/watcher.py treat the return value as completion status ("Folder watcher restarted successfully"), but the in-thread path only schedules the restart asynchronously. This semantic mismatch should be resolved by updating the docstring to distinguish between scheduled and completed states, or add a note clarifying the return value is "True if restart completed (external) or scheduled (in-thread)".

Proposed patch
 def watcher_util_restart_folder_watcher() -> bool:
     """
     Restart the folder watcher by stopping the current one and starting fresh.

     Returns:
-        True if restart was successful, False otherwise
+        True if restart completed (external call) or was successfully scheduled
+        (in-thread call), False otherwise
     """
     logger.info("Restarting folder watcher...")
     
     if watcher_thread and threading.current_thread() == watcher_thread:
         # We are inside the watcher thread, so we can't join ourselves.
         # Launch a background thread to do the restart.
         def restart_worker():
             watcher_util_stop_folder_watcher()
             watcher_util_start_folder_watcher()

         threading.Thread(target=restart_worker, daemon=True).start()
+        logger.info("Watcher restart scheduled on background thread")
         return True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sync-microservice/app/utils/watcher.py` around lines 334 - 351, The function
watcher_util_restart_folder_watcher currently returns True both when the restart
is scheduled asynchronously from inside the watcher thread and when it has
completed from an external caller, causing ambiguous semantics for callers
(e.g., routes/watcher.py). Update the watcher_util_restart_folder_watcher
docstring to explicitly state the two return meanings: "True if restart
completed when called externally, or True if restart was scheduled when called
from inside the watcher thread (async)"; also mention that callers who need
confirmation of completion should call
watcher_util_stop_folder_watcher/watcher_util_start_folder_watcher directly or
implement a synchronization mechanism. Reference
watcher_util_restart_folder_watcher, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher and the caller routes/watcher.py in your
change so readers can find the relevant logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sync-microservice/app/utils/watcher.py`:
- Around line 343-351: Add an automated regression test that exercises the
self-restart path inside the watcher worker so we never hang on a self-join:
spawn a real watcher_thread (or simulate it via running the watcher worker
function on a new Thread), ensure threading.current_thread() == watcher_thread
when calling the restart logic, and invoke the code path that creates
restart_worker (the branch referencing watcher_thread and
threading.current_thread()). In the test, mock or patch
watcher_util_stop_folder_watcher and watcher_util_start_folder_watcher to record
calls, start the restart path from inside that watcher thread, and assert within
a short timeout that (a) the background restart thread was started, (b) mocked
stop/start functions were called, and (c) the test thread does not block (use a
join with timeout or an event to fail on hang). Target functions/symbols to
locate code: watcher_thread, restart_worker, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher, and the branch that checks
threading.current_thread() == watcher_thread.

---

Nitpick comments:
In `@sync-microservice/app/utils/watcher.py`:
- Around line 334-351: The function watcher_util_restart_folder_watcher
currently returns True both when the restart is scheduled asynchronously from
inside the watcher thread and when it has completed from an external caller,
causing ambiguous semantics for callers (e.g., routes/watcher.py). Update the
watcher_util_restart_folder_watcher docstring to explicitly state the two return
meanings: "True if restart completed when called externally, or True if restart
was scheduled when called from inside the watcher thread (async)"; also mention
that callers who need confirmation of completion should call
watcher_util_stop_folder_watcher/watcher_util_start_folder_watcher directly or
implement a synchronization mechanism. Reference
watcher_util_restart_folder_watcher, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher and the caller routes/watcher.py in your
change so readers can find the relevant logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a3f3ca87-e0f6-4f81-b9e1-9fafe54e788d

📥 Commits

Reviewing files that changed from the base of the PR and between e3ee7e6 and b036cc9.

📒 Files selected for processing (2)
  • docs/backend/backend_python/openapi.json
  • sync-microservice/app/utils/watcher.py

Comment on lines +343 to +351
if watcher_thread and threading.current_thread() == watcher_thread:
# We are inside the watcher thread, so we can't join ourselves.
# Launch a background thread to do the restart.
def restart_worker():
watcher_util_stop_folder_watcher()
watcher_util_start_folder_watcher()

threading.Thread(target=restart_worker, daemon=True).start()
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add an automated regression test for the self-restart path.

This is critical thread-lifecycle logic. Please add a test that triggers restart from inside the watcher worker path and verifies there is no self-join hang and restart completes.

As per coding guidelines, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sync-microservice/app/utils/watcher.py` around lines 343 - 351, Add an
automated regression test that exercises the self-restart path inside the
watcher worker so we never hang on a self-join: spawn a real watcher_thread (or
simulate it via running the watcher worker function on a new Thread), ensure
threading.current_thread() == watcher_thread when calling the restart logic, and
invoke the code path that creates restart_worker (the branch referencing
watcher_thread and threading.current_thread()). In the test, mock or patch
watcher_util_stop_folder_watcher and watcher_util_start_folder_watcher to record
calls, start the restart path from inside that watcher thread, and assert within
a short timeout that (a) the background restart thread was started, (b) mocked
stop/start functions were called, and (c) the test thread does not block (use a
join with timeout or an event to fail on hang). Target functions/symbols to
locate code: watcher_thread, restart_worker, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher, and the branch that checks
threading.current_thread() == watcher_thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Sync-service: Deadlock when Watcher Self-Restarts after Folder Deletion

1 participant