Skip to content

Clean up orphaned multiprocessing workers#192

Open
dll-create wants to merge 1 commit intoAgentFlocks:mainfrom
dll-create:fix-orphaned-multiprocessing-workers
Open

Clean up orphaned multiprocessing workers#192
dll-create wants to merge 1 commit intoAgentFlocks:mainfrom
dll-create:fix-orphaned-multiprocessing-workers

Conversation

@dll-create
Copy link
Copy Markdown
Contributor

@dll-create dll-create commented Apr 25, 2026

Summary

  • Detect orphaned Python multiprocessing workers that belong to the current Flocks install.
  • Include those orphaned workers in flocks stop target discovery and force-kill refreshes.
  • Add regression coverage for matching only Flocks-owned orphan workers and stopping when no pid file or port listener remains.

Evidence

On macOS, multiple stale workers were observed with PPID 1 and sustained CPU usage after the Flocks service was no longer reachable through the tracked backend pid tree:

/opt/homebrew/.../Python -c from multiprocessing.spawn import spawn_main; spawn_main(...) --multiprocessing-fork

lsof -p <pid> showed those processes had cwd/open files under a Flocks checkout and its virtualenv. Because they were reparented to PID 1 and did not own the backend port, the existing stop logic could miss them: it only gathered the tracked pid tree, runtime record pids, process group, and port listeners.

This change keeps the cleanup narrow: a candidate must be a Unix PPID=1 Python multiprocessing spawn process, and lsof must show open files under the current Flocks repo, .venv, or state directory.

Tests

python -m pytest tests/cli/test_service_manager.py
66 passed

@dll-create dll-create force-pushed the fix-orphaned-multiprocessing-workers branch from 4a26e3e to 1309c0e Compare April 25, 2026 12:00
@dll-create dll-create force-pushed the fix-orphaned-multiprocessing-workers branch from 1309c0e to e1b4247 Compare April 25, 2026 12:01
@xiami762 xiami762 self-requested a review April 28, 2026 10:17
Copy link
Copy Markdown
Contributor

@xiami762 xiami762 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the PR. Here is the review.

Findings

  1. High: This change wires backend orphan-worker cleanup into the generic stop_one() helper, so the WebUI stop path can also kill backend-owned Python workers. That is too broad. stop_one() is a shared helper, and stop_all() stops WebUI first; the upgrade flow also calls stop_one(..., "WebUI", ...) directly. With orphaned_flocks_worker_pids() added unconditionally to both the initial target list and the refreshed stop targets, “stop WebUI” can now clean up orphaned backend workers and attribute that work to the WebUI shutdown path. At minimum, this logic should be limited to the backend-specific path, not the generic stop helper.
def stop_one(port: int, pid_file: Path, name: str, console) -> None:
    """Stop a single service by tracked pid and/or listening port."""
    cleanup_stale_pid_file(pid_file)
    runtime_record = read_runtime_record(pid_file)
    tracked_pid = runtime_record.pid if runtime_record else None
    listeners = port_owner_pids(port)

    target_pids: list[int] = []
    if tracked_pid is not None:
        target_pids = append_unique_pids(target_pids, collect_process_tree_pids(tracked_pid))
    target_pids = append_unique_pids(target_pids, listeners)
"""Stop frontend then backend while reusing the caller's lock."""
fe_port, be_port = _resolve_stop_ports(paths, config)
_resolve_upgrade_runtime(console, frontend_port=fe_port, attempt_recover=False)
stop_one(fe_port, paths.frontend_pid, "WebUI", console)
stop_one(be_port, paths.backend_pid, "后端", console)
console = _NullConsole()
paths = service_manager.ensure_runtime_dirs()
frontend_port = service_manager._recorded_port(paths.frontend_pid, config.frontend_port)
service_manager.stop_one(frontend_port, paths.frontend_pid, "WebUI", console)
  1. Medium: The new ownership heuristic is too broad and can kill unrelated Python multiprocessing jobs started from the same checkout or virtualenv. Based on the PR, a process is treated as Flocks-owned if it has PPID=1, its command contains multiprocessing.spawn, and lsof output mentions the current repo, current .venv, or the Flocks state directory. That does not prove the process came from the Flocks backend. Any local script, test run, or other daemon started from the same repo/venv could match and be killed by flocks stop. The ownership check needs a stronger Flocks-specific signal, such as a backend-specific command signature, runtime artifact association, or another marker that ties the worker to the active Flocks service rather than to the checkout in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants