Skip to content

Fix VNC container startup failure when x11 volume is in corrupt state#14

Merged
cooperj merged 6 commits into
mainfrom
copilot/fix-vnc-container-start-issue
May 5, 2026
Merged

Fix VNC container startup failure when x11 volume is in corrupt state#14
cooperj merged 6 commits into
mainfrom
copilot/fix-vnc-container-start-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 5, 2026

The named x11 Docker volume can retain stale lock files and sockets (/tmp/.X1-lock, /tmp/.X11-unix/X1) from a crashed or restarted container, causing TurboVNC to silently fail and the entrypoint to hang indefinitely.

Changes

  • Verbose x11 cleanup — replaces silent sudo rm -rf ... > /dev/null 2>&1 with a cleanup_x11() function that reports what it finds/removes and warns if removal fails; falls back to sudo if a plain rm is denied
  • Startup timeouts — both the TurboVNC display-ready loop and the xfce4 session loop now time out after 30 s (0.5 s poll interval) instead of hanging forever
  • Auto-recovery on VNC failure — if TurboVNC doesn't come up within the timeout, dumps /tmp/vnc.log to stderr, kills the failed screen session, re-runs cleanup, and retries once
  • Fatal exit with recovery instructions — if the retry also fails, exits with a clear message and the manual recovery command:
    docker compose down -v && docker compose up
    
  • xfce4 failure logging — timeout on xfce4 wait loop dumps /tmp/xfce4.log before exiting
  • Docker health check — adds docker/vnc-healthcheck.sh which verifies the X11 display (xdpyinfo -display :1) and noVNC endpoint (curl localhost:5801/vnc.html); the container is marked unhealthy if either check fails, allowing dependent containers using condition: service_healthy to hold until the full VNC stack is ready
  • Dockerfile HEALTHCHECK instruction — the health check script is baked into the image (HEALTHCHECK --start-period=60s) so it works correctly even without Docker Compose
  • start_period: 60s added to the healthcheck in compose.yml and compose.cuda.yml to allow the VNC stack time to fully initialise before retries start counting

Copilot AI changed the title [WIP] Fix VNC container startup failures due to corrupt x11 volume Fix VNC container startup failure when x11 volume is in corrupt state May 5, 2026
Copilot AI requested a review from cooperj May 5, 2026 09:06
Copilot finished work on behalf of cooperj May 5, 2026 09:06
@cooperj
Copy link
Copy Markdown
Member

cooperj commented May 5, 2026

@copilot Intergrate this as a docker health check, so when there are issues... the container is tainted as 'unhealthly' and both dependant containers can hold on etc

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 5, 2026

@copilot Intergrate this as a docker health check, so when there are issues... the container is tainted as 'unhealthly' and both dependant containers can hold on etc

Done in commit 66b09ef.

Added docker/vnc-healthcheck.sh which checks both the X11 display (xdpyinfo -display :1) and the noVNC endpoint (curl localhost:5801/vnc.html), exiting 1 with a clear message when either fails. This marks the container as unhealthy so dependent containers (ros, etc.) holding on condition: service_healthy will wait until the full VNC stack is genuinely up.

The Dockerfile now also declares HEALTHCHECK --interval=10s --timeout=5s --start-period=60s --retries=5 CMD ["/vnc-healthcheck.sh"] so it works even without compose. Both compose.yml and compose.cuda.yml have been updated to use the same script and add start_period: 60s to give the VNC stack time to initialise before retries start counting.

Copilot finished work on behalf of cooperj May 5, 2026 09:18
@cooperj cooperj marked this pull request as ready for review May 5, 2026 10:40
@cooperj cooperj requested review from Copilot May 5, 2026 10:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the VNC container startup path so stale X11 state does not leave the service hanging indefinitely, and so Compose can gate dependent services on VNC readiness.

Changes:

  • Adds explicit X11 cleanup, TurboVNC/XFCE startup timeouts, retry-on-failure logic, and failure log output in the VNC entrypoint.
  • Adds an in-image VNC health check script and Dockerfile HEALTHCHECK for X11 and noVNC readiness.
  • Updates both Compose variants to use the new health check script with a startup grace period.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
vnc.dockerfile Copies the healthcheck script into the VNC image and adds an image-level HEALTHCHECK.
docker/vnc-healthcheck.sh New healthcheck script that probes the X display and the noVNC HTTP endpoint.
docker/vnc-entrypoint.sh Reworks VNC startup with X11 cleanup, timeouts, retry handling, and failure diagnostics.
compose.yml Switches the VNC service healthcheck to the bundled script and adds start_period.
compose.cuda.yml Mirrors the Compose healthcheck update for the CUDA variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docker/vnc-entrypoint.sh
echo "display is up"
if [ -e /tmp/.X1-lock ]; then
echo " [x11] Found stale lock file /tmp/.X1-lock — removing..."
rm -f /tmp/.X1-lock || sudo rm -f /tmp/.X1-lock
Comment thread docker/vnc-entrypoint.sh

if [ -e /tmp/.X11-unix/X1 ]; then
echo " [x11] Found stale socket /tmp/.X11-unix/X1 — removing..."
rm -f /tmp/.X11-unix/X1 || sudo rm -f /tmp/.X11-unix/X1
Comment thread docker/vnc-entrypoint.sh Outdated
Comment on lines +36 to +39
if $had_stale; then
echo " [x11] Stale X11 files from previous run cleared successfully."
else
echo " [x11] No stale X11 files found."
Comment thread docker/vnc-entrypoint.sh Outdated
Comment thread docker/vnc-entrypoint.sh Outdated
Comment on lines +95 to +96
echo "To recover manually, remove the x11 volume and restart:" >&2
echo " docker compose down -v && docker compose up" >&2
cooperj and others added 3 commits May 5, 2026 12:05
This allows us to clear the x11 volume on crashes

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@cooperj cooperj merged commit 9b774da into main May 5, 2026
7 checks passed
@cooperj cooperj deleted the copilot/fix-vnc-container-start-issue branch May 5, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VNC container fails to start if x11 volume is in corrupt state

3 participants