Skip to content

NE-2664: add option to separate router and haproxy containers#772

Open
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:NE-2664-haproxy-sidecar
Open

NE-2664: add option to separate router and haproxy containers#772
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:NE-2664-haproxy-sidecar

Conversation

@jcmoraisjr
Copy link
Copy Markdown
Member

@jcmoraisjr jcmoraisjr commented May 7, 2026

Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket.

Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections.

The script should be moved later to a new HAProxy image, separated from the router one.

https://redhat.atlassian.net/browse/NE-2664

Summary by CodeRabbit

Release Notes

  • New Features
    • Added graceful shutdown capability for HAProxy with configurable delays, enabling safer application updates and deployments
    • Enhanced router reload functionality to support both embedded and external HAProxy control methods
    • Improved reliability of reload operations with validation of success and enhanced error reporting
    • Better process lifecycle management with structured logging of router operations

Add an opt-in configuration that informs the router that HAProxy runs on
a sidecar container, so its process should be managed via its admin
socket.

Add also a script that manages signals sent from kubelet to HAProxy. The
script should be the entry point of the container in order to capture
terminating signals, converting to SIGUSR1, and than into SIGTERM when
the timeout expires. The former asks HAProxy to finish as soon as all the
current connections close, and the later asks HAProxy to close (FIN)
connections and finish just after receiving FIN+ACK from both sides of
all connections.

The script should be moved to a new HAProxy image, separated from the
router one.

https://redhat.atlassian.net/browse/NE-2664
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 7, 2026

@jcmoraisjr: This pull request references NE-2664 which is a valid jira issue.

Details

In response to this:

Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket.

Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections.

The script should be moved later to a new HAProxy image, separated from the router one.

https://redhat.atlassian.net/browse/NE-2664

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Walkthrough

This PR adds HAProxy graceful shutdown capability via a new entrypoint script and refactors the router's reload logic to support both embedded and external configuration reload via HAProxy's admin socket.

Changes

HAProxy Graceful Shutdown Entrypoint

Layer / File(s) Summary
Process Setup
images/router/haproxy/start-haproxy
New Bash entrypoint validates ROUTER_GRACEFUL_SHUTDOWN_DELAY format (^[0-9]+s$), starts HAProxy in background, and captures its PID.
Signal Handling & Graceful Shutdown
images/router/haproxy/start-haproxy
Installs traps for SIGTERM, SIGUSR1, SIGINT, and SIGHUP. stopHAProxy function sends SIGUSR1 first, waits up to the timeout, then sends SIGTERM on timeout. SIGHUP is forwarded directly to HAProxy.
Process Monitoring & Exit
images/router/haproxy/start-haproxy
Polls process liveness using kill -0 and wait in a loop, then exits with HAProxy's recorded status code (or 0 if stopped via signal).

Router Configuration Reload Logic

Layer / File(s) Summary
Import Additions
pkg/router/template/router.go
Added context import and external reload support imports: go-haproxy, k8s.io/apimachinery/pkg/util/wait.
Reload Dispatcher
pkg/router/template/router.go
New reloadRouter function routes between embedded reload and external HAProxy admin socket reload based on ROUTER_HAPROXY_ADMIN_UNIX_SOCKET environment variable.
Embedded Reload Path
pkg/router/template/router.go
Updated embedded reload to log structured success message with "embedded" mode indicator and command output; preserves shutdown behavior via ROUTER_SHUTDOWN=true.
External Reload Path
pkg/router/template/router.go
New reloadRouterExternal waits for HAProxy admin socket, issues reload command via haproxy.HAProxyClient with timeout, validates Success=1 output prefix, and returns descriptive errors on failure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding HAProxy sidecar container separation with a new entrypoint script and reload logic changes to support external HAProxy management via admin socket.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR does not modify any Ginkgo tests. The repository uses standard Go testing, not Ginkgo. No test files were added or modified.
Test Structure And Quality ✅ Passed The PR contains no Ginkgo test files. Changes are limited to a Bash script (start-haproxy) and Go source code (router.go). The custom check for Ginkgo test structure and quality is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The changes consist of a Bash startup script and Go source code modifications for router reload logic. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests. Changes are production code only: a Bash script and a Go source file with router reload logic. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds a shell script entrypoint and refactors router reload logic. No deployment manifests, scheduling constraints, or topology assumptions are introduced.
Ote Binary Stdout Contract ✅ Passed Not applicable. openshift-router is a standard Kubernetes application, not an OTE test binary. New code uses structured logging only. The Bash script is a separate container entrypoint.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests. Changes are a Bash entrypoint script and router config code. Check applies only to e2e tests, so it is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign candita for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@images/router/haproxy/start-haproxy`:
- Line 22: The unguarded kill invocations using haproxyPID (the calls to "kill
-s USR1 $haproxyPID" and "kill -s TERM $haproxyPID") will abort the script under
set -e if HAProxy already exited; change them to tolerate a missing PID by
either checking the process first (use "kill -0 $haproxyPID" and only send the
signal if that succeeds) or temporarily disable exit-on-error around the signal
send (e.g., set +e; kill -s USR1 $haproxyPID || true; set -e) so the kill in the
shutdown/poll loop (the code that currently uses kill -0 for polling) does not
cause the script to terminate on a race.

In `@pkg/router/template/router.go`:
- Around line 706-713: The poll currently uses context.Background() and discards
its error, which can block reloads forever; change reloadRouterExternal to
accept a ctx context.Context (threaded from the caller), pass that ctx into
wait.PollUntilContextCancel instead of context.Background(), capture the
returned error (remove the "_ =" discard), and if the poll returns an error or
context cancellation, abort/return the error before calling RunCommand against
adminSocket so you don't attempt to use a missing socket; update callers to
propagate the application context into reloadRouterExternal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74d9dbab-230a-49fa-b4eb-04fa960ee6bc

📥 Commits

Reviewing files that changed from the base of the PR and between b75bab2 and 70a6f54.

📒 Files selected for processing (2)
  • images/router/haproxy/start-haproxy
  • pkg/router/template/router.go

# HAProxy handles SIGUSR1 by finishing its process as soon as all the current connections,
# active or not, are closed by either the client or the backend server.
echo "Sending SIGUSR1 to HAProxy process $haproxyPID"
kill -s USR1 $haproxyPID
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

kill commands are unguarded under set -e — risk of premature script termination

set -euo pipefail is active for the whole script. Both kill -s USR1 $haproxyPID (line 22) and kill -s TERM $haproxyPID (line 37) will cause the script to abort if HAProxy has already exited. The race on line 37 is realistically reachable: HAProxy can drain and die naturally during the kill -0 poll loop, then the loop falls through by timeout — at that point HAProxy is already gone and kill -s TERM fails. A similar, narrower race exists at line 22.

🐛 Proposed fix
-    kill -s USR1 $haproxyPID
+    kill -s USR1 $haproxyPID 2>/dev/null || true
-    kill -s TERM $haproxyPID
+    kill -s TERM $haproxyPID 2>/dev/null || true

Also applies to: 37-37

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/router/haproxy/start-haproxy` at line 22, The unguarded kill
invocations using haproxyPID (the calls to "kill -s USR1 $haproxyPID" and "kill
-s TERM $haproxyPID") will abort the script under set -e if HAProxy already
exited; change them to tolerate a missing PID by either checking the process
first (use "kill -0 $haproxyPID" and only send the signal if that succeeds) or
temporarily disable exit-on-error around the signal send (e.g., set +e; kill -s
USR1 $haproxyPID || true; set -e) so the kill in the shutdown/poll loop (the
code that currently uses kill -0 for polling) does not cause the script to
terminate on a race.

Comment on lines +706 to +713
_ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
_, errstat := os.Lstat(adminSocket)
if errstat != nil {
log.Info("waiting for haproxy socket", "message", errstat.Error())
return false, nil
}
return true, nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unbounded socket poll blocks the reload goroutine permanently if HAProxy's admin socket never appears

context.Background() never cancels, so if the sidecar never creates the socket (crash, misconfigured path, wrong volume mount), wait.PollUntilContextCancel loops forever and the router's commit/reload flow is permanently blocked — no further route changes will ever be applied.

Additionally, the error result is discarded (_ =); even after a proper context is plumbed in, a cancelled poll will be silently ignored and execution will proceed to RunCommand against a socket that may not exist.

🛡️ Proposed fix
-    // TODO missing application's context
-    _ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
+    pollCtx, cancel := context.WithTimeout(ctx, 30*time.Second) // propagate caller's context; add a sensible upper bound
+    defer cancel()
+    if err := wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
         _, errstat := os.Lstat(adminSocket)
         if errstat != nil {
             log.Info("waiting for haproxy socket", "message", errstat.Error())
             return false, nil
         }
         return true, nil
-    })
+    }); err != nil {
+        return fmt.Errorf("timed out waiting for haproxy admin socket %q: %w", adminSocket, err)
+    }

The caller (reloadRouterExternal) would need to accept a ctx context.Context parameter, threaded from the application-level context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/router.go` around lines 706 - 713, The poll currently
uses context.Background() and discards its error, which can block reloads
forever; change reloadRouterExternal to accept a ctx context.Context (threaded
from the caller), pass that ctx into wait.PollUntilContextCancel instead of
context.Background(), capture the returned error (remove the "_ =" discard), and
if the poll returns an error or context cancellation, abort/return the error
before calling RunCommand against adminSocket so you don't attempt to use a
missing socket; update callers to propagate the application context into
reloadRouterExternal.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@jcmoraisjr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial-2of2 70a6f54 link true /test e2e-aws-serial-2of2
ci/prow/e2e-aws-serial-1of2 70a6f54 link true /test e2e-aws-serial-1of2
ci/prow/e2e-upgrade 70a6f54 link true /test e2e-upgrade
ci/prow/e2e-aws-fips 70a6f54 link true /test e2e-aws-fips
ci/prow/e2e-agnostic 70a6f54 link true /test e2e-agnostic

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants