NE-2664: add option to separate router and haproxy containers by jcmoraisjr · Pull Request #772 · openshift/router

jcmoraisjr · 2026-05-07T16:04:42Z

Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket.

Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections.

The script should be moved later to a new HAProxy image, separated from the router one.

https://redhat.atlassian.net/browse/NE-2664

Summary by CodeRabbit

Release Notes

New Features
- Added graceful shutdown capability for HAProxy with configurable delays, enabling safer application updates and deployments
- Enhanced router reload functionality to support both embedded and external HAProxy control methods
- Improved reliability of reload operations with validation of success and enhanced error reporting
- Better process lifecycle management with structured logging of router operations

Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket. Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections. The script should be moved to a new HAProxy image, separated from the router one. https://redhat.atlassian.net/browse/NE-2664

openshift-ci-robot · 2026-05-07T16:04:46Z

@jcmoraisjr: This pull request references NE-2664 which is a valid jira issue.

Details

In response to this:

Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket.

Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections.

The script should be moved later to a new HAProxy image, separated from the router one.

https://redhat.atlassian.net/browse/NE-2664

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-07T16:04:57Z

Walkthrough

This PR adds HAProxy graceful shutdown capability via a new entrypoint script and refactors the router's reload logic to support both embedded and external configuration reload via HAProxy's admin socket.

Changes

HAProxy Graceful Shutdown Entrypoint

Layer / File(s)	Summary
Process Setup `images/router/haproxy/start-haproxy`	New Bash entrypoint validates `ROUTER_GRACEFUL_SHUTDOWN_DELAY` format (`^[0-9]+s$`), starts HAProxy in background, and captures its PID.
Signal Handling & Graceful Shutdown `images/router/haproxy/start-haproxy`	Installs traps for `SIGTERM`, `SIGUSR1`, `SIGINT`, and `SIGHUP`. `stopHAProxy` function sends `SIGUSR1` first, waits up to the timeout, then sends `SIGTERM` on timeout. `SIGHUP` is forwarded directly to HAProxy.
Process Monitoring & Exit `images/router/haproxy/start-haproxy`	Polls process liveness using `kill -0` and `wait` in a loop, then exits with HAProxy's recorded status code (or `0` if stopped via signal).

Router Configuration Reload Logic

Layer / File(s)	Summary
Import Additions `pkg/router/template/router.go`	Added `context` import and external reload support imports: `go-haproxy`, `k8s.io/apimachinery/pkg/util/wait`.
Reload Dispatcher `pkg/router/template/router.go`	New `reloadRouter` function routes between embedded reload and external HAProxy admin socket reload based on `ROUTER_HAPROXY_ADMIN_UNIX_SOCKET` environment variable.
Embedded Reload Path `pkg/router/template/router.go`	Updated embedded reload to log structured success message with "embedded" mode indicator and command output; preserves shutdown behavior via `ROUTER_SHUTDOWN=true`.
External Reload Path `pkg/router/template/router.go`	New `reloadRouterExternal` waits for HAProxy admin socket, issues reload command via `haproxy.HAProxyClient` with timeout, validates `Success=1` output prefix, and returns descriptive errors on failure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: adding HAProxy sidecar container separation with a new entrypoint script and reload logic changes to support external HAProxy management via admin socket.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	This PR does not modify any Ginkgo tests. The repository uses standard Go testing, not Ginkgo. No test files were added or modified.
Test Structure And Quality	✅ Passed	The PR contains no Ginkgo test files. Changes are limited to a Bash script (start-haproxy) and Go source code (router.go). The custom check for Ginkgo test structure and quality is not applicable.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. The changes consist of a Bash startup script and Go source code modifications for router reload logic. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR adds no Ginkgo e2e tests. Changes are production code only: a Bash script and a Go source file with router reload logic. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds a shell script entrypoint and refactors router reload logic. No deployment manifests, scheduling constraints, or topology assumptions are introduced.
Ote Binary Stdout Contract	✅ Passed	Not applicable. openshift-router is a standard Kubernetes application, not an OTE test binary. New code uses structured logging only. The Bash script is a separate container entrypoint.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR does not add any Ginkgo e2e tests. Changes are a Bash entrypoint script and router config code. Check applies only to e2e tests, so it is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-07T16:06:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign candita for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@images/router/haproxy/start-haproxy`:
- Line 22: The unguarded kill invocations using haproxyPID (the calls to "kill
-s USR1 $haproxyPID" and "kill -s TERM $haproxyPID") will abort the script under
set -e if HAProxy already exited; change them to tolerate a missing PID by
either checking the process first (use "kill -0 $haproxyPID" and only send the
signal if that succeeds) or temporarily disable exit-on-error around the signal
send (e.g., set +e; kill -s USR1 $haproxyPID || true; set -e) so the kill in the
shutdown/poll loop (the code that currently uses kill -0 for polling) does not
cause the script to terminate on a race.

In `@pkg/router/template/router.go`:
- Around line 706-713: The poll currently uses context.Background() and discards
its error, which can block reloads forever; change reloadRouterExternal to
accept a ctx context.Context (threaded from the caller), pass that ctx into
wait.PollUntilContextCancel instead of context.Background(), capture the
returned error (remove the "_ =" discard), and if the poll returns an error or
context cancellation, abort/return the error before calling RunCommand against
adminSocket so you don't attempt to use a missing socket; update callers to
propagate the application context into reloadRouterExternal.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74d9dbab-230a-49fa-b4eb-04fa960ee6bc

📥 Commits

Reviewing files that changed from the base of the PR and between b75bab2 and 70a6f54.

📒 Files selected for processing (2)

images/router/haproxy/start-haproxy
pkg/router/template/router.go

coderabbitai · 2026-05-07T16:12:41Z

+    # HAProxy handles SIGUSR1 by finishing its process as soon as all the current connections,
+    # active or not, are closed by either the client or the backend server.
+    echo "Sending SIGUSR1 to HAProxy process $haproxyPID"
+    kill -s USR1 $haproxyPID


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

kill commands are unguarded under set -e — risk of premature script termination

set -euo pipefail is active for the whole script. Both kill -s USR1 $haproxyPID (line 22) and kill -s TERM $haproxyPID (line 37) will cause the script to abort if HAProxy has already exited. The race on line 37 is realistically reachable: HAProxy can drain and die naturally during the kill -0 poll loop, then the loop falls through by timeout — at that point HAProxy is already gone and kill -s TERM fails. A similar, narrower race exists at line 22.

🐛 Proposed fix

- kill -s USR1 $haproxyPID + kill -s USR1 $haproxyPID 2>/dev/null || true

- kill -s TERM $haproxyPID + kill -s TERM $haproxyPID 2>/dev/null || true

Also applies to: 37-37

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@images/router/haproxy/start-haproxy` at line 22, The unguarded kill invocations using haproxyPID (the calls to "kill -s USR1 $haproxyPID" and "kill -s TERM $haproxyPID") will abort the script under set -e if HAProxy already exited; change them to tolerate a missing PID by either checking the process first (use "kill -0 $haproxyPID" and only send the signal if that succeeds) or temporarily disable exit-on-error around the signal send (e.g., set +e; kill -s USR1 $haproxyPID || true; set -e) so the kill in the shutdown/poll loop (the code that currently uses kill -0 for polling) does not cause the script to terminate on a race.

coderabbitai · 2026-05-07T16:12:41Z

+	_ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
+		_, errstat := os.Lstat(adminSocket)
+		if errstat != nil {
+			log.Info("waiting for haproxy socket", "message", errstat.Error())
+			return false, nil
+		}
+		return true, nil
+	})


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unbounded socket poll blocks the reload goroutine permanently if HAProxy's admin socket never appears

context.Background() never cancels, so if the sidecar never creates the socket (crash, misconfigured path, wrong volume mount), wait.PollUntilContextCancel loops forever and the router's commit/reload flow is permanently blocked — no further route changes will ever be applied.

Additionally, the error result is discarded (_ =); even after a proper context is plumbed in, a cancelled poll will be silently ignored and execution will proceed to RunCommand against a socket that may not exist.

🛡️ Proposed fix

- // TODO missing application's context - _ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) { + pollCtx, cancel := context.WithTimeout(ctx, 30*time.Second) // propagate caller's context; add a sensible upper bound + defer cancel() + if err := wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(ctx context.Context) (done bool, err error) { _, errstat := os.Lstat(adminSocket) if errstat != nil { log.Info("waiting for haproxy socket", "message", errstat.Error()) return false, nil } return true, nil - }) + }); err != nil { + return fmt.Errorf("timed out waiting for haproxy admin socket %q: %w", adminSocket, err) + }

The caller (reloadRouterExternal) would need to accept a ctx context.Context parameter, threaded from the application-level context.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/template/router.go` around lines 706 - 713, The poll currently uses context.Background() and discards its error, which can block reloads forever; change reloadRouterExternal to accept a ctx context.Context (threaded from the caller), pass that ctx into wait.PollUntilContextCancel instead of context.Background(), capture the returned error (remove the "_ =" discard), and if the poll returns an error or context cancellation, abort/return the error before calling RunCommand against adminSocket so you don't attempt to use a missing socket; update callers to propagate the application context into reloadRouterExternal.

openshift-ci · 2026-05-07T20:27:45Z

@jcmoraisjr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-serial-2of2	`70a6f54`	link	true	`/test e2e-aws-serial-2of2`
ci/prow/e2e-aws-serial-1of2	`70a6f54`	link	true	`/test e2e-aws-serial-1of2`
ci/prow/e2e-upgrade	`70a6f54`	link	true	`/test e2e-upgrade`
ci/prow/e2e-aws-fips	`70a6f54`	link	true	`/test e2e-aws-fips`
ci/prow/e2e-agnostic	`70a6f54`	link	true	`/test e2e-agnostic`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026

openshift-ci Bot requested review from grzpiotrowski and ironcladlou May 7, 2026 16:06

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NE-2664: add option to separate router and haproxy containers#772

NE-2664: add option to separate router and haproxy containers#772
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:NE-2664-haproxy-sidecar

jcmoraisjr commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented May 7, 2026 •

edited by openshift-ci Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jcmoraisjr commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-ci-robot commented May 7, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcmoraisjr commented May 7, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 7, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading