NE-2664: add option to separate router and haproxy containers#772
NE-2664: add option to separate router and haproxy containers#772jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
Conversation
Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket. Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections. The script should be moved to a new HAProxy image, separated from the router one. https://redhat.atlassian.net/browse/NE-2664
|
@jcmoraisjr: This pull request references NE-2664 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds HAProxy graceful shutdown capability via a new entrypoint script and refactors the router's reload logic to support both embedded and external configuration reload via HAProxy's admin socket. ChangesHAProxy Graceful Shutdown Entrypoint
Router Configuration Reload Logic
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@images/router/haproxy/start-haproxy`:
- Line 22: The unguarded kill invocations using haproxyPID (the calls to "kill
-s USR1 $haproxyPID" and "kill -s TERM $haproxyPID") will abort the script under
set -e if HAProxy already exited; change them to tolerate a missing PID by
either checking the process first (use "kill -0 $haproxyPID" and only send the
signal if that succeeds) or temporarily disable exit-on-error around the signal
send (e.g., set +e; kill -s USR1 $haproxyPID || true; set -e) so the kill in the
shutdown/poll loop (the code that currently uses kill -0 for polling) does not
cause the script to terminate on a race.
In `@pkg/router/template/router.go`:
- Around line 706-713: The poll currently uses context.Background() and discards
its error, which can block reloads forever; change reloadRouterExternal to
accept a ctx context.Context (threaded from the caller), pass that ctx into
wait.PollUntilContextCancel instead of context.Background(), capture the
returned error (remove the "_ =" discard), and if the poll returns an error or
context cancellation, abort/return the error before calling RunCommand against
adminSocket so you don't attempt to use a missing socket; update callers to
propagate the application context into reloadRouterExternal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 74d9dbab-230a-49fa-b4eb-04fa960ee6bc
📒 Files selected for processing (2)
images/router/haproxy/start-haproxypkg/router/template/router.go
| # HAProxy handles SIGUSR1 by finishing its process as soon as all the current connections, | ||
| # active or not, are closed by either the client or the backend server. | ||
| echo "Sending SIGUSR1 to HAProxy process $haproxyPID" | ||
| kill -s USR1 $haproxyPID |
There was a problem hiding this comment.
kill commands are unguarded under set -e — risk of premature script termination
set -euo pipefail is active for the whole script. Both kill -s USR1 $haproxyPID (line 22) and kill -s TERM $haproxyPID (line 37) will cause the script to abort if HAProxy has already exited. The race on line 37 is realistically reachable: HAProxy can drain and die naturally during the kill -0 poll loop, then the loop falls through by timeout — at that point HAProxy is already gone and kill -s TERM fails. A similar, narrower race exists at line 22.
🐛 Proposed fix
- kill -s USR1 $haproxyPID
+ kill -s USR1 $haproxyPID 2>/dev/null || true- kill -s TERM $haproxyPID
+ kill -s TERM $haproxyPID 2>/dev/null || trueAlso applies to: 37-37
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@images/router/haproxy/start-haproxy` at line 22, The unguarded kill
invocations using haproxyPID (the calls to "kill -s USR1 $haproxyPID" and "kill
-s TERM $haproxyPID") will abort the script under set -e if HAProxy already
exited; change them to tolerate a missing PID by either checking the process
first (use "kill -0 $haproxyPID" and only send the signal if that succeeds) or
temporarily disable exit-on-error around the signal send (e.g., set +e; kill -s
USR1 $haproxyPID || true; set -e) so the kill in the shutdown/poll loop (the
code that currently uses kill -0 for polling) does not cause the script to
terminate on a race.
| _ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) { | ||
| _, errstat := os.Lstat(adminSocket) | ||
| if errstat != nil { | ||
| log.Info("waiting for haproxy socket", "message", errstat.Error()) | ||
| return false, nil | ||
| } | ||
| return true, nil | ||
| }) |
There was a problem hiding this comment.
Unbounded socket poll blocks the reload goroutine permanently if HAProxy's admin socket never appears
context.Background() never cancels, so if the sidecar never creates the socket (crash, misconfigured path, wrong volume mount), wait.PollUntilContextCancel loops forever and the router's commit/reload flow is permanently blocked — no further route changes will ever be applied.
Additionally, the error result is discarded (_ =); even after a proper context is plumbed in, a cancelled poll will be silently ignored and execution will proceed to RunCommand against a socket that may not exist.
🛡️ Proposed fix
- // TODO missing application's context
- _ = wait.PollUntilContextCancel(context.Background(), 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
+ pollCtx, cancel := context.WithTimeout(ctx, 30*time.Second) // propagate caller's context; add a sensible upper bound
+ defer cancel()
+ if err := wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(ctx context.Context) (done bool, err error) {
_, errstat := os.Lstat(adminSocket)
if errstat != nil {
log.Info("waiting for haproxy socket", "message", errstat.Error())
return false, nil
}
return true, nil
- })
+ }); err != nil {
+ return fmt.Errorf("timed out waiting for haproxy admin socket %q: %w", adminSocket, err)
+ }The caller (reloadRouterExternal) would need to accept a ctx context.Context parameter, threaded from the application-level context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/router/template/router.go` around lines 706 - 713, The poll currently
uses context.Background() and discards its error, which can block reloads
forever; change reloadRouterExternal to accept a ctx context.Context (threaded
from the caller), pass that ctx into wait.PollUntilContextCancel instead of
context.Background(), capture the returned error (remove the "_ =" discard), and
if the poll returns an error or context cancellation, abort/return the error
before calling RunCommand against adminSocket so you don't attempt to use a
missing socket; update callers to propagate the application context into
reloadRouterExternal.
|
@jcmoraisjr: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add an opt-in configuration that informs the router that HAProxy runs on a sidecar container, so its process should be managed via its admin socket.
Add also a script that manages signals sent from kubelet to HAProxy. The script should be the entry point of the container in order to capture terminating signals, converting to SIGUSR1, and than into SIGTERM when the timeout expires. The former asks HAProxy to finish as soon as all the current connections close, and the later asks HAProxy to close (FIN) connections and finish just after receiving FIN+ACK from both sides of all connections.
The script should be moved later to a new HAProxy image, separated from the router one.
https://redhat.atlassian.net/browse/NE-2664
Summary by CodeRabbit
Release Notes