feat: update spec when db versions change#377
Conversation
📝 WalkthroughWalkthroughAdds a leader-backed periodic databases monitor that reconciles instance Postgres/Spock versions into stored instance records and specs, introduces parsing/normalization for PgEdge versions, wires monitoring/election and logging, extends storage txn comparisons, and includes tests, docs, and changelog entries. Database Version Reconciliation & Monitoring
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 3 medium |
🟢 Metrics 52 complexity · 3 duplication
Metric Results Complexity 52 Duplication 3
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
clustertest/utils_test.go (1)
163-176: ⚡ Quick win
dockerCmd: add a context/timeout and surface captured output on failureTwo independent issues:
No timeout —
exec.Commandhas no deadline. A hung Docker daemon or a blockingdocker service updatewill stall the entire test suite indefinitely. Compare withbuildImage, which wraps itsexec.Commandin a 300 sCommandContext. The test receivest testing.TBso at.Context()or an explicit background timeout can be threaded through.Docker output dropped on failure — stderr/stdout are captured into
out, butrequire.NoError(t, err)only surfaces the OS error string (e.g.,"exit status 1"). The actual Docker error message is silently discarded, making CI failures much harder to diagnose.🔧 Proposed fix
-func dockerCmd(t testing.TB, args ...string) string { +func dockerCmd(t testing.TB, ctx context.Context, args ...string) string { t.Helper() tLogf(t, "executing command: docker %s", strings.Join(args, " ")) var out strings.Builder - cmd := exec.Command("docker", args...) + cmd := exec.CommandContext(ctx, "docker", args...) cmd.Stdout = &out cmd.Stderr = &out err := cmd.Run() - require.NoError(t, err) + require.NoError(t, err, "docker %s output:\n%s", strings.Join(args, " "), out.String()) return strings.TrimSpace(out.String()) }Callers can pass
t.Context()for a test-lifetime bound, or a tightercontext.WithTimeout(context.Background(), 60*time.Second)for specific operations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/utils_test.go` around lines 163 - 176, The dockerCmd function lacks a context/timeout and discards Docker output on failure; update dockerCmd to accept/use a context (use t.Context() when available or wrap with context.WithTimeout(ctx, 60*time.Second)) and run exec.CommandContext instead of exec.Command, keep capturing stdout/stderr into the out buffer, and when the command returns an error include the trimmed out.String() in the test failure (e.g., via require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is surfaced; reference the dockerCmd function and model the pattern after buildImage which uses CommandContext and a 300s timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 140-142: The tLogf call in the error handling for database
deletion uses the fmt.Errorf-specific %w verb which tLogf (a printf-style
wrapper around t.Logf) doesn't support; update the format string in the error
log that references databaseID and err (the call using tLogf) to use %v for the
error value instead of %w so the error message prints correctly (i.e., change
the format argument in the tLogf call that logs "failed to delete database '%s':
..." to use %v for err).
In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 18-33: The two yum_repository tasks named "Add pgEdge old
repository" and "Add pgEdge noarch old repository" disable package signature
verification by setting gpgcheck: no and hardcode OS major as 9 in baseurl;
update these tasks to (1) avoid gpgcheck: no — instead set repo_gpgcheck: no if
you only want to skip repo metadata but keep package RPM signature verification,
or keep gpgcheck and add sslverify: yes to assert HTTPS validation, and (2) make
the baseurl dynamic by replacing the literal 9 with
{{ansible_facts['distribution_major_version']}} so the repository URL matches
the host OS major version. Ensure the updated fields are applied in the tasks
with those task names.
In `@server/internal/monitor/provide.go`:
- Around line 60-61: The code unconditionally creates an election candidate (via
electionSvc.NewCandidate) which opens an etcd watch and is never stopped if
DatabasesMonitorIntervalSeconds == 0 because DatabasesMonitor is not created and
candidate.Start/Stop are never called; to fix, avoid creating the candidate when
the databases monitor is disabled by only constructing the candidate when
cfg.DatabasesMonitorIntervalSeconds > 0 (i.e., move the electionSvc.NewCandidate
call behind that conditional), or alternatively change NewCandidate to defer
starting/creating the watch until Candidate.Start is invoked and ensure
Candidate.Stop can safely close any resources even if Start was never called;
update provide.go to create/pass a nil/no-op candidate to NewService when the
monitor is disabled and reference electionSvc.NewCandidate, Candidate.Start,
Candidate.Stop, DatabasesMonitor and DatabasesMonitorIntervalSeconds in your
change.
---
Nitpick comments:
In `@clustertest/utils_test.go`:
- Around line 163-176: The dockerCmd function lacks a context/timeout and
discards Docker output on failure; update dockerCmd to accept/use a context (use
t.Context() when available or wrap with context.WithTimeout(ctx,
60*time.Second)) and run exec.CommandContext instead of exec.Command, keep
capturing stdout/stderr into the out buffer, and when the command returns an
error include the trimmed out.String() in the test failure (e.g., via
require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is
surfaced; reference the dockerCmd function and model the pattern after
buildImage which uses CommandContext and a 300s timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: da1e0e34-a22e-484a-9dde-4138a08808f5
📒 Files selected for processing (32)
changes/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mdlima/roles/install_prerequisites/tasks/main.yamlserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_store.goserver/internal/database/provide.goserver/internal/database/reconcile_versions.goserver/internal/database/reconcile_versions_test.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/database/status.goserver/internal/ds/versions.goserver/internal/ds/versions_test.goserver/internal/host/host_test.goserver/internal/logging/factory.goserver/internal/monitor/databases_monitor.goserver/internal/monitor/host_monitor.goserver/internal/monitor/instance_monitor.goserver/internal/monitor/provide.goserver/internal/monitor/service.goserver/internal/orchestrator/common/patroni_config_generator_test.goserver/internal/orchestrator/swarm/images.goserver/internal/orchestrator/swarm/rag_instance_resources_test.goserver/internal/storage/interface.goserver/internal/storage/txn.goserver/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
- server/internal/database/status.go
99a663b to
0efec38
Compare
Instances are updated sequentially and switchovers can happen during that sequence. We want to use the most recently fetched primary instance ID in the node resource to account for those switchovers. This change is backward compatible and does not need any migration. PLAT-512
This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512
Improves the ds.PgEdgeVersion initializers by: - Renaming `NewPgEdgeVersion`, which parsed a `PgEdgeVersion` from strings to `ParsePgEdgeVersion`. - Add a method to normalize `PgEdgeVersion`s so that the Postgres version contains major and minor components and the Spock version contains just a major component. This matches the way that versions are currently provided in our API. PLAT-512
Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512
Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512
PLAT-512
Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512
Adds a new database service method to reconcile the Postgres and Spock versions that we observe through the instance monitors with the versions in the database spec. The domain logic of this process is split into a standalone function with unit tests to enforce its behavior. The service method uses a transaction with additional constraints to handle conflicts with any database operations that run while we're performing this check. The summary of this new behavior is that when an instance is updated, we update its corresponding `Instance` record in our database. If all of the instances have been updated for a node, we update the node's entry in the database spec. If all of the nodes have been updated, we normalize the database spec by setting the top-level version fields and zeroing out the node-level version fields. PLAT-512
Adds a 'databases' monitor to periodically run the database version reconciliation process. The monitor's interval is configurable and can be set to '0' to disable it entirely. This setting is useful in testing and in clusters with a large number of databases where the operator knows that the database versions will never change outside of the Control Plane API. PLAT-512
2742e61 to
2ba5138
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
e2e/cancel_task_test.go (1)
54-58: ⚡ Quick winReplace fixed sleep with condition-based waiting to reduce flakiness
Line 57 introduces a hard-coded delay that can still race in CI and makes the test slower than needed. Prefer waiting until the task reaches a cancelable state (or until
CancelDatabaseTasksucceeds) with a bounded timeout.Proposed test-stability refactor
- // TODO: even after many attempts at fixing this, rapidly cancelling a task - // still occasionally causes problems. This sleep is a workaround until - // we're able to track down the issue. - time.Sleep(500 * time.Millisecond) + // Wait until cancellation becomes reliably accepted instead of sleeping a fixed duration. + require.Eventually(t, func() bool { + _, err := fixture.Client.CancelDatabaseTask(t.Context(), &controlplane.CancelDatabaseTaskPayload{ + DatabaseID: database.ID, + TaskID: controlplane.Identifier(creation_task.TaskID), + }) + return err == nil + }, 10*time.Second, 200*time.Millisecond, "task did not become cancelable in time")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/cancel_task_test.go` around lines 54 - 58, Replace the fixed time.Sleep with a bounded, condition-based wait: inside the test (cancel_task_test) poll in a loop with a short backoff until either CancelDatabaseTask returns success or the task's status indicates it is cancellable (use the existing CancelDatabaseTask and the task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus), breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail the test on timeout. Ensure the loop uses a small sleep (10–50ms) between attempts to avoid busy-waiting and that the timeout is enforced via context.WithTimeout or time.After to keep the test bounded.server/internal/database/node_resource.go (1)
86-95: 💤 Low valueOptional: extract the comparison into a tiny helper for readability.
Not blocking — the loop body now mixes "skip" filtering with "is-newer" tracking. A small helper (e.g.,
isMoreRecent(prevID string, prevAt, curAt time.Time) bool) or a comment-anchoredlatestPrimarystruct would make intent obvious if this pattern grows (e.g., once the version-reconciliation monitor lands and similar timestamp-driven picks appear elsewhere).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/database/node_resource.go` around lines 86 - 95, Extract the "is newer" logic into a small helper to improve readability: create a helper like isMoreRecent(prevID string, prevAt, curAt time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method) and replace the inline condition in the loop that compares instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for empty PrimaryInstanceID and call the helper to decide whether to assign the current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 264-265: The call to waitForTaskComplete (waiting on
updateResp.Task.TaskID) assigns err but never checks it; add an assertion
immediately after that call to fail the test if err is non-nil (e.g., use
t.Fatalf/t.Fatalf-like or require.NoError) so a timeout or failure is not
ignored before the subsequent assertions that depend on the task completing.
Ensure the check references the err returned from waitForTaskComplete(ctx,
cluster.Client(), databaseID, updateResp.Task.TaskID, 3*time.Minute).
- Around line 134-145: Replace the unbounded ctx := context.Background() used
for the cleanup DeleteDatabase call with a bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>); defer
cancel()) so cluster.Client().DeleteDatabase(...) cannot hang indefinitely;
update the DeleteDatabase call site that uses ctx and add the necessary time
import, referencing the ctx variable and the DeleteDatabase method and
databaseID/testConfig.skipCleanup flow.
In `@docs/installation/systemd.md`:
- Around line 248-250: Replace the fenced code blocks inside the numbered steps
with indented code blocks (prefix each command line with four spaces) so
markdownlint MD046 no longer flags nested fenced blocks; specifically update the
snippet containing "sudo dnf upgrade pgedge-postgresql18" and the other
occurrences referenced (around the blocks at the next two locations) to use
indented code formatting instead of ``` fences.
- Around line 28-30: Change the phrase "Minor versions upgrades" to "Minor
version upgrades" in the limitations paragraph (the sentence currently beginning
"Database upgrades are not supported via the API. Minor versions upgrades can be
performed manually..."); update that sentence to read "Minor version upgrades
can be performed manually by a system administrator." to fix the grammar.
- Around line 268-269: The sentence hardcodes "30 seconds" even though that
delay is controlled by the configuration key databases_monitor_interval_seconds;
update the wording in docs/installation/systemd.md to reference the configured
value instead of a hardcoded time (e.g., say "up to the configured
databases_monitor_interval_seconds (defaults to 30s)" or "up to the configured
databases_monitor_interval_seconds value"), and ensure the doc mentions where to
change that setting so readers know it is configurable.
---
Nitpick comments:
In `@e2e/cancel_task_test.go`:
- Around line 54-58: Replace the fixed time.Sleep with a bounded,
condition-based wait: inside the test (cancel_task_test) poll in a loop with a
short backoff until either CancelDatabaseTask returns success or the task's
status indicates it is cancellable (use the existing CancelDatabaseTask and the
task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus),
breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail
the test on timeout. Ensure the loop uses a small sleep (10–50ms) between
attempts to avoid busy-waiting and that the timeout is enforced via
context.WithTimeout or time.After to keep the test bounded.
In `@server/internal/database/node_resource.go`:
- Around line 86-95: Extract the "is newer" logic into a small helper to improve
readability: create a helper like isMoreRecent(prevID string, prevAt, curAt
time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method)
and replace the inline condition in the loop that compares
instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets
primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for
empty PrimaryInstanceID and call the helper to decide whether to assign the
current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 746ee34d-9c87-4823-8c88-e174405bfca8
📒 Files selected for processing (36)
Makefilechanges/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mde2e/cancel_task_test.golima/roles/install_prerequisites/tasks/main.yamlserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_resource.goserver/internal/database/instance_store.goserver/internal/database/node_resource.goserver/internal/database/provide.goserver/internal/database/reconcile_versions.goserver/internal/database/reconcile_versions_test.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/database/status.goserver/internal/ds/versions.goserver/internal/ds/versions_test.goserver/internal/host/host_test.goserver/internal/logging/factory.goserver/internal/monitor/databases_monitor.goserver/internal/monitor/host_monitor.goserver/internal/monitor/instance_monitor.goserver/internal/monitor/provide.goserver/internal/monitor/service.goserver/internal/orchestrator/common/patroni_config_generator_test.goserver/internal/orchestrator/swarm/images.goserver/internal/orchestrator/swarm/rag_instance_resources_test.goserver/internal/storage/interface.goserver/internal/storage/txn.goserver/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
- server/internal/database/status.go
✅ Files skipped from review due to trivial changes (7)
- server/internal/storage/interface.go
- changes/unreleased/Changed-20260504-152348.yaml
- changes/unreleased/Added-20260504-150235.yaml
- server/internal/workflows/plan_update.go
- lima/roles/install_prerequisites/tasks/main.yaml
- server/internal/host/host_test.go
- server/internal/orchestrator/swarm/rag_instance_resources_test.go
🚧 Files skipped from review as they are similar to previous changes (15)
- server/internal/orchestrator/swarm/images.go
- server/internal/database/instance_store.go
- server/internal/monitor/databases_monitor.go
- server/internal/monitor/instance_monitor.go
- server/internal/monitor/provide.go
- clustertest/host_test.go
- server/internal/database/provide.go
- server/internal/orchestrator/common/patroni_config_generator_test.go
- server/internal/database/reconcile_versions.go
- server/internal/database/instance.go
- server/internal/database/service.go
- server/internal/monitor/host_monitor.go
- server/internal/database/spec.go
- server/internal/database/reconcile_versions_test.go
- server/internal/ds/versions.go
Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512
pgEdge only makes the latest package versions available in the `release` repositories. Older versions of packages are only available in the `release_old` repositories. This commit adds the `release_old` repositories to the `dev-lima` environment so that we can install older versions of Postgres and other components in this environment. PLAT-512
Adds the new configuration setting and logging component to the configuration guide. PLAT-512
Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512
Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512
2ba5138 to
25094c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick winFixed
time.Sleepcalls are fragile after async Docker service updates.
docker service updateis asynchronous — the new task may still be starting whensleepDurationexpires. Within 5 s the test requires: the container to restart with the new image, the instance monitor to report the new version, and the databases monitor (interval = 3 s) to reconcile the spec. In a slow CI environment this is a tight race that can produce intermittent failures.
require.Eventuallywith a generous deadline and a short poll interval would remove the dependency on a fixed sleep and give clear assertion failure messages when the deadline expires:require.Eventually(t, func() bool { db, err = cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{ DatabaseID: databaseID, }) if err != nil { return false } // inline the version checks; return false on mismatch return specAndInstanceVersionsMatch(db, ...) }, 30*time.Second, 500*time.Millisecond)Apply the same pattern at Lines 168, 198, 225, and 270.
Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/external_upgrade_test.go` at line 168, Replace the fragile time.Sleep(sleepDuration) usages with require.Eventually polling: for each occurrence (the calls using sleepDuration at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and specAndInstanceVersionsMatch), call require.Eventually with a generous timeout (e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID: databaseID}), return false on error or version mismatches, and true only when specAndInstanceVersionsMatch(...) and instance version checks pass; apply this replacement to the four locations that currently use time.Sleep (the ones around lines referencing sleepDuration, databaseID, GetDatabase and specAndInstanceVersionsMatch).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 55: The assertion uses require.Equal with arguments reversed; change the
call so the expected value client.InstanceStateAvailable is the second-to-last
parameter and the runtime value instance.State is the last (i.e., call
require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures
show "expected" as the known constant and "actual" as the observed
instance.State.
---
Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: Replace the fragile time.Sleep(sleepDuration) usages with
require.Eventually polling: for each occurrence (the calls using sleepDuration
at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and
specAndInstanceVersionsMatch), call require.Eventually with a generous timeout
(e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the
cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID:
databaseID}), return false on error or version mismatches, and true only when
specAndInstanceVersionsMatch(...) and instance version checks pass; apply this
replacement to the four locations that currently use time.Sleep (the ones around
lines referencing sleepDuration, databaseID, GetDatabase and
specAndInstanceVersionsMatch).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a4038ff5-5bcb-44e5-a2bc-3448c7337c6d
📒 Files selected for processing (9)
Makefilechanges/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mdlima/roles/install_prerequisites/tasks/main.yaml
|
|
||
| actualNodeHostVersions := map[string]map[string]string{} | ||
| for _, instance := range instances { | ||
| require.Equal(t, instance.State, client.InstanceStateAvailable) |
There was a problem hiding this comment.
require.Equal arguments are in the wrong order.
Testify's require.Equal(t, expected, actual) convention expects the constant/known value first and the runtime value second. With instance.State as the first argument and client.InstanceStateAvailable as the second, a failure would read:
expected: "some_other_state", actual: "available"
which is the reverse of what you'd expect, making failures harder to diagnose.
🐛 Proposed fix
- require.Equal(t, instance.State, client.InstanceStateAvailable)
+ require.Equal(t, client.InstanceStateAvailable, instance.State)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@clustertest/external_upgrade_test.go` at line 55, The assertion uses
require.Equal with arguments reversed; change the call so the expected value
client.InstanceStateAvailable is the second-to-last parameter and the runtime
value instance.State is the last (i.e., call require.Equal(t,
client.InstanceStateAvailable, instance.State)) so failures show "expected" as
the known constant and "actual" as the observed instance.State.
Summary
A database spec can fall out of sync with its running instances when a user upgrades packages outside of our system. We expect this to happen in systemd clusters from typical system package upgrade workflows. So, we need a way to update the database spec when we observe version changes in the running instances.
This PR adds a new monitor that loops over all databases and reconciles the observed versions with those in the instance record and the spec. This monitor uses the existing election system to ensure that only one host runs the reconciliation process.
Changes
This PR is split into granular commits. I highly recommend reviewing the commits separately rather than looking at the entire diff at once. The most notable of the commits are:
fix: use most recently fetched primary instance idfeat: update host in host monitorfeat: record versions for replica instancesfeat: add service method to reconcile db versionsfeat: add databases monitordocs: minor postgres version upgrades for systemdTesting
I've added a new cluster test that exercises this feature. You can also test this by hand using the instructions in this section.
Using the dev-lima environment
Using the compose environment
PLAT-512