feat: update spec when db versions change by jason-lynch · Pull Request #377 · pgEdge/control-plane

jason-lynch · 2026-05-04T21:26:19Z

Summary

A database spec can fall out of sync with its running instances when a user upgrades packages outside of our system. We expect this to happen in systemd clusters from typical system package upgrade workflows. So, we need a way to update the database spec when we observe version changes in the running instances.

This PR adds a new monitor that loops over all databases and reconciles the observed versions with those in the instance record and the spec. This monitor uses the existing election system to ensure that only one host runs the reconciliation process.

Changes

This PR is split into granular commits. I highly recommend reviewing the commits separately rather than looking at the entire diff at once. The most notable of the commits are:

fix: use most recently fetched primary instance id
- This is an existing bug that I ran into while testing the scenarios in this PR description and the cluster test. Before, we were just picking the first non-empty primary instance ID in the node resource. This could cause a problem if there was a switchover after that primary instance field was updated. This change accounts for that in a backward-compatible way.
feat: update host in host monitor
- This small change makes it so we update the list of supported versions when the packages change on the host.
feat: record versions for replica instances
- This was a necessary change for us to determine whether all instances in a node have been updated. It's also a usability improvement to show when a replica has been updated.
feat: add service method to reconcile db versions
- This commit contains the core logic of this feature.
feat: add databases monitor
- This is the mechanism that invokes the core logic of this feature.
docs: minor postgres version upgrades for systemd
- This is a small doc update to describe how we can accommodate minor version upgrades ahead of supporting package management in the control plane API. Externally-invoked major version upgrades are not supported.

Testing

I've added a new cluster test that exercises this feature. You can also test this by hand using the instructions in this section.

Using the dev-lima environment

# reset your environment
make dev-lima-reset
# deploy the fixture updates to add the 'pgedge old' repos
make dev-lima-deploy

# switch to the dev-lima environment
use-dev-lima

# downgrade to the previous version of postgres 18 on hosts 1-3
cp1-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp2-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp3-ssh 'sudo dnf downgrade -y pgedge-postgresql18'

# start the dev-lima environment in one terminal
make dev-lima-run

# then in another terminal
# switch to the dev-lima environment
use-dev-lima

# initialize the cluster
cp-init

# check the supported versions for hosts 1-3
# you should see 18.2 listed for all three.
cp1-req list-hosts | jq '.hosts[] 
        | select(.id | IN("host-1", "host-2", "host-3"))
        | {id, supported_pgedge_versions}'

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on host-3
# you might see messages like 'Shared connection to 127.0.0.1 closed.', but they
# should still exit 0 and the message can be ignored.
cp3-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp3-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-3:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-3")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# next, upgrade host-2
cp2-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp2-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-2:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-2")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
cp1-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp1-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-1:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-1")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. however, these updates
# won't actually change anything.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-lima-reset

Using the compose environment

# reset your environment
make dev-reset

# ensure you're using the compose environment
use-compose

# start the compose environment in one terminal
make dev-watch

# then in another terminal
# ensure you're using the compose environment
use-compose

# initialize the cluster
cp-init

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "postgres_version": "18.2",
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-3 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n2-ant97dj4

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-2 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-9ptayhma

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-689qacsi

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. this will trigger a
# patroni reload because the swarm patroni config resource includes the entire
# instance spec. it will also trigger a switchover because the docker CLI sorts
# volume mounts. however, these updates won't change the database versions
# or other configuration.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-reset

PLAT-512

coderabbitai · 2026-05-04T21:26:34Z

📝 Walkthrough

Walkthrough

Adds a leader-backed periodic databases monitor that reconciles instance Postgres/Spock versions into stored instance records and specs, introduces parsing/normalization for PgEdge versions, wires monitoring/election and logging, extends storage txn comparisons, and includes tests, docs, and changelog entries.

Database Version Reconciliation & Monitoring

Layer / File(s)	Summary
Version Parsing / Data Shape `server/internal/ds/versions.go`, `server/internal/ds/versions_test.go`	Introduce `Version.MajorMinorVersion()` and `PgEdgeVersion.Normalize()`; replace `MustPgEdgeVersion`/`NewPgEdgeVersion` with `MustParsePgEdgeVersion`/`ParsePgEdgeVersion`.
Config & Logging Consts `server/internal/config/config.go`, `server/internal/logging/factory.go`	Add `DatabasesMonitorIntervalSeconds` config (default 30) and `ComponentDatabaseService` logging component.
Instance/Spec Model Changes `server/internal/database/instance.go`, `server/internal/database/spec.go`, `server/internal/database/instance_store.go`	Add `InstanceStatus.IsStale()`, `Spec.NormalizePostgresVersions()`, and `InstanceStore.Update()` op builder.
Storage Transaction Support `server/internal/storage/interface.go`, `server/internal/storage/txn.go`	Add `Txn.AddConditions(...)` to interface and `txn` implementation to collect/include etcd `clientv3.Cmp` conditions in commits.
Reconciliation Core `server/internal/database/reconcile_versions.go`, `server/internal/database/reconcile_versions_test.go`	Add `Service.ReconcileAllDatabaseVersions` and `ReconcileVersions` which reconcile instance pgEdge versions, derive observed node versions, stage instance/spec updates, and commit guarded transactions; tests exercise scenarios.
Service Logging & Wiring `server/internal/database/service.go`, `server/internal/database/provide.go`	Add `logger` to `Service`; `NewService` accepts `loggerFactory` and provider injects it. Replace some `NewPgEdgeVersion` calls with `ParsePgEdgeVersion`.
Monitor Implementation `server/internal/monitor/databases_monitor.go`	Add `DatabasesMonitor` that periodically calls reconciliation when its `election.Candidate` is leader; interval from config.
Monitor Integration `server/internal/monitor/provide.go`, `server/internal/monitor/service.go`	Provider resolves `*election.Service`, creates a candidate and passes it to `NewService`; `Service` conditionally constructs, starts, and stops `databasesMonitor`.
Monitor & Host/Instance Adjustments `server/internal/monitor/instance_monitor.go`, `server/internal/monitor/host_monitor.go`	Instance monitor always refreshes DB-connection-derived fields (primary-only fields guarded internally); host monitor now runs `UpdateHost` and `UpdateHostStatus` each refresh.
Instance Primary Tracking & Node Selection `server/internal/database/instance_resource.go`, `server/internal/database/node_resource.go`	Track `PrimaryInstanceIDUpdatedAt` on `InstanceResource` and select node primary by newest timestamp across instances.
Populate/Plan Parsing Updates `server/internal/workflows/plan_update.go`, `server/internal/database/service.go`, `server/internal/database/spec.go`	Replace `ds.NewPgEdgeVersion` calls with `ds.ParsePgEdgeVersion` in spec/node parsing and default population; add spec normalization call.
Tests & Integration `clustertest/external_upgrade_test.go`, `clustertest/utils_test.go`, `clustertest/host_test.go`, `server/internal/*_test.go`	Add `TestExternalUpgrade`, introduce `dockerCmd` helper, update tests to new parse APIs, adjust utility behaviors and timeouts.
Docs, Changelog & Installer `docs/installation/configuration.md`, `docs/installation/systemd.md`, `changes/unreleased/*.yaml`, `lima/roles/install_prerequisites/tasks/main.yaml`	Document `databases_monitor_interval_seconds`, add manual systemd Postgres minor-upgrade procedure and changelog entries; add old RPM repos and idempotency fixes to installer.
Misc Test/Infra Adjustments `Makefile`, `e2e/cancel_task_test.go`, `clustertest/host_test.go`, `clustertest/*`	Increase cluster test timeout, add small sleep in cancel test, propagate context in log helpers, and other test/util tweaks.

🐰 I sniff the versions, one by one,
Leader-bound I watch what hosts have done;
Instances change, the spec learns through,
Manual upgrades whisper, then come true —
A soft hop, and the cluster hums anew.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: update spec when db versions change' directly describes the main feature added in this PR: a monitor that updates the database spec when it detects version changes in running instances.
Description check	✅ Passed	The PR description includes all required sections: a clear Summary explaining the motivation and what the PR adds, a detailed Changes section listing notable commits with explanations, comprehensive Testing instructions for both dev-lima and compose environments, and a complete Checklist with all items addressed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/PLAT-512/update-database-version-ii

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-05-04T21:27:16Z

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results

Complexity 3 medium

View in Codacy

🟢 Metrics 52 complexity · 3 duplication

Metric Results

Complexity 52

Duplication 3

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

clustertest/utils_test.go (1)
163-176: ⚡ Quick win

dockerCmd: add a context/timeout and surface captured output on failure

Two independent issues:

No timeout — exec.Command has no deadline. A hung Docker daemon or a blocking docker service update will stall the entire test suite indefinitely. Compare with buildImage, which wraps its exec.Command in a 300 s CommandContext. The test receives t testing.TB so a t.Context() or an explicit background timeout can be threaded through.

Docker output dropped on failure — stderr/stdout are captured into out, but require.NoError(t, err) only surfaces the OS error string (e.g., "exit status 1"). The actual Docker error message is silently discarded, making CI failures much harder to diagnose.
🔧 Proposed fix
-func dockerCmd(t testing.TB, args ...string) string {
+func dockerCmd(t testing.TB, ctx context.Context, args ...string) string {
 	t.Helper()
 
 	tLogf(t, "executing command: docker %s", strings.Join(args, " "))
 
 	var out strings.Builder
-	cmd := exec.Command("docker", args...)
+	cmd := exec.CommandContext(ctx, "docker", args...)
 	cmd.Stdout = &out
 	cmd.Stderr = &out
 	err := cmd.Run()
-	require.NoError(t, err)
+	require.NoError(t, err, "docker %s output:\n%s", strings.Join(args, " "), out.String())
 
 	return strings.TrimSpace(out.String())
 }
Callers can pass t.Context() for a test-lifetime bound, or a tighter context.WithTimeout(context.Background(), 60*time.Second) for specific operations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/utils_test.go` around lines 163 - 176, The dockerCmd function
lacks a context/timeout and discards Docker output on failure; update dockerCmd
to accept/use a context (use t.Context() when available or wrap with
context.WithTimeout(ctx, 60*time.Second)) and run exec.CommandContext instead of
exec.Command, keep capturing stdout/stderr into the out buffer, and when the
command returns an error include the trimmed out.String() in the test failure
(e.g., via require.NoErrorf/require.Failf or t.Fatalf) so the Docker error
message is surfaced; reference the dockerCmd function and model the pattern
after buildImage which uses CommandContext and a 300s timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 140-142: The tLogf call in the error handling for database
deletion uses the fmt.Errorf-specific %w verb which tLogf (a printf-style
wrapper around t.Logf) doesn't support; update the format string in the error
log that references databaseID and err (the call using tLogf) to use %v for the
error value instead of %w so the error message prints correctly (i.e., change
the format argument in the tLogf call that logs "failed to delete database '%s':
..." to use %v for err).

In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 18-33: The two yum_repository tasks named "Add pgEdge old
repository" and "Add pgEdge noarch old repository" disable package signature
verification by setting gpgcheck: no and hardcode OS major as 9 in baseurl;
update these tasks to (1) avoid gpgcheck: no — instead set repo_gpgcheck: no if
you only want to skip repo metadata but keep package RPM signature verification,
or keep gpgcheck and add sslverify: yes to assert HTTPS validation, and (2) make
the baseurl dynamic by replacing the literal 9 with
{{ansible_facts['distribution_major_version']}} so the repository URL matches
the host OS major version. Ensure the updated fields are applied in the tasks
with those task names.

In `@server/internal/monitor/provide.go`:
- Around line 60-61: The code unconditionally creates an election candidate (via
electionSvc.NewCandidate) which opens an etcd watch and is never stopped if
DatabasesMonitorIntervalSeconds == 0 because DatabasesMonitor is not created and
candidate.Start/Stop are never called; to fix, avoid creating the candidate when
the databases monitor is disabled by only constructing the candidate when
cfg.DatabasesMonitorIntervalSeconds > 0 (i.e., move the electionSvc.NewCandidate
call behind that conditional), or alternatively change NewCandidate to defer
starting/creating the watch until Candidate.Start is invoked and ensure
Candidate.Stop can safely close any resources even if Start was never called;
update provide.go to create/pass a nil/no-op candidate to NewService when the
monitor is disabled and reference electionSvc.NewCandidate, Candidate.Start,
Candidate.Stop, DatabasesMonitor and DatabasesMonitorIntervalSeconds in your
change.

---

Nitpick comments:
In `@clustertest/utils_test.go`:
- Around line 163-176: The dockerCmd function lacks a context/timeout and
discards Docker output on failure; update dockerCmd to accept/use a context (use
t.Context() when available or wrap with context.WithTimeout(ctx,
60*time.Second)) and run exec.CommandContext instead of exec.Command, keep
capturing stdout/stderr into the out buffer, and when the command returns an
error include the trimmed out.String() in the test failure (e.g., via
require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is
surfaced; reference the dockerCmd function and model the pattern after
buildImage which uses CommandContext and a 300s timeout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da1e0e34-a22e-484a-9dde-4138a08808f5

📥 Commits

Reviewing files that changed from the base of the PR and between ea89e93 and 99a663b.

📒 Files selected for processing (32)

changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/config/config.go
server/internal/database/instance.go
server/internal/database/instance_store.go
server/internal/database/provide.go
server/internal/database/reconcile_versions.go
server/internal/database/reconcile_versions_test.go
server/internal/database/service.go
server/internal/database/spec.go
server/internal/database/status.go
server/internal/ds/versions.go
server/internal/ds/versions_test.go
server/internal/host/host_test.go
server/internal/logging/factory.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/host_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
server/internal/monitor/service.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/orchestrator/swarm/images.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go
server/internal/storage/interface.go
server/internal/storage/txn.go
server/internal/workflows/plan_update.go

💤 Files with no reviewable changes (1)

server/internal/database/status.go

Instances are updated sequentially and switchovers can happen during that sequence. We want to use the most recently fetched primary instance ID in the node resource to account for those switchovers. This change is backward compatible and does not need any migration. PLAT-512

This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512

Improves the ds.PgEdgeVersion initializers by: - Renaming `NewPgEdgeVersion`, which parsed a `PgEdgeVersion` from strings to `ParsePgEdgeVersion`. - Add a method to normalize `PgEdgeVersion`s so that the Postgres version contains major and minor components and the Spock version contains just a major component. This matches the way that versions are currently provided in our API. PLAT-512

Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512

Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512

PLAT-512

Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512

Adds a new database service method to reconcile the Postgres and Spock versions that we observe through the instance monitors with the versions in the database spec. The domain logic of this process is split into a standalone function with unit tests to enforce its behavior. The service method uses a transaction with additional constraints to handle conflicts with any database operations that run while we're performing this check. The summary of this new behavior is that when an instance is updated, we update its corresponding `Instance` record in our database. If all of the instances have been updated for a node, we update the node's entry in the database spec. If all of the nodes have been updated, we normalize the database spec by setting the top-level version fields and zeroing out the node-level version fields. PLAT-512

Adds a 'databases' monitor to periodically run the database version reconciliation process. The monitor's interval is configurable and can be set to '0' to disable it entirely. This setting is useful in testing and in clusters with a large number of databases where the operator knows that the database versions will never change outside of the Control Plane API. PLAT-512

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

e2e/cancel_task_test.go (1)

54-58: ⚡ Quick win

Replace fixed sleep with condition-based waiting to reduce flakiness

Line 57 introduces a hard-coded delay that can still race in CI and makes the test slower than needed. Prefer waiting until the task reaches a cancelable state (or until CancelDatabaseTask succeeds) with a bounded timeout.

Proposed test-stability refactor

-	// TODO: even after many attempts at fixing this, rapidly cancelling a task
-	// still occasionally causes problems. This sleep is a workaround until
-	// we're able to track down the issue.
-	time.Sleep(500 * time.Millisecond)
+	// Wait until cancellation becomes reliably accepted instead of sleeping a fixed duration.
+	require.Eventually(t, func() bool {
+		_, err := fixture.Client.CancelDatabaseTask(t.Context(), &controlplane.CancelDatabaseTaskPayload{
+			DatabaseID: database.ID,
+			TaskID:     controlplane.Identifier(creation_task.TaskID),
+		})
+		return err == nil
+	}, 10*time.Second, 200*time.Millisecond, "task did not become cancelable in time")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/cancel_task_test.go` around lines 54 - 58, Replace the fixed time.Sleep
with a bounded, condition-based wait: inside the test (cancel_task_test) poll in
a loop with a short backoff until either CancelDatabaseTask returns success or
the task's status indicates it is cancellable (use the existing
CancelDatabaseTask and the task-status getter used elsewhere, e.g.,
GetDatabaseTask/GetTaskStatus), breaking on success or when a context timeout
(e.g., 2–5s) elapses; return/fail the test on timeout. Ensure the loop uses a
small sleep (10–50ms) between attempts to avoid busy-waiting and that the
timeout is enforced via context.WithTimeout or time.After to keep the test
bounded.

server/internal/database/node_resource.go (1)

86-95: 💤 Low value

Optional: extract the comparison into a tiny helper for readability.

Not blocking — the loop body now mixes "skip" filtering with "is-newer" tracking. A small helper (e.g., isMoreRecent(prevID string, prevAt, curAt time.Time) bool) or a comment-anchored latestPrimary struct would make intent obvious if this pattern grows (e.g., once the version-reconciliation monitor lands and similar timestamp-driven picks appear elsewhere).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/node_resource.go` around lines 86 - 95, Extract the
"is newer" logic into a small helper to improve readability: create a helper
like isMoreRecent(prevID string, prevAt, curAt time.Time) bool (or a tiny
latestPrimary struct with an UpdateIfNewer method) and replace the inline
condition in the loop that compares instance.PrimaryInstanceIDUpdatedAt with
primaryInstanceUpdatedAt and sets primaryInstanceID/primaryInstanceUpdatedAt;
keep the existing skip/continue for empty PrimaryInstanceID and call the helper
to decide whether to assign the current instance's PrimaryInstanceID and
PrimaryInstanceIDUpdatedAt.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 264-265: The call to waitForTaskComplete (waiting on
updateResp.Task.TaskID) assigns err but never checks it; add an assertion
immediately after that call to fail the test if err is non-nil (e.g., use
t.Fatalf/t.Fatalf-like or require.NoError) so a timeout or failure is not
ignored before the subsequent assertions that depend on the task completing.
Ensure the check references the err returned from waitForTaskComplete(ctx,
cluster.Client(), databaseID, updateResp.Task.TaskID, 3*time.Minute).
- Around line 134-145: Replace the unbounded ctx := context.Background() used
for the cleanup DeleteDatabase call with a bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>); defer
cancel()) so cluster.Client().DeleteDatabase(...) cannot hang indefinitely;
update the DeleteDatabase call site that uses ctx and add the necessary time
import, referencing the ctx variable and the DeleteDatabase method and
databaseID/testConfig.skipCleanup flow.

In `@docs/installation/systemd.md`:
- Around line 248-250: Replace the fenced code blocks inside the numbered steps
with indented code blocks (prefix each command line with four spaces) so
markdownlint MD046 no longer flags nested fenced blocks; specifically update the
snippet containing "sudo dnf upgrade pgedge-postgresql18" and the other
occurrences referenced (around the blocks at the next two locations) to use
indented code formatting instead of ``` fences.
- Around line 28-30: Change the phrase "Minor versions upgrades" to "Minor
version upgrades" in the limitations paragraph (the sentence currently beginning
"Database upgrades are not supported via the API. Minor versions upgrades can be
performed manually..."); update that sentence to read "Minor version upgrades
can be performed manually by a system administrator." to fix the grammar.
- Around line 268-269: The sentence hardcodes "30 seconds" even though that
delay is controlled by the configuration key databases_monitor_interval_seconds;
update the wording in docs/installation/systemd.md to reference the configured
value instead of a hardcoded time (e.g., say "up to the configured
databases_monitor_interval_seconds (defaults to 30s)" or "up to the configured
databases_monitor_interval_seconds value"), and ensure the doc mentions where to
change that setting so readers know it is configurable.

---

Nitpick comments:
In `@e2e/cancel_task_test.go`:
- Around line 54-58: Replace the fixed time.Sleep with a bounded,
condition-based wait: inside the test (cancel_task_test) poll in a loop with a
short backoff until either CancelDatabaseTask returns success or the task's
status indicates it is cancellable (use the existing CancelDatabaseTask and the
task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus),
breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail
the test on timeout. Ensure the loop uses a small sleep (10–50ms) between
attempts to avoid busy-waiting and that the timeout is enforced via
context.WithTimeout or time.After to keep the test bounded.

In `@server/internal/database/node_resource.go`:
- Around line 86-95: Extract the "is newer" logic into a small helper to improve
readability: create a helper like isMoreRecent(prevID string, prevAt, curAt
time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method)
and replace the inline condition in the loop that compares
instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets
primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for
empty PrimaryInstanceID and call the helper to decide whether to assign the
current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 746ee34d-9c87-4823-8c88-e174405bfca8

📥 Commits

Reviewing files that changed from the base of the PR and between 0efec38 and 2742e61.

📒 Files selected for processing (36)

Makefile
changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
e2e/cancel_task_test.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/config/config.go
server/internal/database/instance.go
server/internal/database/instance_resource.go
server/internal/database/instance_store.go
server/internal/database/node_resource.go
server/internal/database/provide.go
server/internal/database/reconcile_versions.go
server/internal/database/reconcile_versions_test.go
server/internal/database/service.go
server/internal/database/spec.go
server/internal/database/status.go
server/internal/ds/versions.go
server/internal/ds/versions_test.go
server/internal/host/host_test.go
server/internal/logging/factory.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/host_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
server/internal/monitor/service.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/orchestrator/swarm/images.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go
server/internal/storage/interface.go
server/internal/storage/txn.go
server/internal/workflows/plan_update.go

💤 Files with no reviewable changes (1)

server/internal/database/status.go

✅ Files skipped from review due to trivial changes (7)

server/internal/storage/interface.go
changes/unreleased/Changed-20260504-152348.yaml
changes/unreleased/Added-20260504-150235.yaml
server/internal/workflows/plan_update.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/host/host_test.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go

🚧 Files skipped from review as they are similar to previous changes (15)

server/internal/orchestrator/swarm/images.go
server/internal/database/instance_store.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
clustertest/host_test.go
server/internal/database/provide.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/database/reconcile_versions.go
server/internal/database/instance.go
server/internal/database/service.go
server/internal/monitor/host_monitor.go
server/internal/database/spec.go
server/internal/database/reconcile_versions_test.go
server/internal/ds/versions.go

Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512

pgEdge only makes the latest package versions available in the `release` repositories. Older versions of packages are only available in the `release_old` repositories. This commit adds the `release_old` repositories to the `dev-lima` environment so that we can install older versions of Postgres and other components in this environment. PLAT-512

Adds the new configuration setting and logging component to the configuration guide. PLAT-512

Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512

Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick win

Fixed time.Sleep calls are fragile after async Docker service updates.

docker service update is asynchronous — the new task may still be starting when sleepDuration expires. Within 5 s the test requires: the container to restart with the new image, the instance monitor to report the new version, and the databases monitor (interval = 3 s) to reconcile the spec. In a slow CI environment this is a tight race that can produce intermittent failures.

require.Eventually with a generous deadline and a short poll interval would remove the dependency on a fixed sleep and give clear assertion failure messages when the deadline expires:
require.Eventually(t, func() bool {
    db, err = cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{
        DatabaseID: databaseID,
    })
    if err != nil {
        return false
    }
    // inline the version checks; return false on mismatch
    return specAndInstanceVersionsMatch(db, ...)
}, 30*time.Second, 500*time.Millisecond)
Apply the same pattern at Lines 168, 198, 225, and 270.

Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/external_upgrade_test.go` at line 168, Replace the fragile
time.Sleep(sleepDuration) usages with require.Eventually polling: for each
occurrence (the calls using sleepDuration at the sites referencing databaseID,
ctx, cluster.Client().GetDatabase and specAndInstanceVersionsMatch), call
require.Eventually with a generous timeout (e.g. 30s) and short tick (e.g.
500ms) and inside the predicate perform the cluster.Client().GetDatabase(ctx,
&controlplane.GetDatabasePayload{DatabaseID: databaseID}), return false on error
or version mismatches, and true only when specAndInstanceVersionsMatch(...) and
instance version checks pass; apply this replacement to the four locations that
currently use time.Sleep (the ones around lines referencing sleepDuration,
databaseID, GetDatabase and specAndInstanceVersionsMatch).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 55: The assertion uses require.Equal with arguments reversed; change the
call so the expected value client.InstanceStateAvailable is the second-to-last
parameter and the runtime value instance.State is the last (i.e., call
require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures
show "expected" as the known constant and "actual" as the observed
instance.State.

---

Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: Replace the fragile time.Sleep(sleepDuration) usages with
require.Eventually polling: for each occurrence (the calls using sleepDuration
at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and
specAndInstanceVersionsMatch), call require.Eventually with a generous timeout
(e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the
cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID:
databaseID}), return false on error or version mismatches, and true only when
specAndInstanceVersionsMatch(...) and instance version checks pass; apply this
replacement to the four locations that currently use time.Sleep (the ones around
lines referencing sleepDuration, databaseID, GetDatabase and
specAndInstanceVersionsMatch).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4038ff5-5bcb-44e5-a2bc-3448c7337c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 2742e61 and 25094c9.

📒 Files selected for processing (9)

Makefile
changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
lima/roles/install_prerequisites/tasks/main.yaml

coderabbitai · 2026-05-05T21:27:07Z

+
+		actualNodeHostVersions := map[string]map[string]string{}
+		for _, instance := range instances {
+			require.Equal(t, instance.State, client.InstanceStateAvailable)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

require.Equal arguments are in the wrong order.

Testify's require.Equal(t, expected, actual) convention expects the constant/known value first and the runtime value second. With instance.State as the first argument and client.InstanceStateAvailable as the second, a failure would read:

expected: "some_other_state", actual: "available"

which is the reverse of what you'd expect, making failures harder to diagnose.

🐛 Proposed fix

- require.Equal(t, instance.State, client.InstanceStateAvailable) + require.Equal(t, client.InstanceStateAvailable, instance.State)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/external_upgrade_test.go` at line 55, The assertion uses require.Equal with arguments reversed; change the call so the expected value client.InstanceStateAvailable is the second-to-last parameter and the runtime value instance.State is the last (i.e., call require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures show "expected" as the known constant and "actual" as the observed instance.State.

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go

Comment thread lima/roles/install_prerequisites/tasks/main.yaml

Comment thread server/internal/monitor/provide.go

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 99a663b to 0efec38 Compare May 5, 2026 12:51

jason-lynch added 9 commits May 5, 2026 15:06

test: add brief sleep to cancel task test

835ffef

This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512

feat: add conditions to etcd txn

6b0ba1d

Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512

feat: update host in host monitor

47d2c09

Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512

chore: remove empty file

d26cf05

PLAT-512

feat: record versions for replica instances

d0a1645

Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch 2 times, most recently from 2742e61 to 2ba5138 Compare May 5, 2026 19:09

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go Outdated

Comment thread clustertest/external_upgrade_test.go

Comment thread docs/installation/systemd.md Outdated

Comment thread docs/installation/systemd.md

Comment thread docs/installation/systemd.md

jason-lynch added 5 commits May 5, 2026 15:42

test: add cluster test for external db upgrades

415fe62

Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512

docs: update configuration doc

e980273

Adds the new configuration setting and logging component to the configuration guide. PLAT-512

docs: minor postgres version upgrades for systemd

8260e45

Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512

docs: changelog entries

25094c9

Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 2ba5138 to 25094c9 Compare May 5, 2026 19:42

jason-lynch marked this pull request as ready for review May 5, 2026 21:20

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update spec when db versions change#377

feat: update spec when db versions change#377
jason-lynch wants to merge 14 commits intomainfrom
feat/PLAT-512/update-database-version-ii

jason-lynch commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jason-lynch commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Using the dev-lima environment

Using the compose environment

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jason-lynch commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codacy-production Bot commented May 4, 2026 •

edited

Loading