Skip to content

feat: update spec when db versions change#377

Open
jason-lynch wants to merge 14 commits intomainfrom
feat/PLAT-512/update-database-version-ii
Open

feat: update spec when db versions change#377
jason-lynch wants to merge 14 commits intomainfrom
feat/PLAT-512/update-database-version-ii

Conversation

@jason-lynch
Copy link
Copy Markdown
Member

@jason-lynch jason-lynch commented May 4, 2026

Summary

A database spec can fall out of sync with its running instances when a user upgrades packages outside of our system. We expect this to happen in systemd clusters from typical system package upgrade workflows. So, we need a way to update the database spec when we observe version changes in the running instances.

This PR adds a new monitor that loops over all databases and reconciles the observed versions with those in the instance record and the spec. This monitor uses the existing election system to ensure that only one host runs the reconciliation process.

Changes

This PR is split into granular commits. I highly recommend reviewing the commits separately rather than looking at the entire diff at once. The most notable of the commits are:

  • fix: use most recently fetched primary instance id
    • This is an existing bug that I ran into while testing the scenarios in this PR description and the cluster test. Before, we were just picking the first non-empty primary instance ID in the node resource. This could cause a problem if there was a switchover after that primary instance field was updated. This change accounts for that in a backward-compatible way.
  • feat: update host in host monitor
    • This small change makes it so we update the list of supported versions when the packages change on the host.
  • feat: record versions for replica instances
    • This was a necessary change for us to determine whether all instances in a node have been updated. It's also a usability improvement to show when a replica has been updated.
  • feat: add service method to reconcile db versions
    • This commit contains the core logic of this feature.
  • feat: add databases monitor
    • This is the mechanism that invokes the core logic of this feature.
  • docs: minor postgres version upgrades for systemd
    • This is a small doc update to describe how we can accommodate minor version upgrades ahead of supporting package management in the control plane API. Externally-invoked major version upgrades are not supported.

Testing

I've added a new cluster test that exercises this feature. You can also test this by hand using the instructions in this section.

Using the dev-lima environment

# reset your environment
make dev-lima-reset
# deploy the fixture updates to add the 'pgedge old' repos
make dev-lima-deploy

# switch to the dev-lima environment
use-dev-lima

# downgrade to the previous version of postgres 18 on hosts 1-3
cp1-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp2-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp3-ssh 'sudo dnf downgrade -y pgedge-postgresql18'

# start the dev-lima environment in one terminal
make dev-lima-run

# then in another terminal
# switch to the dev-lima environment
use-dev-lima

# initialize the cluster
cp-init

# check the supported versions for hosts 1-3
# you should see 18.2 listed for all three.
cp1-req list-hosts | jq '.hosts[] 
        | select(.id | IN("host-1", "host-2", "host-3"))
        | {id, supported_pgedge_versions}'

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on host-3
# you might see messages like 'Shared connection to 127.0.0.1 closed.', but they
# should still exit 0 and the message can be ignored.
cp3-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp3-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-3:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-3")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# next, upgrade host-2
cp2-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp2-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-2:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-2")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
cp1-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp1-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-1:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-1")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. however, these updates
# won't actually change anything.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-lima-reset

Using the compose environment

# reset your environment
make dev-reset

# ensure you're using the compose environment
use-compose

# start the compose environment in one terminal
make dev-watch

# then in another terminal
# ensure you're using the compose environment
use-compose

# initialize the cluster
cp-init

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "postgres_version": "18.2",
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-3 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n2-ant97dj4

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-2 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-9ptayhma

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-689qacsi

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. this will trigger a
# patroni reload because the swarm patroni config resource includes the entire
# instance spec. it will also trigger a switchover because the docker CLI sorts
# volume mounts. however, these updates won't change the database versions
# or other configuration.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-reset

PLAT-512

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

Adds a leader-backed periodic databases monitor that reconciles instance Postgres/Spock versions into stored instance records and specs, introduces parsing/normalization for PgEdge versions, wires monitoring/election and logging, extends storage txn comparisons, and includes tests, docs, and changelog entries.

Database Version Reconciliation & Monitoring

Layer / File(s) Summary
Version Parsing / Data Shape
server/internal/ds/versions.go, server/internal/ds/versions_test.go
Introduce Version.MajorMinorVersion() and PgEdgeVersion.Normalize(); replace MustPgEdgeVersion/NewPgEdgeVersion with MustParsePgEdgeVersion/ParsePgEdgeVersion.
Config & Logging Consts
server/internal/config/config.go, server/internal/logging/factory.go
Add DatabasesMonitorIntervalSeconds config (default 30) and ComponentDatabaseService logging component.
Instance/Spec Model Changes
server/internal/database/instance.go, server/internal/database/spec.go, server/internal/database/instance_store.go
Add InstanceStatus.IsStale(), Spec.NormalizePostgresVersions(), and InstanceStore.Update() op builder.
Storage Transaction Support
server/internal/storage/interface.go, server/internal/storage/txn.go
Add Txn.AddConditions(...) to interface and txn implementation to collect/include etcd clientv3.Cmp conditions in commits.
Reconciliation Core
server/internal/database/reconcile_versions.go, server/internal/database/reconcile_versions_test.go
Add Service.ReconcileAllDatabaseVersions and ReconcileVersions which reconcile instance pgEdge versions, derive observed node versions, stage instance/spec updates, and commit guarded transactions; tests exercise scenarios.
Service Logging & Wiring
server/internal/database/service.go, server/internal/database/provide.go
Add logger to Service; NewService accepts loggerFactory and provider injects it. Replace some NewPgEdgeVersion calls with ParsePgEdgeVersion.
Monitor Implementation
server/internal/monitor/databases_monitor.go
Add DatabasesMonitor that periodically calls reconciliation when its election.Candidate is leader; interval from config.
Monitor Integration
server/internal/monitor/provide.go, server/internal/monitor/service.go
Provider resolves *election.Service, creates a candidate and passes it to NewService; Service conditionally constructs, starts, and stops databasesMonitor.
Monitor & Host/Instance Adjustments
server/internal/monitor/instance_monitor.go, server/internal/monitor/host_monitor.go
Instance monitor always refreshes DB-connection-derived fields (primary-only fields guarded internally); host monitor now runs UpdateHost and UpdateHostStatus each refresh.
Instance Primary Tracking & Node Selection
server/internal/database/instance_resource.go, server/internal/database/node_resource.go
Track PrimaryInstanceIDUpdatedAt on InstanceResource and select node primary by newest timestamp across instances.
Populate/Plan Parsing Updates
server/internal/workflows/plan_update.go, server/internal/database/service.go, server/internal/database/spec.go
Replace ds.NewPgEdgeVersion calls with ds.ParsePgEdgeVersion in spec/node parsing and default population; add spec normalization call.
Tests & Integration
clustertest/external_upgrade_test.go, clustertest/utils_test.go, clustertest/host_test.go, server/internal/*_test.go
Add TestExternalUpgrade, introduce dockerCmd helper, update tests to new parse APIs, adjust utility behaviors and timeouts.
Docs, Changelog & Installer
docs/installation/configuration.md, docs/installation/systemd.md, changes/unreleased/*.yaml, lima/roles/install_prerequisites/tasks/main.yaml
Document databases_monitor_interval_seconds, add manual systemd Postgres minor-upgrade procedure and changelog entries; add old RPM repos and idempotency fixes to installer.
Misc Test/Infra Adjustments
Makefile, e2e/cancel_task_test.go, clustertest/host_test.go, clustertest/*
Increase cluster test timeout, add small sleep in cancel test, propagate context in log helpers, and other test/util tweaks.

🐰 I sniff the versions, one by one,
Leader-bound I watch what hosts have done;
Instances change, the spec learns through,
Manual upgrades whisper, then come true —
A soft hop, and the cluster hums anew.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: update spec when db versions change' directly describes the main feature added in this PR: a monitor that updates the database spec when it detects version changes in running instances.
Description check ✅ Passed The PR description includes all required sections: a clear Summary explaining the motivation and what the PR adds, a detailed Changes section listing notable commits with explanations, comprehensive Testing instructions for both dev-lima and compose environments, and a complete Checklist with all items addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/PLAT-512/update-database-version-ii

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 4, 2026

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results
Complexity 3 medium

View in Codacy

🟢 Metrics 52 complexity · 3 duplication

Metric Results
Complexity 52
Duplication 3

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
clustertest/utils_test.go (1)

163-176: ⚡ Quick win

dockerCmd: add a context/timeout and surface captured output on failure

Two independent issues:

  1. No timeoutexec.Command has no deadline. A hung Docker daemon or a blocking docker service update will stall the entire test suite indefinitely. Compare with buildImage, which wraps its exec.Command in a 300 s CommandContext. The test receives t testing.TB so a t.Context() or an explicit background timeout can be threaded through.

  2. Docker output dropped on failure — stderr/stdout are captured into out, but require.NoError(t, err) only surfaces the OS error string (e.g., "exit status 1"). The actual Docker error message is silently discarded, making CI failures much harder to diagnose.

🔧 Proposed fix
-func dockerCmd(t testing.TB, args ...string) string {
+func dockerCmd(t testing.TB, ctx context.Context, args ...string) string {
 	t.Helper()
 
 	tLogf(t, "executing command: docker %s", strings.Join(args, " "))
 
 	var out strings.Builder
-	cmd := exec.Command("docker", args...)
+	cmd := exec.CommandContext(ctx, "docker", args...)
 	cmd.Stdout = &out
 	cmd.Stderr = &out
 	err := cmd.Run()
-	require.NoError(t, err)
+	require.NoError(t, err, "docker %s output:\n%s", strings.Join(args, " "), out.String())
 
 	return strings.TrimSpace(out.String())
 }

Callers can pass t.Context() for a test-lifetime bound, or a tighter context.WithTimeout(context.Background(), 60*time.Second) for specific operations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/utils_test.go` around lines 163 - 176, The dockerCmd function
lacks a context/timeout and discards Docker output on failure; update dockerCmd
to accept/use a context (use t.Context() when available or wrap with
context.WithTimeout(ctx, 60*time.Second)) and run exec.CommandContext instead of
exec.Command, keep capturing stdout/stderr into the out buffer, and when the
command returns an error include the trimmed out.String() in the test failure
(e.g., via require.NoErrorf/require.Failf or t.Fatalf) so the Docker error
message is surfaced; reference the dockerCmd function and model the pattern
after buildImage which uses CommandContext and a 300s timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 140-142: The tLogf call in the error handling for database
deletion uses the fmt.Errorf-specific %w verb which tLogf (a printf-style
wrapper around t.Logf) doesn't support; update the format string in the error
log that references databaseID and err (the call using tLogf) to use %v for the
error value instead of %w so the error message prints correctly (i.e., change
the format argument in the tLogf call that logs "failed to delete database '%s':
..." to use %v for err).

In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 18-33: The two yum_repository tasks named "Add pgEdge old
repository" and "Add pgEdge noarch old repository" disable package signature
verification by setting gpgcheck: no and hardcode OS major as 9 in baseurl;
update these tasks to (1) avoid gpgcheck: no — instead set repo_gpgcheck: no if
you only want to skip repo metadata but keep package RPM signature verification,
or keep gpgcheck and add sslverify: yes to assert HTTPS validation, and (2) make
the baseurl dynamic by replacing the literal 9 with
{{ansible_facts['distribution_major_version']}} so the repository URL matches
the host OS major version. Ensure the updated fields are applied in the tasks
with those task names.

In `@server/internal/monitor/provide.go`:
- Around line 60-61: The code unconditionally creates an election candidate (via
electionSvc.NewCandidate) which opens an etcd watch and is never stopped if
DatabasesMonitorIntervalSeconds == 0 because DatabasesMonitor is not created and
candidate.Start/Stop are never called; to fix, avoid creating the candidate when
the databases monitor is disabled by only constructing the candidate when
cfg.DatabasesMonitorIntervalSeconds > 0 (i.e., move the electionSvc.NewCandidate
call behind that conditional), or alternatively change NewCandidate to defer
starting/creating the watch until Candidate.Start is invoked and ensure
Candidate.Stop can safely close any resources even if Start was never called;
update provide.go to create/pass a nil/no-op candidate to NewService when the
monitor is disabled and reference electionSvc.NewCandidate, Candidate.Start,
Candidate.Stop, DatabasesMonitor and DatabasesMonitorIntervalSeconds in your
change.

---

Nitpick comments:
In `@clustertest/utils_test.go`:
- Around line 163-176: The dockerCmd function lacks a context/timeout and
discards Docker output on failure; update dockerCmd to accept/use a context (use
t.Context() when available or wrap with context.WithTimeout(ctx,
60*time.Second)) and run exec.CommandContext instead of exec.Command, keep
capturing stdout/stderr into the out buffer, and when the command returns an
error include the trimmed out.String() in the test failure (e.g., via
require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is
surfaced; reference the dockerCmd function and model the pattern after
buildImage which uses CommandContext and a 300s timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da1e0e34-a22e-484a-9dde-4138a08808f5

📥 Commits

Reviewing files that changed from the base of the PR and between ea89e93 and 99a663b.

📒 Files selected for processing (32)
  • changes/unreleased/Added-20260504-150235.yaml
  • changes/unreleased/Changed-20260504-152348.yaml
  • clustertest/external_upgrade_test.go
  • clustertest/host_test.go
  • clustertest/utils_test.go
  • docs/installation/configuration.md
  • docs/installation/systemd.md
  • lima/roles/install_prerequisites/tasks/main.yaml
  • server/internal/config/config.go
  • server/internal/database/instance.go
  • server/internal/database/instance_store.go
  • server/internal/database/provide.go
  • server/internal/database/reconcile_versions.go
  • server/internal/database/reconcile_versions_test.go
  • server/internal/database/service.go
  • server/internal/database/spec.go
  • server/internal/database/status.go
  • server/internal/ds/versions.go
  • server/internal/ds/versions_test.go
  • server/internal/host/host_test.go
  • server/internal/logging/factory.go
  • server/internal/monitor/databases_monitor.go
  • server/internal/monitor/host_monitor.go
  • server/internal/monitor/instance_monitor.go
  • server/internal/monitor/provide.go
  • server/internal/monitor/service.go
  • server/internal/orchestrator/common/patroni_config_generator_test.go
  • server/internal/orchestrator/swarm/images.go
  • server/internal/orchestrator/swarm/rag_instance_resources_test.go
  • server/internal/storage/interface.go
  • server/internal/storage/txn.go
  • server/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
  • server/internal/database/status.go

Comment thread clustertest/external_upgrade_test.go
Comment thread lima/roles/install_prerequisites/tasks/main.yaml
Comment thread server/internal/monitor/provide.go
@jason-lynch jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 99a663b to 0efec38 Compare May 5, 2026 12:51
Instances are updated sequentially and switchovers can happen during
that sequence. We want to use the most recently fetched primary instance
ID in the node resource to account for those switchovers.

This change is backward compatible and does not need any migration.

PLAT-512
This test still causes issues occasionally due to how quickly it cancels
the task after starting it. This sleep is a workaround until we're able
to find and fix the underlying issue.

PLAT-512
Improves the ds.PgEdgeVersion initializers by:

- Renaming `NewPgEdgeVersion`, which parsed a `PgEdgeVersion` from
  strings to `ParsePgEdgeVersion`.
- Add a method to normalize `PgEdgeVersion`s so that the Postgres
  version contains major and minor components and the Spock version
  contains just a major component. This matches the way that versions
  are currently provided in our API.

PLAT-512
Adds an `AddConditions` method to `storage.Txn` to make it possible to
add additional constraints to a transaction. This enables us to enforce
other constraints that are outside of the values we're updating.

PLAT-512
Modifies the existing host monitor to also update the host record
periodically. This accounts for system changes that happen while the
Control Plane is running, such as system package updates.

PLAT-512
Updates the existing instance monitor to query and record the Postgres
and Spock versions from replica instances. We'll need this to determine
when a node has been updated by an external process, such as a system
package update.

PLAT-512
Adds a new database service method to reconcile the Postgres and Spock
versions that we observe through the instance monitors with the versions
in the database spec. The domain logic of this process is split into a
standalone function with unit tests to enforce its behavior. The service
method uses a transaction with additional constraints to handle
conflicts with any database operations that run while we're performing
this check.

The summary of this new behavior is that when an instance is updated, we
update its corresponding `Instance` record in our database. If all of
the instances have been updated for a node, we update the node's entry
in the database spec. If all of the nodes have been updated, we
normalize the database spec by setting the top-level version fields and
zeroing out the node-level version fields.

PLAT-512
Adds a 'databases' monitor to periodically run the database version
reconciliation process. The monitor's interval is configurable and can
be set to '0' to disable it entirely. This setting is useful in testing
and in clusters with a large number of databases where the operator
knows that the database versions will never change outside of the
Control Plane API.

PLAT-512
@jason-lynch jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch 2 times, most recently from 2742e61 to 2ba5138 Compare May 5, 2026 19:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
e2e/cancel_task_test.go (1)

54-58: ⚡ Quick win

Replace fixed sleep with condition-based waiting to reduce flakiness

Line 57 introduces a hard-coded delay that can still race in CI and makes the test slower than needed. Prefer waiting until the task reaches a cancelable state (or until CancelDatabaseTask succeeds) with a bounded timeout.

Proposed test-stability refactor
-	// TODO: even after many attempts at fixing this, rapidly cancelling a task
-	// still occasionally causes problems. This sleep is a workaround until
-	// we're able to track down the issue.
-	time.Sleep(500 * time.Millisecond)
+	// Wait until cancellation becomes reliably accepted instead of sleeping a fixed duration.
+	require.Eventually(t, func() bool {
+		_, err := fixture.Client.CancelDatabaseTask(t.Context(), &controlplane.CancelDatabaseTaskPayload{
+			DatabaseID: database.ID,
+			TaskID:     controlplane.Identifier(creation_task.TaskID),
+		})
+		return err == nil
+	}, 10*time.Second, 200*time.Millisecond, "task did not become cancelable in time")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/cancel_task_test.go` around lines 54 - 58, Replace the fixed time.Sleep
with a bounded, condition-based wait: inside the test (cancel_task_test) poll in
a loop with a short backoff until either CancelDatabaseTask returns success or
the task's status indicates it is cancellable (use the existing
CancelDatabaseTask and the task-status getter used elsewhere, e.g.,
GetDatabaseTask/GetTaskStatus), breaking on success or when a context timeout
(e.g., 2–5s) elapses; return/fail the test on timeout. Ensure the loop uses a
small sleep (10–50ms) between attempts to avoid busy-waiting and that the
timeout is enforced via context.WithTimeout or time.After to keep the test
bounded.
server/internal/database/node_resource.go (1)

86-95: 💤 Low value

Optional: extract the comparison into a tiny helper for readability.

Not blocking — the loop body now mixes "skip" filtering with "is-newer" tracking. A small helper (e.g., isMoreRecent(prevID string, prevAt, curAt time.Time) bool) or a comment-anchored latestPrimary struct would make intent obvious if this pattern grows (e.g., once the version-reconciliation monitor lands and similar timestamp-driven picks appear elsewhere).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/node_resource.go` around lines 86 - 95, Extract the
"is newer" logic into a small helper to improve readability: create a helper
like isMoreRecent(prevID string, prevAt, curAt time.Time) bool (or a tiny
latestPrimary struct with an UpdateIfNewer method) and replace the inline
condition in the loop that compares instance.PrimaryInstanceIDUpdatedAt with
primaryInstanceUpdatedAt and sets primaryInstanceID/primaryInstanceUpdatedAt;
keep the existing skip/continue for empty PrimaryInstanceID and call the helper
to decide whether to assign the current instance's PrimaryInstanceID and
PrimaryInstanceIDUpdatedAt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 264-265: The call to waitForTaskComplete (waiting on
updateResp.Task.TaskID) assigns err but never checks it; add an assertion
immediately after that call to fail the test if err is non-nil (e.g., use
t.Fatalf/t.Fatalf-like or require.NoError) so a timeout or failure is not
ignored before the subsequent assertions that depend on the task completing.
Ensure the check references the err returned from waitForTaskComplete(ctx,
cluster.Client(), databaseID, updateResp.Task.TaskID, 3*time.Minute).
- Around line 134-145: Replace the unbounded ctx := context.Background() used
for the cleanup DeleteDatabase call with a bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>); defer
cancel()) so cluster.Client().DeleteDatabase(...) cannot hang indefinitely;
update the DeleteDatabase call site that uses ctx and add the necessary time
import, referencing the ctx variable and the DeleteDatabase method and
databaseID/testConfig.skipCleanup flow.

In `@docs/installation/systemd.md`:
- Around line 248-250: Replace the fenced code blocks inside the numbered steps
with indented code blocks (prefix each command line with four spaces) so
markdownlint MD046 no longer flags nested fenced blocks; specifically update the
snippet containing "sudo dnf upgrade pgedge-postgresql18" and the other
occurrences referenced (around the blocks at the next two locations) to use
indented code formatting instead of ``` fences.
- Around line 28-30: Change the phrase "Minor versions upgrades" to "Minor
version upgrades" in the limitations paragraph (the sentence currently beginning
"Database upgrades are not supported via the API. Minor versions upgrades can be
performed manually..."); update that sentence to read "Minor version upgrades
can be performed manually by a system administrator." to fix the grammar.
- Around line 268-269: The sentence hardcodes "30 seconds" even though that
delay is controlled by the configuration key databases_monitor_interval_seconds;
update the wording in docs/installation/systemd.md to reference the configured
value instead of a hardcoded time (e.g., say "up to the configured
databases_monitor_interval_seconds (defaults to 30s)" or "up to the configured
databases_monitor_interval_seconds value"), and ensure the doc mentions where to
change that setting so readers know it is configurable.

---

Nitpick comments:
In `@e2e/cancel_task_test.go`:
- Around line 54-58: Replace the fixed time.Sleep with a bounded,
condition-based wait: inside the test (cancel_task_test) poll in a loop with a
short backoff until either CancelDatabaseTask returns success or the task's
status indicates it is cancellable (use the existing CancelDatabaseTask and the
task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus),
breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail
the test on timeout. Ensure the loop uses a small sleep (10–50ms) between
attempts to avoid busy-waiting and that the timeout is enforced via
context.WithTimeout or time.After to keep the test bounded.

In `@server/internal/database/node_resource.go`:
- Around line 86-95: Extract the "is newer" logic into a small helper to improve
readability: create a helper like isMoreRecent(prevID string, prevAt, curAt
time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method)
and replace the inline condition in the loop that compares
instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets
primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for
empty PrimaryInstanceID and call the helper to decide whether to assign the
current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 746ee34d-9c87-4823-8c88-e174405bfca8

📥 Commits

Reviewing files that changed from the base of the PR and between 0efec38 and 2742e61.

📒 Files selected for processing (36)
  • Makefile
  • changes/unreleased/Added-20260504-150235.yaml
  • changes/unreleased/Changed-20260504-152348.yaml
  • clustertest/external_upgrade_test.go
  • clustertest/host_test.go
  • clustertest/utils_test.go
  • docs/installation/configuration.md
  • docs/installation/systemd.md
  • e2e/cancel_task_test.go
  • lima/roles/install_prerequisites/tasks/main.yaml
  • server/internal/config/config.go
  • server/internal/database/instance.go
  • server/internal/database/instance_resource.go
  • server/internal/database/instance_store.go
  • server/internal/database/node_resource.go
  • server/internal/database/provide.go
  • server/internal/database/reconcile_versions.go
  • server/internal/database/reconcile_versions_test.go
  • server/internal/database/service.go
  • server/internal/database/spec.go
  • server/internal/database/status.go
  • server/internal/ds/versions.go
  • server/internal/ds/versions_test.go
  • server/internal/host/host_test.go
  • server/internal/logging/factory.go
  • server/internal/monitor/databases_monitor.go
  • server/internal/monitor/host_monitor.go
  • server/internal/monitor/instance_monitor.go
  • server/internal/monitor/provide.go
  • server/internal/monitor/service.go
  • server/internal/orchestrator/common/patroni_config_generator_test.go
  • server/internal/orchestrator/swarm/images.go
  • server/internal/orchestrator/swarm/rag_instance_resources_test.go
  • server/internal/storage/interface.go
  • server/internal/storage/txn.go
  • server/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
  • server/internal/database/status.go
✅ Files skipped from review due to trivial changes (7)
  • server/internal/storage/interface.go
  • changes/unreleased/Changed-20260504-152348.yaml
  • changes/unreleased/Added-20260504-150235.yaml
  • server/internal/workflows/plan_update.go
  • lima/roles/install_prerequisites/tasks/main.yaml
  • server/internal/host/host_test.go
  • server/internal/orchestrator/swarm/rag_instance_resources_test.go
🚧 Files skipped from review as they are similar to previous changes (15)
  • server/internal/orchestrator/swarm/images.go
  • server/internal/database/instance_store.go
  • server/internal/monitor/databases_monitor.go
  • server/internal/monitor/instance_monitor.go
  • server/internal/monitor/provide.go
  • clustertest/host_test.go
  • server/internal/database/provide.go
  • server/internal/orchestrator/common/patroni_config_generator_test.go
  • server/internal/database/reconcile_versions.go
  • server/internal/database/instance.go
  • server/internal/database/service.go
  • server/internal/monitor/host_monitor.go
  • server/internal/database/spec.go
  • server/internal/database/reconcile_versions_test.go
  • server/internal/ds/versions.go

Comment thread clustertest/external_upgrade_test.go Outdated
Comment thread clustertest/external_upgrade_test.go
Comment thread docs/installation/systemd.md Outdated
Comment thread docs/installation/systemd.md
Comment thread docs/installation/systemd.md
Adds a cluster test to verify the behavior of the new database version
reconciliation process.

PLAT-512
pgEdge only makes the latest package versions available in the `release`
repositories. Older versions of packages are only available in the
`release_old` repositories.

This commit adds the `release_old` repositories to the `dev-lima`
environment so that we can install older versions of Postgres and other
components in this environment.

PLAT-512
Adds the new configuration setting and logging component to the
configuration guide.

PLAT-512
Updates the systemd document to describe the Postgres minor version
upgrade process.

PLAT-512
Adds changelog entries for db version reconciliation and the instance
monitor change.

PLAT-512
@jason-lynch jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 2ba5138 to 25094c9 Compare May 5, 2026 19:42
@jason-lynch jason-lynch marked this pull request as ready for review May 5, 2026 21:20
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
clustertest/external_upgrade_test.go (1)

168-168: ⚡ Quick win

Fixed time.Sleep calls are fragile after async Docker service updates.

docker service update is asynchronous — the new task may still be starting when sleepDuration expires. Within 5 s the test requires: the container to restart with the new image, the instance monitor to report the new version, and the databases monitor (interval = 3 s) to reconcile the spec. In a slow CI environment this is a tight race that can produce intermittent failures.

require.Eventually with a generous deadline and a short poll interval would remove the dependency on a fixed sleep and give clear assertion failure messages when the deadline expires:

require.Eventually(t, func() bool {
    db, err = cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{
        DatabaseID: databaseID,
    })
    if err != nil {
        return false
    }
    // inline the version checks; return false on mismatch
    return specAndInstanceVersionsMatch(db, ...)
}, 30*time.Second, 500*time.Millisecond)

Apply the same pattern at Lines 168, 198, 225, and 270.

Also applies to: 198-198, 225-225, 270-270

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/external_upgrade_test.go` at line 168, Replace the fragile
time.Sleep(sleepDuration) usages with require.Eventually polling: for each
occurrence (the calls using sleepDuration at the sites referencing databaseID,
ctx, cluster.Client().GetDatabase and specAndInstanceVersionsMatch), call
require.Eventually with a generous timeout (e.g. 30s) and short tick (e.g.
500ms) and inside the predicate perform the cluster.Client().GetDatabase(ctx,
&controlplane.GetDatabasePayload{DatabaseID: databaseID}), return false on error
or version mismatches, and true only when specAndInstanceVersionsMatch(...) and
instance version checks pass; apply this replacement to the four locations that
currently use time.Sleep (the ones around lines referencing sleepDuration,
databaseID, GetDatabase and specAndInstanceVersionsMatch).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 55: The assertion uses require.Equal with arguments reversed; change the
call so the expected value client.InstanceStateAvailable is the second-to-last
parameter and the runtime value instance.State is the last (i.e., call
require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures
show "expected" as the known constant and "actual" as the observed
instance.State.

---

Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: Replace the fragile time.Sleep(sleepDuration) usages with
require.Eventually polling: for each occurrence (the calls using sleepDuration
at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and
specAndInstanceVersionsMatch), call require.Eventually with a generous timeout
(e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the
cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID:
databaseID}), return false on error or version mismatches, and true only when
specAndInstanceVersionsMatch(...) and instance version checks pass; apply this
replacement to the four locations that currently use time.Sleep (the ones around
lines referencing sleepDuration, databaseID, GetDatabase and
specAndInstanceVersionsMatch).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4038ff5-5bcb-44e5-a2bc-3448c7337c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 2742e61 and 25094c9.

📒 Files selected for processing (9)
  • Makefile
  • changes/unreleased/Added-20260504-150235.yaml
  • changes/unreleased/Changed-20260504-152348.yaml
  • clustertest/external_upgrade_test.go
  • clustertest/host_test.go
  • clustertest/utils_test.go
  • docs/installation/configuration.md
  • docs/installation/systemd.md
  • lima/roles/install_prerequisites/tasks/main.yaml


actualNodeHostVersions := map[string]map[string]string{}
for _, instance := range instances {
require.Equal(t, instance.State, client.InstanceStateAvailable)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

require.Equal arguments are in the wrong order.

Testify's require.Equal(t, expected, actual) convention expects the constant/known value first and the runtime value second. With instance.State as the first argument and client.InstanceStateAvailable as the second, a failure would read:

expected: "some_other_state", actual: "available"

which is the reverse of what you'd expect, making failures harder to diagnose.

🐛 Proposed fix
-			require.Equal(t, instance.State, client.InstanceStateAvailable)
+			require.Equal(t, client.InstanceStateAvailable, instance.State)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/external_upgrade_test.go` at line 55, The assertion uses
require.Equal with arguments reversed; change the call so the expected value
client.InstanceStateAvailable is the second-to-last parameter and the runtime
value instance.State is the last (i.e., call require.Equal(t,
client.InstanceStateAvailable, instance.State)) so failures show "expected" as
the known constant and "actual" as the observed instance.State.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant