Skip to content

Add secret rotation tracking with composite hash drift detection#1781

Open
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap
Open

Add secret rotation tracking with composite hash drift detection#1781
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap

Conversation

@lmiccini
Copy link
Copy Markdown
Contributor

@lmiccini lmiccini commented Jan 27, 2026

Track which nodes in a nodeset have been deployed with the latest secret
versions, enabling safe credential rotation by blocking deletion of old
credentials until all nodes are updated.

Design:

  • Compute a composite SHA-256 hash from all tracked secrets, sorted by name
    for determinism, using lib-common's secret.Hash() for content-based hashing
  • Store the deployed node list in a ConfigMap (-secret-tracking)
    to avoid bloating CR status for large nodesets. The ConfigMap contains
    {deployedSecretHash, deployedNodes[]}
  • Detect drift on every reconcile by fetching tracked secrets from the cluster
    and comparing the composite hash against DeployedSecretHash in status. On
    mismatch, set AllNodesUpdated=false and UpdatedNodes=0 immediately
  • Accumulate nodes across partial deployments using AnsibleLimit: when the
    deployment hash matches the stored hash, nodes are appended; when the hash
    differs (secrets changed), the node list resets
  • Skip stale deployments whose secret hashes no longer match the cluster state
  • Prune removed nodes by intersecting DeployedNodes with current Spec.Nodes

SecretDeploymentStatus fields in NodeSet status:

  • AllNodesUpdated: all nodes deployed with current secrets (consumed by
    infra-operator's RabbitMQUser controller to block credential deletion)
  • TotalNodes / UpdatedNodes: node counts for observability
  • DeployedSecretHash: composite hash at last successful deployment
  • LastUpdateTime: timestamp of last status update

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/56ac80bd0e7547ad88350eb0206886b5

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 18m 47s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 23m 38s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 37m 31s
adoption-standalone-to-crc-ceph-provider FAILURE in 3h 01m 55s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 51m 23s
openstack-operator-docs-preview POST_FAILURE in 2m 32s

@stuggi stuggi requested a review from slagle January 28, 2026 08:13
Comment thread api/core/v1beta1/openstackcontrolplane_webhook.go Outdated
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/db62c9cd33b34a538c7eccf243769b6a

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 02m 26s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 56s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 03s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 46m 57s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 34m 08s
openstack-operator-docs-preview POST_FAILURE in 3m 15s

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 3885c4a to c1fe8f8 Compare February 7, 2026 18:56
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b5d3972863e64857b2da5055f867ef55

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 43s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 41s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 22s
adoption-standalone-to-crc-ceph-provider FAILURE in 2h 05m 30s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 43m 01s
✔️ openstack-operator-docs-preview SUCCESS in 3m 14s

@lmiccini
Copy link
Copy Markdown
Contributor Author

lmiccini commented Feb 8, 2026

/retest

@lmiccini
Copy link
Copy Markdown
Contributor Author

lmiccini commented Feb 8, 2026

recheck

@lmiccini
Copy link
Copy Markdown
Contributor Author

lmiccini commented Feb 8, 2026

/test openstack-operator-build-deploy-kuttl-4-18

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from cbfbb7c to f52529a Compare February 8, 2026 15:01
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from f52529a to 017d2ca Compare February 10, 2026 06:41
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test functional

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 97bb482 to b1d9350 Compare February 12, 2026 15:46
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@slagle
Copy link
Copy Markdown
Contributor

slagle commented Mar 4, 2026

I think this change actually encompasses several pieces of functionality that could be expressed as different features. Each of which probably need a good design and discussion how we want them to work. A couple I can see right off:

  1. partial NodeSet deployments due to ansible limit, and how to represent that
  2. secret/configmap tracking of what was used by deployments
  3. is it safe to delete old credentials (are all Nodesets updated, etc).

I feel like we'll short-circuited some of the design/acceptance/testing cycle if we merge this. This is a big change to the NodeSet type and controller, which is an area edpm is largely responsible for, but I don't feel like that team has a good handle on taking ownership of this functionality (testing, docs, etc).

Can't we just say to users to do a deployment across all nodesets without ansible limit before deleting old credentials? I'm not sure that's something we need to solve in code right away.

@lmiccini
Copy link
Copy Markdown
Contributor Author

lmiccini commented Mar 5, 2026

Absolutely, right now we are documenting how to check and eventually remove the rotated users, so we are not in a rush to merge this as is.
We can just use this PR as a base to see if the proposed approach (or part of it) is viable to solve some of the challenges that you raise. I'll rebase it and set up an env so we can poke at it.

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from cd2cef1 to 490e443 Compare March 5, 2026 17:17
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from 490e443 to 3157e2b Compare March 5, 2026 17:17
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

OpenStackControlPlane CRD Size Report

Metric Value
CRD JSON size 322464 bytes (315KB)
Base branch size 322464 bytes
Change +0.00%
Status yellow — growing
Threshold reference
Color Range Meaning
🟢 green < 300KB Comfortable
🟡 yellow 300–400KB Growing
🟠 orange 400–750KB Concerning
🔴 red > 750KB Approaching 1.5MB etcd limit (cut in half to allow space for update)

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 8db6b16 to d974015 Compare March 6, 2026 07:02
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from d974015 to 45f39f8 Compare April 24, 2026 15:49
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 4 times, most recently from 5b509f3 to 5880a65 Compare April 25, 2026 10:44
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from 5880a65 to 187eecc Compare April 27, 2026 07:16
@lmiccini lmiccini changed the title Add per-node secret rotation tracking with drift detection Add secret rotation tracking with composite hash drift detection Apr 27, 2026
@lmiccini lmiccini requested a review from slagle April 27, 2026 07:33
Track which nodes in a nodeset have been deployed with the latest secret
versions, enabling safe credential rotation by blocking deletion of old
credentials until all nodes are updated.

Design:
- Compute a composite SHA-256 hash from all tracked secrets, sorted by name
  for determinism, using lib-common's secret.Hash() for content-based hashing
- Store the deployed node list in a ConfigMap (<nodeset>-secret-tracking)
  to avoid bloating CR status for large nodesets. The ConfigMap contains
  {deployedSecretHash, deployedNodes[]}
- Detect drift on every reconcile by fetching tracked secrets from the cluster
  and comparing the composite hash against DeployedSecretHash in status. On
  mismatch, set AllNodesUpdated=false and UpdatedNodes=0 immediately
- Accumulate nodes across partial deployments using AnsibleLimit: when the
  deployment hash matches the stored hash, nodes are appended; when the hash
  differs (secrets changed), the node list resets
- Skip stale deployments whose secret hashes no longer match the cluster state
- Prune removed nodes by intersecting DeployedNodes with current Spec.Nodes

SecretDeploymentStatus fields in NodeSet status:
- AllNodesUpdated: all nodes deployed with current secrets (consumed by
  infra-operator's RabbitMQUser controller to block credential deletion)
- TotalNodes / UpdatedNodes: node counts for observability
- DeployedSecretHash: composite hash at last successful deployment
- LastUpdateTime: timestamp of last status update

API exports:
- GetSecretTrackingConfigMapName(nodesetName) returns the deterministic
  ConfigMap name for a given nodeset, allowing consumers to reference it

Watch configuration:
- Secret watch uses ResourceVersionChangedPredicate to avoid reconciling
  on metadata-only updates
- Field index on status.secretHashes.keys enables O(1) lookup of nodesets
  tracking a given secret, replacing O(secrets * nodesets) list scans

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from 187eecc to 9f7cae7 Compare April 27, 2026 07:38
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@lmiccini: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl 1698305 link true /test openstack-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@lmiccini
Copy link
Copy Markdown
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@rabi
Copy link
Copy Markdown
Contributor

rabi commented Apr 28, 2026

Why can't infra-operator block deletion of there is mismatch between latest deployment and nodeset secret hashes? I am not sure if we should go this path as it is othrogonal to the original design. Also, users should be discouraged to do secret rotations using ansible_limits. It is there for scenarios where nodes are not reachable for some reason and users are expected to do full deployment across all nodes of the nodeset eventually to keep things in sync.

@lmiccini
Copy link
Copy Markdown
Contributor Author

Why can't infra-operator block deletion of there is mismatch between latest deployment and nodeset secret hashes? I am not sure if we should go this path as it is othrogonal to the original design. Also, users should be discouraged to do secret rotations using ansible_limits. It is there for scenarios where nodes are not reachable for some reason and users are expected to do full deployment across all nodes of the nodeset eventually to keep things in sync.

Thanks @rabi . One of the challenges with that approach is that the last deployment may not be representative of the overall status of the nodeset. AFAICU we would have to introspect the deployments to check whether ansible_limits was used and/or if only a subset of the services were deployed, and repeat this pattern not just in infra but other operators that need to know if the dataplane is in sync (keystone for appcred, possibly others?).
I thought having a generic "am I up to date" status field would be in line with the loosely coupled approach we have for the dataplane, plus we would be able to use this for credentials rotation (rabbitmq, appcred, etc) and also to facilitate the cleanup after we migrate from mirrored to quorum queues among others. I am sure there are more use cases we could use this for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants