From f1284e1c02e60d68877af82c245d1e9c8588f133 Mon Sep 17 00:00:00 2001 From: Qian Deng Date: Thu, 14 May 2026 11:15:41 +0000 Subject: [PATCH] docs: add OpenSpec baseline and follow-up change proposals Introduce OpenSpec as the spec source-of-truth for the controller: - Archive the baseline-existing-system change that captures current behavior of Requirement, Operation, AppDeployment, and Cache CRDs as four capability specs under openspec/specs/. - Add two follow-up change proposals: - guarantee-operation-id-uniqueness: tighten operation ID assignment so uniqueness is enforced (via metadata.uid), not probabilistic. - document-cache-mode-annotation: document the cache-mode annotation emitted on cache-created Operations as an operator-facing contract. - Seed openspec/config.yaml with project context. - Add a Specifications section to README pointing future PRs at the spec deltas workflow. No runtime or API changes. Co-Authored-By: Claude Opus 4 --- README.md | 5 +- .../.openspec.yaml | 2 + .../design.md | 57 ++++++++++++++++++ .../proposal.md | 30 ++++++++++ .../specs/appdeployment-execution/spec.md | 43 ++++++++++++++ .../specs/cache-pool/spec.md | 56 ++++++++++++++++++ .../specs/operation-orchestration/spec.md | 51 ++++++++++++++++ .../specs/requirement-management/spec.md | 43 ++++++++++++++ .../tasks.md | 29 +++++++++ .../.openspec.yaml | 2 + .../document-cache-mode-annotation/design.md | 47 +++++++++++++++ .../proposal.md | 27 +++++++++ .../specs/cache-pool/spec.md | 16 +++++ .../document-cache-mode-annotation/tasks.md | 16 +++++ .../.openspec.yaml | 2 + .../design.md | 54 +++++++++++++++++ .../proposal.md | 27 +++++++++ .../specs/operation-orchestration/spec.md | 21 +++++++ .../tasks.md | 20 +++++++ openspec/config.yaml | 19 ++++++ .../specs/appdeployment-execution/spec.md | 46 +++++++++++++++ openspec/specs/cache-pool/spec.md | 59 +++++++++++++++++++ .../specs/operation-orchestration/spec.md | 54 +++++++++++++++++ openspec/specs/requirement-management/spec.md | 46 +++++++++++++++ 24 files changed, 771 insertions(+), 1 deletion(-) create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/.openspec.yaml create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/design.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/proposal.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/specs/appdeployment-execution/spec.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/specs/cache-pool/spec.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/specs/operation-orchestration/spec.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/specs/requirement-management/spec.md create mode 100644 openspec/changes/archive/2026-05-13-baseline-existing-system/tasks.md create mode 100644 openspec/changes/document-cache-mode-annotation/.openspec.yaml create mode 100644 openspec/changes/document-cache-mode-annotation/design.md create mode 100644 openspec/changes/document-cache-mode-annotation/proposal.md create mode 100644 openspec/changes/document-cache-mode-annotation/specs/cache-pool/spec.md create mode 100644 openspec/changes/document-cache-mode-annotation/tasks.md create mode 100644 openspec/changes/guarantee-operation-id-uniqueness/.openspec.yaml create mode 100644 openspec/changes/guarantee-operation-id-uniqueness/design.md create mode 100644 openspec/changes/guarantee-operation-id-uniqueness/proposal.md create mode 100644 openspec/changes/guarantee-operation-id-uniqueness/specs/operation-orchestration/spec.md create mode 100644 openspec/changes/guarantee-operation-id-uniqueness/tasks.md create mode 100644 openspec/config.yaml create mode 100644 openspec/specs/appdeployment-execution/spec.md create mode 100644 openspec/specs/cache-pool/spec.md create mode 100644 openspec/specs/operation-orchestration/spec.md create mode 100644 openspec/specs/requirement-management/spec.md diff --git a/README.md b/README.md index 874c8be..9ab3683 100644 --- a/README.md +++ b/README.md @@ -5,5 +5,8 @@ A k8s controller used to manage operations and cache the outcome of that operation +## Specifications -**Trademarks** This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. \ No newline at end of file +OpenSpec capability baselines live under `openspec/specs/`. Future PRs that change CRD fields or reconciler behavior must include an OpenSpec change with deltas against those specs so reviewers can compare the intended behavior before implementation details. + +**Trademarks** This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/.openspec.yaml b/openspec/changes/archive/2026-05-13-baseline-existing-system/.openspec.yaml new file mode 100644 index 0000000..93831bd --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-13 diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/design.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/design.md new file mode 100644 index 0000000..a1b1e0c --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/design.md @@ -0,0 +1,57 @@ +## Context + +The Operation Cache Controller is a kubebuilder v4 operator already running in production. It manages four CRDs in `controller.azure.github.com/v1alpha1` (`Requirement`, `Operation`, `AppDeployment`, `Cache`) using a handler-based reconciliation pattern in `internal/controller/` and `internal/handler/`. There is no machine-readable spec, so this change introduces OpenSpec as the source of truth going forward without modifying any code paths. + +Stakeholders: controller maintainers (need a baseline to diff against), reviewers (want spec-level PRs), new contributors (want to learn intent without reading 22 test files). + +## Goals / Non-Goals + +**Goals:** +- Capture the *current* externally observable behavior of each CRD as testable scenarios under `openspec/specs//spec.md`. +- One capability spec per top-level CRD; nested concepts (Jobs, conditions, finalizers) appear as scenarios within the parent capability rather than separate capabilities. +- Use SHALL/MUST language so each requirement is normative and verifiable against existing Ginkgo tests. + +**Non-Goals:** +- No code, CRD field, RBAC, or runtime behavior changes. +- No reorganization of `internal/` packages. +- No documentation of internal helper packages (`internal/utils/...`); these are implementation detail. +- No webhook spec — webhooks are scaffolded but not implemented. + +## Decisions + +### Decision: One capability per top-level CRD + +We map exactly one capability to each of `Requirement`, `Operation`, `AppDeployment`, `Cache`. Alternative considered: a single `cache-controller` capability covering everything. Rejected — it would force every future delta to touch one giant spec and erase the natural CRD-level review boundary the team already uses. + +### Decision: Scenarios written in WHEN/THEN against observable cluster state + +Each scenario describes what an external observer (kubectl / a watcher) sees, not what handler functions do internally. Alternative: describe internal handler invocations. Rejected — couples specs to implementation, defeats the purpose of being able to refactor handlers without rewriting specs. + +### Decision: Document constants (finalizers, annotations, label keys) as part of requirements + +Names like `finalizer.operation.controller.azure.com` and `operation.controller.azure.com/acquired` are part of the *contract* with cluster operators and cannot be silently renamed. They appear inline in the relevant scenarios. Alternative: keep them only in code constants. Rejected — anyone integrating with the controller (dashboards, GitOps tooling) needs to discover these from the spec. + +### Decision: No delta operations in this change + +Because no prior specs exist, every requirement uses `## ADDED Requirements`. Future changes will use `MODIFIED`/`REMOVED`/`RENAMED` against these baselines. This keeps the baseline change reviewable as additive-only. + +## Risks / Trade-offs + +- **Risk:** Spec drifts from code if it under-describes edge cases (e.g., race conditions on cache acquisition). → **Mitigation:** Treat baseline as v1; correct in follow-up changes when discrepancies are found during real reviews. Do not block this PR on exhaustive coverage. +- **Risk:** Naming a capability after a CRD ties spec churn to API renames. → **Mitigation:** API is `v1alpha1` and stable in practice; if a CRD is ever renamed, that change uses `RENAMED Requirements` plus a folder rename — supported by OpenSpec. +- **Risk:** Cache-hit acquisition logic is genuinely non-trivial (annotation timestamps, ownership transfer). Spec may oversimplify. → **Mitigation:** Scenarios cite the annotation key and ownership-transfer behavior explicitly so future deltas have something concrete to amend. + +## Migration Plan + +1. Land this change. Archive after `/opsx:verify` passes. +2. Going forward, every PR that touches `api/v1alpha1/` or reconciler behavior MUST include an OpenSpec change with deltas against the baseline. +3. No rollback needed — pure documentation. + +## Open Questions + +- Should `internal/utils/reconciler/operations.go` (sequential-operations pattern) be promoted to its own capability later? Deferred; it's an implementation detail today. +- Webhook scaffolding exists but is unused — do we add a placeholder capability now or wait until it's wired up? Deferred to when webhooks are activated. +- Follow-up test gap: no existing controller or handler test covers the current absence of `Requirement` finalizer reconciliation, despite the API constant `RequirementFinalizerName`. +- Follow-up test gap: no existing controller or handler test asserts that two independently initialized `Operation` resources receive distinct `status.operationId` values; helper coverage currently checks only that generated IDs are non-empty. +- Follow-up test gap: cache acquisition removal from `Cache.status.availableCaches` is covered indirectly by ownership-transfer behavior, but no end-to-end controller or handler test verifies the acquired operation disappearing from cache status after a cache reconcile. +- Follow-up test gap: no existing controller or handler test explicitly covers the `Cache` no-finalizer contract or garbage-collection behavior after user deletion. diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/proposal.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/proposal.md new file mode 100644 index 0000000..380cab9 --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/proposal.md @@ -0,0 +1,30 @@ +## Why + +The Operation Cache Controller has been built and is running in production, but it has no machine-readable specification. New contributors must reverse-engineer behavior from Go code, and future OpenSpec change proposals have nothing to diff against. This baseline change captures the *current* behavior of the four CRDs and their reconcilers as OpenSpec capability specs so that subsequent work can propose true deltas. + +## What Changes + +- Document the existing system as four capability specs derived from the current `internal/controller/` and `internal/handler/` implementations. +- No source code, CRDs, RBAC, or runtime behavior change. This is a documentation-only baseline. +- Establish naming conventions for future capability deltas (one capability per top-level CRD). +- Record the cache hit/miss data flow and finalizer/ownership rules as testable requirements. + +## Capabilities + +### New Capabilities + +- `requirement-management`: Reconciliation of `Requirement` CRs into `Operation` and (optionally) `Cache` resources, including cache-key derivation and cache-hit acquisition. +- `operation-orchestration`: Lifecycle of `Operation` CRs — fan-out to per-app `AppDeployment` children, status aggregation, finalizer-driven teardown, and acquisition annotations. +- `appdeployment-execution`: Translation of `AppDeployment` CRs into provision/teardown Kubernetes `Job`s, job-status reconciliation, and finalizer cleanup. +- `cache-pool`: Pre-provisioning of `Operation`s under a `Cache`, auto-count maintenance, cache-duration expiry, and label-based cache-key indexing. + +### Modified Capabilities + + + +## Impact + +- Affected code: none (read-only documentation pass over `api/v1alpha1/`, `internal/controller/`, `internal/handler/`). +- Affected APIs: none. The four CRDs in `controller.azure.github.com/v1alpha1` are described, not changed. +- Affected processes: future changes must now ship spec deltas against these baselines instead of free-form proposals. +- Risk: low — if the spec mis-describes current behavior, it is corrected in a follow-up change; runtime is unaffected. diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/appdeployment-execution/spec.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/appdeployment-execution/spec.md new file mode 100644 index 0000000..fb5a643 --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/appdeployment-execution/spec.md @@ -0,0 +1,43 @@ +## ADDED Requirements + +### Requirement: AppDeployment runs the provision Job on creation + +The controller SHALL, when an `AppDeployment` is created, launch a Kubernetes `Job` derived from `spec.provision` owned by the `AppDeployment`, transition `status.phase` from `""` → `Pending` → `Deploying`, and set `status.phase=Ready` when the provision `Job` reports `Complete=True`. + +#### Scenario: Successful provision Job promotes AppDeployment to Ready + +- **WHEN** an `AppDeployment` is created and its provision `Job` succeeds +- **THEN** `status.phase` of the `AppDeployment` becomes `Ready` + +#### Scenario: Failed provision Job is retried and keeps AppDeployment out of Ready + +- **WHEN** the provision `Job` reports `Failed=True` +- **THEN** the controller deletes the failed provision `Job`, creates a replacement provision `Job`, and keeps the `AppDeployment` out of `Ready` until a provision `Job` succeeds + +### Requirement: AppDeployment respects declared dependencies + +The controller SHALL NOT launch the provision `Job` for an `AppDeployment` whose `spec.dependencies` include sibling app names until every dependency `AppDeployment` (sharing the same parent `Operation` via `spec.opId`) has reached `status.phase=Ready`. + +#### Scenario: Dependent app waits for its dependency + +- **WHEN** `AppDeployment` `app-b` declares `spec.dependencies=["app-a"]` and `app-a` is not yet `Ready` +- **THEN** `app-b` remains in `status.phase=Pending` and no provision `Job` is created for it +- **AND** once `app-a` reaches `status.phase=Ready`, the controller launches `app-b`'s provision `Job` + +### Requirement: AppDeployment runs the teardown Job on deletion + +The controller SHALL add the finalizer `finalizer.appdeployment.devinfra.goms.io` to every `AppDeployment`, and on deletion SHALL launch a `Job` derived from `spec.teardown` owned by the `AppDeployment`, transition `status.phase` to `Deleting`, and remove the finalizer after `status.phase=Deleted`. The controller sets `status.phase=Deleted` when the teardown `Job` succeeds, or after it observes and deletes a failed teardown `Job` while emitting a warning event. + +#### Scenario: Teardown Job runs before AppDeployment is removed + +- **WHEN** an `AppDeployment` in `status.phase=Ready` is deleted +- **THEN** a teardown `Job` is created, the `AppDeployment` reports `status.phase=Deleting` until the teardown attempt completes or fails, then reports `status.phase=Deleted` and has its finalizer removed + +### Requirement: AppDeployment owns its Jobs for cascade deletion + +The controller SHALL set the `AppDeployment` as the controller `ownerReference` of every provision and teardown `Job` it creates, so deleting the `AppDeployment` causes Kubernetes garbage collection of orphaned `Job`s. + +#### Scenario: Deleting AppDeployment removes its Jobs + +- **WHEN** an `AppDeployment` and its provision `Job` exist, and the `AppDeployment` is deleted +- **THEN** the provision `Job` is garbage-collected by Kubernetes via `ownerReferences` diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/cache-pool/spec.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/cache-pool/spec.md new file mode 100644 index 0000000..829877b --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/cache-pool/spec.md @@ -0,0 +1,56 @@ +## ADDED Requirements + +### Requirement: Cache maintains a pool of pre-provisioned Operations + +The controller SHALL, for every `Cache`, set `status.keepAlive` to the controller's fixed keep-alive count, list owned `Operation`s, publish the names of owned `Operation`s with `status.phase=Reconciled` in `status.availableCaches`, and create additional owned `Operation`s from `spec.operationTemplate` only when the total owned `Operation` count is below `status.keepAlive`. + +#### Scenario: Empty Cache provisions Operations to fill the pool + +- **WHEN** a `Cache` is created with `spec.operationTemplate` set and no owned `Operation`s exist +- **THEN** the controller creates owned `Operation`s from the template until the total owned `Operation` count reaches `status.keepAlive` +- **AND** after owned `Operation`s reach `status.phase=Reconciled`, `status.availableCaches` lists their names + +### Requirement: Cache CR is keyed by a deterministic cache key + +The controller SHALL compute `status.cacheKey` deterministically from `spec.operationTemplate` and SHALL apply the label `operation-cache-controller.azure.github.com/cache-key=` to every `Operation` it creates for the cache, truncating the label value to the Kubernetes label-value limit when required. The controller SHALL NOT add this cache-key label to the `Cache` resource itself. + +#### Scenario: Two Caches with identical templates compute the same key + +- **WHEN** two `Cache` CRs with byte-identical `spec.operationTemplate` are created +- **THEN** both report the same `status.cacheKey`, and cached `Operation`s created for either `Cache` carry the matching `cache-key` label value + +### Requirement: Cache surfaces available entries in status + +The controller SHALL keep `status.availableCaches` synchronized with the names of owned `Operation`s that are `Reconciled`. When a cached `Operation` is acquired, ownership transfers away from the `Cache`, so the next cache reconcile SHALL omit that `Operation` from `status.availableCaches`. + +#### Scenario: Acquired cached Operation disappears from status + +- **WHEN** a `Requirement` acquires a cached `Operation` +- **THEN** that `Operation`'s name is removed from the `Cache`'s `status.availableCaches` within one reconcile cycle + +### Requirement: Cache expires entries past spec.expireTime + +The controller SHALL delete the `Cache` CR once the wall-clock time exceeds `Cache.spec.expireTime` (when set) and SHALL stop processing additional cache-pool operations in that reconcile. Cleanup of owned cached `Operation`s relies on Kubernetes `ownerReferences` cascade deletion. + +#### Scenario: Past-expiry Cache is deleted + +- **WHEN** the current time is after `Cache.spec.expireTime` +- **THEN** the controller deletes the `Cache` CR and does not create replacement cached `Operation`s in that reconcile + +### Requirement: Cache reconciliation runs at least every 60 seconds + +The controller SHALL re-reconcile every `Cache` CR within 60 seconds even with no watch events, so pool depletion and expiry are eventually observed. + +#### Scenario: Idle Cache is re-evaluated + +- **WHEN** a `Cache` exists and no events fire against it +- **THEN** the controller re-reconciles it within 60 seconds + +### Requirement: Cache does not use a finalizer + +The controller SHALL NOT register a finalizer on `Cache` CRs; cleanup of owned `Operation`s relies on Kubernetes `ownerReferences` cascade deletion alone. + +#### Scenario: Deleting a Cache cascades via owner references + +- **WHEN** a `Cache` with owned cached `Operation`s is deleted +- **THEN** Kubernetes garbage-collects those `Operation`s via `ownerReferences` without any finalizer interaction diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/operation-orchestration/spec.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/operation-orchestration/spec.md new file mode 100644 index 0000000..61b20ba --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/operation-orchestration/spec.md @@ -0,0 +1,51 @@ +## ADDED Requirements + +### Requirement: Operation fans out to one AppDeployment per application + +The controller SHALL, for every `Operation`, create exactly one owned `AppDeployment` per entry in `spec.applications`, copying the application's `provision`, `teardown`, and `dependencies` fields and stamping `spec.opId` with the `Operation`'s unique `status.operationId`. + +#### Scenario: Multi-application Operation creates matching AppDeployments + +- **WHEN** an `Operation` is created with two `ApplicationSpec` entries named `app-a` and `app-b` +- **THEN** the controller creates two `AppDeployment` resources owned by the `Operation`, each carrying the parent `operationId` in `spec.opId`, and the `Operation` reports `status.phase=Reconciling` + +### Requirement: Operation status aggregates child AppDeployment phases + +The controller SHALL set `status.phase=Reconciled` on an `Operation` only when every owned `AppDeployment` reports `status.phase=Ready`, and SHALL keep `status.phase=Reconciling` otherwise. + +#### Scenario: Operation becomes Reconciled when all children are Ready + +- **WHEN** all `AppDeployment`s owned by an `Operation` report `status.phase=Ready` +- **THEN** the `Operation` reports `status.phase=Reconciled` + +#### Scenario: One pending child keeps the Operation reconciling + +- **WHEN** at least one owned `AppDeployment` is not yet `Ready` +- **THEN** the `Operation` continues to report `status.phase=Reconciling` + +### Requirement: Operation acquisition is recorded via annotation + +When an `Operation` is acquired from a cache by a `Requirement`, the controller SHALL stamp the annotation `operation.controller.azure.com/acquired` on the `Operation` with an RFC3339 timestamp and SHALL transfer the `ownerReference` from the `Cache` to the acquiring `Requirement`. + +#### Scenario: Acquired Operation carries the timestamp annotation + +- **WHEN** a `Requirement` acquires a cached `Operation` +- **THEN** the `Operation` has annotation `operation.controller.azure.com/acquired` set to the acquisition time and its sole controller `ownerReference` points to the `Requirement` + +### Requirement: Operation uses a finalizer to record deletion lifecycle + +The controller SHALL add the finalizer `finalizer.operation.controller.azure.com` to every `Operation`, transition a deleting `Operation` through `status.phase=Deleting` and `status.phase=Deleted`, and then remove the finalizer. Deletion of owned `AppDeployment`s is delegated to Kubernetes `ownerReferences` and the `AppDeployment` controller; the `Operation` controller does not wait for every child to report `Deleted`. + +#### Scenario: Operation deletion records Deleting and Deleted phases + +- **WHEN** an `Operation` is deleted +- **THEN** the `Operation` enters `status.phase=Deleting`, then `status.phase=Deleted`, after which the controller removes `finalizer.operation.controller.azure.com` and the API server may remove the `Operation` + +### Requirement: Operation IDs are unique within the cluster + +The controller SHALL assign every `Operation` a `status.operationId` value that is unique across all `Operation` resources in the cluster. + +#### Scenario: Two independently created Operations get distinct IDs + +- **WHEN** two `Operation` CRs are created in any namespaces +- **THEN** their `status.operationId` values differ diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/requirement-management/spec.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/requirement-management/spec.md new file mode 100644 index 0000000..921c3bf --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/specs/requirement-management/spec.md @@ -0,0 +1,43 @@ +## ADDED Requirements + +### Requirement: Requirement CR creates an Operation when caching is disabled + +The controller SHALL, for every `Requirement` whose `spec.enableCache` is `false`, create one owned `Operation` named after the `Requirement` whose `spec` is a copy of `requirement.spec.template`, set `status.operationName` and `status.operationId` accordingly, and transition `status.phase` from `""` → `Operating` → `Ready` as the child `Operation` becomes `Reconciled`. + +#### Scenario: Cache-disabled requirement provisions a fresh Operation + +- **WHEN** a `Requirement` is created with `spec.enableCache=false` and a valid `spec.template` +- **THEN** the controller records the `Requirement` name in `status.operationName`, creates an `Operation` owned by the `Requirement`, and reports `status.phase=Operating` +- **AND** when the child `Operation` reaches `status.phase=Reconciled`, the `Requirement` reports `status.phase=Ready` with `status.operationId` copied from the child `Operation` + +### Requirement: Requirement CR consults the cache when caching is enabled + +The controller SHALL, for every `Requirement` whose `spec.enableCache` is `true`, derive a deterministic cache key from `spec.template`, look up a `Cache` CR by that key, and either acquire an existing cached `Operation` (cache hit) or create a new `Operation` (cache miss). + +#### Scenario: Cache hit acquires a pre-provisioned Operation + +- **WHEN** a `Requirement` with `enableCache=true` is reconciled and a matching `Cache` CR exists with at least one available cached `Operation` +- **THEN** the controller records one cached `Operation` name in `status.operationName`, transfers ownership of that `Operation` to the `Requirement`, stamps the annotation `operation.controller.azure.com/acquired` on it with the acquisition timestamp, sets `status.phase=Ready`, and records condition `OperationReady=True` with reason `CacheHit` + +#### Scenario: Cache miss creates a new Operation and a Cache CR if missing + +- **WHEN** a `Requirement` with `enableCache=true` is reconciled and either the `Cache` CR is absent or has no available cached `Operation` +- **THEN** the controller ensures a `Cache` CR exists for the derived cache key (creating it from `spec.template` if necessary), creates a new owned `Operation` named after the `Requirement`, and records cache-miss status with reason `CacheMiss` when an existing `Cache` has no available entries + +### Requirement: Requirement CR does not run finalizer cleanup + +The controller SHALL NOT add the declared finalizer `finalizer.requirement.devinfra.goms.io` during `Requirement` reconciliation; cleanup of owned `Operation` resources relies on Kubernetes `ownerReferences` cascade deletion. + +#### Scenario: Deletion is not blocked by a Requirement finalizer + +- **WHEN** a user deletes a `Requirement` that owns an `Operation` +- **THEN** the controller does not add a `Requirement` finalizer or drive a `Deleting` phase, and Kubernetes garbage collection removes owned `Operation` resources via `ownerReferences` + +### Requirement: Requirement reconciliation is periodically requeued + +The controller SHALL requeue every `Requirement` for re-reconciliation at most every 10 minutes even when no watch event fires, so that `spec.expireAt` and cache state changes are eventually observed. + +#### Scenario: Idle requirement is re-checked + +- **WHEN** a `Requirement` is in `status.phase=Ready` and no events occur +- **THEN** the controller re-evaluates it within 10 minutes diff --git a/openspec/changes/archive/2026-05-13-baseline-existing-system/tasks.md b/openspec/changes/archive/2026-05-13-baseline-existing-system/tasks.md new file mode 100644 index 0000000..84a1c03 --- /dev/null +++ b/openspec/changes/archive/2026-05-13-baseline-existing-system/tasks.md @@ -0,0 +1,29 @@ +## 1. Verify spec accuracy against existing code + +- [x] 1.1 Cross-check `requirement-management` scenarios against `internal/controller/requirement_controller.go` and `internal/handler/requirement.go`; note any divergence in PR review thread +- [x] 1.2 Cross-check `operation-orchestration` scenarios against `internal/controller/operation_controller.go` and `internal/handler/operation.go` +- [x] 1.3 Cross-check `appdeployment-execution` scenarios against `internal/controller/appdeployment_controller.go` and `internal/handler/appdeployment.go` +- [x] 1.4 Cross-check `cache-pool` scenarios against `internal/controller/cache_controller.go` and `internal/handler/cache.go` +- [x] 1.5 Confirm constant names quoted in specs (finalizers, annotation keys, label keys) match `api/v1alpha1/*_types.go` exactly + +## 2. Validate OpenSpec artifacts + +- [x] 2.1 Run `openspec status --change baseline-existing-system` and confirm all four artifacts are `done` +- [x] 2.2 Run `openspec validate --change baseline-existing-system` (or equivalent) and resolve any schema errors +- [x] 2.3 Verify each spec file lives under `openspec/changes/baseline-existing-system/specs//spec.md` + +## 3. Reconcile with existing tests + +- [x] 3.1 For each requirement, identify at least one existing Ginkgo test in `internal/controller/*_test.go` or `internal/handler/*_test.go` that exercises it; note coverage gaps as follow-up issues (do NOT add tests in this change) +- [x] 3.2 Record any requirement that has no covering test as an open question for the next change + +## 4. Project documentation hookup + +- [x] 4.1 Add a one-paragraph "Specifications" section to `README.md` (or `doc/arch/`) pointing at `openspec/specs/` and explaining that future PRs must ship deltas +- [x] 4.2 Optionally seed `openspec/config.yaml` `context:` with the controller's tech-stack summary (kubebuilder v4, Go 1.26, controller-runtime, Ginkgo) so future LLM runs have project context + +## 5. Archive + +- [x] 5.1 Run `/opsx:verify` and address any findings +- [x] 5.2 Run `/opsx:archive` to promote the four capability specs from `openspec/changes/baseline-existing-system/specs/` into `openspec/specs/` +- [x] 5.3 Confirm `openspec/specs/{requirement-management,operation-orchestration,appdeployment-execution,cache-pool}/spec.md` exist after archive diff --git a/openspec/changes/document-cache-mode-annotation/.openspec.yaml b/openspec/changes/document-cache-mode-annotation/.openspec.yaml new file mode 100644 index 0000000..66dd08a --- /dev/null +++ b/openspec/changes/document-cache-mode-annotation/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-14 diff --git a/openspec/changes/document-cache-mode-annotation/design.md b/openspec/changes/document-cache-mode-annotation/design.md new file mode 100644 index 0000000..fba0c48 --- /dev/null +++ b/openspec/changes/document-cache-mode-annotation/design.md @@ -0,0 +1,47 @@ +## Context + +`CacheHandler.initOperationFromCache` stamps `operation-cache-controller.azure.github.com/cache-mode=true` on every `Operation` it creates from `Cache.spec.operationTemplate`. The baseline `cache-pool` spec already documents cache-owned Operation creation and the cache-key label, but it does not mention this annotation. + +Because annotations are visible to cluster operators and automation, this change decides whether the emitted cache-mode marker is part of the controller contract. + +## Goals / Non-Goals + +**Goals:** +- Document the cache-mode annotation as observable metadata on cache-created `Operation`s. +- Add tests that fail if cached Operation creation drops or changes the annotation. +- Keep the annotation scoped to Operations created by the Cache controller. + +**Non-Goals:** +- No CRD schema changes. +- No change to cache hit acquisition, ownerReference transfer, or `status.availableCaches` behavior. +- No requirement that acquired Operations keep or remove the annotation after ownership transfer unless a later change specifies that lifecycle. + +## Decisions + +### Decision: Treat cache-mode as an operator-facing marker on cache-created Operations + +The delta updates `cache-pool` because the annotation is created by Cache reconciliation and identifies Operations that originated from a cache pool. The requirement is limited to creation time, matching current behavior without inventing lifecycle semantics for acquired Operations. + +Alternative considered: omit the annotation from specs as internal. Rejected for this proposal because operators can already observe and select on annotations, and the baseline policy documents observable constants that carry integration value. + +### Decision: Keep the value as the existing string literal `true` + +The spec names both the key and value so future refactors cannot silently change the contract. This follows the baseline approach for finalizers, labels, and acquisition annotations. + +Alternative considered: specify only that the annotation exists. Rejected because a marker annotation with changing values is harder for operators to consume consistently. + +## Risks / Trade-offs + +- **Risk:** Maintainers may decide cache-mode should be an internal implementation detail. -> **Mitigation:** Resolve before apply; if internal, change this proposal to remove the annotation from code instead of documenting it. +- **Risk:** Documenting the annotation freezes a currently undocumented constant. -> **Mitigation:** The scope is narrow: only `Operation`s created by `Cache` reconciliation must carry it at creation. +- **Risk:** Existing tests use mocks and may not inspect created Operation metadata. -> **Mitigation:** Capture the `Create` argument in the cache handler test or add a fake-client controller test. + +## Migration Plan + +1. Add a `cache-pool` spec delta for the annotation on cached Operation creation. +2. Add focused coverage around `CacheHandler.AdjustCache` or `initOperationFromCache` proving the annotation key and value. +3. Run the existing Go test suite and OpenSpec validation. + +## Open Questions + +- Should acquired cached Operations keep `operation-cache-controller.azure.github.com/cache-mode=true` after the ownerReference transfers to a Requirement, or should a later acquisition lifecycle delta clear it? diff --git a/openspec/changes/document-cache-mode-annotation/proposal.md b/openspec/changes/document-cache-mode-annotation/proposal.md new file mode 100644 index 0000000..d129b9f --- /dev/null +++ b/openspec/changes/document-cache-mode-annotation/proposal.md @@ -0,0 +1,27 @@ +## Why + +`operation-cache-controller.azure.github.com/cache-mode` is written to cached `Operation`s, but the baseline specs do not say whether cluster operators may rely on it. This change clarifies that observable annotation contract before future cache changes drift around an undocumented behavior. + +## What Changes + +- Document the cache-mode annotation on `Operation`s created for a `Cache`. +- Specify the expected value and when the annotation is applied. +- Add focused coverage that cached `Operation`s carry the annotation. +- Do not change cache acquisition, cache key derivation, or pool sizing behavior. + +## Capabilities + +### New Capabilities + + + +### Modified Capabilities + +- `cache-pool`: Add the externally observable cache-mode annotation to the cached `Operation` creation contract. + +## Impact + +- Affected code: `internal/handler/cache.go`, `internal/utils/controller/const.go`, and related cache tests. +- Affected APIs: no CRD schema changes; this documents an annotation already emitted on child `Operation`s. +- Affected systems: dashboards, scripts, or GitOps checks can treat the annotation as an intentional signal after this lands. +- Risk: low; the annotation already exists, so the main risk is documenting a behavior that maintainers later decide should remain internal. diff --git a/openspec/changes/document-cache-mode-annotation/specs/cache-pool/spec.md b/openspec/changes/document-cache-mode-annotation/specs/cache-pool/spec.md new file mode 100644 index 0000000..d4732d5 --- /dev/null +++ b/openspec/changes/document-cache-mode-annotation/specs/cache-pool/spec.md @@ -0,0 +1,16 @@ +## MODIFIED Requirements + +### Requirement: Cache maintains a pool of pre-provisioned Operations + +The controller SHALL, for every `Cache`, set `status.keepAlive` to the controller's fixed keep-alive count, list owned `Operation`s, publish the names of owned `Operation`s with `status.phase=Reconciled` in `status.availableCaches`, and create additional owned `Operation`s from `spec.operationTemplate` only when the total owned `Operation` count is below `status.keepAlive`. Every `Operation` created by the `Cache` controller for the pool SHALL carry the annotation `operation-cache-controller.azure.github.com/cache-mode=true`. + +#### Scenario: Empty Cache provisions Operations to fill the pool + +- **WHEN** a `Cache` is created with `spec.operationTemplate` set and no owned `Operation`s exist +- **THEN** the controller creates owned `Operation`s from the template until the total owned `Operation` count reaches `status.keepAlive` +- **AND** after owned `Operation`s reach `status.phase=Reconciled`, `status.availableCaches` lists their names + +#### Scenario: Cached Operations carry cache-mode annotation + +- **WHEN** the controller creates an `Operation` to fill a `Cache` pool +- **THEN** the created `Operation` has annotation `operation-cache-controller.azure.github.com/cache-mode` set to `true` diff --git a/openspec/changes/document-cache-mode-annotation/tasks.md b/openspec/changes/document-cache-mode-annotation/tasks.md new file mode 100644 index 0000000..82dddd7 --- /dev/null +++ b/openspec/changes/document-cache-mode-annotation/tasks.md @@ -0,0 +1,16 @@ +## 1. Contract Confirmation + +- [ ] 1.1 Confirm maintainers want `operation-cache-controller.azure.github.com/cache-mode=true` to be an operator-facing contract +- [ ] 1.2 If the annotation should remain internal, revise this change before implementation to remove or stop relying on the emitted annotation + +## 2. Test Coverage + +- [ ] 2.1 Add cache handler coverage that captures created `Operation`s and verifies the cache-mode annotation key and value +- [ ] 2.2 Keep existing cache-key label assertions passing while adding the annotation assertion +- [ ] 2.3 Add or update coverage so the test fails if cached Operation creation drops annotations while preserving labels and ownerReferences + +## 3. Validation + +- [ ] 3.1 Run `go test ./...` +- [ ] 3.2 Run `openspec validate --change document-cache-mode-annotation` +- [ ] 3.3 Review the `cache-pool` delta for consistency with the existing cache-key label requirement diff --git a/openspec/changes/guarantee-operation-id-uniqueness/.openspec.yaml b/openspec/changes/guarantee-operation-id-uniqueness/.openspec.yaml new file mode 100644 index 0000000..66dd08a --- /dev/null +++ b/openspec/changes/guarantee-operation-id-uniqueness/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-14 diff --git a/openspec/changes/guarantee-operation-id-uniqueness/design.md b/openspec/changes/guarantee-operation-id-uniqueness/design.md new file mode 100644 index 0000000..e38e5cb --- /dev/null +++ b/openspec/changes/guarantee-operation-id-uniqueness/design.md @@ -0,0 +1,54 @@ +## Context + +`OperationHandler.EnsureAllAppsAreReady` initializes an empty `Operation` by clearing conditions and assigning `status.operationId` from `OperationHelper.NewOperationId()`. That helper returns a random UUID string with hyphens removed. The baseline `operation-orchestration` spec now treats the ID as cluster-unique, but random generation only makes collision probability small. + +The ID is externally observable through `Operation.status.operationId`, copied into `Requirement.status.operationId`, and stamped into child `AppDeployment.spec.opId` and generated resource names. + +## Goals / Non-Goals + +**Goals:** +- Make `status.operationId` strictly unique across `Operation` resources before child resources consume it. +- Keep the current string shape compatible with generated names where practical. +- Add tests that prove collision handling or deterministic uniqueness. + +**Non-Goals:** +- No CRD schema changes. +- No migration of existing non-empty `status.operationId` values. +- No change to cache acquisition, AppDeployment fan-out semantics, or Requirement status copying beyond using the guaranteed ID. + +## Decisions + +### Decision: Prefer Kubernetes UID-derived IDs for new Operations + +Use the `Operation` object's Kubernetes `metadata.uid` as the uniqueness source and normalize it to the existing 32-character lowercase UUID-without-hyphens format. Kubernetes assigns UIDs at object creation and uses them as the stable identity for namespaced resources, which avoids the race inherent in list-then-generate checks under concurrent reconciles. + +Alternative considered: keep random IDs and list all `Operation`s before status update. Rejected as the primary mechanism because two reconcilers can list the same pre-update state and still publish the same candidate. A list check is useful as a defensive assertion, but it is not a strict uniqueness primitive by itself. + +### Decision: Preserve existing non-empty IDs + +Only initialize `status.operationId` when it is empty. Existing Operations with a populated ID continue through reconciliation unchanged so this change does not rename already-created AppDeployments or invalidate Requirement status references. + +Alternative considered: rewrite all IDs to UID-derived values. Rejected because it would churn child names and externally visible status for live resources. + +### Decision: Keep the random helper available only for non-contract use or remove it after refactor + +If no production path needs `OperationHelper.NewOperationId()` after UID-derived assignment, remove it and its narrow test. If the helper remains, rename or document it so callers do not mistake it for a uniqueness guarantee. + +Alternative considered: leave `NewOperationId()` unchanged and wrap it with collision retries. Rejected unless UID-derived IDs prove incompatible, because retry logic adds code while still needing concurrency-safe proof. + +## Risks / Trade-offs + +- **Risk:** Some consumer expects operation IDs to be random rather than UID-derived. -> **Mitigation:** The API only exposes the value as an opaque ID; keep the same normalized UUID format. +- **Risk:** Fake-client or unit-test objects may lack `metadata.uid`. -> **Mitigation:** Test helper setup should assign UIDs for initialization cases; production reconcilers should treat an empty UID as a transient error and requeue instead of publishing a fallback random ID. +- **Risk:** Existing duplicate IDs could remain if they were already published before this change. -> **Mitigation:** This change guarantees new assignments only; add a defensive test/documentation note rather than mutating live IDs. + +## Migration Plan + +1. Refactor operation ID initialization to derive from `Operation.UID` when `status.operationId` is empty. +2. Ensure status is updated only after a non-empty, normalized ID is available. +3. Add handler/controller tests for two Operations in different namespaces receiving distinct IDs and for preserving existing IDs. +4. Run the existing Go test suite and OpenSpec validation. + +## Open Questions + +- Should the implementation also list existing Operations and emit an event if it detects a legacy duplicate ID? That would aid diagnosis but is not required for strict uniqueness of new assignments. diff --git a/openspec/changes/guarantee-operation-id-uniqueness/proposal.md b/openspec/changes/guarantee-operation-id-uniqueness/proposal.md new file mode 100644 index 0000000..6075e24 --- /dev/null +++ b/openspec/changes/guarantee-operation-id-uniqueness/proposal.md @@ -0,0 +1,27 @@ +## Why + +The baseline spec says `Operation.status.operationId` is unique across the cluster, but the current implementation relies only on random ID generation. That makes collisions unlikely, not impossible, so the implementation should enforce the contract before the spec is treated as strict. + +## What Changes + +- Add a cluster-wide uniqueness check before assigning a generated `status.operationId`. +- Regenerate candidate IDs when a collision with an existing `Operation` is found. +- Preserve the existing operation ID format unless implementation proves it must change. +- Add tests that force a collision and verify the controller resolves it before publishing status. + +## Capabilities + +### New Capabilities + + + +### Modified Capabilities + +- `operation-orchestration`: Strengthen the existing operation ID requirement so uniqueness is actively enforced, not only probabilistically expected. + +## Impact + +- Affected code: `internal/handler/operation.go`, `internal/utils/controller/operation_helper.go`, and related tests. +- Affected APIs: no CRD schema changes; `status.operationId` remains the externally visible field. +- Affected systems: controller-runtime client behavior may include an extra `OperationList` or indexed lookup during ID assignment. +- Risk: low to moderate; retry logic must avoid publishing an ID until a non-conflicting value is found. diff --git a/openspec/changes/guarantee-operation-id-uniqueness/specs/operation-orchestration/spec.md b/openspec/changes/guarantee-operation-id-uniqueness/specs/operation-orchestration/spec.md new file mode 100644 index 0000000..eb6d81b --- /dev/null +++ b/openspec/changes/guarantee-operation-id-uniqueness/specs/operation-orchestration/spec.md @@ -0,0 +1,21 @@ +## MODIFIED Requirements + +### Requirement: Operation IDs are unique within the cluster + +The controller SHALL assign every `Operation` a non-empty `status.operationId` value that is unique across all `Operation` resources in the cluster. For newly initialized `Operation`s, the controller SHALL derive or validate the ID before publishing status so two `Operation`s cannot receive the same `status.operationId` even when reconciled concurrently. + +#### Scenario: Two independently created Operations get distinct IDs + +- **WHEN** two `Operation` CRs are created in any namespaces +- **THEN** their `status.operationId` values differ + +#### Scenario: Concurrent initialization does not publish duplicate IDs + +- **WHEN** two empty `Operation` CRs are reconciled at the same time +- **THEN** each `Operation` publishes a non-empty `status.operationId` +- **AND** no `status.operationId` value is shared by both `Operation`s + +#### Scenario: Existing Operation ID is preserved + +- **WHEN** an `Operation` already has a non-empty `status.operationId` +- **THEN** reconciliation does not replace that value while the `Operation` continues normal fan-out and status aggregation diff --git a/openspec/changes/guarantee-operation-id-uniqueness/tasks.md b/openspec/changes/guarantee-operation-id-uniqueness/tasks.md new file mode 100644 index 0000000..449be0c --- /dev/null +++ b/openspec/changes/guarantee-operation-id-uniqueness/tasks.md @@ -0,0 +1,20 @@ +## 1. Operation ID Assignment + +- [ ] 1.1 Add or refactor an operation ID helper that normalizes `Operation.metadata.uid` into the existing lowercase UUID-without-hyphens format +- [ ] 1.2 Update `OperationHandler.EnsureAllAppsAreReady` so empty `status.operationId` values are initialized from the Operation UID before AppDeployments are created +- [ ] 1.3 Ensure reconciliation requeues without publishing status if an initializing Operation has an empty UID +- [ ] 1.4 Preserve non-empty existing `status.operationId` values during later reconciles +- [ ] 1.5 Remove or rename `OperationHelper.NewOperationId()` if it is no longer used for contract-level ID assignment + +## 2. Tests + +- [ ] 2.1 Add helper coverage for UID normalization and empty UID handling +- [ ] 2.2 Add handler or controller coverage proving two Operations in different namespaces receive distinct non-empty `status.operationId` values +- [ ] 2.3 Add coverage proving an existing non-empty `status.operationId` is not rewritten +- [ ] 2.4 Add coverage that child AppDeployments still receive the parent `status.operationId` in `spec.opId` + +## 3. Validation + +- [ ] 3.1 Run `go test ./...` +- [ ] 3.2 Run `openspec validate --change guarantee-operation-id-uniqueness` +- [ ] 3.3 Review the archived baseline requirement and this delta for wording consistency before applying diff --git a/openspec/config.yaml b/openspec/config.yaml new file mode 100644 index 0000000..d835b48 --- /dev/null +++ b/openspec/config.yaml @@ -0,0 +1,19 @@ +schema: spec-driven + +# Project context (optional) +# This is shown to AI when creating artifacts. +# Add your tech stack, conventions, style guides, domain knowledge, etc. +context: | + Tech stack: Go 1.26, kubebuilder v4, controller-runtime, Ginkgo/Gomega, gomock. + Domain: Kubernetes operator for Requirement, Operation, AppDeployment, and Cache CRDs in controller.azure.github.com/v1alpha1. + Conventions: document externally observable CRD behavior in OpenSpec; future reconciler or API behavior changes should ship spec deltas. + +# Per-artifact rules (optional) +# Add custom rules for specific artifacts. +# Example: +# rules: +# proposal: +# - Keep proposals under 500 words +# - Always include a "Non-goals" section +# tasks: +# - Break tasks into chunks of max 2 hours diff --git a/openspec/specs/appdeployment-execution/spec.md b/openspec/specs/appdeployment-execution/spec.md new file mode 100644 index 0000000..f1f654b --- /dev/null +++ b/openspec/specs/appdeployment-execution/spec.md @@ -0,0 +1,46 @@ +# appdeployment-execution Specification + +## Purpose +Specify AppDeployment reconciliation for dependency-gated provision jobs, teardown jobs, deletion finalizers, and Job ownership. +## Requirements +### Requirement: AppDeployment runs the provision Job on creation + +The controller SHALL, when an `AppDeployment` is created, launch a Kubernetes `Job` derived from `spec.provision` owned by the `AppDeployment`, transition `status.phase` from `""` → `Pending` → `Deploying`, and set `status.phase=Ready` when the provision `Job` reports `Complete=True`. + +#### Scenario: Successful provision Job promotes AppDeployment to Ready + +- **WHEN** an `AppDeployment` is created and its provision `Job` succeeds +- **THEN** `status.phase` of the `AppDeployment` becomes `Ready` + +#### Scenario: Failed provision Job is retried and keeps AppDeployment out of Ready + +- **WHEN** the provision `Job` reports `Failed=True` +- **THEN** the controller deletes the failed provision `Job`, creates a replacement provision `Job`, and keeps the `AppDeployment` out of `Ready` until a provision `Job` succeeds + +### Requirement: AppDeployment respects declared dependencies + +The controller SHALL NOT launch the provision `Job` for an `AppDeployment` whose `spec.dependencies` include sibling app names until every dependency `AppDeployment` (sharing the same parent `Operation` via `spec.opId`) has reached `status.phase=Ready`. + +#### Scenario: Dependent app waits for its dependency + +- **WHEN** `AppDeployment` `app-b` declares `spec.dependencies=["app-a"]` and `app-a` is not yet `Ready` +- **THEN** `app-b` remains in `status.phase=Pending` and no provision `Job` is created for it +- **AND** once `app-a` reaches `status.phase=Ready`, the controller launches `app-b`'s provision `Job` + +### Requirement: AppDeployment runs the teardown Job on deletion + +The controller SHALL add the finalizer `finalizer.appdeployment.devinfra.goms.io` to every `AppDeployment`, and on deletion SHALL launch a `Job` derived from `spec.teardown` owned by the `AppDeployment`, transition `status.phase` to `Deleting`, and remove the finalizer after `status.phase=Deleted`. The controller sets `status.phase=Deleted` when the teardown `Job` succeeds, or after it observes and deletes a failed teardown `Job` while emitting a warning event. + +#### Scenario: Teardown Job runs before AppDeployment is removed + +- **WHEN** an `AppDeployment` in `status.phase=Ready` is deleted +- **THEN** a teardown `Job` is created, the `AppDeployment` reports `status.phase=Deleting` until the teardown attempt completes or fails, then reports `status.phase=Deleted` and has its finalizer removed + +### Requirement: AppDeployment owns its Jobs for cascade deletion + +The controller SHALL set the `AppDeployment` as the controller `ownerReference` of every provision and teardown `Job` it creates, so deleting the `AppDeployment` causes Kubernetes garbage collection of orphaned `Job`s. + +#### Scenario: Deleting AppDeployment removes its Jobs + +- **WHEN** an `AppDeployment` and its provision `Job` exist, and the `AppDeployment` is deleted +- **THEN** the provision `Job` is garbage-collected by Kubernetes via `ownerReferences` diff --git a/openspec/specs/cache-pool/spec.md b/openspec/specs/cache-pool/spec.md new file mode 100644 index 0000000..3fc2fd4 --- /dev/null +++ b/openspec/specs/cache-pool/spec.md @@ -0,0 +1,59 @@ +# cache-pool Specification + +## Purpose +Specify Cache reconciliation for deterministic cache keys, fixed keep-alive pools, available cached Operations, expiry, and no-finalizer deletion. +## Requirements +### Requirement: Cache maintains a pool of pre-provisioned Operations + +The controller SHALL, for every `Cache`, set `status.keepAlive` to the controller's fixed keep-alive count, list owned `Operation`s, publish the names of owned `Operation`s with `status.phase=Reconciled` in `status.availableCaches`, and create additional owned `Operation`s from `spec.operationTemplate` only when the total owned `Operation` count is below `status.keepAlive`. + +#### Scenario: Empty Cache provisions Operations to fill the pool + +- **WHEN** a `Cache` is created with `spec.operationTemplate` set and no owned `Operation`s exist +- **THEN** the controller creates owned `Operation`s from the template until the total owned `Operation` count reaches `status.keepAlive` +- **AND** after owned `Operation`s reach `status.phase=Reconciled`, `status.availableCaches` lists their names + +### Requirement: Cache CR is keyed by a deterministic cache key + +The controller SHALL compute `status.cacheKey` deterministically from `spec.operationTemplate` and SHALL apply the label `operation-cache-controller.azure.github.com/cache-key=` to every `Operation` it creates for the cache, truncating the label value to the Kubernetes label-value limit when required. The controller SHALL NOT add this cache-key label to the `Cache` resource itself. + +#### Scenario: Two Caches with identical templates compute the same key + +- **WHEN** two `Cache` CRs with byte-identical `spec.operationTemplate` are created +- **THEN** both report the same `status.cacheKey`, and cached `Operation`s created for either `Cache` carry the matching `cache-key` label value + +### Requirement: Cache surfaces available entries in status + +The controller SHALL keep `status.availableCaches` synchronized with the names of owned `Operation`s that are `Reconciled`. When a cached `Operation` is acquired, ownership transfers away from the `Cache`, so the next cache reconcile SHALL omit that `Operation` from `status.availableCaches`. + +#### Scenario: Acquired cached Operation disappears from status + +- **WHEN** a `Requirement` acquires a cached `Operation` +- **THEN** that `Operation`'s name is removed from the `Cache`'s `status.availableCaches` within one reconcile cycle + +### Requirement: Cache expires entries past spec.expireTime + +The controller SHALL delete the `Cache` CR once the wall-clock time exceeds `Cache.spec.expireTime` (when set) and SHALL stop processing additional cache-pool operations in that reconcile. Cleanup of owned cached `Operation`s relies on Kubernetes `ownerReferences` cascade deletion. + +#### Scenario: Past-expiry Cache is deleted + +- **WHEN** the current time is after `Cache.spec.expireTime` +- **THEN** the controller deletes the `Cache` CR and does not create replacement cached `Operation`s in that reconcile + +### Requirement: Cache reconciliation runs at least every 60 seconds + +The controller SHALL re-reconcile every `Cache` CR within 60 seconds even with no watch events, so pool depletion and expiry are eventually observed. + +#### Scenario: Idle Cache is re-evaluated + +- **WHEN** a `Cache` exists and no events fire against it +- **THEN** the controller re-reconciles it within 60 seconds + +### Requirement: Cache does not use a finalizer + +The controller SHALL NOT register a finalizer on `Cache` CRs; cleanup of owned `Operation`s relies on Kubernetes `ownerReferences` cascade deletion alone. + +#### Scenario: Deleting a Cache cascades via owner references + +- **WHEN** a `Cache` with owned cached `Operation`s is deleted +- **THEN** Kubernetes garbage-collects those `Operation`s via `ownerReferences` without any finalizer interaction diff --git a/openspec/specs/operation-orchestration/spec.md b/openspec/specs/operation-orchestration/spec.md new file mode 100644 index 0000000..9a9c45a --- /dev/null +++ b/openspec/specs/operation-orchestration/spec.md @@ -0,0 +1,54 @@ +# operation-orchestration Specification + +## Purpose +Specify Operation reconciliation for AppDeployment fan-out, status aggregation, cache acquisition metadata, unique operation IDs, and deletion lifecycle. +## Requirements +### Requirement: Operation fans out to one AppDeployment per application + +The controller SHALL, for every `Operation`, create exactly one owned `AppDeployment` per entry in `spec.applications`, copying the application's `provision`, `teardown`, and `dependencies` fields and stamping `spec.opId` with the `Operation`'s unique `status.operationId`. + +#### Scenario: Multi-application Operation creates matching AppDeployments + +- **WHEN** an `Operation` is created with two `ApplicationSpec` entries named `app-a` and `app-b` +- **THEN** the controller creates two `AppDeployment` resources owned by the `Operation`, each carrying the parent `operationId` in `spec.opId`, and the `Operation` reports `status.phase=Reconciling` + +### Requirement: Operation status aggregates child AppDeployment phases + +The controller SHALL set `status.phase=Reconciled` on an `Operation` only when every owned `AppDeployment` reports `status.phase=Ready`, and SHALL keep `status.phase=Reconciling` otherwise. + +#### Scenario: Operation becomes Reconciled when all children are Ready + +- **WHEN** all `AppDeployment`s owned by an `Operation` report `status.phase=Ready` +- **THEN** the `Operation` reports `status.phase=Reconciled` + +#### Scenario: One pending child keeps the Operation reconciling + +- **WHEN** at least one owned `AppDeployment` is not yet `Ready` +- **THEN** the `Operation` continues to report `status.phase=Reconciling` + +### Requirement: Operation acquisition is recorded via annotation + +When an `Operation` is acquired from a cache by a `Requirement`, the controller SHALL stamp the annotation `operation.controller.azure.com/acquired` on the `Operation` with an RFC3339 timestamp and SHALL transfer the `ownerReference` from the `Cache` to the acquiring `Requirement`. + +#### Scenario: Acquired Operation carries the timestamp annotation + +- **WHEN** a `Requirement` acquires a cached `Operation` +- **THEN** the `Operation` has annotation `operation.controller.azure.com/acquired` set to the acquisition time and its sole controller `ownerReference` points to the `Requirement` + +### Requirement: Operation uses a finalizer to record deletion lifecycle + +The controller SHALL add the finalizer `finalizer.operation.controller.azure.com` to every `Operation`, transition a deleting `Operation` through `status.phase=Deleting` and `status.phase=Deleted`, and then remove the finalizer. Deletion of owned `AppDeployment`s is delegated to Kubernetes `ownerReferences` and the `AppDeployment` controller; the `Operation` controller does not wait for every child to report `Deleted`. + +#### Scenario: Operation deletion records Deleting and Deleted phases + +- **WHEN** an `Operation` is deleted +- **THEN** the `Operation` enters `status.phase=Deleting`, then `status.phase=Deleted`, after which the controller removes `finalizer.operation.controller.azure.com` and the API server may remove the `Operation` + +### Requirement: Operation IDs are unique within the cluster + +The controller SHALL assign every `Operation` a `status.operationId` value that is unique across all `Operation` resources in the cluster. + +#### Scenario: Two independently created Operations get distinct IDs + +- **WHEN** two `Operation` CRs are created in any namespaces +- **THEN** their `status.operationId` values differ diff --git a/openspec/specs/requirement-management/spec.md b/openspec/specs/requirement-management/spec.md new file mode 100644 index 0000000..793a967 --- /dev/null +++ b/openspec/specs/requirement-management/spec.md @@ -0,0 +1,46 @@ +# requirement-management Specification + +## Purpose +Specify Requirement reconciliation for fresh Operation creation, cache lookup and acquisition, cache misses, deletion behavior, and periodic rechecks. +## Requirements +### Requirement: Requirement CR creates an Operation when caching is disabled + +The controller SHALL, for every `Requirement` whose `spec.enableCache` is `false`, create one owned `Operation` named after the `Requirement` whose `spec` is a copy of `requirement.spec.template`, set `status.operationName` and `status.operationId` accordingly, and transition `status.phase` from `""` → `Operating` → `Ready` as the child `Operation` becomes `Reconciled`. + +#### Scenario: Cache-disabled requirement provisions a fresh Operation + +- **WHEN** a `Requirement` is created with `spec.enableCache=false` and a valid `spec.template` +- **THEN** the controller records the `Requirement` name in `status.operationName`, creates an `Operation` owned by the `Requirement`, and reports `status.phase=Operating` +- **AND** when the child `Operation` reaches `status.phase=Reconciled`, the `Requirement` reports `status.phase=Ready` with `status.operationId` copied from the child `Operation` + +### Requirement: Requirement CR consults the cache when caching is enabled + +The controller SHALL, for every `Requirement` whose `spec.enableCache` is `true`, derive a deterministic cache key from `spec.template`, look up a `Cache` CR by that key, and either acquire an existing cached `Operation` (cache hit) or create a new `Operation` (cache miss). + +#### Scenario: Cache hit acquires a pre-provisioned Operation + +- **WHEN** a `Requirement` with `enableCache=true` is reconciled and a matching `Cache` CR exists with at least one available cached `Operation` +- **THEN** the controller records one cached `Operation` name in `status.operationName`, transfers ownership of that `Operation` to the `Requirement`, stamps the annotation `operation.controller.azure.com/acquired` on it with the acquisition timestamp, sets `status.phase=Ready`, and records condition `OperationReady=True` with reason `CacheHit` + +#### Scenario: Cache miss creates a new Operation and a Cache CR if missing + +- **WHEN** a `Requirement` with `enableCache=true` is reconciled and either the `Cache` CR is absent or has no available cached `Operation` +- **THEN** the controller ensures a `Cache` CR exists for the derived cache key (creating it from `spec.template` if necessary), creates a new owned `Operation` named after the `Requirement`, and records cache-miss status with reason `CacheMiss` when an existing `Cache` has no available entries + +### Requirement: Requirement CR does not run finalizer cleanup + +The controller SHALL NOT add the declared finalizer `finalizer.requirement.devinfra.goms.io` during `Requirement` reconciliation; cleanup of owned `Operation` resources relies on Kubernetes `ownerReferences` cascade deletion. + +#### Scenario: Deletion is not blocked by a Requirement finalizer + +- **WHEN** a user deletes a `Requirement` that owns an `Operation` +- **THEN** the controller does not add a `Requirement` finalizer or drive a `Deleting` phase, and Kubernetes garbage collection removes owned `Operation` resources via `ownerReferences` + +### Requirement: Requirement reconciliation is periodically requeued + +The controller SHALL requeue every `Requirement` for re-reconciliation at most every 10 minutes even when no watch event fires, so that `spec.expireAt` and cache state changes are eventually observed. + +#### Scenario: Idle requirement is re-checked + +- **WHEN** a `Requirement` is in `status.phase=Ready` and no events occur +- **THEN** the controller re-evaluates it within 10 minutes