Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,8 @@

A k8s controller used to manage operations and cache the outcome of that operation

## Specifications

**Trademarks** This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
OpenSpec capability baselines live under `openspec/specs/`. Future PRs that change CRD fields or reconciler behavior must include an OpenSpec change with deltas against those specs so reviewers can compare the intended behavior before implementation details.

**Trademarks** This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-05-13
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Context

The Operation Cache Controller is a kubebuilder v4 operator already running in production. It manages four CRDs in `controller.azure.github.com/v1alpha1` (`Requirement`, `Operation`, `AppDeployment`, `Cache`) using a handler-based reconciliation pattern in `internal/controller/` and `internal/handler/`. There is no machine-readable spec, so this change introduces OpenSpec as the source of truth going forward without modifying any code paths.

Stakeholders: controller maintainers (need a baseline to diff against), reviewers (want spec-level PRs), new contributors (want to learn intent without reading 22 test files).

## Goals / Non-Goals

**Goals:**
- Capture the *current* externally observable behavior of each CRD as testable scenarios under `openspec/specs/<capability>/spec.md`.
- One capability spec per top-level CRD; nested concepts (Jobs, conditions, finalizers) appear as scenarios within the parent capability rather than separate capabilities.
- Use SHALL/MUST language so each requirement is normative and verifiable against existing Ginkgo tests.

**Non-Goals:**
- No code, CRD field, RBAC, or runtime behavior changes.
- No reorganization of `internal/` packages.
- No documentation of internal helper packages (`internal/utils/...`); these are implementation detail.
- No webhook spec — webhooks are scaffolded but not implemented.

## Decisions

### Decision: One capability per top-level CRD

We map exactly one capability to each of `Requirement`, `Operation`, `AppDeployment`, `Cache`. Alternative considered: a single `cache-controller` capability covering everything. Rejected — it would force every future delta to touch one giant spec and erase the natural CRD-level review boundary the team already uses.

### Decision: Scenarios written in WHEN/THEN against observable cluster state

Each scenario describes what an external observer (kubectl / a watcher) sees, not what handler functions do internally. Alternative: describe internal handler invocations. Rejected — couples specs to implementation, defeats the purpose of being able to refactor handlers without rewriting specs.

### Decision: Document constants (finalizers, annotations, label keys) as part of requirements

Names like `finalizer.operation.controller.azure.com` and `operation.controller.azure.com/acquired` are part of the *contract* with cluster operators and cannot be silently renamed. They appear inline in the relevant scenarios. Alternative: keep them only in code constants. Rejected — anyone integrating with the controller (dashboards, GitOps tooling) needs to discover these from the spec.

### Decision: No delta operations in this change

Because no prior specs exist, every requirement uses `## ADDED Requirements`. Future changes will use `MODIFIED`/`REMOVED`/`RENAMED` against these baselines. This keeps the baseline change reviewable as additive-only.

## Risks / Trade-offs

- **Risk:** Spec drifts from code if it under-describes edge cases (e.g., race conditions on cache acquisition). → **Mitigation:** Treat baseline as v1; correct in follow-up changes when discrepancies are found during real reviews. Do not block this PR on exhaustive coverage.
- **Risk:** Naming a capability after a CRD ties spec churn to API renames. → **Mitigation:** API is `v1alpha1` and stable in practice; if a CRD is ever renamed, that change uses `RENAMED Requirements` plus a folder rename — supported by OpenSpec.
- **Risk:** Cache-hit acquisition logic is genuinely non-trivial (annotation timestamps, ownership transfer). Spec may oversimplify. → **Mitigation:** Scenarios cite the annotation key and ownership-transfer behavior explicitly so future deltas have something concrete to amend.

## Migration Plan

1. Land this change. Archive after `/opsx:verify` passes.
2. Going forward, every PR that touches `api/v1alpha1/` or reconciler behavior MUST include an OpenSpec change with deltas against the baseline.
3. No rollback needed — pure documentation.

## Open Questions

- Should `internal/utils/reconciler/operations.go` (sequential-operations pattern) be promoted to its own capability later? Deferred; it's an implementation detail today.
- Webhook scaffolding exists but is unused — do we add a placeholder capability now or wait until it's wired up? Deferred to when webhooks are activated.
- Follow-up test gap: no existing controller or handler test covers the current absence of `Requirement` finalizer reconciliation, despite the API constant `RequirementFinalizerName`.
- Follow-up test gap: no existing controller or handler test asserts that two independently initialized `Operation` resources receive distinct `status.operationId` values; helper coverage currently checks only that generated IDs are non-empty.
- Follow-up test gap: cache acquisition removal from `Cache.status.availableCaches` is covered indirectly by ownership-transfer behavior, but no end-to-end controller or handler test verifies the acquired operation disappearing from cache status after a cache reconcile.
- Follow-up test gap: no existing controller or handler test explicitly covers the `Cache` no-finalizer contract or garbage-collection behavior after user deletion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Why

The Operation Cache Controller has been built and is running in production, but it has no machine-readable specification. New contributors must reverse-engineer behavior from Go code, and future OpenSpec change proposals have nothing to diff against. This baseline change captures the *current* behavior of the four CRDs and their reconcilers as OpenSpec capability specs so that subsequent work can propose true deltas.

## What Changes

- Document the existing system as four capability specs derived from the current `internal/controller/` and `internal/handler/` implementations.
- No source code, CRDs, RBAC, or runtime behavior change. This is a documentation-only baseline.
- Establish naming conventions for future capability deltas (one capability per top-level CRD).
- Record the cache hit/miss data flow and finalizer/ownership rules as testable requirements.

## Capabilities

### New Capabilities

- `requirement-management`: Reconciliation of `Requirement` CRs into `Operation` and (optionally) `Cache` resources, including cache-key derivation and cache-hit acquisition.
- `operation-orchestration`: Lifecycle of `Operation` CRs — fan-out to per-app `AppDeployment` children, status aggregation, finalizer-driven teardown, and acquisition annotations.
- `appdeployment-execution`: Translation of `AppDeployment` CRs into provision/teardown Kubernetes `Job`s, job-status reconciliation, and finalizer cleanup.
- `cache-pool`: Pre-provisioning of `Operation`s under a `Cache`, auto-count maintenance, cache-duration expiry, and label-based cache-key indexing.

### Modified Capabilities

<!-- None: this is the initial baseline; no prior specs exist. -->

## Impact

- Affected code: none (read-only documentation pass over `api/v1alpha1/`, `internal/controller/`, `internal/handler/`).
- Affected APIs: none. The four CRDs in `controller.azure.github.com/v1alpha1` are described, not changed.
- Affected processes: future changes must now ship spec deltas against these baselines instead of free-form proposals.
- Risk: low — if the spec mis-describes current behavior, it is corrected in a follow-up change; runtime is unaffected.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
## ADDED Requirements

### Requirement: AppDeployment runs the provision Job on creation

The controller SHALL, when an `AppDeployment` is created, launch a Kubernetes `Job` derived from `spec.provision` owned by the `AppDeployment`, transition `status.phase` from `""` → `Pending` → `Deploying`, and set `status.phase=Ready` when the provision `Job` reports `Complete=True`.

#### Scenario: Successful provision Job promotes AppDeployment to Ready

- **WHEN** an `AppDeployment` is created and its provision `Job` succeeds
- **THEN** `status.phase` of the `AppDeployment` becomes `Ready`

#### Scenario: Failed provision Job is retried and keeps AppDeployment out of Ready

- **WHEN** the provision `Job` reports `Failed=True`
- **THEN** the controller deletes the failed provision `Job`, creates a replacement provision `Job`, and keeps the `AppDeployment` out of `Ready` until a provision `Job` succeeds

### Requirement: AppDeployment respects declared dependencies

The controller SHALL NOT launch the provision `Job` for an `AppDeployment` whose `spec.dependencies` include sibling app names until every dependency `AppDeployment` (sharing the same parent `Operation` via `spec.opId`) has reached `status.phase=Ready`.

#### Scenario: Dependent app waits for its dependency

- **WHEN** `AppDeployment` `app-b` declares `spec.dependencies=["app-a"]` and `app-a` is not yet `Ready`
- **THEN** `app-b` remains in `status.phase=Pending` and no provision `Job` is created for it
- **AND** once `app-a` reaches `status.phase=Ready`, the controller launches `app-b`'s provision `Job`

### Requirement: AppDeployment runs the teardown Job on deletion

The controller SHALL add the finalizer `finalizer.appdeployment.devinfra.goms.io` to every `AppDeployment`, and on deletion SHALL launch a `Job` derived from `spec.teardown` owned by the `AppDeployment`, transition `status.phase` to `Deleting`, and remove the finalizer after `status.phase=Deleted`. The controller sets `status.phase=Deleted` when the teardown `Job` succeeds, or after it observes and deletes a failed teardown `Job` while emitting a warning event.

#### Scenario: Teardown Job runs before AppDeployment is removed

- **WHEN** an `AppDeployment` in `status.phase=Ready` is deleted
- **THEN** a teardown `Job` is created, the `AppDeployment` reports `status.phase=Deleting` until the teardown attempt completes or fails, then reports `status.phase=Deleted` and has its finalizer removed

### Requirement: AppDeployment owns its Jobs for cascade deletion

The controller SHALL set the `AppDeployment` as the controller `ownerReference` of every provision and teardown `Job` it creates, so deleting the `AppDeployment` causes Kubernetes garbage collection of orphaned `Job`s.

#### Scenario: Deleting AppDeployment removes its Jobs

- **WHEN** an `AppDeployment` and its provision `Job` exist, and the `AppDeployment` is deleted
- **THEN** the provision `Job` is garbage-collected by Kubernetes via `ownerReferences`
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
## ADDED Requirements

### Requirement: Cache maintains a pool of pre-provisioned Operations

The controller SHALL, for every `Cache`, set `status.keepAlive` to the controller's fixed keep-alive count, list owned `Operation`s, publish the names of owned `Operation`s with `status.phase=Reconciled` in `status.availableCaches`, and create additional owned `Operation`s from `spec.operationTemplate` only when the total owned `Operation` count is below `status.keepAlive`.

#### Scenario: Empty Cache provisions Operations to fill the pool

- **WHEN** a `Cache` is created with `spec.operationTemplate` set and no owned `Operation`s exist
- **THEN** the controller creates owned `Operation`s from the template until the total owned `Operation` count reaches `status.keepAlive`
- **AND** after owned `Operation`s reach `status.phase=Reconciled`, `status.availableCaches` lists their names

### Requirement: Cache CR is keyed by a deterministic cache key

The controller SHALL compute `status.cacheKey` deterministically from `spec.operationTemplate` and SHALL apply the label `operation-cache-controller.azure.github.com/cache-key=<key>` to every `Operation` it creates for the cache, truncating the label value to the Kubernetes label-value limit when required. The controller SHALL NOT add this cache-key label to the `Cache` resource itself.

#### Scenario: Two Caches with identical templates compute the same key

- **WHEN** two `Cache` CRs with byte-identical `spec.operationTemplate` are created
- **THEN** both report the same `status.cacheKey`, and cached `Operation`s created for either `Cache` carry the matching `cache-key` label value

### Requirement: Cache surfaces available entries in status

The controller SHALL keep `status.availableCaches` synchronized with the names of owned `Operation`s that are `Reconciled`. When a cached `Operation` is acquired, ownership transfers away from the `Cache`, so the next cache reconcile SHALL omit that `Operation` from `status.availableCaches`.

#### Scenario: Acquired cached Operation disappears from status

- **WHEN** a `Requirement` acquires a cached `Operation`
- **THEN** that `Operation`'s name is removed from the `Cache`'s `status.availableCaches` within one reconcile cycle

### Requirement: Cache expires entries past spec.expireTime

The controller SHALL delete the `Cache` CR once the wall-clock time exceeds `Cache.spec.expireTime` (when set) and SHALL stop processing additional cache-pool operations in that reconcile. Cleanup of owned cached `Operation`s relies on Kubernetes `ownerReferences` cascade deletion.

#### Scenario: Past-expiry Cache is deleted

- **WHEN** the current time is after `Cache.spec.expireTime`
- **THEN** the controller deletes the `Cache` CR and does not create replacement cached `Operation`s in that reconcile

### Requirement: Cache reconciliation runs at least every 60 seconds

The controller SHALL re-reconcile every `Cache` CR within 60 seconds even with no watch events, so pool depletion and expiry are eventually observed.

#### Scenario: Idle Cache is re-evaluated

- **WHEN** a `Cache` exists and no events fire against it
- **THEN** the controller re-reconciles it within 60 seconds

### Requirement: Cache does not use a finalizer

The controller SHALL NOT register a finalizer on `Cache` CRs; cleanup of owned `Operation`s relies on Kubernetes `ownerReferences` cascade deletion alone.

#### Scenario: Deleting a Cache cascades via owner references

- **WHEN** a `Cache` with owned cached `Operation`s is deleted
- **THEN** Kubernetes garbage-collects those `Operation`s via `ownerReferences` without any finalizer interaction
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
## ADDED Requirements

### Requirement: Operation fans out to one AppDeployment per application

The controller SHALL, for every `Operation`, create exactly one owned `AppDeployment` per entry in `spec.applications`, copying the application's `provision`, `teardown`, and `dependencies` fields and stamping `spec.opId` with the `Operation`'s unique `status.operationId`.

#### Scenario: Multi-application Operation creates matching AppDeployments

- **WHEN** an `Operation` is created with two `ApplicationSpec` entries named `app-a` and `app-b`
- **THEN** the controller creates two `AppDeployment` resources owned by the `Operation`, each carrying the parent `operationId` in `spec.opId`, and the `Operation` reports `status.phase=Reconciling`

### Requirement: Operation status aggregates child AppDeployment phases

The controller SHALL set `status.phase=Reconciled` on an `Operation` only when every owned `AppDeployment` reports `status.phase=Ready`, and SHALL keep `status.phase=Reconciling` otherwise.

#### Scenario: Operation becomes Reconciled when all children are Ready

- **WHEN** all `AppDeployment`s owned by an `Operation` report `status.phase=Ready`
- **THEN** the `Operation` reports `status.phase=Reconciled`

#### Scenario: One pending child keeps the Operation reconciling

- **WHEN** at least one owned `AppDeployment` is not yet `Ready`
- **THEN** the `Operation` continues to report `status.phase=Reconciling`

### Requirement: Operation acquisition is recorded via annotation

When an `Operation` is acquired from a cache by a `Requirement`, the controller SHALL stamp the annotation `operation.controller.azure.com/acquired` on the `Operation` with an RFC3339 timestamp and SHALL transfer the `ownerReference` from the `Cache` to the acquiring `Requirement`.

#### Scenario: Acquired Operation carries the timestamp annotation

- **WHEN** a `Requirement` acquires a cached `Operation`
- **THEN** the `Operation` has annotation `operation.controller.azure.com/acquired` set to the acquisition time and its sole controller `ownerReference` points to the `Requirement`

### Requirement: Operation uses a finalizer to record deletion lifecycle

The controller SHALL add the finalizer `finalizer.operation.controller.azure.com` to every `Operation`, transition a deleting `Operation` through `status.phase=Deleting` and `status.phase=Deleted`, and then remove the finalizer. Deletion of owned `AppDeployment`s is delegated to Kubernetes `ownerReferences` and the `AppDeployment` controller; the `Operation` controller does not wait for every child to report `Deleted`.

#### Scenario: Operation deletion records Deleting and Deleted phases

- **WHEN** an `Operation` is deleted
- **THEN** the `Operation` enters `status.phase=Deleting`, then `status.phase=Deleted`, after which the controller removes `finalizer.operation.controller.azure.com` and the API server may remove the `Operation`

### Requirement: Operation IDs are unique within the cluster

The controller SHALL assign every `Operation` a `status.operationId` value that is unique across all `Operation` resources in the cluster.

#### Scenario: Two independently created Operations get distinct IDs

- **WHEN** two `Operation` CRs are created in any namespaces
- **THEN** their `status.operationId` values differ
Loading
Loading