chore: code review fixes 2026-05 (with breaking API tightenings)#2
Merged
Conversation
VerifySession parsed Exp but never validated it, leaving caller correctness as the only gate against expired cookies. Reject sessions where Exp is set and in the past; Exp == 0 still means "no expiry" so callers that opt out of expiration keep working. Downstream consumers in epoch and glance already perform a redundant Exp check at the call site; those will be cleaned up in coordinated PRs.
Swallowing rand.Read's error meant a degraded OS RNG would silently return all-zero state, defeating CSRF protection. Panic instead — the syscall is documented to never fail on supported platforms, and if it ever does, halting is the only safe option.
LoadOrGenerateCert previously accepted any keypair that parsed, so a kubelet whose mounted secret rotated out from under it would silently keep serving an expired cert until the API server refused TLS — a diagnosis-hostile failure mode. Parse the leaf cert after a successful keypair load; if NotAfter is in the past, log a warning and mint a self-signed cert instead. Parse errors on the cert chain are logged but non-fatal so a working keypair is never gratuitously discarded. Keypair load failures (corrupt PEM, mismatched key) continue to surface as fatal — those signal operator misconfiguration and silent substitution would hide them.
Status arrays without an x-kubernetes-list-type were treated as atomic by Server-Side Apply, so two controllers (e.g. the operator and vk) patching different slots would clobber each other on every write. Tag Agents with listMapKey=slot and Toolboxes with listMapKey=name so apiserver merges entries by their natural identifier. Regenerate the CRD YAML.
Move the const block above the type declaration to match the rest of the package, and back IsTerminal with a package-private terminalStates map so adding or removing a terminal state is a single-line edit instead of a switch-case maintenance burden.
Replace each per-type Default's if-empty-return ladder with cmp.Or, and each IsValid's hand-rolled OR chain with slices.Contains against a package-private value table. Adding a new enum member is now a single edit to the const block plus one entry in the validity table — no parallel switch to keep in sync. Behavior is unchanged: all existing IsValid/Default tests pass without modification.
meta.go was a 160-line grab-bag of every constant and helper in the package. Split by concern so each file matches one responsibility: keys.go const vocabulary (annotations, labels, roles) owner.go toleration + owner-reference helpers vmname.go VM name builders + slot extraction connection.go connection-protocol inference meta.go keeps only the package godoc. No exported symbol changes; godoc strings on each annotation/label const are new.
LabelManagedBy was declared but never referenced — not by cocoon-common, not by any downstream consumer (vk-cocoon, cocoon-operator, cocoon-webhook, glance, epoch). Dead vocabulary in the public API costs maintenance with no payoff.
Setup used to call Fatalf on failure, which killed the process before the caller could run defers or surface a structured exit. The new signature returns the wrapped error so callers in main packages can log via their own logger and choose the exit policy. Breaking change: all downstream consumers must check the returned error. Each consumer (vk-cocoon, cocoon-operator, cocoon-webhook, glance, epoch, cocoon-net) will be updated in a coordinated PR.
The previous string-only return conflated three different signals into "": no owners at all, no ReplicaSet among the owners, and a ReplicaSet whose name had no hash suffix. Callers had no way to tell those apart and either treated empty as benign (silently dropping the "no owner" case) or wrote a duplicate check on the input. Return a comma-ok pair instead so the caller decides. Add table-driven tests for each branch. Breaking change: downstream callers (none today outside cocoon-common, mentioned only in cocoon-specs docs) must adopt the comma-ok form.
Returning "127.0.0.1" when net.InterfaceAddrs failed (or returned only
loopback / IPv6) made misconfigured hosts look healthy: a vk-cocoon
that couldn't see its own primary IP would advertise localhost,
register fine with the API server, then break every kubelet TLS
handshake from outside.
Surface the failure instead. Callers that want the previous fallback
write `if err != nil { ip = "127.0.0.1" }` and own the decision.
ErrNoNodeIP is exported so callers can branch on the "no IPv4
interface" case specifically.
Breaking change: vk-cocoon main.go must adopt the new signature in a
coordinated PR.
The function only checks the Toleration Key; Operator, Value, and Effect are ignored. The previous name implied a stronger match than the implementation gave, so any caller wanting strict semantics had to read the body to find out. Rename to make the looseness explicit: the cocoon-webhook gate is intentionally permissive and a stricter check belongs at the call site. Breaking change: downstream callers (cocoon-webhook, glance) must adopt the new name.
The legacy ExtractSlotFromVMName / MainAgentVMName / InferRoleFromVMName
trio split on the last dash, which misclassifies a toolbox whose
pod-name happens to end in -N (e.g. toolbox "db-2" produces vmName
vk-NS-CS-db-2 and reads as slot 2). The bug surfaces in glance role
inference and in vmspec.ShouldSnapshotVM under MainOnly policy.
Add a (namespace, cocoonSet)-anchored variant family:
AgentVMNamePrefix shared agent-VM-name prefix builder
ExtractAgentSlot parses the slot only when the prefix matches
and the suffix has no inner dash
MainAgentVMNameFor constant function — no parsing involved
InferRoleFromAgentSlot pure int -> role mapping
Existing helpers are kept and marked Deprecated in the godoc so
downstream consumers can migrate one call site at a time.
The PR-context framing ("the previous implementation returned X")
belongs in the commit message, not in code that has to live with
every future reader.
Trim narrative godocs that restate the signature or describe historical behavior. Keep only the contract and non-obvious WHY.
k8s/tls.go: LoadOrGenerateCert now takes ctx as its first argument and threads it through tryLoadDiskCert and isCertExpired so library logging no longer fabricates context.Background(). Drop the redundant os.Stat pre-check in tryLoadDiskCert; LoadX509KeyPair's fs.ErrNotExist covers the "missing cert" path and stops swallowing permission errors. Capture time.Now() once in GenerateSelfSignedCert. meta/lifecycle.go: inline LifecycleStatus.Apply so the per-pod reconcile path no longer allocates a throwaway map[string]any with boxed values just to iterate it. meta/connection.go: compare against string(cocoonv1.OS*) instead of raw "android"/"windows" literals so renames break the build.
There was a problem hiding this comment.
Pull request overview
This PR applies a set of cross-package refactors and API tightenings across auth, k8s, meta, log, and apis/v1 to improve correctness (e.g., session expiry handling, node IP detection), make failure modes explicit (returning errors instead of fatal exits), and clarify/extend metadata helpers (agent/toolbox VM naming disambiguation, owner parsing).
Changes:
meta: split helpers/keys into focused files, add agent-slot-aware VM naming helpers, rename toleration helper, and tightenOwnerDeploymentNameto return(string, bool).k8s: makeDetectNodeIPreturn(string, error)and enhance TLS cert loading to fall back to self-signed when disk certs are expired.auth/log/apis/v1: session verification now rejects expired sessions, log setup returns an error, and enum validation/defaulting is modernized; CRD list-map markers added for status arrays.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| meta/vmname.go | Introduces agent-vs-toolbox-safe VM naming helpers and deprecates legacy parsing helpers. |
| meta/owner.go | Renames toleration helper and changes OwnerDeploymentName to return (string, bool). |
| meta/meta.go | Converts meta.go into package-level documentation after splitting helpers/keys into new files. |
| meta/meta_test.go | Adds tests for new VM naming helpers and updated owner/toleration behaviors. |
| meta/lifecycle.go | Refactors IsTerminal to use a lookup table. |
| meta/keys.go | Moves metadata keys/constants into a dedicated file with one-line docs. |
| meta/connection.go | Moves ConnectionType helper into a dedicated file. |
| log/log.go | Changes Setup to return an error instead of fatal logging. |
| k8s/tls.go | Refactors cert loading to fall back to self-signed on expired disk certs; adds logging. |
| k8s/tls_test.go | Adds coverage for expired disk cert fallback and updates node IP test for new error return. |
| k8s/netip.go | Changes DetectNodeIP to return (string, error) and introduces ErrNoNodeIP. |
| auth/state.go | Makes RandomState panic on crypto/rand.Read failure. |
| auth/session.go | Updates VerifySession to reject expired sessions. |
| auth/session_test.go | Adds tests for session expiry and Exp == 0 behavior. |
| apis/v1/enums.go | Modernizes enum IsValid/Default implementations using slices + cmp.Or. |
| apis/v1/crds/cocoonset.cocoonstack.io_cocoonsets.yaml | Regenerates CRD with list-map markers for status lists. |
| apis/v1/cocoonset_types.go | Adds +listType=map / +listMapKey markers to status fields for stable patch semantics. |
Comments suppressed due to low confidence (3)
meta/vmname.go:105
- This deprecation guidance suggests
InferRoleFromAgentSlot(ExtractAgentSlot(ns, cs, vmName)), but ExtractAgentSlot returns -1 for non-agent names and InferRoleFromAgentSlot currently maps that to RoleSubAgent. Update the guidance to include a slot<0 branch (or adjust InferRoleFromAgentSlot) so the suggested migration path can’t misclassify toolboxes.
if idx < 0 {
return s, "", false
}
return s[:idx], s[idx+len(sep):], true
}
meta/owner.go:41
- OwnerDeploymentName claims to return ok=false when the ReplicaSet name has no recognizable hash suffix, but the implementation treats any name containing a '-' as OK (e.g. "demo-" or "demo-foo" would return ok=true). Either validate the suffix matches the expected Deployment ReplicaSet hash format or adjust the doc/comment to match the actual behavior.
for _, ref := range ownerRefs {
if ref.Kind != "ReplicaSet" {
continue
}
if before, _, ok := lastCut(ref.Name, "-"); ok {
return before, true
}
}
return "", false
}
auth/state.go:20
- The doc says RandomState panics with a wrapped message on rand.Read failure, but the code panics with a formatted string, losing the original error type/chain. Prefer panicking with an error value (e.g. fmt.Errorf("crypto/rand.Read: %w", err)) or update the comment to avoid implying wrapping.
// Panics on crypto/rand failure — a weak nonce silently breaks CSRF.
func RandomState() string {
b := make([]byte, 16)
if _, err := rand.Read(b); err != nil {
panic(fmt.Sprintf("crypto/rand.Read: %v", err))
}
return hex.EncodeToString(b)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Without this defensive branch, the suggested migration chain InferRoleFromAgentSlot(ExtractAgentSlot(ns, cs, vmName)) misclassified toolbox VM names as RoleSubAgent — ExtractAgentSlot returns -1 for non-agent names but the legacy slot != 0 fall-through still hit RoleSubAgent. Also drop the unused MainAgentVMName helper.
Contributor
Author
|
@copilot review |
Previously, tryLoadDiskCert silently fell back to self-signed for any fs.ErrNotExist from LoadX509KeyPair — including the asymmetric case where the cert exists but the key is missing (or vice versa). Stat the cert path explicitly so only "cert path missing" triggers the fallback; other errors (e.g. key missing while cert exists, permission denied) propagate so operators can detect the misconfiguration.
Contributor
Author
|
@copilot review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Round of code-review fixes across
auth,k8s,apis/v1,meta, andlog— a mix of bug fixes, API tightenings, and structural refactors.Breaking API changes (coordinated downstream PRs required)
The following signatures or names changed. Each is annotated in the commit with a
!and a body explaining the migration. Downstream projects (vk-cocoon,cocoon-operator,cocoon-webhook,glance,epoch,cocoon-net) need follow-up PRs.auth.VerifySession— now also rejects expired sessions (whenExp != 0and in the past). Downstreamepoch/server/auth_session.goandglance/auth/sso/session.goperform a redundantsess.Exp < time.Now().Unix()check that can be deleted in their respective PRs.log.Setup(ctx, envVar) error— wasfunc Setup(ctx, envVar)and calledFatalfinternally. Callers now choose the failure policy. Consumers:vk-cocoon/main.go,cocoon-operator/main.go,cocoon-webhook/main.go,glance/main.go,epoch/main.go,cocoon-net/cmd/root.go.meta.OwnerDeploymentName(ownerRefs) (string, bool)— wasstring. The bool disambiguates "no owner" from "owner present but no parseable hash suffix".k8s.DetectNodeIP() (string, error)— wasstring. Previously returned"127.0.0.1"on failure, masking misconfigured network namespaces.vk-cocoon/main.go:77needs to handle the new return.meta.HasCocoonToleration->meta.HasCocoonTolerationKey— rename only, semantics unchanged. The new name makes the legacy looseness (onlyKeyis matched) explicit. Consumers:cocoon-webhook/admission/pod_mutator.go,cocoon-webhook/admission/workload_validator.go,glance/discovery/kubernetes/metadata.go.Additive / soft-deprecation
meta.ExtractAgentSlot(ns, cs, vmName),meta.MainAgentVMNameFor(ns, cs),meta.InferRoleFromAgentSlot(slot),meta.AgentVMNamePrefix(ns, cs)are new. The legacyExtractSlotFromVMName/MainAgentVMName/InferRoleFromVMNametrio is now godoc-deprecated because it misclassifies toolbox VM names whose pod-name ends in-N(e.g. toolboxdb-2). Downstream call sites inglance/discovery/kubernetes/ownership.goandcocoon-common/meta/vmspec.go::ShouldSnapshotVMstill work today but should migrate.auth.RandomStatenow panics (with a wrapped message) oncrypto/rand.Readfailure instead of silently returning an all-zero state. Behavior change only on the previously-unreachable failure path.k8s.LoadOrGenerateCertnow falls back to self-signed when the on-disk cert is expired. Atls.LoadX509KeyPairfailure still surfaces as fatal.apis/v1.CocoonSetStatus.{Agents,Toolboxes}gained+listType=mapmarkers. CRD YAML regenerated.Non-API improvements
metapackage split:meta.gois now the package godoc; constants moved tokeys.go(with one-line docs), helpers split intoowner.go,vmname.go,connection.go.meta.LabelManagedByremoved (no consumers in any cocoonstack project).apis/v1/enums.gomodernized:cmp.OrforDefault,slices.Containsagainst per-type value tables forIsValid.meta.LifecycleState.IsTerminalnow table-driven; const block moved above the type to match the package convention.Test plan
make lint0 issues onGOOS=linuxandGOOS=darwingo test -race -count=1 ./...clean (all packages green)