OAuth Support#646
Conversation
28c980f to
6baab17
Compare
End-to-end validation: third-party OIDC consumer (JupyterHub)To validate that this PR isn't just "Grafana with extra steps", we ran the entire flow with JupyterHub as the OIDC client - installed as a separate Helm chart, not packaged with ACS, no code in this repo aware that it exists. What worked: What we did on the ACS side
What we did on the JupyterHub sideStandard Zero-to-JupyterHub install with auth_state_groups_key: oauth_user.fp_permissions
admin_groups:
- "<jupyterhub-admin permission UUID>"
allowed_groups:
- "<jupyterhub-admin permission UUID>"
- "<jupyterhub-user permission UUID>"
That's the entire integration. No glue code, no shim. The application reads UUIDs out of a JWT claim and gates on UUIDs it was told about. Verified end-to-end
One sharp edge we found and documentedKeycloak's user attribute cache memoises Docs added in this PR
This makes the PR's value proposition concrete: operators can connect any OIDC application to ACS by adding one values entry and creating F+ permission objects in the UI - no ACS internals touched. JupyterHub is the proof; Grafana is the in-tree reference. |
Follow-up: dropped the cache-invalidation footgunThe JupyterHub validation surfaced a sharp edge: F+ permission grants didn't propagate to consumers' tokens without a
Fix (commit 367197d):
Docs updated in the same commit.
End-state of the operator story is now: one values entry adds the OIDC client, F+ permission objects live in ConfigDB, ACL grants in Access Control, and changes propagate within a minute - all via the supported |
Follow-up: SPI bug that NO_CACHE exposed (commit
|
Adds the empty Maven module that will house the Keycloak User Storage SPI plugin (Java 17, Keycloak 26.1.1 SPI deps in 'provided' scope, JUnit 5 + Mockito + AssertJ + Testcontainers for tests). No source files yet. Also lands the shaping artefacts: the pitch (2-month appetite to make Factory+ the single source of truth for Keycloak users/groups via a custom federation provider) and the implementation plan with Phase 1 detailed and Phases 2-12 outlined.
The website is factoryplus.app.amrc.co.uk; reversed strictly that's uk.co.amrc.app.factoryplus, not uk.co.amrc.factoryplus (which the existing hivemq-krb module uses but drops the .app segment). Renamed the empty package directories (not tracked by git, doesn't show in the diff) and updated pom.xml groupId + every package reference in the implementation plan.
Wires the new module into the existing AMRC Java multi-module build:
* Inherits dependency versions (JUnit BOM, SLF4J, JSON libs, Apache
HTTP, RxJava, Vavr, Jetty) from base-pom 0.0.0-intree
* Sets acs.top to .. so the local-build profile finds the in-tree
Maven repo at lib/mvn (one level less deep than lib/X children)
* Pins keycloak/mockito/assertj/testcontainers locally since the
parent doesn't manage them
Defers adding the java-service-client dep to Phase 2. The in-tree
publication is 0.1-SNAPSHOT (predating base-pom inheritance) while the
parent's dependencyManagement uses ${acs.version} = 0.0.0-intree, which
isn't a published artefact. We'll resolve when we actually need the
client (pin 0.1-SNAPSHOT, rebuild the lib, or upstream a fix).
Plan updated: Phase 2 uses FPAuth, Phase 4 uses FPGssClientKeytab with
the existing sv1openid keytab, Phase 8 adds maven-shade-plugin to bundle
+ relocate transitives to dodge Keycloak classloader collisions.
Two SPI classes Keycloak needs to recognise the federation provider:
* FactoryPlusUserStorageProviderFactory.getId() returns 'factoryplus',
matching the existing kerberos federation's id pattern. One factory
per JVM, holds no per-request state.
* FactoryPlusUserStorageProvider is the per-request instance Keycloak
constructs via the factory. Empty for now; close() is a no-op until
we hold real resources (HTTP client, GSS context).
Required adding keycloak-model-storage to pom.xml: in Keycloak 26 the
UserStorageProvider + Factory interfaces moved out of keycloak-server-spi
into this dedicated artefact. Discovered via a compile failure that
referenced classes resolvable by name but not on the classpath.
TDD: factory-id test failed with 'cannot find symbol', then went green
after the implementation classes existed. mvn -B test reports 1/1.
Adds the ServiceLoader metadata file that makes Keycloak discover FactoryPlusUserStorageProviderFactory at startup. Without this file the jar ships fine and the classes load on demand, but Keycloak never knows the factory exists, so no realm can configure 'factoryplus' as a federation provider. The file is a single line containing the fully-qualified class name. Maven copies src/main/resources/* into the jar at build time, landing it at META-INF/services/org.keycloak.storage.UserStorageProviderFactory inside the jar where Keycloak's ServiceLoader scan finds it. SpiRegistrationTest asserts the file is on the classpath at the expected path with the expected contents - a guard against renames or typos that would otherwise only surface as 'no such provider' at Keycloak startup. mvn -B test now runs 2/2 green.
Replaces the planned hardcoded fixed-user with a real seam from day one,
so no throwaway code is ever committed and Phase 2 just slots a real
HTTP-backed store implementation in behind the existing interface.
New production types:
* FactoryPlusUser - immutable record DTO (uuid, username, email)
* FactoryPlusUserStore - read-only interface: findByUuid|Username|Email,
each returning Optional<FactoryPlusUser>
* FactoryPlusUserAdapter - extends Keycloak's AbstractUserAdapter to
expose F+ data as a UserModel; overrides only getUsername, getEmail,
getId (uses F+ UUID, not username, as the federated external id so
renames don't break references), credentialManager (standard plumbing,
Phase 6 wires real validation)
* NullFactoryPlusUserStore - singleton null-object impl returning
Optional.empty for every lookup; used by the factory as the
production default until Phase 2
The provider holds a FactoryPlusUserStore via constructor injection and
implements UserLookupProvider by delegating findBy* and wrapping results
in adapters. The factory hands NullFactoryPlusUserStore.INSTANCE to each
provider for now.
Test coverage (17 unit tests, 5 classes, all green):
* FactoryPlusUserAdapterTest: username/email exposure, F+ UUID
accessor, storage id format (f:<modelId>:<userUuid>), null email
handling
* FactoryPlusUserStorageProviderTest: delegation for username/email/id
lookups, including null-on-empty and adapter type assertion
* NullFactoryPlusUserStoreTest: every lookup returns empty (pins the
contract against future regressions)
Mockito bumped 5.11.0 -> 5.18.0 because 5.11 cannot mock interfaces on
JDK 25 ("Could not modify all classes [interface
org.keycloak.models.KeycloakSession]"). The system JVM here is 25.
Failsafe + Testcontainers + Keycloak admin client integration test that
bakes the built jar into a real quay.io/keycloak/keycloak:26.1.1
container, then asserts:
1. An admin can register the 'factoryplus' provider on a realm.
Keycloak's components endpoint returns 201, which it only does if
the SPI was discovered at startup via META-INF/services.
2. With the null store wired, user search through the realm returns
no matches and doesn't error - proving end-to-end that the
federation chain falls through cleanly.
Failsafe wired to run *IT.java in the verify phase, after package, with
spi.jar system property pointing at the freshly-built artefact. Also
forwards DOCKER_HOST (env + sysprop) to the forked test JVM for setups
where Docker isn't at the default /var/run/docker.sock.
The IT is gated by @EnabledIf("dockerAvailable") so it skips cleanly on
dev machines where Testcontainers can't auto-detect Docker (notably
macOS + OrbStack/Colima/Rancher Desktop, where the JVM can't follow the
/var/run/docker.sock symlink). CI on ubuntu-latest has Docker at the
default path so the IT runs there. The class doc explains the local
workarounds.
Bumped Mockito 5.11.0 -> 5.18.0 (JDK 25 inline-mock-maker compatibility,
already needed before this commit) and Testcontainers 1.19.7 -> 1.21.3
(better socket detection on rootless/alt Docker hosts). Added
keycloak-admin-client 26.0.5 (test scope; admin client artefact lags the
main 26.1.1 release but the REST contract is stable across 26.x). Added
slf4j-simple test dep so Testcontainers' diagnostic logs surface.
The earlier diagnostic guessed at SIP/sandboxing as the reason
Testcontainers couldn't see Docker. Real cause: OrbStack stops the VM
when idle, removing /Users/<user>/.orbstack/run/docker.sock. The docker
CLI auto-starts the VM on demand (so 'docker version' works) but
Testcontainers just stat()s the socket file without that wake-up. With
OrbStack running ('orb start'), DOCKER_HOST is honoured, the
EnvironmentAndSystemPropertyClientProviderStrategy applies, and the IT
runs end-to-end (~9s for both tests on this machine).
Updated the @EnabledIf comment to describe the real cause and the fix.
Trimmed simplelogger.properties from trace to info; trace is left as a
commented-out toggle for future debugging.
GitHub Actions workflow runs on push/PR touching the SPI module, the Java parent POM, the in-tree Maven repo, or the workflow itself. ubuntu-latest has Docker preinstalled at the standard path so the Testcontainers IT runs there without configuration. JDK 21 with Maven caching for fast iteration. README documents build/test/run-locally, the Phase 1 status caveat (SPI loads but the null store returns no users), the OrbStack workaround for local IT runs, the project layout, the recipe for adding a new SPI capability, and the architecture sketch showing how the FactoryPlusUserStore seam absorbs Phase 2 changes.
Pinned to 0.1-SNAPSHOT explicitly to override base-pom's
dependencyManagement entry (which references ${acs.version} =
0.0.0-intree, mismatched with the in-tree publication). Reconciling
the parent POM's version scheme is a separate cleanup outside this
phase's scope.
Compile + full test suite (17 unit + 2 IT) still green.
First real implementation of FactoryPlusUserStore: calls the F+ auth
service over HTTP using bare java.net.http.HttpClient. The class
Javadoc documents the expected HTTP contract that Phase 3 implements
in acs-auth:
GET /v2/principal/{uuid} -> { uuid, name, email? } or 404
GET /v2/principal?name={name} -> { uuid, name, email? } or 404
GET /v2/principal?email={email} -> { uuid, name, email } or 404
Added FactoryPlusAuthException (RuntimeException) for transport / 5xx /
malformed-response failures. Surfaces upward so Keycloak's federation
chain can fall through; silent emptys would look like 'user not found'
instead of the real fault.
Phase 4 will swap the bare HttpClient for lib/java-service-client's
FPGssClientKeytab when we need real SPNEGO auth against a real F+; the
FactoryPlusUserStore interface absorbs that change. The
java-service-client dep added in 2.1 is unused at compile time today
but stays declared for that future Phase 4 swap.
Tests (10 new, all Wiremock-driven, 27 unit tests total green):
* find_by_uuid happy path, 404, 5xx
* find_by_username happy path, 404, URL-encoding edge case
* find_by_email happy path, 404
* principal with no email yields null in the DTO
* malformed JSON throws FactoryPlusAuthException
Wiremock 3.13.1 added as test dep; org.json (managed by base-pom) added
as compile dep.
The realm admin now configures the federation provider with a F+ auth
URL via the Keycloak admin UI:
Federation -> Add provider -> factoryplus
Factory+ auth URL: http://acs-auth.../...
Request timeout (seconds): 5
The factory's create() reads the ComponentModel config:
* If auth.url is set, builds an FPAuthBackedUserStore pointing at it
with the configured timeout
* Otherwise (or if blank) returns NullFactoryPlusUserStore.INSTANCE
so the federation loads cleanly without F+ dependency
The two config keys (auth.url, auth.timeout.seconds) are advertised to
Keycloak via getConfigProperties() so they appear as form fields in the
admin UI. Phase 4 adds auth.principal and auth.keytab.path for SPNEGO.
Test coverage (4 new, 31 unit tests total green):
* No auth.url -> NullFactoryPlusUserStore
* auth.url set -> FPAuthBackedUserStore
* auth.url blank -> NullFactoryPlusUserStore (defensive)
* getConfigProperties() exposes auth.url + auth.timeout.seconds
Provider gained a package-private getStore() accessor purely for these
factory tests; production code never calls it.
Adds a third IT case: stand up Wiremock as an in-process sibling (exposed inside the Keycloak container via host.testcontainers.internal), configure a 'factoryplus' federation pointing at it via the ComponentRepresentation config (auth.url + auth.timeout.seconds), and assert (a) Keycloak accepts the configuration with HTTP 201 and (b) the config round-trips back through Keycloak's component store. Originally tried to assert end-to-end via realm.users().searchByUsername. That requires a UserQueryProvider implementation; my first attempt at adding one made admin-client REST searches return HTTP 400 (root cause not pinned down without further digging into Keycloak internals; likely a quirk of how Keycloak 26.1.1 dispatches searches when a federation provider implements UserQueryProvider, vs how admin-client 26.0.5 calls the search endpoint). Pragmatic call: defer UserQueryProvider to Phase 5, which is where it naturally lives anyway (alongside group lookup + caching). The FPAuthBackedUserStore's behaviour is already fully covered by the 10 Wiremock unit tests in FPAuthBackedUserStoreTest. Phase 4 (real F+ integration) and Phase 5 will add the end-to-end coverage. Also switched FPAuthBackedUserStore from org.json to Jackson - Keycloak doesn't ship org.json, so the SPI got a NoClassDefFoundError at runtime. Jackson is part of Keycloak's classpath (transitively via keycloak-core) so we mark it 'provided' and don't bundle it. Tests: 31 unit + 3 IT = 34 green. Test count up from 19 in Phase 1.
The Phase 2 Wiremock contract assumed a /v2/principal?name=... +
?email=... endpoint shape that doesn't exist in F+ today and would have
required schema/API extensions in acs-auth. While planning Phase 3 I
realised the existing F+ identity API is already sufficient if we adapt
the SPI to its shape:
GET /v2/principal/{uuid}
200 -> { uuid, kerberos: 'alice@FACTORYPLUS.LOCAL' }
410 -> principal does not exist (F+ uses 410, not 404)
GET /v2/identity/kerberos/{upn}
200 -> '"<uuid>"' (JSON string, not object)
410 -> identity does not exist
So Phase 3 ships zero acs-auth changes; the work is entirely in the SPI:
* findByUuid: 1 call to /v2/principal/{uuid}, parse {uuid, kerberos}
* findByUsername: 2 calls (resolve UPN -> UUID, then UUID -> principal).
Login is rare; Phase 5 caching will fold both into a single warm path.
* findByEmail: returns Optional.empty without making any HTTP call.
F+ has no email field. Reserved for a possible future schema bump.
Username = full Kerberos UPN (e.g. 'alice@FACTORYPLUS.LOCAL'), matching
the convention Keycloak's existing kerberos federation uses. Status 410
treated as not-found (404 also accepted in case a proxy rewrites it).
'Principal exists but has no kerberos identity' is treated as not-found
rather than NPE-able.
Updated Wiremock unit tests (12 in this class, 33 total unit) to mirror
the real F+ shape, including:
* URL encoding for UPNs containing / and @
* The two-call race where principal disappears between identity and
principal lookups
* Defensive 'identity endpoint returns object instead of string' check
* findByEmail makes zero HTTP calls (asserted via wiremock.getAllServeEvents)
Plan doc updated to record the contract revision. Group/membership
endpoints originally pencilled into Phase 3 are now part of Phase 5
alongside GroupLookupProvider.
Tests: 33 unit + 3 IT = 36 green.
Replaces the stock keycloak.v2 login pages with bespoke FreeMarker templates and CSS that mirror acs-admin's shadcn-vue look (slate borders, rounded-lg, sm shadow, slate-900 button) and swap the "FACTORY_PLUS" wordmark for the ACS "+" mark plus an "ACS" wordmark. The theme ships in the acs-keycloak image; service-setup activates it by setting loginTheme=factory-plus on the realm and exposes openid.loginTheme in values.yaml so deployments can opt out by setting it empty. Covers login, logout-confirm, error and info pages; account console keeps the upstream theme since it isn't a user-facing surface in ACS.
Keycloak doesn't always populate `locale` on the login template (e.g.
fresh auth requests before the language cookie is set), so
`${locale.currentLanguageTag!'en'}` exploded with InvalidReferenceException
because the `!` default only covers the last step of the chain. Wrap the
whole expression in parens so the fallback also fires when `locale`
itself is missing, not just `currentLanguageTag`.
Surfaces as HTTP 500 -> our error.ftl rendering "We are sorry... An
internal server error has occurred" the moment the realm flips to
loginTheme=factory-plus.
* Render Back to Application arrow as a styled <span>; override backToApplication message to drop upstream's "«" entity, which FreeMarker's auto-escape was rendering as the literal text "«". * Custom ease-out cubic-bezier curves replace `ease`/`ease-out` keywords for snappier UI feel. * Buttons get scale(0.97) on :active for tactile press feedback. * Card has a subtle hover ring (matches acs-admin LoginDialog) and a scale(0.97) → scale(1) entrance via @starting-style. * Card width 24rem -> 26rem and padding 1.5rem -> 1.75rem so the form has more presence on the otherwise empty page. * Hover/active states gated to (hover: hover) and (pointer: fine), and reduced-motion drops the entrance + press scale.
F+ identities are username+Kerberos. Keycloak's loginWithEmailAllowed defaulted true caused the login form to render "Username or email", inviting users to try their personal email; the F+ federation provider has no email field to query so any such attempt fails with bad creds. Force the field to "Username".
The workflow ran mvn verify against the SPI but never built the in-tree java-service-client dep first; it broke the moment that dep was added in the Phase 2.1 commit. The SPI is still built and tested as part of the acs-keycloak Docker image build on any image rebuild, so we don't need a standalone CI workflow on top.
The custom Keycloak image (acs-keycloak/Dockerfile) bakes the F+ User Storage SPI into Keycloak's providers/ directory, and the Helm chart references it via openid.image.repository=acs-keycloak. Without a CI job producing the image, tagged releases would fail to pull it on install. Adds a dedicated build-keycloak job rather than slotting acs-keycloak into build-x86-js: the Dockerfile is a standalone multi-stage build (Maven + Keycloak base) with no dependency on acs-base-js, and it needs a second build context (spi=./acs-keycloak-spi) that the JS matrix doesn't pass. Wired into build-helm-chart's needs list so the helm package job waits for the image to publish.
Previously the build relied on lib/mvn being pre-populated by `make -C lib` before `docker buildx build` ran. That works on a workstation but breaks in CI on a clean checkout, because lib/mvn is gitignored and never gets created. The build-keycloak job's `COPY --from=lib mvn /build/lib/mvn` blew up with "/mvn: not found". Make the Dockerfile self-contained: copy java-base-pom and java-service-client source via the lib build context and mvn-install each into Maven's local cache before building the SPI. The cache mount keeps incremental builds fast.
Includes comprehensive instructions for setup, usage, auto-closeout, and PR creation. Documents critical rules, review process, and teardown commands to ensure consistent workflows and prevent production issues.
The dep was declared as 0.1-SNAPSHOT but every in-tree publication of lib/java-service-client (its parent's <version>) is 0.0.0-intree, so a fresh-checkout build fails to resolve. Past local builds only worked because someone had 0.1-SNAPSHOT cached in ~/.m2/repository from an earlier flow; CI's clean cache surfaced the mismatch. The dep isn't imported by the SPI source yet (Phase 2 uses bare java.net.http); we keep the declaration so Phase 4's FPGssClientKeytab wiring is one line away.
Commit 55659de pinned the import to `isomorphic-git/http/node/index.cjs` to work around an ESM resolution failure after the dep was bumped in 861ed71. That workaround is no longer valid: current isomorphic-git publishes a strict `exports` map that exposes only `./http/node` (auto-resolves to index.js for ESM consumers, index.cjs for require), not the deep `/index.cjs` path. The bumped image fails on import: Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './http/node/index.cjs' is not defined by "exports" in /home/node/app/node_modules/isomorphic-git/package.json Revert to the bare `./http/node` subpath the package actually exports.
The seed Secret was templating `admin-password` to the empty string, which made Grafana fall back to its built-in default of "admin" on first start. Combined with the earlier mistake of using `admin@<realm>` as `admin-user`, this is what produced the "user already exists" failure when the F+ admin signed in via OAuth: Grafana already had a local user owning that login. The seed admin is meant to be a break-glass credential and nothing more - the F+ admin principal gets Grafana Admin via the fp_permissions claim mapping, not via the local account. Stop hardcoding any value derived from the realm or from Grafana's default: generate a random 32-char password on first install, preserved across helm upgrades by the existing resource-policy: keep and `if not lookup` guard. Existing installs that hit the collision need a one-time manual cleanup: Server Admin -> Users -> rename the realm-UPN login to something safe.
* Worked example: JupyterHub via Z2JH. End-to-end recipe for an OIDC consumer installed alongside ACS (not shipped with it) and gated by fp_permissions, with no ACS code touching the consumer. * Correct the "60s lag" claim - we hit a second cache layer in Keycloak's user attribute cache that requires `kubectl rollout restart deploy/openid` to flush. Note this and point at the proper fix (cachePolicy: NO_CACHE on the federation component). * Add a troubleshooting section walking through how to actually inspect what's in a user's JWT: MCP-based ACL verification, temporarily enabling direct access grants, password-grant + decode, flushing the user cache, then turning direct grants back off.
…book Reads as a complete recreate-from-zero recipe a year later: every values block in full, every UI click spelled out, the gotchas inline (IngressClassName=acs-traefik vs upstream Z2JH default of traefik; auth_state_groups_key vs deprecated claim_groups_key; manage_groups required alongside admin_groups), the cache-flush step for grant changes, and a "what this exercises" map back to the F+ machinery so the reader knows which pieces generalise to other OIDC consumers.
Keycloak's UserStorageProvider has a per-user attribute cache that defaults to "DEFAULT" (cache forever, invalidated only on restart). Combined with our SPI's permission lookups, this meant an F+ ACL grant added after a user had been loaded into Keycloak's cache stayed invisible in `fp_permissions` until the openid pod was restarted - a surprising operator footgun. Set the component's `cachePolicy` to NO_CACHE so each token issuance re-reads attributes via the SPI. The SPI's own `cache.ttl.seconds` (60) still shields the F+ Auth service from per-request load. Update docs/services/oauth-clients.md to drop the now-obsolete "restart openid after granting" workaround and reflect the simpler operator story.
Keycloak's AbstractUserAdapter exposes username/email through getUsername()/getEmail() but does NOT route them through the per-name attribute API (getAttributeStream / getFirstAttribute). The standard oidc-usermodel-attribute-mapper - which Keycloak attaches to the realm's built-in `profile` and `email` client scopes for the `preferred_username` and `email` claims respectively - reads from the per-name attribute API. Our adapter's per-name overrides returned `Stream.empty()` / `null` for everything but the F+ attributes, masking username/email at this layer. The mistake was the same one the author already encountered for getAttributes() (see the existing comment at line 97): mistakenly returning "from scratch" instead of delegating to super for the inherited fields. The default user-cache wrapper hid this from us because Keycloak's CachedUserModel snapshots `getAttributes()` (which we *do* delegate to super correctly) and serves cached lookups from there. The bug surfaces the moment cachePolicy is anything other than DEFAULT - in our case, the recent NO_CACHE switch made tokens come out with `preferred_username: null` and `email: null`, which broke Grafana's OAuth flow (Grafana falls back to a GitHub-shape /userinfo/emails endpoint when it can't extract an email, and 500s). Fix: route name=USERNAME and name=EMAIL through getUsername()/ getEmail() in both getAttributeStream and getFirstAttribute, and delegate everything else to super. Tests added for the username, email and FP attribute paths.
The OAuth work introduced four LocalSecret resources to populate the
keycloak-clients Secret with random passwords. That works on a fresh
install of v5.2.x (where the chart ships the LocalSecret CRD via the
crds/ directory), but breaks helm upgrade from 5.1.x: helm 3 only
applies CRDs in crds/ on first install, never on upgrade. Any cluster
running 5.1.0 without an edge cluster deployed (and therefore no
LocalSecret CRD applied by the edge-cluster subchart) fails the
upgrade with "no matches for kind LocalSecret" the moment the chart
tries to render the new resources.
Replace the four LocalSecrets with a single native K8s Secret template
using lookup + randAlphaNum to generate values on first install and
preserve them across upgrades. The pattern mirrors grafana-admin-user.
- deploy/templates/openid/keycloak-clients.yaml replaces
openid/local-secrets.yaml
- deploy/crds/local-secret-crd.yaml is removed; the edge-cluster
subchart still ships its own copy for driver/OPC-server passwords
- krb-keys-operator RBAC drops the localsecrets verbs since the
central operator no longer reconciles any
Same Secret name and key structure, so the keycloak StatefulSet env,
service-setup volume mount, and openid pod consumers stay unchanged.
Follow-up: drop LocalSecret CRD dependency on the central chartHit a fresh bug running the 5.1.0 to OAuth migration test on a freshly-nuked AGO cluster. Root causeThe OAuth templates introduced four That works on a fresh So on the upgrade path from 5.1.x (which never created any LocalSecret resources on the central chart, so never had the CRD applied) to the OAuth release, the API server has no The reason this never bit anyone before: the edge-cluster subchart ships the LocalSecret CRD in its own Solutions considered
Went with option 3. The LocalSecret pattern stays appropriate for edge agent / OPC server / driver passwords on edge clusters, where the operator's password rotation and Sealed-Secret integration matter. On the central cluster it was buying us nothing that a native Secret with What this commit does
Downstream consumers - Keycloak StatefulSet env ( Behaviour matrix
The first three cases are zero-intervention. Edge clusters unaffectedLocalSecret usage in |
docker/login-action has no built-in retry (upstream feature request docker/login-action#750 is still open) and ghcr.io's token endpoint flakes often enough that a single timeout fails an otherwise valid release build. Replace the action with a bash retry loop using docker login --password-stdin in both places we authenticate: - .github/actions/prepare/action.yml (composite action used by every per-service build) - .github/workflows/publish.yml (Helm chart publish step) Five attempts with linear backoff (5s, 10s, 15s, 20s, 25s) covering up to ~75s of transient unavailability before failing the job. --password-stdin keeps the token out of ps and the shell history.
Keycloak's OIDCAttributeMapperHelper omits the fp_permissions claim entirely when a user has zero F+ permission grants - even with multivalued=true, an empty Collection collapses to null and the helper skips emission. The previous JMESPath called contains(fp_permissions[*], ...) which gets null as its subject when the claim is absent. With go-jmespath (what Grafana uses) this leaks a single space through the || chain instead of falling through to 'Viewer', and role_attribute_strict=true then rejects the login with "Integration requires a valid org role assigned in idP. Assigned role: ' '". Wrap each occurrence of fp_permissions in not_null(..., `[]`) so the subject of contains() is guaranteed to be an array. Users with no grants now resolve to 'Viewer' as intended.
Grafana's OAuth user-lookup refuses to claim a pre-existing user by email match when email_verified=false in the OIDC claims, unless oauth_allow_insecure_email_lookup is set in grafana.ini. On the AuthProxy -> OAuth migration path that surfaces as "Login Failed: user already exists" because Grafana then falls into the create-new path and collides with the existing DB row. The email we issue is either the F+ Auth email or the kerberos UPN, both derived from the same trusted F+ Principal record the SPI already accepted, so emails are kerberos-authoritative by construction. Override isEmailVerified() to return true so OIDC consumers can treat the email as a safe matching key without needing the "insecure" flag flipped on every OAuth client downstream.
Two more issues found and fixed while running the AuthProxy to OAuth migration testHit both of these in the same upgrade test (5.1.0 install with AuthProxy-created Grafana users, then upgrade to a dev release and try OAuth sign-in as those existing users). Each blocks the migration entirely on its own. 1. JMESPath leaks
|
Grafana's OAuth client (pkg/services/authn/clients/oauth.go) only populates lookupParams.Email when oauth_allow_insecure_email_lookup is true - and crucially lookupParams.Login is never populated at all on the OAuth path regardless of this flag. With the flag at its default of false, lookupByOneOf sees both Email and Login as nil, returns ErrUserNotFound, Grafana drops into the createUser path, and the create collides with the existing row's email/login uniqueness check producing the unhelpful "user already exists" error. This breaks the AuthProxy -> OAuth migration: every pre-existing Grafana user (admin and the test users) hits the "user already exists" path and OAuth sign-in fails despite the SPI emitting email_verified=true and the kerberos UPNs matching the existing login columns byte-for-byte. The "insecure" naming refers to OAuth providers that don't verify the email they emit; ours is our own Keycloak federating F+ Auth, and FactoryPlusUserAdapter.isEmailVerified() (28f6dd6) already stamps email_verified=true on the OIDC claims. The email is kerberos-authoritative by construction. Enabling the flag here is the only knob that makes the lookup actually happen.
Correction + final fix:
|
| Commit | Fix |
|---|---|
7c760887 |
Drop LocalSecret dependency for keycloak-clients Secret (CRD-upgrade trap) |
b6e60bfa |
Retry ghcr.io docker login on transient flakes |
7b5f4bc7 |
Wrap fp_permissions in not_null(..., \[]`)` so role JMESPath falls through to Viewer when claim absent |
28f6dd6e |
SPI stamps email_verified=true on federated users |
3983525f |
Set oauth_allow_insecure_email_lookup: true — the actual migration unblock |
The merged effect: 5.1.x clusters upgrading through this branch get OAuth that handles both never-had-an-edge-cluster CRD state, transient registry flakes during release builds, no-F+-permission OAuth users, and the AuthProxy-to-OAuth user-id migration — all with no manual operator intervention beyond helm upgrade.
Login form ergonomics: a local user who types just "alice" should not have to know or type the cluster's kerberos realm to sign in. Cross-realm users who type their full UPN should keep working unchanged. Add a default.realm config property on the SPI user-storage provider. FPAuthBackedUserStore.canonicaliseUpn appends "@<defaultRealm>" (uppercased defensively) when the input has no @, otherwise passes the input through verbatim. service-setup wires the property to the chart's identity.realm value when it provisions the federation, so the affordance is on by default for every deployment. Behaviour matrix: alice + realm=FP.LOCAL -> alice@FP.LOCAL alice@OTHER.REALM + realm=FP.LOCAL -> alice@OTHER.REALM alice@fp.local + realm=FP.LOCAL -> alice@FP.LOCAL (realm uppercased, existing behaviour) alice + realm="" -> alice (legacy pass-through) alice + realm unset -> alice (legacy pass-through) Five new wiremock-driven tests in FPAuthBackedUserStoreTest cover each row of that matrix. SPI test count 74 -> 79, all passing.
Previous canonicaliseUpn() uppercased the realm part of every @-suffixed input. Uppercase Kerberos realms are convention but not universal - some clusters legitimately run with mixed- or lowercase realm names, and force-casing the input silently broke identity lookups against F+ entries that didn't match the forced casing. Drop the uppercasing on both paths: respect the realm in typed UPNs verbatim, and append the default.realm configured value exactly as supplied (without case-folding it on the way in either). The configured default realm now uses whatever case the chart's identity.realm value carries, which is what the operator intended in the first place. Renamed canonicaliseUpn to applyDefaultRealm since it no longer canonicalises anything other than appending the realm for short names. Updated short_name_default_realm_is_uppercased to assert verbatim usage, added typed_upn_realm_passes_through_verbatim as a regression test. Tests 79 -> 80, all passing.
Summary
Adds OAuth/OIDC sign-in to ACS via a self-hosted Keycloak that federates against the F+ Auth service. Grafana now logs in through OIDC instead of being seeded with admin credentials, and Grafana roles are driven directly from F+ ACL grants. The whole stack is provisioned automatically by
service-setuponhelm upgrade.What's in the box
Keycloak deployment
acs-keycloakDocker image (Keycloak 26.1.1 + custom F+ User Storage SPI baked into/opt/keycloak/providers/).openidDeployment, Service, Ingress and Postgres-backed JAAS config in the chart (deploy/templates/openid/*).service-setup(acs-service-setup/lib/openid.js).F+ User Storage SPI (
acs-keycloak-spi/)FactoryPlusUserStorageProviderlooks users up againstacs-authinstead of Keycloak's local user table; passwords validate against Kerberos.auth.url,auth.principal,auth.keytab.path, cache TTL).fp_principal_uuid- the F+ principal UUID for the signed-in user.fp_permissions- the user's F+ permission UUIDs (multi-valued, JMESPath-friendly).Grafana wiring
role_attribute_pathmatchesfp_permissionsagainst ACS-defined Grafana role groups (Admin/Editor) seeded byservice-setup.role_attribute_strictenforced - revoking the F+ permission demotes the user on the next login. Grafana role capped at Admin (no GrafanaAdmin escalation path through ACL).ACS-branded login theme
factory-plusKeycloak theme (acs-keycloak/themes/factory-plus/) replacing the stockkeycloak.v2look with shadcn-vue styling that matchesacs-admin.messages_en.properties.realm.loginTheme = factory-plusfromservice-setup;openid.loginTheme: ""invalues.yamlreverts to stock.Supporting changes
service-setupallows per-client URL overrides (rootUrl, redirect, post-logout) and asks Keycloak to push backchannel logout to OIDC clients.LocalSecretCRD and RBAC updates so a fresh install can bootstrap Keycloak client secrets.krb5.confgetsrdns = falseto avoid libkrb5 reverse-DNS hangs in the openid pod.makeis portable on macOS BSD/GNU make, parallel subdir builds viaSUBDIRS_PARALLEL, per-subdir log files with a summary.acs-mqtt: portable maven-output filename expansion (BSD vs GNU).acs-edge:skipLibCheckso the top-level build doesn't trip on transitive type errors.Docs
docs/services/oauth-clients.md- operator guide for adding OIDC clients.docs/plans/2026-05-07-keycloak-fp-spi-plan.md- the staged plan this branch implemented (kept for archival).docs/plans/2026-05-07-keycloak-fp-user-storage-spi.md- SPI architecture notes.Migration
For existing deployments,
helm upgradeis sufficient. The first run will:factory_plusrealm, thefactoryplusfederation provider, and the configured OIDC clients (grafanaby default).openidandmqttto pick up regenerated config.To opt out of the branded login theme set
openid.loginTheme: "". To run upstream Keycloak without the SPI overrideopenid.image.repository/registry- but the F+ federation provider won't load.Test plan
helm upgradesucceeds against an existing cluster; no service-setup errors.factory_plusexists withloginTheme=factory-plus, federation providerfactoryplus, and thegrafanaclient.+mark +ACSwordmark, slate-900 button, no«artifact).fp_principal_uuidand a multi-valuedfp_permissionsclaim.mvn -pl acs-keycloak-spi verify).fplus-edge. Validate that password for_bootstrapis the same before and after inkeycloak-clients.