OAuth Support by AlexGodbehere · Pull Request #646 · AMRC-FactoryPlus/amrc-connectivity-stack

AlexGodbehere · 2026-05-08T10:10:25Z

Summary

Adds OAuth/OIDC sign-in to ACS via a self-hosted Keycloak that federates against the F+ Auth service. Grafana now logs in through OIDC instead of being seeded with admin credentials, and Grafana roles are driven directly from F+ ACL grants. The whole stack is provisioned automatically by service-setup on helm upgrade.

What's in the box

Keycloak deployment

New acs-keycloak Docker image (Keycloak 26.1.1 + custom F+ User Storage SPI baked into /opt/keycloak/providers/).
New openid Deployment, Service, Ingress and Postgres-backed JAAS config in the chart (deploy/templates/openid/*).
Realm, federation provider, claim mappers and OIDC clients all created idempotently by service-setup (acs-service-setup/lib/openid.js).

F+ User Storage SPI (acs-keycloak-spi/)

FactoryPlusUserStorageProvider looks users up against acs-auth instead of Keycloak's local user table; passwords validate against Kerberos.
Caching store with TTL so login flows aren't HTTP-bound.
Per-realm component config (auth.url, auth.principal, auth.keytab.path, cache TTL).
SPNEGO outbound auth for the SPI's own F+ HTTP calls.
Two protocol mappers stamp every issued token:
- fp_principal_uuid - the F+ principal UUID for the signed-in user.
- fp_permissions - the user's F+ permission UUIDs (multi-valued, JMESPath-friendly).
Comprehensive Wiremock + Testcontainers integration tests.

Grafana wiring

Generic OAuth flow against the new realm; admin seed account is renamed so it doesn't collide with OIDC sign-in.
role_attribute_path matches fp_permissions against ACS-defined Grafana role groups (Admin/Editor) seeded by service-setup.
role_attribute_strict enforced - revoking the F+ permission demotes the user on the next login. Grafana role capped at Admin (no GrafanaAdmin escalation path through ACL).

ACS-branded login theme

New factory-plus Keycloak theme (acs-keycloak/themes/factory-plus/) replacing the stock keycloak.v2 look with shadcn-vue styling that matches acs-admin.
Hand-rolled CSS, FreeMarker templates for login / logout-confirm / error / info pages, and a scoped messages_en.properties.
Activated via realm.loginTheme = factory-plus from service-setup; openid.loginTheme: "" in values.yaml reverts to stock.

Supporting changes

service-setup allows per-client URL overrides (rootUrl, redirect, post-logout) and asks Keycloak to push backchannel logout to OIDC clients.
LocalSecret CRD and RBAC updates so a fresh install can bootstrap Keycloak client secrets.
krb5.conf gets rdns = false to avoid libkrb5 reverse-DNS hangs in the openid pod.
Build infrastructure: top-level make is portable on macOS BSD/GNU make, parallel subdir builds via SUBDIRS_PARALLEL, per-subdir log files with a summary.
acs-mqtt: portable maven-output filename expansion (BSD vs GNU).
acs-edge: skipLibCheck so the top-level build doesn't trip on transitive type errors.

Docs

docs/services/oauth-clients.md - operator guide for adding OIDC clients.
docs/plans/2026-05-07-keycloak-fp-spi-plan.md - the staged plan this branch implemented (kept for archival).
docs/plans/2026-05-07-keycloak-fp-user-storage-spi.md - SPI architecture notes.

Migration

For existing deployments, helm upgrade is sufficient. The first run will:

Provision the factory_plus realm, the factoryplus federation provider, and the configured OIDC clients (grafana by default).
Mount the SPI keytab into the openid pod and seed Grafana role groups in F+.
Restart openid and mqtt to pick up regenerated config.

To opt out of the branded login theme set openid.loginTheme: "". To run upstream Keycloak without the SPI override openid.image.repository/registry - but the F+ federation provider won't load.

Test plan

AlexGodbehere · 2026-05-12T14:12:18Z

End-to-end validation: third-party OIDC consumer (JupyterHub)

To validate that this PR isn't just "Grafana with extra steps", we ran the entire flow with JupyterHub as the OIDC client - installed as a separate Helm chart, not packaged with ACS, no code in this repo aware that it exists. What worked:

What we did on the ACS side

One values entry, no code change. Added a jupyterhub block to serviceSetup.config.openidClients in the existing ACS values. helm upgrade and that was it: service-setup created the Keycloak client, attached the fp_principal_uuid and fp_permissions claim mappers, and dropped the auto-generated client secret into keycloak-clients/jupyterhub. The generic openidClients extension point did everything; we didn't touch service-setup logic.
Created two F+ Permission objects via the ACS Admin UI. Navigated to ConfigDB -> Create Object, selected Permission (Rank 1 class), created JupyterHub admin and JupyterHub user with chosen UUIDs. These live in F+ as ordinary Permission objects - JupyterHub-specific only by name.
Granted the admin permission to admin@<realm> via Access Control -> Principals, with target = Wildcard. Verified the grant via the ACS MCP server (find_grants, check_acl).

What we did on the JupyterHub side

Standard Zero-to-JupyterHub install with GenericOAuthenticator, pointing the auth/token/userinfo/logout URLs at our Keycloak. The two ACS-aware bits in JupyterHub's config:

auth_state_groups_key: oauth_user.fp_permissions
admin_groups:
  - "<jupyterhub-admin permission UUID>"
allowed_groups:
  - "<jupyterhub-admin permission UUID>"
  - "<jupyterhub-user permission UUID>"

auth_state_groups_key tells the OAuthenticator (>=17) to look in the post-OAuth auth_state dict for the groups list at the dotted path oauth_user.fp_permissions - i.e. the array of UUIDs that this PR's acs-keycloak-spi FactoryPlusPermissionsMapper stamps into every issued token. admin_groups and allowed_groups are just lists of UUIDs JH compares against that array. Membership of the admin UUID -> hub admin; membership of the user UUID -> regular access; neither -> 403.

That's the entire integration. No glue code, no shim. The application reads UUIDs out of a JWT claim and gates on UUIDs it was told about.

Verified end-to-end

Sign in to http://jupyterhub.<acs.baseUrl> -> redirected to ACS-branded Keycloak login.
Authenticate as admin@<realm> (the F+ principal, password validated via Kerberos through the F+ User Storage SPI in acs-keycloak-spi).
Token issued with fp_permissions containing the JupyterHub admin UUID we created.
JupyterHub honours admin_groups, lands the user in JupyterLab as a hub admin.
Confirmed permissions changes propagate: removed/regranted, verified the change via the userinfo endpoint, hit JH again.

One sharp edge we found and documented

Keycloak's user attribute cache memoises fp_permissions per-user-load, on top of the SPI's own 60s lookup cache. Granting a new permission in F+ does not show up in tokens until Keycloak's cache is invalidated. Today the fix is kubectl rollout restart deploy/openid. Better fix queued in the docs: set cachePolicy: NO_CACHE on the F+ federation component so attribute changes propagate without a restart.

Docs added in this PR

docs/services/oauth-clients.md now contains:

"Worked example: JupyterHub" - the runbook above, complete with values, the ACS Admin UI clickthrough for creating Permissions and grants, and the Z2JH values.
Troubleshooting -> "Inspecting what a user's JWT actually contains" - the technique we used to debug, including how to (temporarily) enable direct access grants on a client, password-grant a token, and decode it. Plus the cache-flush step for when grants don't propagate.
Constraints and gotchas updated to note the two-layer cache (the previous "60s lag" claim under-described it).

This makes the PR's value proposition concrete: operators can connect any OIDC application to ACS by adding one values entry and creating F+ permission objects in the UI - no ACS internals touched. JupyterHub is the proof; Grafana is the in-tree reference.

AlexGodbehere · 2026-05-12T14:26:20Z

Follow-up: dropped the cache-invalidation footgun

The JupyterHub validation surfaced a sharp edge: F+ permission grants didn't propagate to consumers' tokens without a kubectl rollout restart deploy/openid. Two cache layers were responsible:

The SPI's own 60s TTL on F+ lookups (intentional, configurable, fine).
Keycloak's per-user attribute cache on the UserStorageProvider component, which defaults to DEFAULT (cache forever, invalidated only on restart). Once a user had been loaded into Keycloak's cache, attribute reads served the snapshot taken at first load - so a new ACL grant in F+ stayed invisible in fp_permissions indefinitely.

Fix (commit 367197d): acs-service-setup/lib/openid.js now sets cachePolicy: NO_CACHE on the factoryplus federation component when it provisions or updates it. Each token issuance re-reads attributes via the SPI; the SPI's cache.ttl.seconds=60 still shields F+ Auth from per-request load. After this lands:

Grant a permission in ACS Admin -> wait up to 60s for the SPI cache -> user signs in again -> new permission is in their fp_permissions claim.
No openid restart needed for routine grant changes.

Docs updated in the same commit. docs/services/oauth-clients.md:

Constraints and gotchas now describes a single 60s lag, not two cache layers.
The JupyterHub example's "If you grant or change a permission later" section drops the openid-restart step and just says wait + re-login. Restart is mentioned only as a debugging shortcut to bypass the SPI's 60s cache.

End-state of the operator story is now: one values entry adds the OIDC client, F+ permission objects live in ConfigDB, ACL grants in Access Control, and changes propagate within a minute - all via the supported openidClients extension point with no ACS code changes.

AlexGodbehere · 2026-05-13T08:43:54Z

Follow-up: SPI bug that NO_CACHE exposed (commit `5de7c16b`)

The cachePolicy: NO_CACHE change in the previous commit revealed a pre-existing bug in FactoryPlusUserAdapter. Symptom on the test cluster (fpd-fry, fresh install of 5.2.0-ago.14): Grafana sign-in fails with HTTP 500 -> Error getting email address -> /userinfo/emails 404 (the GitHub-shape /emails fallback Grafana does when it can't extract an email from the userinfo response).

What was actually wrong

Keycloak exposes user properties two ways and the SPI was handling them inconsistently:

getAttributes() (bulk) - already delegated to super (the author fixed this previously, see the comment in the file).
getAttributeStream(name) / getFirstAttribute(name) (per-name) - returned Stream.empty() / null for anything but the F+ attributes, masking username and email.

The standard oidc-usermodel-attribute-mapper Keycloak attaches to the realm's profile / email client scopes - the ones that produce the preferred_username and email claims - reads via the per-name path. With both null, tokens came out with preferred_username: null and email: null:

// fry, ago.14, NO_CACHE on the federation:
{
  "sub": "f:...:...",
  "fp_principal_uuid": "...",
  "fp_permissions": [...],
  "email_verified": false
  // no preferred_username, no email
}

Why we hadn't noticed

Keycloak's UserStorageProvider user cache (cachePolicy: DEFAULT) snapshots a user's attributes via getAttributes() on first load and serves all subsequent lookups from that snapshot. The bulk path was correct, so the snapshot contained username and email. Every per-name lookup hit the cache instead of our adapter, and the bug stayed hidden.

The previous commit set cachePolicy: NO_CACHE on the federation provider so ACL changes propagate without an openid restart. That switch sent the per-name lookups straight to the buggy SPI methods, which is what fry was hitting.

ago still has cache=DEFAULT so it shows the cached behaviour (preferred_username present, looking healthy) - but it has the same latent bug; flipping its cache off would have produced the same Grafana failure.

The fix

FactoryPlusUserAdapter.getAttributeStream(name) / getFirstAttribute(name) now route name=username -> getUsername() and name=email -> getEmail(), and delegate everything else to super. Tests added that would have caught this. No change to the bulk method - it was already correct.

Net result:

Keep cachePolicy: NO_CACHE from the previous commit - ACL grant changes still propagate without restart.
Grafana's OAuth flow works again - tokens come back with preferred_username and email populated.
The same fix applies on ago too; it just wasn't symptomatic there because of the cache.

Why the SPI test suite didn't catch this earlier

The existing tests only asserted on getUsername() and getEmail() directly. Those still worked - what was broken was the per-name attribute methods that Keycloak's mappers actually use. New tests in FactoryPlusUserAdapterTest cover both paths (username, email, and the F+ attribute names).

Adds the empty Maven module that will house the Keycloak User Storage SPI plugin (Java 17, Keycloak 26.1.1 SPI deps in 'provided' scope, JUnit 5 + Mockito + AssertJ + Testcontainers for tests). No source files yet. Also lands the shaping artefacts: the pitch (2-month appetite to make Factory+ the single source of truth for Keycloak users/groups via a custom federation provider) and the implementation plan with Phase 1 detailed and Phases 2-12 outlined.

The website is factoryplus.app.amrc.co.uk; reversed strictly that's uk.co.amrc.app.factoryplus, not uk.co.amrc.factoryplus (which the existing hivemq-krb module uses but drops the .app segment). Renamed the empty package directories (not tracked by git, doesn't show in the diff) and updated pom.xml groupId + every package reference in the implementation plan.

Wires the new module into the existing AMRC Java multi-module build: * Inherits dependency versions (JUnit BOM, SLF4J, JSON libs, Apache HTTP, RxJava, Vavr, Jetty) from base-pom 0.0.0-intree * Sets acs.top to .. so the local-build profile finds the in-tree Maven repo at lib/mvn (one level less deep than lib/X children) * Pins keycloak/mockito/assertj/testcontainers locally since the parent doesn't manage them Defers adding the java-service-client dep to Phase 2. The in-tree publication is 0.1-SNAPSHOT (predating base-pom inheritance) while the parent's dependencyManagement uses ${acs.version} = 0.0.0-intree, which isn't a published artefact. We'll resolve when we actually need the client (pin 0.1-SNAPSHOT, rebuild the lib, or upstream a fix). Plan updated: Phase 2 uses FPAuth, Phase 4 uses FPGssClientKeytab with the existing sv1openid keytab, Phase 8 adds maven-shade-plugin to bundle + relocate transitives to dodge Keycloak classloader collisions.

Two SPI classes Keycloak needs to recognise the federation provider: * FactoryPlusUserStorageProviderFactory.getId() returns 'factoryplus', matching the existing kerberos federation's id pattern. One factory per JVM, holds no per-request state. * FactoryPlusUserStorageProvider is the per-request instance Keycloak constructs via the factory. Empty for now; close() is a no-op until we hold real resources (HTTP client, GSS context). Required adding keycloak-model-storage to pom.xml: in Keycloak 26 the UserStorageProvider + Factory interfaces moved out of keycloak-server-spi into this dedicated artefact. Discovered via a compile failure that referenced classes resolvable by name but not on the classpath. TDD: factory-id test failed with 'cannot find symbol', then went green after the implementation classes existed. mvn -B test reports 1/1.

Adds the ServiceLoader metadata file that makes Keycloak discover FactoryPlusUserStorageProviderFactory at startup. Without this file the jar ships fine and the classes load on demand, but Keycloak never knows the factory exists, so no realm can configure 'factoryplus' as a federation provider. The file is a single line containing the fully-qualified class name. Maven copies src/main/resources/* into the jar at build time, landing it at META-INF/services/org.keycloak.storage.UserStorageProviderFactory inside the jar where Keycloak's ServiceLoader scan finds it. SpiRegistrationTest asserts the file is on the classpath at the expected path with the expected contents - a guard against renames or typos that would otherwise only surface as 'no such provider' at Keycloak startup. mvn -B test now runs 2/2 green.

Replaces the planned hardcoded fixed-user with a real seam from day one, so no throwaway code is ever committed and Phase 2 just slots a real HTTP-backed store implementation in behind the existing interface. New production types: * FactoryPlusUser - immutable record DTO (uuid, username, email) * FactoryPlusUserStore - read-only interface: findByUuid|Username|Email, each returning Optional<FactoryPlusUser> * FactoryPlusUserAdapter - extends Keycloak's AbstractUserAdapter to expose F+ data as a UserModel; overrides only getUsername, getEmail, getId (uses F+ UUID, not username, as the federated external id so renames don't break references), credentialManager (standard plumbing, Phase 6 wires real validation) * NullFactoryPlusUserStore - singleton null-object impl returning Optional.empty for every lookup; used by the factory as the production default until Phase 2 The provider holds a FactoryPlusUserStore via constructor injection and implements UserLookupProvider by delegating findBy* and wrapping results in adapters. The factory hands NullFactoryPlusUserStore.INSTANCE to each provider for now. Test coverage (17 unit tests, 5 classes, all green): * FactoryPlusUserAdapterTest: username/email exposure, F+ UUID accessor, storage id format (f:<modelId>:<userUuid>), null email handling * FactoryPlusUserStorageProviderTest: delegation for username/email/id lookups, including null-on-empty and adapter type assertion * NullFactoryPlusUserStoreTest: every lookup returns empty (pins the contract against future regressions) Mockito bumped 5.11.0 -> 5.18.0 because 5.11 cannot mock interfaces on JDK 25 ("Could not modify all classes [interface org.keycloak.models.KeycloakSession]"). The system JVM here is 25.

Failsafe + Testcontainers + Keycloak admin client integration test that bakes the built jar into a real quay.io/keycloak/keycloak:26.1.1 container, then asserts: 1. An admin can register the 'factoryplus' provider on a realm. Keycloak's components endpoint returns 201, which it only does if the SPI was discovered at startup via META-INF/services. 2. With the null store wired, user search through the realm returns no matches and doesn't error - proving end-to-end that the federation chain falls through cleanly. Failsafe wired to run *IT.java in the verify phase, after package, with spi.jar system property pointing at the freshly-built artefact. Also forwards DOCKER_HOST (env + sysprop) to the forked test JVM for setups where Docker isn't at the default /var/run/docker.sock. The IT is gated by @EnabledIf("dockerAvailable") so it skips cleanly on dev machines where Testcontainers can't auto-detect Docker (notably macOS + OrbStack/Colima/Rancher Desktop, where the JVM can't follow the /var/run/docker.sock symlink). CI on ubuntu-latest has Docker at the default path so the IT runs there. The class doc explains the local workarounds. Bumped Mockito 5.11.0 -> 5.18.0 (JDK 25 inline-mock-maker compatibility, already needed before this commit) and Testcontainers 1.19.7 -> 1.21.3 (better socket detection on rootless/alt Docker hosts). Added keycloak-admin-client 26.0.5 (test scope; admin client artefact lags the main 26.1.1 release but the REST contract is stable across 26.x). Added slf4j-simple test dep so Testcontainers' diagnostic logs surface.

The earlier diagnostic guessed at SIP/sandboxing as the reason Testcontainers couldn't see Docker. Real cause: OrbStack stops the VM when idle, removing /Users/<user>/.orbstack/run/docker.sock. The docker CLI auto-starts the VM on demand (so 'docker version' works) but Testcontainers just stat()s the socket file without that wake-up. With OrbStack running ('orb start'), DOCKER_HOST is honoured, the EnvironmentAndSystemPropertyClientProviderStrategy applies, and the IT runs end-to-end (~9s for both tests on this machine). Updated the @EnabledIf comment to describe the real cause and the fix. Trimmed simplelogger.properties from trace to info; trace is left as a commented-out toggle for future debugging.

GitHub Actions workflow runs on push/PR touching the SPI module, the Java parent POM, the in-tree Maven repo, or the workflow itself. ubuntu-latest has Docker preinstalled at the standard path so the Testcontainers IT runs there without configuration. JDK 21 with Maven caching for fast iteration. README documents build/test/run-locally, the Phase 1 status caveat (SPI loads but the null store returns no users), the OrbStack workaround for local IT runs, the project layout, the recipe for adding a new SPI capability, and the architecture sketch showing how the FactoryPlusUserStore seam absorbs Phase 2 changes.

Pinned to 0.1-SNAPSHOT explicitly to override base-pom's dependencyManagement entry (which references ${acs.version} = 0.0.0-intree, mismatched with the in-tree publication). Reconciling the parent POM's version scheme is a separate cleanup outside this phase's scope. Compile + full test suite (17 unit + 2 IT) still green.

First real implementation of FactoryPlusUserStore: calls the F+ auth service over HTTP using bare java.net.http.HttpClient. The class Javadoc documents the expected HTTP contract that Phase 3 implements in acs-auth: GET /v2/principal/{uuid} -> { uuid, name, email? } or 404 GET /v2/principal?name={name} -> { uuid, name, email? } or 404 GET /v2/principal?email={email} -> { uuid, name, email } or 404 Added FactoryPlusAuthException (RuntimeException) for transport / 5xx / malformed-response failures. Surfaces upward so Keycloak's federation chain can fall through; silent emptys would look like 'user not found' instead of the real fault. Phase 4 will swap the bare HttpClient for lib/java-service-client's FPGssClientKeytab when we need real SPNEGO auth against a real F+; the FactoryPlusUserStore interface absorbs that change. The java-service-client dep added in 2.1 is unused at compile time today but stays declared for that future Phase 4 swap. Tests (10 new, all Wiremock-driven, 27 unit tests total green): * find_by_uuid happy path, 404, 5xx * find_by_username happy path, 404, URL-encoding edge case * find_by_email happy path, 404 * principal with no email yields null in the DTO * malformed JSON throws FactoryPlusAuthException Wiremock 3.13.1 added as test dep; org.json (managed by base-pom) added as compile dep.

The realm admin now configures the federation provider with a F+ auth URL via the Keycloak admin UI: Federation -> Add provider -> factoryplus Factory+ auth URL: http://acs-auth.../... Request timeout (seconds): 5 The factory's create() reads the ComponentModel config: * If auth.url is set, builds an FPAuthBackedUserStore pointing at it with the configured timeout * Otherwise (or if blank) returns NullFactoryPlusUserStore.INSTANCE so the federation loads cleanly without F+ dependency The two config keys (auth.url, auth.timeout.seconds) are advertised to Keycloak via getConfigProperties() so they appear as form fields in the admin UI. Phase 4 adds auth.principal and auth.keytab.path for SPNEGO. Test coverage (4 new, 31 unit tests total green): * No auth.url -> NullFactoryPlusUserStore * auth.url set -> FPAuthBackedUserStore * auth.url blank -> NullFactoryPlusUserStore (defensive) * getConfigProperties() exposes auth.url + auth.timeout.seconds Provider gained a package-private getStore() accessor purely for these factory tests; production code never calls it.

Adds a third IT case: stand up Wiremock as an in-process sibling (exposed inside the Keycloak container via host.testcontainers.internal), configure a 'factoryplus' federation pointing at it via the ComponentRepresentation config (auth.url + auth.timeout.seconds), and assert (a) Keycloak accepts the configuration with HTTP 201 and (b) the config round-trips back through Keycloak's component store. Originally tried to assert end-to-end via realm.users().searchByUsername. That requires a UserQueryProvider implementation; my first attempt at adding one made admin-client REST searches return HTTP 400 (root cause not pinned down without further digging into Keycloak internals; likely a quirk of how Keycloak 26.1.1 dispatches searches when a federation provider implements UserQueryProvider, vs how admin-client 26.0.5 calls the search endpoint). Pragmatic call: defer UserQueryProvider to Phase 5, which is where it naturally lives anyway (alongside group lookup + caching). The FPAuthBackedUserStore's behaviour is already fully covered by the 10 Wiremock unit tests in FPAuthBackedUserStoreTest. Phase 4 (real F+ integration) and Phase 5 will add the end-to-end coverage. Also switched FPAuthBackedUserStore from org.json to Jackson - Keycloak doesn't ship org.json, so the SPI got a NoClassDefFoundError at runtime. Jackson is part of Keycloak's classpath (transitively via keycloak-core) so we mark it 'provided' and don't bundle it. Tests: 31 unit + 3 IT = 34 green. Test count up from 19 in Phase 1.

The Phase 2 Wiremock contract assumed a /v2/principal?name=... + ?email=... endpoint shape that doesn't exist in F+ today and would have required schema/API extensions in acs-auth. While planning Phase 3 I realised the existing F+ identity API is already sufficient if we adapt the SPI to its shape: GET /v2/principal/{uuid} 200 -> { uuid, kerberos: 'alice@FACTORYPLUS.LOCAL' } 410 -> principal does not exist (F+ uses 410, not 404) GET /v2/identity/kerberos/{upn} 200 -> '"<uuid>"' (JSON string, not object) 410 -> identity does not exist So Phase 3 ships zero acs-auth changes; the work is entirely in the SPI: * findByUuid: 1 call to /v2/principal/{uuid}, parse {uuid, kerberos} * findByUsername: 2 calls (resolve UPN -> UUID, then UUID -> principal). Login is rare; Phase 5 caching will fold both into a single warm path. * findByEmail: returns Optional.empty without making any HTTP call. F+ has no email field. Reserved for a possible future schema bump. Username = full Kerberos UPN (e.g. 'alice@FACTORYPLUS.LOCAL'), matching the convention Keycloak's existing kerberos federation uses. Status 410 treated as not-found (404 also accepted in case a proxy rewrites it). 'Principal exists but has no kerberos identity' is treated as not-found rather than NPE-able. Updated Wiremock unit tests (12 in this class, 33 total unit) to mirror the real F+ shape, including: * URL encoding for UPNs containing / and @ * The two-call race where principal disappears between identity and principal lookups * Defensive 'identity endpoint returns object instead of string' check * findByEmail makes zero HTTP calls (asserted via wiremock.getAllServeEvents) Plan doc updated to record the contract revision. Group/membership endpoints originally pencilled into Phase 3 are now part of Phase 5 alongside GroupLookupProvider. Tests: 33 unit + 3 IT = 36 green.

Replaces the stock keycloak.v2 login pages with bespoke FreeMarker templates and CSS that mirror acs-admin's shadcn-vue look (slate borders, rounded-lg, sm shadow, slate-900 button) and swap the "FACTORY_PLUS" wordmark for the ACS "+" mark plus an "ACS" wordmark. The theme ships in the acs-keycloak image; service-setup activates it by setting loginTheme=factory-plus on the realm and exposes openid.loginTheme in values.yaml so deployments can opt out by setting it empty. Covers login, logout-confirm, error and info pages; account console keeps the upstream theme since it isn't a user-facing surface in ACS.

Keycloak doesn't always populate `locale` on the login template (e.g. fresh auth requests before the language cookie is set), so `${locale.currentLanguageTag!'en'}` exploded with InvalidReferenceException because the `!` default only covers the last step of the chain. Wrap the whole expression in parens so the fallback also fires when `locale` itself is missing, not just `currentLanguageTag`. Surfaces as HTTP 500 -> our error.ftl rendering "We are sorry... An internal server error has occurred" the moment the realm flips to loginTheme=factory-plus.

* Render Back to Application arrow as a styled <span>; override backToApplication message to drop upstream's "«" entity, which FreeMarker's auto-escape was rendering as the literal text "«". * Custom ease-out cubic-bezier curves replace `ease`/`ease-out` keywords for snappier UI feel. * Buttons get scale(0.97) on :active for tactile press feedback. * Card has a subtle hover ring (matches acs-admin LoginDialog) and a scale(0.97) → scale(1) entrance via @starting-style. * Card width 24rem -> 26rem and padding 1.5rem -> 1.75rem so the form has more presence on the otherwise empty page. * Hover/active states gated to (hover: hover) and (pointer: fine), and reduced-motion drops the entrance + press scale.

F+ identities are username+Kerberos. Keycloak's loginWithEmailAllowed defaulted true caused the login form to render "Username or email", inviting users to try their personal email; the F+ federation provider has no email field to query so any such attempt fails with bad creds. Force the field to "Username".

The workflow ran mvn verify against the SPI but never built the in-tree java-service-client dep first; it broke the moment that dep was added in the Phase 2.1 commit. The SPI is still built and tested as part of the acs-keycloak Docker image build on any image rebuild, so we don't need a standalone CI workflow on top.

The custom Keycloak image (acs-keycloak/Dockerfile) bakes the F+ User Storage SPI into Keycloak's providers/ directory, and the Helm chart references it via openid.image.repository=acs-keycloak. Without a CI job producing the image, tagged releases would fail to pull it on install. Adds a dedicated build-keycloak job rather than slotting acs-keycloak into build-x86-js: the Dockerfile is a standalone multi-stage build (Maven + Keycloak base) with no dependency on acs-base-js, and it needs a second build context (spi=./acs-keycloak-spi) that the JS matrix doesn't pass. Wired into build-helm-chart's needs list so the helm package job waits for the image to publish.

Previously the build relied on lib/mvn being pre-populated by `make -C lib` before `docker buildx build` ran. That works on a workstation but breaks in CI on a clean checkout, because lib/mvn is gitignored and never gets created. The build-keycloak job's `COPY --from=lib mvn /build/lib/mvn` blew up with "/mvn: not found". Make the Dockerfile self-contained: copy java-base-pom and java-service-client source via the lib build context and mvn-install each into Maven's local cache before building the SPI. The cache mount keeps incremental builds fast.

Includes comprehensive instructions for setup, usage, auto-closeout, and PR creation. Documents critical rules, review process, and teardown commands to ensure consistent workflows and prevent production issues.

The dep was declared as 0.1-SNAPSHOT but every in-tree publication of lib/java-service-client (its parent's <version>) is 0.0.0-intree, so a fresh-checkout build fails to resolve. Past local builds only worked because someone had 0.1-SNAPSHOT cached in ~/.m2/repository from an earlier flow; CI's clean cache surfaced the mismatch. The dep isn't imported by the SPI source yet (Phase 2 uses bare java.net.http); we keep the declaration so Phase 4's FPGssClientKeytab wiring is one line away.

Commit 55659de pinned the import to `isomorphic-git/http/node/index.cjs` to work around an ESM resolution failure after the dep was bumped in 861ed71. That workaround is no longer valid: current isomorphic-git publishes a strict `exports` map that exposes only `./http/node` (auto-resolves to index.js for ESM consumers, index.cjs for require), not the deep `/index.cjs` path. The bumped image fails on import: Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './http/node/index.cjs' is not defined by "exports" in /home/node/app/node_modules/isomorphic-git/package.json Revert to the bare `./http/node` subpath the package actually exports.

The seed Secret was templating `admin-password` to the empty string, which made Grafana fall back to its built-in default of "admin" on first start. Combined with the earlier mistake of using `admin@<realm>` as `admin-user`, this is what produced the "user already exists" failure when the F+ admin signed in via OAuth: Grafana already had a local user owning that login. The seed admin is meant to be a break-glass credential and nothing more - the F+ admin principal gets Grafana Admin via the fp_permissions claim mapping, not via the local account. Stop hardcoding any value derived from the realm or from Grafana's default: generate a random 32-char password on first install, preserved across helm upgrades by the existing resource-policy: keep and `if not lookup` guard. Existing installs that hit the collision need a one-time manual cleanup: Server Admin -> Users -> rename the realm-UPN login to something safe.

* Worked example: JupyterHub via Z2JH. End-to-end recipe for an OIDC consumer installed alongside ACS (not shipped with it) and gated by fp_permissions, with no ACS code touching the consumer. * Correct the "60s lag" claim - we hit a second cache layer in Keycloak's user attribute cache that requires `kubectl rollout restart deploy/openid` to flush. Note this and point at the proper fix (cachePolicy: NO_CACHE on the federation component). * Add a troubleshooting section walking through how to actually inspect what's in a user's JWT: MCP-based ACL verification, temporarily enabling direct access grants, password-grant + decode, flushing the user cache, then turning direct grants back off.

…book Reads as a complete recreate-from-zero recipe a year later: every values block in full, every UI click spelled out, the gotchas inline (IngressClassName=acs-traefik vs upstream Z2JH default of traefik; auth_state_groups_key vs deprecated claim_groups_key; manage_groups required alongside admin_groups), the cache-flush step for grant changes, and a "what this exercises" map back to the F+ machinery so the reader knows which pieces generalise to other OIDC consumers.

Keycloak's UserStorageProvider has a per-user attribute cache that defaults to "DEFAULT" (cache forever, invalidated only on restart). Combined with our SPI's permission lookups, this meant an F+ ACL grant added after a user had been loaded into Keycloak's cache stayed invisible in `fp_permissions` until the openid pod was restarted - a surprising operator footgun. Set the component's `cachePolicy` to NO_CACHE so each token issuance re-reads attributes via the SPI. The SPI's own `cache.ttl.seconds` (60) still shields the F+ Auth service from per-request load. Update docs/services/oauth-clients.md to drop the now-obsolete "restart openid after granting" workaround and reflect the simpler operator story.

Keycloak's AbstractUserAdapter exposes username/email through getUsername()/getEmail() but does NOT route them through the per-name attribute API (getAttributeStream / getFirstAttribute). The standard oidc-usermodel-attribute-mapper - which Keycloak attaches to the realm's built-in `profile` and `email` client scopes for the `preferred_username` and `email` claims respectively - reads from the per-name attribute API. Our adapter's per-name overrides returned `Stream.empty()` / `null` for everything but the F+ attributes, masking username/email at this layer. The mistake was the same one the author already encountered for getAttributes() (see the existing comment at line 97): mistakenly returning "from scratch" instead of delegating to super for the inherited fields. The default user-cache wrapper hid this from us because Keycloak's CachedUserModel snapshots `getAttributes()` (which we *do* delegate to super correctly) and serves cached lookups from there. The bug surfaces the moment cachePolicy is anything other than DEFAULT - in our case, the recent NO_CACHE switch made tokens come out with `preferred_username: null` and `email: null`, which broke Grafana's OAuth flow (Grafana falls back to a GitHub-shape /userinfo/emails endpoint when it can't extract an email, and 500s). Fix: route name=USERNAME and name=EMAIL through getUsername()/ getEmail() in both getAttributeStream and getFirstAttribute, and delegate everything else to super. Tests added for the username, email and FP attribute paths.

The OAuth work introduced four LocalSecret resources to populate the keycloak-clients Secret with random passwords. That works on a fresh install of v5.2.x (where the chart ships the LocalSecret CRD via the crds/ directory), but breaks helm upgrade from 5.1.x: helm 3 only applies CRDs in crds/ on first install, never on upgrade. Any cluster running 5.1.0 without an edge cluster deployed (and therefore no LocalSecret CRD applied by the edge-cluster subchart) fails the upgrade with "no matches for kind LocalSecret" the moment the chart tries to render the new resources. Replace the four LocalSecrets with a single native K8s Secret template using lookup + randAlphaNum to generate values on first install and preserve them across upgrades. The pattern mirrors grafana-admin-user. - deploy/templates/openid/keycloak-clients.yaml replaces openid/local-secrets.yaml - deploy/crds/local-secret-crd.yaml is removed; the edge-cluster subchart still ships its own copy for driver/OPC-server passwords - krb-keys-operator RBAC drops the localsecrets verbs since the central operator no longer reconciles any Same Secret name and key structure, so the keycloak StatefulSet env, service-setup volume mount, and openid pod consumers stay unchanged.

AlexGodbehere · 2026-05-13T10:59:03Z

Follow-up: drop LocalSecret CRD dependency on the central chart

Hit a fresh bug running the 5.1.0 to OAuth migration test on a freshly-nuked AGO cluster. helm upgrade failed with:

Error: UPGRADE FAILED: [resource mapping not found for name: "keycloak-admin-bootstrap" namespace: "factory-plus" from "": no matches for kind "LocalSecret" in version "factoryplus.app.amrc.co.uk/v1"
ensure CRDs are installed first, ... (same for keycloak-admin-admin, keycloak-client-grafana, keycloak-client-jupyterhub)]

Root cause

The OAuth templates introduced four LocalSecret resources in openid/local-secrets.yaml so that krb-keys-operator would generate random passwords for Keycloak's admin / admin-cli / per-client credentials.

That works on a fresh helm install of any v5.2.x because the chart ships the LocalSecret CRD via deploy/crds/ (added in 14c13ca), and Helm 3 applies entries from crds/ on first install. It does NOT apply them on helm upgrade - that's a deliberate Helm 3 design choice to stop charts breaking existing CRD instances.

So on the upgrade path from 5.1.x (which never created any LocalSecret resources on the central chart, so never had the CRD applied) to the OAuth release, the API server has no LocalSecret kind and the upgrade fails before any resources are written.

The reason this never bit anyone before: the edge-cluster subchart ships the LocalSecret CRD in its own crds/ directory, so any cluster that ever deployed an edge cluster has the CRD on the API server as a side-effect. Most real clusters have done that at least once. Fresh central-only clusters - exactly what nuking AGO produced - had never had the CRD applied.

Solutions considered

Pre-install/pre-upgrade Job that kubectl applys the CRD - works in every scenario (zero manual intervention, idempotent regardless of CRD origin) but costs ~60 lines of YAML (Job + ServiceAccount + ClusterRole + ClusterRoleBinding + hook annotations + the inline CRD apply). Adds a moving part on every helm operation.
Move CRD to templates/ with helm.sh/resource-policy: keep - clean for clusters that get the CRD from this chart only, but breaks for clusters where edge-cluster already installed the CRD (Helm errors "resource already exists, not managed by Helm"). Hits the take-ownership wart.
Drop LocalSecret on the central chart entirely - the four LocalSecrets only exist to generate random passwords; that's the exact use case the lookup + randAlphaNum pattern already serves (we ship that today for grafana-admin-user).

Went with option 3. The LocalSecret pattern stays appropriate for edge agent / OPC server / driver passwords on edge clusters, where the operator's password rotation and Sealed-Secret integration matter. On the central cluster it was buying us nothing that a native Secret with lookup doesn't already do for free.

What this commit does

Replaces deploy/templates/openid/local-secrets.yaml with deploy/templates/openid/keycloak-clients.yaml - a native K8s Secret with the same name and key structure (_bootstrap, _admin, plus one key per enabled non-builtin OIDC client).
Uses lookup per-key so existing values are preserved across upgrades; new keys (e.g. adding a new OIDC client) get a fresh randAlphaNum 32 on the first upgrade that includes them.
helm.sh/resource-policy: keep so helm uninstall doesn't take the Secret out from under any consumers.
Removes deploy/crds/local-secret-crd.yaml (no longer needed on the central chart; edge-cluster subchart still ships its own copy).
Trims localsecrets / localsecrets/status verbs from the krb-keys-operator ClusterRole, since the central operator no longer reconciles any.

Downstream consumers - Keycloak StatefulSet env (KEYCLOAK_ADMIN_PASSWORD from _bootstrap), service-setup volume mount, openid pod - are all unchanged.

Behaviour matrix

Cluster state	`helm upgrade --install` outcome
Fresh cluster, never had ACS	Secret created with fresh random passwords. No CRD needed.
5.1.0, no edge cluster ever deployed (the AGO test case)	Secret created with fresh random passwords. No CRD needed. No manual `kubectl apply`.
5.1.0 + edge clusters (LocalSecret CRD present from edge-cluster subchart)	Secret created with fresh random passwords. CRD now unused on the central side but harmless.
Previously on a pre-fix OAuth dev release (LocalSecrets exist, Secret is operator-owned)	`helm upgrade --take-ownership` may be needed once if the existing `keycloak-clients` Secret has owner references to the now-deleted LocalSecrets. Only affects internal dev clusters.

The first three cases are zero-intervention.

Edge clusters unaffected

LocalSecret usage in edge-helm-charts/charts/edge-agent and edge-helm-charts/charts/opcua-server is untouched. The edge-cluster subchart still ships the CRD when an edge cluster is deployed, and krb-keys-operator running on edge clusters still has the RBAC it needs there.

docker/login-action has no built-in retry (upstream feature request docker/login-action#750 is still open) and ghcr.io's token endpoint flakes often enough that a single timeout fails an otherwise valid release build. Replace the action with a bash retry loop using docker login --password-stdin in both places we authenticate: - .github/actions/prepare/action.yml (composite action used by every per-service build) - .github/workflows/publish.yml (Helm chart publish step) Five attempts with linear backoff (5s, 10s, 15s, 20s, 25s) covering up to ~75s of transient unavailability before failing the job. --password-stdin keeps the token out of ps and the shell history.

Keycloak's OIDCAttributeMapperHelper omits the fp_permissions claim entirely when a user has zero F+ permission grants - even with multivalued=true, an empty Collection collapses to null and the helper skips emission. The previous JMESPath called contains(fp_permissions[*], ...) which gets null as its subject when the claim is absent. With go-jmespath (what Grafana uses) this leaks a single space through the || chain instead of falling through to 'Viewer', and role_attribute_strict=true then rejects the login with "Integration requires a valid org role assigned in idP. Assigned role: ' '". Wrap each occurrence of fp_permissions in not_null(..., `[]`) so the subject of contains() is guaranteed to be an array. Users with no grants now resolve to 'Viewer' as intended.

Grafana's OAuth user-lookup refuses to claim a pre-existing user by email match when email_verified=false in the OIDC claims, unless oauth_allow_insecure_email_lookup is set in grafana.ini. On the AuthProxy -> OAuth migration path that surfaces as "Login Failed: user already exists" because Grafana then falls into the create-new path and collides with the existing DB row. The email we issue is either the F+ Auth email or the kerberos UPN, both derived from the same trusted F+ Principal record the SPI already accepted, so emails are kerberos-authoritative by construction. Override isEmailVerified() to return true so OIDC consumers can treat the email as a safe matching key without needing the "insecure" flag flipped on every OAuth client downstream.

AlexGodbehere · 2026-05-13T13:18:23Z

Two more issues found and fixed while running the AuthProxy to OAuth migration test

Hit both of these in the same upgrade test (5.1.0 install with AuthProxy-created Grafana users, then upgrade to a dev release and try OAuth sign-in as those existing users). Each blocks the migration entirely on its own.

1. JMESPath leaks `' '` instead of falling through to `'Viewer'` (`7b5f4bc`)

Symptom: users with zero F+ permission grants couldn't sign in. Grafana refused with:

Integration requires a valid org role assigned in idP. Assigned role: " "

role_attribute_strict: true was working as designed - the JMESPath was returning a literal single space, which isn't a valid Grafana role.

Diagnosis: dumped the userinfo Keycloak issues for a no-grants user via kcadm.sh get clients/<id>/evaluate-scopes/generate-example-userinfo:

{
  "sub": "f:fbc6a05f-e2c5-4a1e-8060-ca065e251585:...",
  "email_verified": false,
  "fp_principal_uuid": "...",
  "preferred_username": "me1newuser@AMRC-FPD-AGO.SHEF.AC.UK",
  "email": "me1newuser@AMRC-FPD-AGO.SHEF.AC.UK"
}

The fp_permissions claim is entirely absent. Even though our FactoryPlusPermissionsMapper has multivalued: true set, Keycloak's OIDCAttributeMapperHelper.mapAttributeValue collapses an empty Collection to null before emission, so the claim is dropped from the token rather than being emitted as [].

The previous JMESPath called contains(fp_permissions[*], '<uuid>') && 'Admin' || contains(fp_permissions[*], '<uuid>') && 'Editor' || 'Viewer'. With the claim missing, fp_permissions[*] resolves to null, and contains(null, '<uuid>') interacts with go-jmespath (what Grafana uses) in a way that leaks a single space through the || chain rather than falling through to 'Viewer'.

Verified against python-jmespath as a reference: with no fp_permissions key, contains(fp_permissions[*], ...) throws JMESPathTypeError. go-jmespath is more lenient but produces the same broken end result.

Fix: wrap each fp_permissions reference in not_null(fp_permissions, ``[]``) so the subject of contains() is guaranteed to be an array.

role_attribute_path: >-
  contains(not_null(fp_permissions, `[]`)[*], '{{ $perms.admin }}') && 'Admin'
  || contains(not_null(fp_permissions, `[]`)[*], '{{ $perms.editor }}') && 'Editor'
  || 'Viewer'

Verified against all five scenarios: no claim / empty array / admin uuid / editor uuid / unrelated uuid - now resolves to the expected role in every case.

2. `email_verified: false` blocks email-based user lookup, breaking AuthProxy migration (`28f6dd6`)

Symptom: for any user already in the Grafana DB from the AuthProxy era (admin and the test users), OAuth sign-in failed at the post-token-issued stage with:

Login Failed
user already exists

Diagnosis: Grafana's OAuth user lookup, when it encounters a pre-existing row that matches by login or email, will only link the OAuth identity to that row if the IdP claims email_verified: true - OR if oauth_allow_insecure_email_lookup is set in grafana.ini. Neither was true:

Our SPI's FactoryPlusUserAdapter extends AbstractUserAdapter, which has isEmailVerified() returning false by default. So Keycloak emits email_verified: false.
We never set oauth_allow_insecure_email_lookup in our Grafana config.

With both off, Grafana drops into the "create new user" path, hits a unique-constraint collision on the existing row, and surfaces "user already exists". The original AuthProxy user_id and its folder ACLs become unreachable through OAuth.

The "insecure" wording is for situations where the OAuth provider doesn't actually verify emails. Our IdP is our own Keycloak federating our own F+ Auth - emails are derived from the kerberos principal name on the F+ Principal record, which the SPI has already accepted. They're kerberos-authoritative, so it's correct to mark them verified.

Fix: override isEmailVerified() in the user adapter to return true. One line plus a test.

@Override
public boolean isEmailVerified() {
    return true;
}

Tests bumped from 9 to 10 on FactoryPlusUserAdapterTest, total SPI test count 74, all passing.

What this unblocks

With both fixes deployed:

Users with no F+ permission grants resolve to Grafana Viewer on OAuth sign-in (instead of being rejected for an invalid role).
AuthProxy-created users from 5.1.0 are claimed by their OAuth identities on first sign-in instead of triggering "user already exists", preserving their user_id (and therefore folder ACLs, dashboard ownership, etc.).

The migration test is now actually testing migration - rather than failing at the protocol layer before user lookup can happen.

What's still on the radar

The role_attribute_strict: true behaviour means F+ permission grants are still required on every OAuth login for anything above Viewer. That's by design (per the existing comment in values.yaml), but worth flagging in the release notes - existing AuthProxy Admins/Editors who don't have a matching F+ permission grant will be quietly demoted to Viewer on first OAuth sign-in. Manual ACLs (per-folder, per-dashboard) survive because they're per-user, not per-role.

Grafana's OAuth client (pkg/services/authn/clients/oauth.go) only populates lookupParams.Email when oauth_allow_insecure_email_lookup is true - and crucially lookupParams.Login is never populated at all on the OAuth path regardless of this flag. With the flag at its default of false, lookupByOneOf sees both Email and Login as nil, returns ErrUserNotFound, Grafana drops into the createUser path, and the create collides with the existing row's email/login uniqueness check producing the unhelpful "user already exists" error. This breaks the AuthProxy -> OAuth migration: every pre-existing Grafana user (admin and the test users) hits the "user already exists" path and OAuth sign-in fails despite the SPI emitting email_verified=true and the kerberos UPNs matching the existing login columns byte-for-byte. The "insecure" naming refers to OAuth providers that don't verify the email they emit; ours is our own Keycloak federating F+ Auth, and FactoryPlusUserAdapter.isEmailVerified() (28f6dd6) already stamps email_verified=true on the OIDC claims. The email is kerberos-authoritative by construction. Enabling the flag here is the only knob that makes the lookup actually happen.

AlexGodbehere · 2026-05-13T14:06:03Z

Correction + final fix: `oauth_allow_insecure_email_lookup` is the real gate

My previous comment leaned on the email_verified SPI fix as the migration unblock. That was wrong in terms of mechanism. Posting the correction with the actual root cause for the record.

What actually controls OAuth user lookup in Grafana

pkg/services/authn/clients/oauth.go:206-210 (Grafana main):

lookupParams := login.UserLookupParams{}
allowInsecureEmailLookup := c.settingsProviderSvc.KeyValue(
    "auth", "oauth_allow_insecure_email_lookup").MustBool(false)
if allowInsecureEmailLookup {
    lookupParams.Email = &userInfo.Email
}

That's the entire user-lookup setup for the OAuth flow. Note that:

lookupParams.Login is never populated on the OAuth path. The login claim from the IdP is ignored for user-lookup purposes entirely.
lookupParams.Email is only populated when oauth_allow_insecure_email_lookup is true.
email_verified has no effect on this decision — it's not consulted at all by the OAuth lookup code.

When both Email and Login in lookupParams are nil, lookupByOneOf (pkg/services/authn/authnimpl/sync/user_sync.go:711) skips both branches and returns ErrUserNotFound. Grafana then falls into the createUser path, which collides with the existing row's email/login uniqueness check and surfaces the unhelpful "user already exists" error.

So the SPI isEmailVerified() change (28f6dd6) doesn't unblock the migration on its own — it's semantically correct for our setup but doesn't touch the relevant code path.

Actual fix: enable the flag (commit `3983525`)

auth.generic_oauth:
  ...
  oauth_allow_insecure_email_lookup: true

The "insecure" wording in the setting name had me chasing the wrong theory — it's not "trust unverified emails", it's literally "perform the email-based lookup that's otherwise disabled." For our setup the email is kerberos-authoritative by construction (derived from the F+ Principal record the SPI has already accepted), so the flag is the right thing to flip.

Migration test result

After deploying the fixed chart, both AuthProxy-era test users (created in 5.1.0 via the AuthProxy header path) successfully signed in via OAuth and retained their per-user folder ACLs — confirming:

user_id preserved (login matched the existing row by email)
Origin column flipped from "Synced via Auth Proxy" to "Synced via Auth Module"
Folder Edit grants survive because they're keyed on user_id, which didn't change
role_attribute_strict: true correctly demotes users without F+ grants to Viewer on first OAuth login (expected, by design)

Why keep the `isEmailVerified()` SPI change anyway

email_verified=true is the semantically correct claim for our IdP — emails are kerberos-authoritative. Future consumers (or future Grafana versions that do gate behaviour on the claim) get the right answer for free. It's the right default; it just wasn't the thing fixing the lookup here.

Summary of fixes that actually landed this branch since the original PR

Commit	Fix
`7c760887`	Drop LocalSecret dependency for keycloak-clients Secret (CRD-upgrade trap)
`b6e60bfa`	Retry ghcr.io docker login on transient flakes
`7b5f4bc7`	Wrap fp_permissions in `not_null(..., \`[]`)` so role JMESPath falls through to Viewer when claim absent
`28f6dd6e`	SPI stamps `email_verified=true` on federated users
`3983525f`	Set `oauth_allow_insecure_email_lookup: true` — the actual migration unblock

The merged effect: 5.1.x clusters upgrading through this branch get OAuth that handles both never-had-an-edge-cluster CRD state, transient registry flakes during release builds, no-F+-permission OAuth users, and the AuthProxy-to-OAuth user-id migration — all with no manual operator intervention beyond helm upgrade.

Login form ergonomics: a local user who types just "alice" should not have to know or type the cluster's kerberos realm to sign in. Cross-realm users who type their full UPN should keep working unchanged. Add a default.realm config property on the SPI user-storage provider. FPAuthBackedUserStore.canonicaliseUpn appends "@<defaultRealm>" (uppercased defensively) when the input has no @, otherwise passes the input through verbatim. service-setup wires the property to the chart's identity.realm value when it provisions the federation, so the affordance is on by default for every deployment. Behaviour matrix: alice + realm=FP.LOCAL -> alice@FP.LOCAL alice@OTHER.REALM + realm=FP.LOCAL -> alice@OTHER.REALM alice@fp.local + realm=FP.LOCAL -> alice@FP.LOCAL (realm uppercased, existing behaviour) alice + realm="" -> alice (legacy pass-through) alice + realm unset -> alice (legacy pass-through) Five new wiremock-driven tests in FPAuthBackedUserStoreTest cover each row of that matrix. SPI test count 74 -> 79, all passing.

Previous canonicaliseUpn() uppercased the realm part of every @-suffixed input. Uppercase Kerberos realms are convention but not universal - some clusters legitimately run with mixed- or lowercase realm names, and force-casing the input silently broke identity lookups against F+ entries that didn't match the forced casing. Drop the uppercasing on both paths: respect the realm in typed UPNs verbatim, and append the default.realm configured value exactly as supplied (without case-folding it on the way in either). The configured default realm now uses whatever case the chart's identity.realm value carries, which is what the operator intended in the first place. Renamed canonicaliseUpn to applyDefaultRealm since it no longer canonicalises anything other than appending the realm for short names. Updated short_name_default_realm_is_uppercased to assert verbatim usage, added typed_upn_realm_passes_through_verbatim as a regression test. Tests 79 -> 80, all passing.

AlexGodbehere mentioned this pull request May 8, 2026

Deploy Keycloak as an OAuth provider #443

Closed

AlexGodbehere force-pushed the dn/oAuth branch from b7e1326 to e5bbc55 Compare May 8, 2026 10:18

AlexGodbehere marked this pull request as ready for review May 8, 2026 11:32

AlexGodbehere requested review from KavanPrice and djnewbould May 8, 2026 12:25

AlexGodbehere mentioned this pull request May 12, 2026

acs-krb-keys-operator: make LocalSecret format=Password idempotent #647

Merged

5 tasks

AlexGodbehere force-pushed the dn/oAuth branch 2 times, most recently from 28c980f to 6baab17 Compare May 12, 2026 09:54

djnewbould and others added 19 commits May 13, 2026 09:47

Add Keycloak Helm templates

2ff1c77

Add ingress route

0e8fe4a

Add keycloak local secrets

34f40f0

Add service setup keycloak setup

1b83c35

Add grafana keycload support

9694677

AlexGodbehere added 15 commits May 13, 2026 09:47

docs: add parallel-worktree workflow guide for Claude Code integration

0eaa6e2

Includes comprehensive instructions for setup, usage, auto-closeout, and PR creation. Documents critical rules, review process, and teardown commands to ensure consistent workflows and prevent production issues.

AlexGodbehere force-pushed the dn/oAuth branch from 5de7c16 to e6b47d8 Compare May 13, 2026 08:48

KavanPrice previously approved these changes May 13, 2026

View reviewed changes

Comment thread docs/plans/2026-05-07-keycloak-fp-user-storage-spi.md

Comment thread deploy/templates/openid/openid.yaml

AlexGodbehere dismissed KavanPrice’s stale review via 7c76088 May 13, 2026 10:58

AlexGodbehere added 3 commits May 13, 2026 12:11

AlexGodbehere added 2 commits May 13, 2026 15:09

AlexGodbehere merged commit 83f0416 into main May 13, 2026
59 checks passed

AlexGodbehere deleted the dn/oAuth branch May 13, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAuth Support#646

OAuth Support#646
AlexGodbehere merged 78 commits into
mainfrom
dn/oAuth

AlexGodbehere commented May 8, 2026 •

edited by KavanPrice

Loading

Uh oh!

AlexGodbehere commented May 12, 2026

Uh oh!

AlexGodbehere commented May 12, 2026

Uh oh!

AlexGodbehere commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

AlexGodbehere commented May 13, 2026

Uh oh!

AlexGodbehere commented May 13, 2026

Uh oh!

AlexGodbehere commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AlexGodbehere commented May 8, 2026 • edited by KavanPrice Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the box

Migration

Test plan

Uh oh!

AlexGodbehere commented May 12, 2026

End-to-end validation: third-party OIDC consumer (JupyterHub)

What we did on the ACS side

What we did on the JupyterHub side

Verified end-to-end

One sharp edge we found and documented

Docs added in this PR

Uh oh!

AlexGodbehere commented May 12, 2026

Follow-up: dropped the cache-invalidation footgun

Uh oh!

AlexGodbehere commented May 13, 2026

Follow-up: SPI bug that NO_CACHE exposed (commit 5de7c16b)

What was actually wrong

Why we hadn't noticed

The fix

Why the SPI test suite didn't catch this earlier

Uh oh!

Uh oh!

Uh oh!

AlexGodbehere commented May 13, 2026

Follow-up: drop LocalSecret CRD dependency on the central chart

Root cause

Solutions considered

What this commit does

Behaviour matrix

Edge clusters unaffected

Uh oh!

AlexGodbehere commented May 13, 2026

Two more issues found and fixed while running the AuthProxy to OAuth migration test

1. JMESPath leaks ' ' instead of falling through to 'Viewer' (7b5f4bc)

2. email_verified: false blocks email-based user lookup, breaking AuthProxy migration (28f6dd6)

What this unblocks

What's still on the radar

Uh oh!

AlexGodbehere commented May 13, 2026

Correction + final fix: oauth_allow_insecure_email_lookup is the real gate

What actually controls OAuth user lookup in Grafana

Actual fix: enable the flag (commit 3983525)

Migration test result

Why keep the isEmailVerified() SPI change anyway

Summary of fixes that actually landed this branch since the original PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexGodbehere commented May 8, 2026 •

edited by KavanPrice

Loading

Follow-up: SPI bug that NO_CACHE exposed (commit `5de7c16b`)

1. JMESPath leaks `' '` instead of falling through to `'Viewer'` (`7b5f4bc`)

2. `email_verified: false` blocks email-based user lookup, breaking AuthProxy migration (`28f6dd6`)

Correction + final fix: `oauth_allow_insecure_email_lookup` is the real gate

Actual fix: enable the flag (commit `3983525`)

Why keep the `isEmailVerified()` SPI change anyway