Skip to content

OAuth Support#646

Merged
AlexGodbehere merged 78 commits into
mainfrom
dn/oAuth
May 13, 2026
Merged

OAuth Support#646
AlexGodbehere merged 78 commits into
mainfrom
dn/oAuth

Conversation

@AlexGodbehere
Copy link
Copy Markdown
Contributor

@AlexGodbehere AlexGodbehere commented May 8, 2026

Summary

Adds OAuth/OIDC sign-in to ACS via a self-hosted Keycloak that federates against the F+ Auth service. Grafana now logs in through OIDC instead of being seeded with admin credentials, and Grafana roles are driven directly from F+ ACL grants. The whole stack is provisioned automatically by service-setup on helm upgrade.

What's in the box

Keycloak deployment

  • New acs-keycloak Docker image (Keycloak 26.1.1 + custom F+ User Storage SPI baked into /opt/keycloak/providers/).
  • New openid Deployment, Service, Ingress and Postgres-backed JAAS config in the chart (deploy/templates/openid/*).
  • Realm, federation provider, claim mappers and OIDC clients all created idempotently by service-setup (acs-service-setup/lib/openid.js).

F+ User Storage SPI (acs-keycloak-spi/)

  • FactoryPlusUserStorageProvider looks users up against acs-auth instead of Keycloak's local user table; passwords validate against Kerberos.
  • Caching store with TTL so login flows aren't HTTP-bound.
  • Per-realm component config (auth.url, auth.principal, auth.keytab.path, cache TTL).
  • SPNEGO outbound auth for the SPI's own F+ HTTP calls.
  • Two protocol mappers stamp every issued token:
    • fp_principal_uuid - the F+ principal UUID for the signed-in user.
    • fp_permissions - the user's F+ permission UUIDs (multi-valued, JMESPath-friendly).
  • Comprehensive Wiremock + Testcontainers integration tests.

Grafana wiring

  • Generic OAuth flow against the new realm; admin seed account is renamed so it doesn't collide with OIDC sign-in.
  • role_attribute_path matches fp_permissions against ACS-defined Grafana role groups (Admin/Editor) seeded by service-setup.
  • role_attribute_strict enforced - revoking the F+ permission demotes the user on the next login. Grafana role capped at Admin (no GrafanaAdmin escalation path through ACL).

ACS-branded login theme

  • New factory-plus Keycloak theme (acs-keycloak/themes/factory-plus/) replacing the stock keycloak.v2 look with shadcn-vue styling that matches acs-admin.
  • Hand-rolled CSS, FreeMarker templates for login / logout-confirm / error / info pages, and a scoped messages_en.properties.
  • Activated via realm.loginTheme = factory-plus from service-setup; openid.loginTheme: "" in values.yaml reverts to stock.

Supporting changes

  • service-setup allows per-client URL overrides (rootUrl, redirect, post-logout) and asks Keycloak to push backchannel logout to OIDC clients.
  • LocalSecret CRD and RBAC updates so a fresh install can bootstrap Keycloak client secrets.
  • krb5.conf gets rdns = false to avoid libkrb5 reverse-DNS hangs in the openid pod.
  • Build infrastructure: top-level make is portable on macOS BSD/GNU make, parallel subdir builds via SUBDIRS_PARALLEL, per-subdir log files with a summary.
  • acs-mqtt: portable maven-output filename expansion (BSD vs GNU).
  • acs-edge: skipLibCheck so the top-level build doesn't trip on transitive type errors.

Docs

  • docs/services/oauth-clients.md - operator guide for adding OIDC clients.
  • docs/plans/2026-05-07-keycloak-fp-spi-plan.md - the staged plan this branch implemented (kept for archival).
  • docs/plans/2026-05-07-keycloak-fp-user-storage-spi.md - SPI architecture notes.

Migration

For existing deployments, helm upgrade is sufficient. The first run will:

  1. Provision the factory_plus realm, the factoryplus federation provider, and the configured OIDC clients (grafana by default).
  2. Mount the SPI keytab into the openid pod and seed Grafana role groups in F+.
  3. Restart openid and mqtt to pick up regenerated config.

To opt out of the branded login theme set openid.loginTheme: "". To run upstream Keycloak without the SPI override openid.image.repository/registry - but the F+ federation provider won't load.

Test plan

  • helm upgrade succeeds against an existing cluster; no service-setup errors.
  • Realm factory_plus exists with loginTheme=factory-plus, federation provider factoryplus, and the grafana client.
  • Sign in to Grafana via OIDC as a user in the F+ Admin role group; user lands as Grafana Admin.
  • Revoke the F+ Admin grant; on next Grafana login the user is demoted (role_attribute_strict).
  • Sign-in page renders with the ACS branding (+ mark + ACS wordmark, slate-900 button, no « artifact).
  • Logout flow renders the branded logout-confirm page; "Back to Application" link returns to the originating client.
  • Token introspection on a Grafana access token shows fp_principal_uuid and a multi-valued fp_permissions claim.
  • SPI Testcontainers integration tests pass (mvn -pl acs-keycloak-spi verify).
  • Helm upgrade (same version) or restart kerberos-keys-operators both centrally and fplus-edge. Validate that password for _bootstrap is the same before and after in keycloak-clients.
  • Follow guide to set up 3rd party OAuth application in docs.

@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

End-to-end validation: third-party OIDC consumer (JupyterHub)

To validate that this PR isn't just "Grafana with extra steps", we ran the entire flow with JupyterHub as the OIDC client - installed as a separate Helm chart, not packaged with ACS, no code in this repo aware that it exists. What worked:

What we did on the ACS side

  1. One values entry, no code change. Added a jupyterhub block to serviceSetup.config.openidClients in the existing ACS values. helm upgrade and that was it: service-setup created the Keycloak client, attached the fp_principal_uuid and fp_permissions claim mappers, and dropped the auto-generated client secret into keycloak-clients/jupyterhub. The generic openidClients extension point did everything; we didn't touch service-setup logic.

  2. Created two F+ Permission objects via the ACS Admin UI. Navigated to ConfigDB -> Create Object, selected Permission (Rank 1 class), created JupyterHub admin and JupyterHub user with chosen UUIDs. These live in F+ as ordinary Permission objects - JupyterHub-specific only by name.

  3. Granted the admin permission to admin@<realm> via Access Control -> Principals, with target = Wildcard. Verified the grant via the ACS MCP server (find_grants, check_acl).

What we did on the JupyterHub side

Standard Zero-to-JupyterHub install with GenericOAuthenticator, pointing the auth/token/userinfo/logout URLs at our Keycloak. The two ACS-aware bits in JupyterHub's config:

auth_state_groups_key: oauth_user.fp_permissions
admin_groups:
  - "<jupyterhub-admin permission UUID>"
allowed_groups:
  - "<jupyterhub-admin permission UUID>"
  - "<jupyterhub-user permission UUID>"

auth_state_groups_key tells the OAuthenticator (>=17) to look in the post-OAuth auth_state dict for the groups list at the dotted path oauth_user.fp_permissions - i.e. the array of UUIDs that this PR's acs-keycloak-spi FactoryPlusPermissionsMapper stamps into every issued token. admin_groups and allowed_groups are just lists of UUIDs JH compares against that array. Membership of the admin UUID -> hub admin; membership of the user UUID -> regular access; neither -> 403.

That's the entire integration. No glue code, no shim. The application reads UUIDs out of a JWT claim and gates on UUIDs it was told about.

Verified end-to-end

  • Sign in to http://jupyterhub.<acs.baseUrl> -> redirected to ACS-branded Keycloak login.
  • Authenticate as admin@<realm> (the F+ principal, password validated via Kerberos through the F+ User Storage SPI in acs-keycloak-spi).
  • Token issued with fp_permissions containing the JupyterHub admin UUID we created.
  • JupyterHub honours admin_groups, lands the user in JupyterLab as a hub admin.
  • Confirmed permissions changes propagate: removed/regranted, verified the change via the userinfo endpoint, hit JH again.

One sharp edge we found and documented

Keycloak's user attribute cache memoises fp_permissions per-user-load, on top of the SPI's own 60s lookup cache. Granting a new permission in F+ does not show up in tokens until Keycloak's cache is invalidated. Today the fix is kubectl rollout restart deploy/openid. Better fix queued in the docs: set cachePolicy: NO_CACHE on the F+ federation component so attribute changes propagate without a restart.

Docs added in this PR

docs/services/oauth-clients.md now contains:

  • "Worked example: JupyterHub" - the runbook above, complete with values, the ACS Admin UI clickthrough for creating Permissions and grants, and the Z2JH values.
  • Troubleshooting -> "Inspecting what a user's JWT actually contains" - the technique we used to debug, including how to (temporarily) enable direct access grants on a client, password-grant a token, and decode it. Plus the cache-flush step for when grants don't propagate.
  • Constraints and gotchas updated to note the two-layer cache (the previous "60s lag" claim under-described it).

This makes the PR's value proposition concrete: operators can connect any OIDC application to ACS by adding one values entry and creating F+ permission objects in the UI - no ACS internals touched. JupyterHub is the proof; Grafana is the in-tree reference.

@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

Follow-up: dropped the cache-invalidation footgun

The JupyterHub validation surfaced a sharp edge: F+ permission grants didn't propagate to consumers' tokens without a kubectl rollout restart deploy/openid. Two cache layers were responsible:

  1. The SPI's own 60s TTL on F+ lookups (intentional, configurable, fine).
  2. Keycloak's per-user attribute cache on the UserStorageProvider component, which defaults to DEFAULT (cache forever, invalidated only on restart). Once a user had been loaded into Keycloak's cache, attribute reads served the snapshot taken at first load - so a new ACL grant in F+ stayed invisible in fp_permissions indefinitely.

Fix (commit 367197d): acs-service-setup/lib/openid.js now sets cachePolicy: NO_CACHE on the factoryplus federation component when it provisions or updates it. Each token issuance re-reads attributes via the SPI; the SPI's cache.ttl.seconds=60 still shields F+ Auth from per-request load. After this lands:

  • Grant a permission in ACS Admin -> wait up to 60s for the SPI cache -> user signs in again -> new permission is in their fp_permissions claim.
  • No openid restart needed for routine grant changes.

Docs updated in the same commit. docs/services/oauth-clients.md:

  • Constraints and gotchas now describes a single 60s lag, not two cache layers.
  • The JupyterHub example's "If you grant or change a permission later" section drops the openid-restart step and just says wait + re-login. Restart is mentioned only as a debugging shortcut to bypass the SPI's 60s cache.

End-state of the operator story is now: one values entry adds the OIDC client, F+ permission objects live in ConfigDB, ACL grants in Access Control, and changes propagate within a minute - all via the supported openidClients extension point with no ACS code changes.

@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

Follow-up: SPI bug that NO_CACHE exposed (commit 5de7c16b)

The cachePolicy: NO_CACHE change in the previous commit revealed a pre-existing bug in FactoryPlusUserAdapter. Symptom on the test cluster (fpd-fry, fresh install of 5.2.0-ago.14): Grafana sign-in fails with HTTP 500 -> Error getting email address -> /userinfo/emails 404 (the GitHub-shape /emails fallback Grafana does when it can't extract an email from the userinfo response).

What was actually wrong

Keycloak exposes user properties two ways and the SPI was handling them inconsistently:

  • getAttributes() (bulk) - already delegated to super (the author fixed this previously, see the comment in the file).
  • getAttributeStream(name) / getFirstAttribute(name) (per-name) - returned Stream.empty() / null for anything but the F+ attributes, masking username and email.

The standard oidc-usermodel-attribute-mapper Keycloak attaches to the realm's profile / email client scopes - the ones that produce the preferred_username and email claims - reads via the per-name path. With both null, tokens came out with preferred_username: null and email: null:

// fry, ago.14, NO_CACHE on the federation:
{
  "sub": "f:...:...",
  "fp_principal_uuid": "...",
  "fp_permissions": [...],
  "email_verified": false
  // no preferred_username, no email
}

Why we hadn't noticed

Keycloak's UserStorageProvider user cache (cachePolicy: DEFAULT) snapshots a user's attributes via getAttributes() on first load and serves all subsequent lookups from that snapshot. The bulk path was correct, so the snapshot contained username and email. Every per-name lookup hit the cache instead of our adapter, and the bug stayed hidden.

The previous commit set cachePolicy: NO_CACHE on the federation provider so ACL changes propagate without an openid restart. That switch sent the per-name lookups straight to the buggy SPI methods, which is what fry was hitting.

ago still has cache=DEFAULT so it shows the cached behaviour (preferred_username present, looking healthy) - but it has the same latent bug; flipping its cache off would have produced the same Grafana failure.

The fix

FactoryPlusUserAdapter.getAttributeStream(name) / getFirstAttribute(name) now route name=username -> getUsername() and name=email -> getEmail(), and delegate everything else to super. Tests added that would have caught this. No change to the bulk method - it was already correct.

Net result:

  • Keep cachePolicy: NO_CACHE from the previous commit - ACL grant changes still propagate without restart.
  • Grafana's OAuth flow works again - tokens come back with preferred_username and email populated.
  • The same fix applies on ago too; it just wasn't symptomatic there because of the cache.

Why the SPI test suite didn't catch this earlier

The existing tests only asserted on getUsername() and getEmail() directly. Those still worked - what was broken was the per-name attribute methods that Keycloak's mappers actually use. New tests in FactoryPlusUserAdapterTest cover both paths (username, email, and the F+ attribute names).

djnewbould and others added 19 commits May 13, 2026 09:47
Adds the empty Maven module that will house the Keycloak User Storage SPI
plugin (Java 17, Keycloak 26.1.1 SPI deps in 'provided' scope, JUnit 5 +
Mockito + AssertJ + Testcontainers for tests). No source files yet.

Also lands the shaping artefacts: the pitch (2-month appetite to make
Factory+ the single source of truth for Keycloak users/groups via a
custom federation provider) and the implementation plan with Phase 1
detailed and Phases 2-12 outlined.
The website is factoryplus.app.amrc.co.uk; reversed strictly that's
uk.co.amrc.app.factoryplus, not uk.co.amrc.factoryplus (which the
existing hivemq-krb module uses but drops the .app segment).

Renamed the empty package directories (not tracked by git, doesn't show
in the diff) and updated pom.xml groupId + every package reference in
the implementation plan.
Wires the new module into the existing AMRC Java multi-module build:

  * Inherits dependency versions (JUnit BOM, SLF4J, JSON libs, Apache
    HTTP, RxJava, Vavr, Jetty) from base-pom 0.0.0-intree
  * Sets acs.top to .. so the local-build profile finds the in-tree
    Maven repo at lib/mvn (one level less deep than lib/X children)
  * Pins keycloak/mockito/assertj/testcontainers locally since the
    parent doesn't manage them

Defers adding the java-service-client dep to Phase 2. The in-tree
publication is 0.1-SNAPSHOT (predating base-pom inheritance) while the
parent's dependencyManagement uses ${acs.version} = 0.0.0-intree, which
isn't a published artefact. We'll resolve when we actually need the
client (pin 0.1-SNAPSHOT, rebuild the lib, or upstream a fix).

Plan updated: Phase 2 uses FPAuth, Phase 4 uses FPGssClientKeytab with
the existing sv1openid keytab, Phase 8 adds maven-shade-plugin to bundle
+ relocate transitives to dodge Keycloak classloader collisions.
Two SPI classes Keycloak needs to recognise the federation provider:

  * FactoryPlusUserStorageProviderFactory.getId() returns 'factoryplus',
    matching the existing kerberos federation's id pattern. One factory
    per JVM, holds no per-request state.
  * FactoryPlusUserStorageProvider is the per-request instance Keycloak
    constructs via the factory. Empty for now; close() is a no-op until
    we hold real resources (HTTP client, GSS context).

Required adding keycloak-model-storage to pom.xml: in Keycloak 26 the
UserStorageProvider + Factory interfaces moved out of keycloak-server-spi
into this dedicated artefact. Discovered via a compile failure that
referenced classes resolvable by name but not on the classpath.

TDD: factory-id test failed with 'cannot find symbol', then went green
after the implementation classes existed. mvn -B test reports 1/1.
Adds the ServiceLoader metadata file that makes Keycloak discover
FactoryPlusUserStorageProviderFactory at startup. Without this file the
jar ships fine and the classes load on demand, but Keycloak never knows
the factory exists, so no realm can configure 'factoryplus' as a
federation provider.

The file is a single line containing the fully-qualified class name.
Maven copies src/main/resources/* into the jar at build time, landing
it at META-INF/services/org.keycloak.storage.UserStorageProviderFactory
inside the jar where Keycloak's ServiceLoader scan finds it.

SpiRegistrationTest asserts the file is on the classpath at the
expected path with the expected contents - a guard against renames or
typos that would otherwise only surface as 'no such provider' at
Keycloak startup. mvn -B test now runs 2/2 green.
Replaces the planned hardcoded fixed-user with a real seam from day one,
so no throwaway code is ever committed and Phase 2 just slots a real
HTTP-backed store implementation in behind the existing interface.

New production types:

  * FactoryPlusUser - immutable record DTO (uuid, username, email)
  * FactoryPlusUserStore - read-only interface: findByUuid|Username|Email,
    each returning Optional<FactoryPlusUser>
  * FactoryPlusUserAdapter - extends Keycloak's AbstractUserAdapter to
    expose F+ data as a UserModel; overrides only getUsername, getEmail,
    getId (uses F+ UUID, not username, as the federated external id so
    renames don't break references), credentialManager (standard plumbing,
    Phase 6 wires real validation)
  * NullFactoryPlusUserStore - singleton null-object impl returning
    Optional.empty for every lookup; used by the factory as the
    production default until Phase 2

The provider holds a FactoryPlusUserStore via constructor injection and
implements UserLookupProvider by delegating findBy* and wrapping results
in adapters. The factory hands NullFactoryPlusUserStore.INSTANCE to each
provider for now.

Test coverage (17 unit tests, 5 classes, all green):
  * FactoryPlusUserAdapterTest: username/email exposure, F+ UUID
    accessor, storage id format (f:<modelId>:<userUuid>), null email
    handling
  * FactoryPlusUserStorageProviderTest: delegation for username/email/id
    lookups, including null-on-empty and adapter type assertion
  * NullFactoryPlusUserStoreTest: every lookup returns empty (pins the
    contract against future regressions)

Mockito bumped 5.11.0 -> 5.18.0 because 5.11 cannot mock interfaces on
JDK 25 ("Could not modify all classes [interface
org.keycloak.models.KeycloakSession]"). The system JVM here is 25.
Failsafe + Testcontainers + Keycloak admin client integration test that
bakes the built jar into a real quay.io/keycloak/keycloak:26.1.1
container, then asserts:

  1. An admin can register the 'factoryplus' provider on a realm.
     Keycloak's components endpoint returns 201, which it only does if
     the SPI was discovered at startup via META-INF/services.
  2. With the null store wired, user search through the realm returns
     no matches and doesn't error - proving end-to-end that the
     federation chain falls through cleanly.

Failsafe wired to run *IT.java in the verify phase, after package, with
spi.jar system property pointing at the freshly-built artefact. Also
forwards DOCKER_HOST (env + sysprop) to the forked test JVM for setups
where Docker isn't at the default /var/run/docker.sock.

The IT is gated by @EnabledIf("dockerAvailable") so it skips cleanly on
dev machines where Testcontainers can't auto-detect Docker (notably
macOS + OrbStack/Colima/Rancher Desktop, where the JVM can't follow the
/var/run/docker.sock symlink). CI on ubuntu-latest has Docker at the
default path so the IT runs there. The class doc explains the local
workarounds.

Bumped Mockito 5.11.0 -> 5.18.0 (JDK 25 inline-mock-maker compatibility,
already needed before this commit) and Testcontainers 1.19.7 -> 1.21.3
(better socket detection on rootless/alt Docker hosts). Added
keycloak-admin-client 26.0.5 (test scope; admin client artefact lags the
main 26.1.1 release but the REST contract is stable across 26.x). Added
slf4j-simple test dep so Testcontainers' diagnostic logs surface.
The earlier diagnostic guessed at SIP/sandboxing as the reason
Testcontainers couldn't see Docker. Real cause: OrbStack stops the VM
when idle, removing /Users/<user>/.orbstack/run/docker.sock. The docker
CLI auto-starts the VM on demand (so 'docker version' works) but
Testcontainers just stat()s the socket file without that wake-up. With
OrbStack running ('orb start'), DOCKER_HOST is honoured, the
EnvironmentAndSystemPropertyClientProviderStrategy applies, and the IT
runs end-to-end (~9s for both tests on this machine).

Updated the @EnabledIf comment to describe the real cause and the fix.
Trimmed simplelogger.properties from trace to info; trace is left as a
commented-out toggle for future debugging.
GitHub Actions workflow runs on push/PR touching the SPI module, the
Java parent POM, the in-tree Maven repo, or the workflow itself.
ubuntu-latest has Docker preinstalled at the standard path so the
Testcontainers IT runs there without configuration. JDK 21 with Maven
caching for fast iteration.

README documents build/test/run-locally, the Phase 1 status caveat (SPI
loads but the null store returns no users), the OrbStack workaround
for local IT runs, the project layout, the recipe for adding a new SPI
capability, and the architecture sketch showing how the FactoryPlusUserStore
seam absorbs Phase 2 changes.
Pinned to 0.1-SNAPSHOT explicitly to override base-pom's
dependencyManagement entry (which references ${acs.version} =
0.0.0-intree, mismatched with the in-tree publication). Reconciling
the parent POM's version scheme is a separate cleanup outside this
phase's scope.

Compile + full test suite (17 unit + 2 IT) still green.
First real implementation of FactoryPlusUserStore: calls the F+ auth
service over HTTP using bare java.net.http.HttpClient. The class
Javadoc documents the expected HTTP contract that Phase 3 implements
in acs-auth:

  GET /v2/principal/{uuid}        -> { uuid, name, email? } or 404
  GET /v2/principal?name={name}   -> { uuid, name, email? } or 404
  GET /v2/principal?email={email} -> { uuid, name, email  } or 404

Added FactoryPlusAuthException (RuntimeException) for transport / 5xx /
malformed-response failures. Surfaces upward so Keycloak's federation
chain can fall through; silent emptys would look like 'user not found'
instead of the real fault.

Phase 4 will swap the bare HttpClient for lib/java-service-client's
FPGssClientKeytab when we need real SPNEGO auth against a real F+; the
FactoryPlusUserStore interface absorbs that change. The
java-service-client dep added in 2.1 is unused at compile time today
but stays declared for that future Phase 4 swap.

Tests (10 new, all Wiremock-driven, 27 unit tests total green):
  * find_by_uuid happy path, 404, 5xx
  * find_by_username happy path, 404, URL-encoding edge case
  * find_by_email happy path, 404
  * principal with no email yields null in the DTO
  * malformed JSON throws FactoryPlusAuthException

Wiremock 3.13.1 added as test dep; org.json (managed by base-pom) added
as compile dep.
The realm admin now configures the federation provider with a F+ auth
URL via the Keycloak admin UI:

  Federation -> Add provider -> factoryplus
    Factory+ auth URL: http://acs-auth.../...
    Request timeout (seconds): 5

The factory's create() reads the ComponentModel config:
  * If auth.url is set, builds an FPAuthBackedUserStore pointing at it
    with the configured timeout
  * Otherwise (or if blank) returns NullFactoryPlusUserStore.INSTANCE
    so the federation loads cleanly without F+ dependency

The two config keys (auth.url, auth.timeout.seconds) are advertised to
Keycloak via getConfigProperties() so they appear as form fields in the
admin UI. Phase 4 adds auth.principal and auth.keytab.path for SPNEGO.

Test coverage (4 new, 31 unit tests total green):
  * No auth.url           -> NullFactoryPlusUserStore
  * auth.url set          -> FPAuthBackedUserStore
  * auth.url blank        -> NullFactoryPlusUserStore (defensive)
  * getConfigProperties() exposes auth.url + auth.timeout.seconds

Provider gained a package-private getStore() accessor purely for these
factory tests; production code never calls it.
Adds a third IT case: stand up Wiremock as an in-process sibling
(exposed inside the Keycloak container via host.testcontainers.internal),
configure a 'factoryplus' federation pointing at it via the
ComponentRepresentation config (auth.url + auth.timeout.seconds), and
assert (a) Keycloak accepts the configuration with HTTP 201 and
(b) the config round-trips back through Keycloak's component store.

Originally tried to assert end-to-end via realm.users().searchByUsername.
That requires a UserQueryProvider implementation; my first attempt at
adding one made admin-client REST searches return HTTP 400 (root cause
not pinned down without further digging into Keycloak internals; likely
a quirk of how Keycloak 26.1.1 dispatches searches when a federation
provider implements UserQueryProvider, vs how admin-client 26.0.5
calls the search endpoint).

Pragmatic call: defer UserQueryProvider to Phase 5, which is where it
naturally lives anyway (alongside group lookup + caching). The
FPAuthBackedUserStore's behaviour is already fully covered by the 10
Wiremock unit tests in FPAuthBackedUserStoreTest. Phase 4 (real F+
integration) and Phase 5 will add the end-to-end coverage.

Also switched FPAuthBackedUserStore from org.json to Jackson - Keycloak
doesn't ship org.json, so the SPI got a NoClassDefFoundError at runtime.
Jackson is part of Keycloak's classpath (transitively via keycloak-core)
so we mark it 'provided' and don't bundle it.

Tests: 31 unit + 3 IT = 34 green. Test count up from 19 in Phase 1.
The Phase 2 Wiremock contract assumed a /v2/principal?name=... +
?email=... endpoint shape that doesn't exist in F+ today and would have
required schema/API extensions in acs-auth. While planning Phase 3 I
realised the existing F+ identity API is already sufficient if we adapt
the SPI to its shape:

  GET /v2/principal/{uuid}
      200 -> { uuid, kerberos: 'alice@FACTORYPLUS.LOCAL' }
      410 -> principal does not exist (F+ uses 410, not 404)

  GET /v2/identity/kerberos/{upn}
      200 -> '"<uuid>"' (JSON string, not object)
      410 -> identity does not exist

So Phase 3 ships zero acs-auth changes; the work is entirely in the SPI:

  * findByUuid: 1 call to /v2/principal/{uuid}, parse {uuid, kerberos}
  * findByUsername: 2 calls (resolve UPN -> UUID, then UUID -> principal).
    Login is rare; Phase 5 caching will fold both into a single warm path.
  * findByEmail: returns Optional.empty without making any HTTP call.
    F+ has no email field. Reserved for a possible future schema bump.

Username = full Kerberos UPN (e.g. 'alice@FACTORYPLUS.LOCAL'), matching
the convention Keycloak's existing kerberos federation uses. Status 410
treated as not-found (404 also accepted in case a proxy rewrites it).
'Principal exists but has no kerberos identity' is treated as not-found
rather than NPE-able.

Updated Wiremock unit tests (12 in this class, 33 total unit) to mirror
the real F+ shape, including:
  * URL encoding for UPNs containing / and @
  * The two-call race where principal disappears between identity and
    principal lookups
  * Defensive 'identity endpoint returns object instead of string' check
  * findByEmail makes zero HTTP calls (asserted via wiremock.getAllServeEvents)

Plan doc updated to record the contract revision. Group/membership
endpoints originally pencilled into Phase 3 are now part of Phase 5
alongside GroupLookupProvider.

Tests: 33 unit + 3 IT = 36 green.
Replaces the stock keycloak.v2 login pages with bespoke FreeMarker
templates and CSS that mirror acs-admin's shadcn-vue look (slate
borders, rounded-lg, sm shadow, slate-900 button) and swap the
"FACTORY_PLUS" wordmark for the ACS "+" mark plus an "ACS" wordmark.
The theme ships in the acs-keycloak image; service-setup activates it
by setting loginTheme=factory-plus on the realm and exposes
openid.loginTheme in values.yaml so deployments can opt out by setting
it empty.

Covers login, logout-confirm, error and info pages; account console
keeps the upstream theme since it isn't a user-facing surface in ACS.
Keycloak doesn't always populate `locale` on the login template (e.g.
fresh auth requests before the language cookie is set), so
`${locale.currentLanguageTag!'en'}` exploded with InvalidReferenceException
because the `!` default only covers the last step of the chain. Wrap the
whole expression in parens so the fallback also fires when `locale`
itself is missing, not just `currentLanguageTag`.

Surfaces as HTTP 500 -> our error.ftl rendering "We are sorry... An
internal server error has occurred" the moment the realm flips to
loginTheme=factory-plus.
* Render Back to Application arrow as a styled <span>; override
  backToApplication message to drop upstream's "&laquo;" entity, which
  FreeMarker's auto-escape was rendering as the literal text "&laquo;".
* Custom ease-out cubic-bezier curves replace `ease`/`ease-out` keywords
  for snappier UI feel.
* Buttons get scale(0.97) on :active for tactile press feedback.
* Card has a subtle hover ring (matches acs-admin LoginDialog) and a
  scale(0.97) → scale(1) entrance via @starting-style.
* Card width 24rem -> 26rem and padding 1.5rem -> 1.75rem so the form
  has more presence on the otherwise empty page.
* Hover/active states gated to (hover: hover) and (pointer: fine), and
  reduced-motion drops the entrance + press scale.
F+ identities are username+Kerberos. Keycloak's loginWithEmailAllowed
defaulted true caused the login form to render "Username or email",
inviting users to try their personal email; the F+ federation provider
has no email field to query so any such attempt fails with bad creds.
Force the field to "Username".
The workflow ran mvn verify against the SPI but never built the in-tree
java-service-client dep first; it broke the moment that dep was added in
the Phase 2.1 commit. The SPI is still built and tested as part of the
acs-keycloak Docker image build on any image rebuild, so we don't need a
standalone CI workflow on top.
The custom Keycloak image (acs-keycloak/Dockerfile) bakes the F+ User
Storage SPI into Keycloak's providers/ directory, and the Helm chart
references it via openid.image.repository=acs-keycloak. Without a CI job
producing the image, tagged releases would fail to pull it on install.

Adds a dedicated build-keycloak job rather than slotting acs-keycloak
into build-x86-js: the Dockerfile is a standalone multi-stage build
(Maven + Keycloak base) with no dependency on acs-base-js, and it needs
a second build context (spi=./acs-keycloak-spi) that the JS matrix
doesn't pass.

Wired into build-helm-chart's needs list so the helm package job waits
for the image to publish.
Previously the build relied on lib/mvn being pre-populated by
`make -C lib` before `docker buildx build` ran. That works on a
workstation but breaks in CI on a clean checkout, because lib/mvn is
gitignored and never gets created. The build-keycloak job's
`COPY --from=lib mvn /build/lib/mvn` blew up with "/mvn: not found".

Make the Dockerfile self-contained: copy java-base-pom and
java-service-client source via the lib build context and mvn-install
each into Maven's local cache before building the SPI. The cache
mount keeps incremental builds fast.
Includes comprehensive instructions for setup, usage, auto-closeout, and PR creation. Documents critical rules, review process, and teardown commands to ensure consistent workflows and prevent production issues.
The dep was declared as 0.1-SNAPSHOT but every in-tree publication of
lib/java-service-client (its parent's <version>) is 0.0.0-intree, so a
fresh-checkout build fails to resolve. Past local builds only worked
because someone had 0.1-SNAPSHOT cached in ~/.m2/repository from an
earlier flow; CI's clean cache surfaced the mismatch.

The dep isn't imported by the SPI source yet (Phase 2 uses bare
java.net.http); we keep the declaration so Phase 4's FPGssClientKeytab
wiring is one line away.
Commit 55659de pinned the import to `isomorphic-git/http/node/index.cjs`
to work around an ESM resolution failure after the dep was bumped in
861ed71. That workaround is no longer valid: current isomorphic-git
publishes a strict `exports` map that exposes only `./http/node`
(auto-resolves to index.js for ESM consumers, index.cjs for require),
not the deep `/index.cjs` path. The bumped image fails on import:

  Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath
  './http/node/index.cjs' is not defined by "exports" in
  /home/node/app/node_modules/isomorphic-git/package.json

Revert to the bare `./http/node` subpath the package actually exports.
The seed Secret was templating `admin-password` to the empty string,
which made Grafana fall back to its built-in default of "admin" on
first start. Combined with the earlier mistake of using
`admin@<realm>` as `admin-user`, this is what produced the
"user already exists" failure when the F+ admin signed in via OAuth:
Grafana already had a local user owning that login.

The seed admin is meant to be a break-glass credential and nothing
more - the F+ admin principal gets Grafana Admin via the
fp_permissions claim mapping, not via the local account. Stop
hardcoding any value derived from the realm or from Grafana's
default: generate a random 32-char password on first install,
preserved across helm upgrades by the existing resource-policy: keep
and `if not lookup` guard.

Existing installs that hit the collision need a one-time manual
cleanup: Server Admin -> Users -> rename the realm-UPN login to
something safe.
* Worked example: JupyterHub via Z2JH. End-to-end recipe for an OIDC
  consumer installed alongside ACS (not shipped with it) and gated
  by fp_permissions, with no ACS code touching the consumer.
* Correct the "60s lag" claim - we hit a second cache layer in
  Keycloak's user attribute cache that requires `kubectl rollout
  restart deploy/openid` to flush. Note this and point at the
  proper fix (cachePolicy: NO_CACHE on the federation component).
* Add a troubleshooting section walking through how to actually
  inspect what's in a user's JWT: MCP-based ACL verification,
  temporarily enabling direct access grants, password-grant + decode,
  flushing the user cache, then turning direct grants back off.
…book

Reads as a complete recreate-from-zero recipe a year later: every
values block in full, every UI click spelled out, the gotchas inline
(IngressClassName=acs-traefik vs upstream Z2JH default of traefik;
auth_state_groups_key vs deprecated claim_groups_key; manage_groups
required alongside admin_groups), the cache-flush step for grant
changes, and a "what this exercises" map back to the F+ machinery
so the reader knows which pieces generalise to other OIDC consumers.
Keycloak's UserStorageProvider has a per-user attribute cache that
defaults to "DEFAULT" (cache forever, invalidated only on restart).
Combined with our SPI's permission lookups, this meant an F+ ACL
grant added after a user had been loaded into Keycloak's cache stayed
invisible in `fp_permissions` until the openid pod was restarted -
a surprising operator footgun.

Set the component's `cachePolicy` to NO_CACHE so each token issuance
re-reads attributes via the SPI. The SPI's own `cache.ttl.seconds`
(60) still shields the F+ Auth service from per-request load.

Update docs/services/oauth-clients.md to drop the now-obsolete
"restart openid after granting" workaround and reflect the simpler
operator story.
Keycloak's AbstractUserAdapter exposes username/email through
getUsername()/getEmail() but does NOT route them through the per-name
attribute API (getAttributeStream / getFirstAttribute). The standard
oidc-usermodel-attribute-mapper - which Keycloak attaches to the
realm's built-in `profile` and `email` client scopes for the
`preferred_username` and `email` claims respectively - reads from the
per-name attribute API.

Our adapter's per-name overrides returned `Stream.empty()` / `null`
for everything but the F+ attributes, masking username/email at this
layer. The mistake was the same one the author already encountered
for getAttributes() (see the existing comment at line 97): mistakenly
returning "from scratch" instead of delegating to super for the
inherited fields.

The default user-cache wrapper hid this from us because Keycloak's
CachedUserModel snapshots `getAttributes()` (which we *do* delegate
to super correctly) and serves cached lookups from there. The bug
surfaces the moment cachePolicy is anything other than DEFAULT - in
our case, the recent NO_CACHE switch made tokens come out with
`preferred_username: null` and `email: null`, which broke Grafana's
OAuth flow (Grafana falls back to a GitHub-shape /userinfo/emails
endpoint when it can't extract an email, and 500s).

Fix: route name=USERNAME and name=EMAIL through getUsername()/
getEmail() in both getAttributeStream and getFirstAttribute, and
delegate everything else to super. Tests added for the username,
email and FP attribute paths.
KavanPrice
KavanPrice previously approved these changes May 13, 2026
Comment thread docs/plans/2026-05-07-keycloak-fp-user-storage-spi.md
Comment thread deploy/templates/openid/openid.yaml
The OAuth work introduced four LocalSecret resources to populate the
keycloak-clients Secret with random passwords. That works on a fresh
install of v5.2.x (where the chart ships the LocalSecret CRD via the
crds/ directory), but breaks helm upgrade from 5.1.x: helm 3 only
applies CRDs in crds/ on first install, never on upgrade. Any cluster
running 5.1.0 without an edge cluster deployed (and therefore no
LocalSecret CRD applied by the edge-cluster subchart) fails the
upgrade with "no matches for kind LocalSecret" the moment the chart
tries to render the new resources.

Replace the four LocalSecrets with a single native K8s Secret template
using lookup + randAlphaNum to generate values on first install and
preserve them across upgrades. The pattern mirrors grafana-admin-user.

  - deploy/templates/openid/keycloak-clients.yaml replaces
    openid/local-secrets.yaml
  - deploy/crds/local-secret-crd.yaml is removed; the edge-cluster
    subchart still ships its own copy for driver/OPC-server passwords
  - krb-keys-operator RBAC drops the localsecrets verbs since the
    central operator no longer reconciles any

Same Secret name and key structure, so the keycloak StatefulSet env,
service-setup volume mount, and openid pod consumers stay unchanged.
@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

Follow-up: drop LocalSecret CRD dependency on the central chart

Hit a fresh bug running the 5.1.0 to OAuth migration test on a freshly-nuked AGO cluster. helm upgrade failed with:

Error: UPGRADE FAILED: [resource mapping not found for name: "keycloak-admin-bootstrap" namespace: "factory-plus" from "": no matches for kind "LocalSecret" in version "factoryplus.app.amrc.co.uk/v1"
ensure CRDs are installed first, ... (same for keycloak-admin-admin, keycloak-client-grafana, keycloak-client-jupyterhub)]

Root cause

The OAuth templates introduced four LocalSecret resources in openid/local-secrets.yaml so that krb-keys-operator would generate random passwords for Keycloak's admin / admin-cli / per-client credentials.

That works on a fresh helm install of any v5.2.x because the chart ships the LocalSecret CRD via deploy/crds/ (added in 14c13ca), and Helm 3 applies entries from crds/ on first install. It does NOT apply them on helm upgrade - that's a deliberate Helm 3 design choice to stop charts breaking existing CRD instances.

So on the upgrade path from 5.1.x (which never created any LocalSecret resources on the central chart, so never had the CRD applied) to the OAuth release, the API server has no LocalSecret kind and the upgrade fails before any resources are written.

The reason this never bit anyone before: the edge-cluster subchart ships the LocalSecret CRD in its own crds/ directory, so any cluster that ever deployed an edge cluster has the CRD on the API server as a side-effect. Most real clusters have done that at least once. Fresh central-only clusters - exactly what nuking AGO produced - had never had the CRD applied.

Solutions considered

  1. Pre-install/pre-upgrade Job that kubectl applys the CRD - works in every scenario (zero manual intervention, idempotent regardless of CRD origin) but costs ~60 lines of YAML (Job + ServiceAccount + ClusterRole + ClusterRoleBinding + hook annotations + the inline CRD apply). Adds a moving part on every helm operation.
  2. Move CRD to templates/ with helm.sh/resource-policy: keep - clean for clusters that get the CRD from this chart only, but breaks for clusters where edge-cluster already installed the CRD (Helm errors "resource already exists, not managed by Helm"). Hits the take-ownership wart.
  3. Drop LocalSecret on the central chart entirely - the four LocalSecrets only exist to generate random passwords; that's the exact use case the lookup + randAlphaNum pattern already serves (we ship that today for grafana-admin-user).

Went with option 3. The LocalSecret pattern stays appropriate for edge agent / OPC server / driver passwords on edge clusters, where the operator's password rotation and Sealed-Secret integration matter. On the central cluster it was buying us nothing that a native Secret with lookup doesn't already do for free.

What this commit does

  • Replaces deploy/templates/openid/local-secrets.yaml with deploy/templates/openid/keycloak-clients.yaml - a native K8s Secret with the same name and key structure (_bootstrap, _admin, plus one key per enabled non-builtin OIDC client).
  • Uses lookup per-key so existing values are preserved across upgrades; new keys (e.g. adding a new OIDC client) get a fresh randAlphaNum 32 on the first upgrade that includes them.
  • helm.sh/resource-policy: keep so helm uninstall doesn't take the Secret out from under any consumers.
  • Removes deploy/crds/local-secret-crd.yaml (no longer needed on the central chart; edge-cluster subchart still ships its own copy).
  • Trims localsecrets / localsecrets/status verbs from the krb-keys-operator ClusterRole, since the central operator no longer reconciles any.

Downstream consumers - Keycloak StatefulSet env (KEYCLOAK_ADMIN_PASSWORD from _bootstrap), service-setup volume mount, openid pod - are all unchanged.

Behaviour matrix

Cluster state helm upgrade --install outcome
Fresh cluster, never had ACS Secret created with fresh random passwords. No CRD needed.
5.1.0, no edge cluster ever deployed (the AGO test case) Secret created with fresh random passwords. No CRD needed. No manual kubectl apply.
5.1.0 + edge clusters (LocalSecret CRD present from edge-cluster subchart) Secret created with fresh random passwords. CRD now unused on the central side but harmless.
Previously on a pre-fix OAuth dev release (LocalSecrets exist, Secret is operator-owned) helm upgrade --take-ownership may be needed once if the existing keycloak-clients Secret has owner references to the now-deleted LocalSecrets. Only affects internal dev clusters.

The first three cases are zero-intervention.

Edge clusters unaffected

LocalSecret usage in edge-helm-charts/charts/edge-agent and edge-helm-charts/charts/opcua-server is untouched. The edge-cluster subchart still ships the CRD when an edge cluster is deployed, and krb-keys-operator running on edge clusters still has the RBAC it needs there.

docker/login-action has no built-in retry (upstream feature request
docker/login-action#750 is still open) and ghcr.io's token endpoint
flakes often enough that a single timeout fails an otherwise valid
release build. Replace the action with a bash retry loop using
docker login --password-stdin in both places we authenticate:

  - .github/actions/prepare/action.yml (composite action used by
    every per-service build)
  - .github/workflows/publish.yml (Helm chart publish step)

Five attempts with linear backoff (5s, 10s, 15s, 20s, 25s) covering
up to ~75s of transient unavailability before failing the job.
--password-stdin keeps the token out of ps and the shell history.
Keycloak's OIDCAttributeMapperHelper omits the fp_permissions claim
entirely when a user has zero F+ permission grants - even with
multivalued=true, an empty Collection collapses to null and the
helper skips emission. The previous JMESPath called
contains(fp_permissions[*], ...) which gets null as its subject when
the claim is absent. With go-jmespath (what Grafana uses) this leaks
a single space through the || chain instead of falling through to
'Viewer', and role_attribute_strict=true then rejects the login with
"Integration requires a valid org role assigned in idP. Assigned
role: ' '".

Wrap each occurrence of fp_permissions in not_null(..., `[]`) so the
subject of contains() is guaranteed to be an array. Users with no
grants now resolve to 'Viewer' as intended.
Grafana's OAuth user-lookup refuses to claim a pre-existing user by
email match when email_verified=false in the OIDC claims, unless
oauth_allow_insecure_email_lookup is set in grafana.ini. On the
AuthProxy -> OAuth migration path that surfaces as "Login Failed:
user already exists" because Grafana then falls into the create-new
path and collides with the existing DB row.

The email we issue is either the F+ Auth email or the kerberos UPN,
both derived from the same trusted F+ Principal record the SPI
already accepted, so emails are kerberos-authoritative by
construction. Override isEmailVerified() to return true so OIDC
consumers can treat the email as a safe matching key without
needing the "insecure" flag flipped on every OAuth client downstream.
@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

Two more issues found and fixed while running the AuthProxy to OAuth migration test

Hit both of these in the same upgrade test (5.1.0 install with AuthProxy-created Grafana users, then upgrade to a dev release and try OAuth sign-in as those existing users). Each blocks the migration entirely on its own.

1. JMESPath leaks ' ' instead of falling through to 'Viewer' (7b5f4bc)

Symptom: users with zero F+ permission grants couldn't sign in. Grafana refused with:

Integration requires a valid org role assigned in idP. Assigned role: " "

role_attribute_strict: true was working as designed - the JMESPath was returning a literal single space, which isn't a valid Grafana role.

Diagnosis: dumped the userinfo Keycloak issues for a no-grants user via kcadm.sh get clients/<id>/evaluate-scopes/generate-example-userinfo:

{
  "sub": "f:fbc6a05f-e2c5-4a1e-8060-ca065e251585:...",
  "email_verified": false,
  "fp_principal_uuid": "...",
  "preferred_username": "me1newuser@AMRC-FPD-AGO.SHEF.AC.UK",
  "email": "me1newuser@AMRC-FPD-AGO.SHEF.AC.UK"
}

The fp_permissions claim is entirely absent. Even though our FactoryPlusPermissionsMapper has multivalued: true set, Keycloak's OIDCAttributeMapperHelper.mapAttributeValue collapses an empty Collection to null before emission, so the claim is dropped from the token rather than being emitted as [].

The previous JMESPath called contains(fp_permissions[*], '<uuid>') && 'Admin' || contains(fp_permissions[*], '<uuid>') && 'Editor' || 'Viewer'. With the claim missing, fp_permissions[*] resolves to null, and contains(null, '<uuid>') interacts with go-jmespath (what Grafana uses) in a way that leaks a single space through the || chain rather than falling through to 'Viewer'.

Verified against python-jmespath as a reference: with no fp_permissions key, contains(fp_permissions[*], ...) throws JMESPathTypeError. go-jmespath is more lenient but produces the same broken end result.

Fix: wrap each fp_permissions reference in not_null(fp_permissions, ``[]``) so the subject of contains() is guaranteed to be an array.

role_attribute_path: >-
  contains(not_null(fp_permissions, `[]`)[*], '{{ $perms.admin }}') && 'Admin'
  || contains(not_null(fp_permissions, `[]`)[*], '{{ $perms.editor }}') && 'Editor'
  || 'Viewer'

Verified against all five scenarios: no claim / empty array / admin uuid / editor uuid / unrelated uuid - now resolves to the expected role in every case.

2. email_verified: false blocks email-based user lookup, breaking AuthProxy migration (28f6dd6)

Symptom: for any user already in the Grafana DB from the AuthProxy era (admin and the test users), OAuth sign-in failed at the post-token-issued stage with:

Login Failed
user already exists

Diagnosis: Grafana's OAuth user lookup, when it encounters a pre-existing row that matches by login or email, will only link the OAuth identity to that row if the IdP claims email_verified: true - OR if oauth_allow_insecure_email_lookup is set in grafana.ini. Neither was true:

  • Our SPI's FactoryPlusUserAdapter extends AbstractUserAdapter, which has isEmailVerified() returning false by default. So Keycloak emits email_verified: false.
  • We never set oauth_allow_insecure_email_lookup in our Grafana config.

With both off, Grafana drops into the "create new user" path, hits a unique-constraint collision on the existing row, and surfaces "user already exists". The original AuthProxy user_id and its folder ACLs become unreachable through OAuth.

The "insecure" wording is for situations where the OAuth provider doesn't actually verify emails. Our IdP is our own Keycloak federating our own F+ Auth - emails are derived from the kerberos principal name on the F+ Principal record, which the SPI has already accepted. They're kerberos-authoritative, so it's correct to mark them verified.

Fix: override isEmailVerified() in the user adapter to return true. One line plus a test.

@Override
public boolean isEmailVerified() {
    return true;
}

Tests bumped from 9 to 10 on FactoryPlusUserAdapterTest, total SPI test count 74, all passing.

What this unblocks

With both fixes deployed:

  • Users with no F+ permission grants resolve to Grafana Viewer on OAuth sign-in (instead of being rejected for an invalid role).
  • AuthProxy-created users from 5.1.0 are claimed by their OAuth identities on first sign-in instead of triggering "user already exists", preserving their user_id (and therefore folder ACLs, dashboard ownership, etc.).

The migration test is now actually testing migration - rather than failing at the protocol layer before user lookup can happen.

What's still on the radar

  • The role_attribute_strict: true behaviour means F+ permission grants are still required on every OAuth login for anything above Viewer. That's by design (per the existing comment in values.yaml), but worth flagging in the release notes - existing AuthProxy Admins/Editors who don't have a matching F+ permission grant will be quietly demoted to Viewer on first OAuth sign-in. Manual ACLs (per-folder, per-dashboard) survive because they're per-user, not per-role.

Grafana's OAuth client (pkg/services/authn/clients/oauth.go) only
populates lookupParams.Email when oauth_allow_insecure_email_lookup
is true - and crucially lookupParams.Login is never populated at
all on the OAuth path regardless of this flag. With the flag at its
default of false, lookupByOneOf sees both Email and Login as nil,
returns ErrUserNotFound, Grafana drops into the createUser path,
and the create collides with the existing row's email/login
uniqueness check producing the unhelpful "user already exists"
error.

This breaks the AuthProxy -> OAuth migration: every pre-existing
Grafana user (admin and the test users) hits the "user already
exists" path and OAuth sign-in fails despite the SPI emitting
email_verified=true and the kerberos UPNs matching the existing
login columns byte-for-byte.

The "insecure" naming refers to OAuth providers that don't verify
the email they emit; ours is our own Keycloak federating F+ Auth,
and FactoryPlusUserAdapter.isEmailVerified() (28f6dd6) already
stamps email_verified=true on the OIDC claims. The email is
kerberos-authoritative by construction. Enabling the flag here is
the only knob that makes the lookup actually happen.
@AlexGodbehere
Copy link
Copy Markdown
Contributor Author

Correction + final fix: oauth_allow_insecure_email_lookup is the real gate

My previous comment leaned on the email_verified SPI fix as the migration unblock. That was wrong in terms of mechanism. Posting the correction with the actual root cause for the record.

What actually controls OAuth user lookup in Grafana

pkg/services/authn/clients/oauth.go:206-210 (Grafana main):

lookupParams := login.UserLookupParams{}
allowInsecureEmailLookup := c.settingsProviderSvc.KeyValue(
    "auth", "oauth_allow_insecure_email_lookup").MustBool(false)
if allowInsecureEmailLookup {
    lookupParams.Email = &userInfo.Email
}

That's the entire user-lookup setup for the OAuth flow. Note that:

  • lookupParams.Login is never populated on the OAuth path. The login claim from the IdP is ignored for user-lookup purposes entirely.
  • lookupParams.Email is only populated when oauth_allow_insecure_email_lookup is true.
  • email_verified has no effect on this decision — it's not consulted at all by the OAuth lookup code.

When both Email and Login in lookupParams are nil, lookupByOneOf (pkg/services/authn/authnimpl/sync/user_sync.go:711) skips both branches and returns ErrUserNotFound. Grafana then falls into the createUser path, which collides with the existing row's email/login uniqueness check and surfaces the unhelpful "user already exists" error.

So the SPI isEmailVerified() change (28f6dd6) doesn't unblock the migration on its own — it's semantically correct for our setup but doesn't touch the relevant code path.

Actual fix: enable the flag (commit 3983525)

auth.generic_oauth:
  ...
  oauth_allow_insecure_email_lookup: true

The "insecure" wording in the setting name had me chasing the wrong theory — it's not "trust unverified emails", it's literally "perform the email-based lookup that's otherwise disabled." For our setup the email is kerberos-authoritative by construction (derived from the F+ Principal record the SPI has already accepted), so the flag is the right thing to flip.

Migration test result

After deploying the fixed chart, both AuthProxy-era test users (created in 5.1.0 via the AuthProxy header path) successfully signed in via OAuth and retained their per-user folder ACLs — confirming:

  • user_id preserved (login matched the existing row by email)
  • Origin column flipped from "Synced via Auth Proxy" to "Synced via Auth Module"
  • Folder Edit grants survive because they're keyed on user_id, which didn't change
  • role_attribute_strict: true correctly demotes users without F+ grants to Viewer on first OAuth login (expected, by design)

Why keep the isEmailVerified() SPI change anyway

email_verified=true is the semantically correct claim for our IdP — emails are kerberos-authoritative. Future consumers (or future Grafana versions that do gate behaviour on the claim) get the right answer for free. It's the right default; it just wasn't the thing fixing the lookup here.

Summary of fixes that actually landed this branch since the original PR

Commit Fix
7c760887 Drop LocalSecret dependency for keycloak-clients Secret (CRD-upgrade trap)
b6e60bfa Retry ghcr.io docker login on transient flakes
7b5f4bc7 Wrap fp_permissions in not_null(..., \[]`)` so role JMESPath falls through to Viewer when claim absent
28f6dd6e SPI stamps email_verified=true on federated users
3983525f Set oauth_allow_insecure_email_lookup: true — the actual migration unblock

The merged effect: 5.1.x clusters upgrading through this branch get OAuth that handles both never-had-an-edge-cluster CRD state, transient registry flakes during release builds, no-F+-permission OAuth users, and the AuthProxy-to-OAuth user-id migration — all with no manual operator intervention beyond helm upgrade.

Login form ergonomics: a local user who types just "alice" should
not have to know or type the cluster's kerberos realm to sign in.
Cross-realm users who type their full UPN should keep working
unchanged.

Add a default.realm config property on the SPI user-storage
provider. FPAuthBackedUserStore.canonicaliseUpn appends
"@<defaultRealm>" (uppercased defensively) when the input has no @,
otherwise passes the input through verbatim. service-setup wires
the property to the chart's identity.realm value when it provisions
the federation, so the affordance is on by default for every
deployment.

Behaviour matrix:
  alice                      + realm=FP.LOCAL  -> alice@FP.LOCAL
  alice@OTHER.REALM          + realm=FP.LOCAL  -> alice@OTHER.REALM
  alice@fp.local             + realm=FP.LOCAL  -> alice@FP.LOCAL (realm uppercased, existing behaviour)
  alice                      + realm=""        -> alice (legacy pass-through)
  alice                      + realm unset     -> alice (legacy pass-through)

Five new wiremock-driven tests in FPAuthBackedUserStoreTest cover
each row of that matrix. SPI test count 74 -> 79, all passing.
Previous canonicaliseUpn() uppercased the realm part of every
@-suffixed input. Uppercase Kerberos realms are convention but not
universal - some clusters legitimately run with mixed- or
lowercase realm names, and force-casing the input silently broke
identity lookups against F+ entries that didn't match the forced
casing.

Drop the uppercasing on both paths: respect the realm in typed
UPNs verbatim, and append the default.realm configured value
exactly as supplied (without case-folding it on the way in either).
The configured default realm now uses whatever case the chart's
identity.realm value carries, which is what the operator intended
in the first place.

Renamed canonicaliseUpn to applyDefaultRealm since it no longer
canonicalises anything other than appending the realm for short
names. Updated short_name_default_realm_is_uppercased to assert
verbatim usage, added typed_upn_realm_passes_through_verbatim as
a regression test. Tests 79 -> 80, all passing.
@AlexGodbehere AlexGodbehere merged commit 83f0416 into main May 13, 2026
59 checks passed
@AlexGodbehere AlexGodbehere deleted the dn/oAuth branch May 13, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants