Skip to content

fix(TermCache): log+skip anchorless orphan terms instead of throwing#2523

Open
sriram-atlan wants to merge 1 commit into
mainfrom
fix-termcache-skip-anchorless
Open

fix(TermCache): log+skip anchorless orphan terms instead of throwing#2523
sriram-atlan wants to merge 1 commit into
mainfrom
fix-termcache-skip-anchorless

Conversation

@sriram-atlan
Copy link
Copy Markdown
Contributor

Summary

When TermCache.refreshCache() encounters a single GlossaryTerm whose anchor relationship can't be resolved (e.g. all anchor edges are soft-deleted), it currently throws and aborts the entire cache initialisation. Every downstream test that depends on the cache then fails with an unrelated stack trace.

Wrap the resolve call in identityForAssetOrLog() — catch IllegalStateException, log a structured warning, return null. Both call sites (refreshCache and lookupById) treat null as "skip this term, continue with the rest of the cache." getIdentityForAsset itself still throws (right contract for callers that actually need a resolved identity); the safe variant is opt-in at bulk-scan call sites.

Trigger

Daily Test (leangraph-test) workflow run 26269147160 — 10 failures, of which 5 were asset-import: chunk 0/1/2/3/4 all blowing up with:

java.lang.IllegalStateException: Term found with no anchor: {
  "guid":"98ef065d-0a0c-449c-8e86-25f66e2b4199",
  "name":"move-catterm-1779369779-d92b04",
  "status":"ACTIVE",
  "attributes":{"qualifiedName":"V281NCOPyFunNTc096In9@OSO7b3NpeKujuYgwZCrW7", "name":"move-catterm-1779369779-d92b04"}
}
    at com.atlan.pkg.cache.TermCache.getIdentityForAsset(TermCache.kt:109)

Root cause (data side — not in this PR)

Direct probes against leangraph-test:

  • The orphan term IS in ES with __state = ACTIVE
  • Its anchor relationship in the entity API points to a glossary but with "relationshipStatus": "DELETED"
  • 16 such move-* entities (4 glossaries + 6 terms + 6 categories) were residue from atlas-metastore's nightly dev-support/test-harness cron at 04:30 UTC, which had partially-failed cleanup
  • The atlan-java daily workflow at 04:53 UTC then picked them up

The 16 orphans have been purged manually on the tenant to unblock the next workflow run. The cron-collision and harness cleanup gaps are tracked separately for follow-up:

  • Stagger workflow schedules so atlan-java + atlas-metastore test-harness don't overlap on the same shared tenant
  • Make test_glossary_qn_moves.py cleanup use ?deleteType=PURGE and run unconditionally on test failure

But the SDK should not blow up everyone else's tests when it encounters a single tenant-side anomaly — that's what this PR addresses.

What this PR changes

TermCache.refreshCache() and TermCache.lookupById() now call identityForAssetOrLog(term) instead of getIdentityForAsset(term) directly:

private fun identityForAssetOrLog(asset: GlossaryTerm): String? =
    try {
        getIdentityForAsset(asset)
    } catch (e: IllegalStateException) {
        logger.warn { "Skipping term ${asset.guid} (name='${asset.name}') with no resolvable anchor — ..." }
        null
    }

getIdentityForAsset itself is unchanged — still throws on inconsistent data, so code that needs a resolved identity will fail loudly. The wrapper is opt-in at bulk-scan call sites where one bad term shouldn't kill the whole refresh.

Test plan

  • PR CI green (existing unit tests should still pass; this is purely additive)
  • After merge, the next Test (leangraph-test) workflow run completes asset-import without the "Term found with no anchor" crash, even if new orphan terms appear in the tenant. Skipped terms appear as WARN lines in the test logs.
  • No regression in any existing TermCache behaviour for tenants without orphan data — terms still cache by name@glossaryName

🤖 Generated with Claude Code

TermCache.refreshCache() scans every active GlossaryTerm in the tenant
and calls getIdentityForAsset(term) on each. That method throws
IllegalStateException("Term found with no anchor: ...") if the term's
anchor relationship can't be resolved to a glossary name. The throw is
unconditional — a single inconsistent term aborts the entire cache
refresh, and every downstream test that depends on the cache being
initialised then fails with the same stack trace.

This is exactly what happened on the leangraph-test daily workflow run
26269147160: another nightly job (atlas-metastore
dev-support/test-harness suite test_glossary_qn_moves.py, cron 04:30
UTC) created `move-*` terms, moved them between glossaries, and a
partially-failed cleanup pass left 6 ACTIVE terms whose anchor edge had
relationshipStatus=DELETED. The atlan-java workflow (dispatched 22 min
later) then crashed every asset-import chunk with:

  java.lang.IllegalStateException: Term found with no anchor: { ... }

The anchor inconsistency is real data and there are deeper fixes
warranted elsewhere (test harness should fully PURGE its residue,
workflow schedules shouldn't overlap on the shared tenant). But the
SDK should not blow up everyone else's tests when it encounters a
single tenant-side anomaly.

Wrap the call in identityForAssetOrLog() — catch IllegalStateException,
log a structured warning identifying the offending term's guid + name,
return null. Both call sites (refreshCache + lookupById) treat null as
"skip this term, continue with the rest of the cache." getIdentityForAsset
itself still throws (it's the right contract — the data IS inconsistent
and code that depends on a resolved identity should fail loudly); the
new safe variant is opt-in at the call sites that perform bulk scans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sriram-atlan sriram-atlan requested a review from cmgrote as a code owner May 22, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant