fix(TermCache): log+skip anchorless orphan terms instead of throwing#2523
Open
sriram-atlan wants to merge 1 commit into
Open
fix(TermCache): log+skip anchorless orphan terms instead of throwing#2523sriram-atlan wants to merge 1 commit into
sriram-atlan wants to merge 1 commit into
Conversation
TermCache.refreshCache() scans every active GlossaryTerm in the tenant
and calls getIdentityForAsset(term) on each. That method throws
IllegalStateException("Term found with no anchor: ...") if the term's
anchor relationship can't be resolved to a glossary name. The throw is
unconditional — a single inconsistent term aborts the entire cache
refresh, and every downstream test that depends on the cache being
initialised then fails with the same stack trace.
This is exactly what happened on the leangraph-test daily workflow run
26269147160: another nightly job (atlas-metastore
dev-support/test-harness suite test_glossary_qn_moves.py, cron 04:30
UTC) created `move-*` terms, moved them between glossaries, and a
partially-failed cleanup pass left 6 ACTIVE terms whose anchor edge had
relationshipStatus=DELETED. The atlan-java workflow (dispatched 22 min
later) then crashed every asset-import chunk with:
java.lang.IllegalStateException: Term found with no anchor: { ... }
The anchor inconsistency is real data and there are deeper fixes
warranted elsewhere (test harness should fully PURGE its residue,
workflow schedules shouldn't overlap on the shared tenant). But the
SDK should not blow up everyone else's tests when it encounters a
single tenant-side anomaly.
Wrap the call in identityForAssetOrLog() — catch IllegalStateException,
log a structured warning identifying the offending term's guid + name,
return null. Both call sites (refreshCache + lookupById) treat null as
"skip this term, continue with the rest of the cache." getIdentityForAsset
itself still throws (it's the right contract — the data IS inconsistent
and code that depends on a resolved identity should fail loudly); the
new safe variant is opt-in at the call sites that perform bulk scans.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
TermCache.refreshCache()encounters a singleGlossaryTermwhoseanchorrelationship can't be resolved (e.g. all anchor edges are soft-deleted), it currently throws and aborts the entire cache initialisation. Every downstream test that depends on the cache then fails with an unrelated stack trace.Wrap the resolve call in
identityForAssetOrLog()— catchIllegalStateException, log a structured warning, return null. Both call sites (refreshCacheandlookupById) treat null as "skip this term, continue with the rest of the cache."getIdentityForAssetitself still throws (right contract for callers that actually need a resolved identity); the safe variant is opt-in at bulk-scan call sites.Trigger
Daily
Test (leangraph-test)workflow run 26269147160 — 10 failures, of which 5 wereasset-import: chunk 0/1/2/3/4all blowing up with:Root cause (data side — not in this PR)
Direct probes against
leangraph-test:__state = ACTIVEanchorrelationship in the entity API points to a glossary but with"relationshipStatus": "DELETED"move-*entities (4 glossaries + 6 terms + 6 categories) were residue from atlas-metastore's nightlydev-support/test-harnesscron at 04:30 UTC, which had partially-failed cleanupThe 16 orphans have been purged manually on the tenant to unblock the next workflow run. The cron-collision and harness cleanup gaps are tracked separately for follow-up:
test_glossary_qn_moves.pycleanup use?deleteType=PURGEand run unconditionally on test failureBut the SDK should not blow up everyone else's tests when it encounters a single tenant-side anomaly — that's what this PR addresses.
What this PR changes
TermCache.refreshCache()andTermCache.lookupById()now callidentityForAssetOrLog(term)instead ofgetIdentityForAsset(term)directly:getIdentityForAssetitself is unchanged — still throws on inconsistent data, so code that needs a resolved identity will fail loudly. The wrapper is opt-in at bulk-scan call sites where one bad term shouldn't kill the whole refresh.Test plan
Test (leangraph-test)workflow run completes asset-import without the "Term found with no anchor" crash, even if new orphan terms appear in the tenant. Skipped terms appear asWARNlines in the test logs.name@glossaryName🤖 Generated with Claude Code