Skip to content

Facets show vocabulary URIs instead of human-readable labels (cross-repo) #148

@rdhyee

Description

@rdhyee

Problem

On the Interactive Explorer, the Material / Sampled Feature / Specimen Type facets render raw vocabulary URIs as their labels:

https://w3id.org/isample/vocabulary/sampledfeature/1.0/pasthumanoccupationsite (1,059,025)
https://w3id.org/isample/biology/biosampledfeature/1.0/Animalia (188,361)

The Source facet is fine (short codes — SESAR, OPENCONTEXT, GEOME, SMITHSONIAN). Everything that comes through the SKOS vocabularies is broken.

Root Cause

The bug lives upstream of the Explorer. Sampling the wide parquet:

SELECT row_id, label, scheme_name
FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
WHERE otype = 'IdentifiedConcept'
LIMIT 5;

returns:

(19401546, 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/wholeorganism', NULL)
(19401547, 'https://w3id.org/isample/opencontext/materialsampleobjecttype/0.1/tile',         NULL)
(19401548, 'https://w3id.org/isample/opencontext/materialsampleobjecttype/0.1/clothing',     NULL)

So IdentifiedConcept.label already holds the URI rather than the SKOS prefLabel, and scheme_name is NULL. By the time pqg facet-summaries runs (pqg/__main__.py @ aefd465), it does:

SELECT 'material' AS facet_type, c.label AS facet_value, c.scheme_name AS scheme, COUNT(*) AS count
FROM (...) JOIN (... otype = 'IdentifiedConcept') c ON c.row_id = s.material_id

and faithfully copies the URI through into facet_summaries.parquet. The Explorer's OJS code (tutorials/isamples_explorer.qmd:119-174) renders r.value directly.

Why a Quick Fix Isn't Right

Easy options on the table:

  • In-page string mangling (x.split('/').pop()): renders pasthumanoccupationsite, not Past Human Occupation Site. Loses fidelity. Doesn't help notebooks or future consumers.
  • Bake labels into facet_summaries.parquet at build time only: helps the Explorer but not the wide parquet, narrow parquet, or any Python notebook that touches IdentifiedConcept.label.

Per @rdhyee on 2026-04-28: "let's do (2) properly and make sure the work we do is useful throughout our work (Quarto, and Python, etc)."

Proposed Approach: canonical `vocab_labels` artifact

Build a single, canonical lookup artifact from the SKOS vocabularies (already at https://github.com/isamplesorg/vocabularies and consumed by `vocab_tools` in `scripts/generate_vocab_docs.sh`):

`vocab_labels.parquet` (or .csv) with columns:

  • `uri` (PK)
  • `pref_label` (SKOS prefLabel, en)
  • `scheme` (the vocabulary it belongs to)
  • `definition` (skos:definition, optional but cheap to include)
  • `alt_labels` (list, optional)

Hosted at `data.isamples.org/vocab_labels.parquet` (versioned + `/current/` alias, like the other parquets).

Consumers

Consumer Use
`pqg` SQL converter Populate `IdentifiedConcept.label` with prefLabel during PQG conversion (replaces the URI). Keep URI in a separate `uri` column.
`pqg facet-summaries` No change needed if upstream is fixed; otherwise LEFT JOIN onto `vocab_labels`.
Explorer (Quarto/OJS) Optional secondary lookup if we want short-term fix without regenerating wide parquet.
isamples-python notebooks Loaded as a small DataFrame for any analysis touching concepts.
Future React / detail-panel UIs JSON dump (`{uri: prefLabel}`) ships from same source.

Open Design Questions

  1. Where does it get built? Most natural home is `isamplesorg/vocabularies` (publish prefLabel CSV alongside the TTL) or `vocab_tools` (add a `vocab labels` subcommand that emits parquet/csv). Voting weakly for `vocab_tools` since it already parses the same TTLs.
  2. Multilingual? SKOS prefLabels can be tagged. Default `en`, ship one row per (uri, lang) so consumers can filter.
  3. Versioning? Vocabularies are versioned in their URIs (`.../1.0/...`). Should `vocab_labels.parquet` be versioned per release of the vocabulary set, or rebuilt-on-write?
  4. Fix at IdentifiedConcept-creation time vs join-on-read? Fixing at IdentifiedConcept-creation time is cleanest (every downstream consumer gets correct labels for free) but means regenerating the wide parquet. Join-on-read is incremental but proliferates the workaround.
  5. Extension vocabs. The biology extension (`https://w3id.org/isample/biology/biosampledfeature/...\`) and Earth Science extension live in different repos. Need a discovery list, or each vocab repo emits its own `vocab_labels` shard and we concatenate.
  6. What about `scheme_name` being NULL in current IdentifiedConcept rows? Same upstream bug; fixing prefLabel is a chance to fix scheme too.

Affected Repos

Suggested First Step (low-risk)

Add a `vocab labels` subcommand to whichever `vocab_tools` we control that emits `vocab_labels.parquet` from a list of vocabulary TTL URLs. Publish to `data.isamples.org/vocab_labels.parquet`. Don't touch any other repo yet — that gives every consumer a stable artifact to JOIN against, and we decide the at-pipeline-time fix once the artifact exists.

cc @rdhyee — filed at your direction (2026-04-28); the Explorer fix is gated on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureHosting, CI/CD, domain, Cloudflareneeds-discussionRequires team input before implementing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions