Problem
On the Interactive Explorer, the Material / Sampled Feature / Specimen Type facets render raw vocabulary URIs as their labels:
https://w3id.org/isample/vocabulary/sampledfeature/1.0/pasthumanoccupationsite (1,059,025)
https://w3id.org/isample/biology/biosampledfeature/1.0/Animalia (188,361)
The Source facet is fine (short codes — SESAR, OPENCONTEXT, GEOME, SMITHSONIAN). Everything that comes through the SKOS vocabularies is broken.
Root Cause
The bug lives upstream of the Explorer. Sampling the wide parquet:
SELECT row_id, label, scheme_name
FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
WHERE otype = 'IdentifiedConcept'
LIMIT 5;
returns:
(19401546, 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/wholeorganism', NULL)
(19401547, 'https://w3id.org/isample/opencontext/materialsampleobjecttype/0.1/tile', NULL)
(19401548, 'https://w3id.org/isample/opencontext/materialsampleobjecttype/0.1/clothing', NULL)
So IdentifiedConcept.label already holds the URI rather than the SKOS prefLabel, and scheme_name is NULL. By the time pqg facet-summaries runs (pqg/__main__.py @ aefd465), it does:
SELECT 'material' AS facet_type, c.label AS facet_value, c.scheme_name AS scheme, COUNT(*) AS count
FROM (...) JOIN (... otype = 'IdentifiedConcept') c ON c.row_id = s.material_id
and faithfully copies the URI through into facet_summaries.parquet. The Explorer's OJS code (tutorials/isamples_explorer.qmd:119-174) renders r.value directly.
Why a Quick Fix Isn't Right
Easy options on the table:
- In-page string mangling (
x.split('/').pop()): renders pasthumanoccupationsite, not Past Human Occupation Site. Loses fidelity. Doesn't help notebooks or future consumers.
- Bake labels into facet_summaries.parquet at build time only: helps the Explorer but not the wide parquet, narrow parquet, or any Python notebook that touches
IdentifiedConcept.label.
Per @rdhyee on 2026-04-28: "let's do (2) properly and make sure the work we do is useful throughout our work (Quarto, and Python, etc)."
Proposed Approach: canonical `vocab_labels` artifact
Build a single, canonical lookup artifact from the SKOS vocabularies (already at https://github.com/isamplesorg/vocabularies and consumed by `vocab_tools` in `scripts/generate_vocab_docs.sh`):
`vocab_labels.parquet` (or .csv) with columns:
- `uri` (PK)
- `pref_label` (SKOS prefLabel, en)
- `scheme` (the vocabulary it belongs to)
- `definition` (skos:definition, optional but cheap to include)
- `alt_labels` (list, optional)
Hosted at `data.isamples.org/vocab_labels.parquet` (versioned + `/current/` alias, like the other parquets).
Consumers
| Consumer |
Use |
| `pqg` SQL converter |
Populate `IdentifiedConcept.label` with prefLabel during PQG conversion (replaces the URI). Keep URI in a separate `uri` column. |
| `pqg facet-summaries` |
No change needed if upstream is fixed; otherwise LEFT JOIN onto `vocab_labels`. |
| Explorer (Quarto/OJS) |
Optional secondary lookup if we want short-term fix without regenerating wide parquet. |
| isamples-python notebooks |
Loaded as a small DataFrame for any analysis touching concepts. |
| Future React / detail-panel UIs |
JSON dump (`{uri: prefLabel}`) ships from same source. |
Open Design Questions
- Where does it get built? Most natural home is `isamplesorg/vocabularies` (publish prefLabel CSV alongside the TTL) or `vocab_tools` (add a `vocab labels` subcommand that emits parquet/csv). Voting weakly for `vocab_tools` since it already parses the same TTLs.
- Multilingual? SKOS prefLabels can be tagged. Default `en`, ship one row per (uri, lang) so consumers can filter.
- Versioning? Vocabularies are versioned in their URIs (`.../1.0/...`). Should `vocab_labels.parquet` be versioned per release of the vocabulary set, or rebuilt-on-write?
- Fix at IdentifiedConcept-creation time vs join-on-read? Fixing at IdentifiedConcept-creation time is cleanest (every downstream consumer gets correct labels for free) but means regenerating the wide parquet. Join-on-read is incremental but proliferates the workaround.
- Extension vocabs. The biology extension (`https://w3id.org/isample/biology/biosampledfeature/...\`) and Earth Science extension live in different repos. Need a discovery list, or each vocab repo emits its own `vocab_labels` shard and we concatenate.
- What about `scheme_name` being NULL in current IdentifiedConcept rows? Same upstream bug; fixing prefLabel is a chance to fix scheme too.
Affected Repos
Suggested First Step (low-risk)
Add a `vocab labels` subcommand to whichever `vocab_tools` we control that emits `vocab_labels.parquet` from a list of vocabulary TTL URLs. Publish to `data.isamples.org/vocab_labels.parquet`. Don't touch any other repo yet — that gives every consumer a stable artifact to JOIN against, and we decide the at-pipeline-time fix once the artifact exists.
cc @rdhyee — filed at your direction (2026-04-28); the Explorer fix is gated on this.
Problem
On the Interactive Explorer, the Material / Sampled Feature / Specimen Type facets render raw vocabulary URIs as their labels:
The Source facet is fine (short codes — SESAR, OPENCONTEXT, GEOME, SMITHSONIAN). Everything that comes through the SKOS vocabularies is broken.
Root Cause
The bug lives upstream of the Explorer. Sampling the wide parquet:
returns:
So
IdentifiedConcept.labelalready holds the URI rather than the SKOSprefLabel, andscheme_nameis NULL. By the timepqg facet-summariesruns (pqg/__main__.py @ aefd465), it does:and faithfully copies the URI through into
facet_summaries.parquet. The Explorer's OJS code (tutorials/isamples_explorer.qmd:119-174) rendersr.valuedirectly.Why a Quick Fix Isn't Right
Easy options on the table:
x.split('/').pop()): renderspasthumanoccupationsite, notPast Human Occupation Site. Loses fidelity. Doesn't help notebooks or future consumers.IdentifiedConcept.label.Per @rdhyee on 2026-04-28: "let's do (2) properly and make sure the work we do is useful throughout our work (Quarto, and Python, etc)."
Proposed Approach: canonical `vocab_labels` artifact
Build a single, canonical lookup artifact from the SKOS vocabularies (already at https://github.com/isamplesorg/vocabularies and consumed by `vocab_tools` in `scripts/generate_vocab_docs.sh`):
`vocab_labels.parquet` (or .csv) with columns:
Hosted at `data.isamples.org/vocab_labels.parquet` (versioned + `/current/` alias, like the other parquets).
Consumers
Open Design Questions
Affected Repos
Suggested First Step (low-risk)
Add a `vocab labels` subcommand to whichever `vocab_tools` we control that emits `vocab_labels.parquet` from a list of vocabulary TTL URLs. Publish to `data.isamples.org/vocab_labels.parquet`. Don't touch any other repo yet — that gives every consumer a stable artifact to JOIN against, and we decide the at-pipeline-time fix once the artifact exists.
cc @rdhyee — filed at your direction (2026-04-28); the Explorer fix is gated on this.