fix(install): align tier name to registry canon + pass TIER to model-init + fail loud on unknown tier by joelteply · Pull Request #1085 · CambrianTech/continuum

joelteply · 2026-05-11T20:16:42Z

Lane A PR-1 — addresses RTX 5090 silent-no-replies install root cause

Per @continuum-8e97's 2026-05-11 RTX VDD finding: install seeded only voice-models, no Qwen GGUF, personas had no local provider to invoke, fail-hard rule (#1077) didn't fire because the silent skip happened pre-resolver in download-models.sh.

Root cause — two compounding bugs

(1) Tier-name divergence: install.sh sets CONTINUUM_TIER='primary' for 32GB+ Macs but src/shared/models.json + src/scripts/download-models.sh canon is 'full'. If 'primary' ever leaks to model-init's TIER env, the jq lookup auto_download.by_tier[primary] returns [] (silent), leaving install with always[] (voice/embedding/whisper/piper/kokoro/silero) only — no Qwen.

(2) Container /proc/meminfo blindspot: docker-compose.yml has mem_limit: ${MODEL_INIT_MEM:-2g} on model-init. download-models.sh:30 reads RAM from /proc/meminfo INSIDE the container. With cgroups-aware /proc/meminfo, that's the 2GB limit, NOT host RAM. Result: TIER auto-detects to mba regardless of host (RTX 5090 / 32GB+ Mac / 8GB MBA all see 2GB). Even when CONTINUUM_TIER isn't set externally, the in-container detection silently bottoms-out at the smallest tier.

Changes — 3 files, single-purpose, additive

install.sh: rename CONTINUUM_TIER='primary' → 'full' (single source of truth = src/shared/models.json tiers keys). Updates inline comment + case-stmt fallback default. Three textual occurrences of the legacy name converted; new comment block explaining the canon.
docker-compose.yml: pass TIER=${CONTINUUM_TIER:-full} to model-init's env. Makes install.sh's hardware-tier choice flow through to the downloader. The :-full default guarantees headed installs (no install.sh) still pull the full multimodal Qwen set rather than bottoming-out at mba.
src/scripts/download-models.sh: validate $TIER against {mba|mid|full} BEFORE the jq lookup. Unknown tier (e.g. residual 'primary' or any future divergence) errors loudly with the registry's actual valid set + the most likely cause. Per Joel's "no silent fallback to placeholder models" rule.

Validation

bash -n install.sh: syntax OK
bash -n src/scripts/download-models.sh: syntax OK
docker compose config --quiet: parses OK
Precommit hook (TypeScript build + browser ping): PASS

Out of scope (separate followups)

Lane B Docker volume/profile mechanics (@continuum-8e97 owns)
Verify refactor(persona): fail hard on missing model selection #1077 fail-hard fires when NO local model present at runtime (currently fires only when SOME model fails to resolve; silent skip pre-resolver may be a separate regression — Lane A PR-2 territory)
Linux install path doesn't set CONTINUUM_TIER; now defaults to full on non-Mac, which is right for Linux+RTX. MBA on Linux would need explicit env override — acceptable since the in-container /proc/meminfo bottom-out is now fail-loud rather than silent-mba

Cross-platform validation: continuum-8e97's RTX rerun should now either pull Qwen models (success) or fail loud at install with a tier-validation error (correct loud-fail). Either result confirms the silent-skip is gone.

🤖 Generated with Claude Code

…oad-avatar-models.sh (#1090) Per the issue: third-party CDN failures (RTX install hit OpenGameArt curl exit 11 = CURLE_FTP_WEIRD_PASS_REPLY on vroid-female-base.vrm) propagated through `set -e` and exited the entire script, which made the model-init container exit non-zero. Compounded with #1085 (tier-name canon) for the "RTX install ships with no Qwen" symptom. Fix shape per #1087's recommended Option A: - Wrap each per-VRM curl/wget call in `set +e ... set -e` so a single download failure increments a FAILED counter instead of killing the script. The script-level `set -e` invariant is preserved everywhere else (jq, mkdir, mv, etc. still hard-fail on real bugs). - Capture and log the actual curl exit code on each failure (Joel's "never swallow errors — evidence is for the debugger" rule). The warning includes the exit code, the failed name, and the source URL so the next debugger has everything they need. - Run summary at the end emits a "DEGRADED" structured warning naming exactly which VRMs failed + the upstream cause (third-party CDN, not a Continuum bug) + the re-run command. Operator visibility, not silent suppression. - Script unconditionally exits 0 — partial avatar set is acceptable (Bevy live mode degrades to whatever VRMs are present), and a third-party CDN blip should NOT block install. The summary above carries the diagnostic; downstream consumers see clean exit + warning. - Bonus: replace hardcoded `8` with EXPECTED constant; quote tmpzip / tmpdir / vrm_file mktemp captures (shellcheck SC2155). Smoke-tested locally: MODELS_DIR=/tmp/avatar-smoke-test bash -x download-avatar-models.sh → all 8 VRMs downloaded successfully on host with working CDN + exit 0. Failure path code is symmetric (set +e capture exit, log, increment FAILED, continue) — same shape proven by the existing per-file failure handling in download-models.sh:115-124. Closes #1087. Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…typed error (#1089) Lane A PR-2 — surfaces install-time-no-Qwen as observable runtime health rather than process panic. Pairs with #1085 (install fix for the SOURCE of the no-Qwen state) by making the runtime VISIBILITY of "no local model loadable" testable + integrable. Background: continuum-8e97 RTX 5090 install (2026-05-11) had cuda stack ready, VRAM available, zero personas replying — root cause was no Qwen GGUF seeded. The existing `LlamaCppAdapter::new()` would have panicked with the right message, but is constructed LAZILY (first generate_text call). Personas silent-skip pre-resolver, so the panic was never reached. Adapter never tried to load. Changes: - New typed error `NoLocalModelLoadable { provider_id, rows_in_registry, rows_with_gguf_local_path }` with thiserror Display naming the actionable remediation ("Install seeded no local Qwen GGUF — run model-init downloader or seed manually"). - New `LlamaCppAdapter::try_new() -> Result<Self, NoLocalModelLoadable>`: Result-returning variant. Boot-time health checks (continuum status, ai/status, install-time validators) MUST use this so an install with no Qwen seeded reports the typed error cleanly instead of crash-looping later when a persona attempts to invoke. - New `LlamaCppAdapter::try_new_from<'a, I>(models: I)` pure variant taking a model iterator directly, mirroring my model_resolver.rs pattern. Lets tests assemble synthetic registries without going through the global() singleton. `try_new()` calls `try_new_from(global().models_for_provider("llamacpp-local"))`. - Legacy `LlamaCppAdapter::new()` preserved (panics on err) — same observable behavior as before for callers that haven't migrated. 3 tests covering the contract: - try_new_from_errors_when_no_llamacpp_local_rows: empty iterator → NoLocalModelLoadable with rows_in_registry=0, error message contains "model-init" remediation hint - try_new_from_errors_when_llamacpp_rows_exist_but_none_have_gguf_path: registry has llamacpp-local rows but artifact resolver couldn't find any GGUF on disk → NoLocalModelLoadable with rows_in_registry=2, rows_with_gguf_local_path=0 (the RTX 5090 case Codex's #1085 + upstream model-init bug produces) - try_new_from_succeeds_with_at_least_one_resolved_path: mixed registry (one resolved, one not) → adapter picks resolved row, model_path + default_model match Validation: - cargo test --features metal,accelerate -p continuum-core --lib inference::llamacpp_adapter: 3/3 pass Out of scope (separate followups): - Wire `try_new()` into a runtime boot health check (Lane A PR-3 or ai/status integration), surfaces the typed error to operators via jtag command output. PR-2 ships the primitive; integration is next. - The artifact resolver behavior when explicit gguf path doesn't exist on disk — silently falls through to other resolvers (artifacts.rs:73). Worth a separate audit but doesn't change PR-2's contract. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… fail loud on unknown tier ROOT CAUSE for 2026-05-11 RTX 5090 silent-no-replies finding (continuum-8e97 VDD report): install seeded only voice-models, no Qwen GGUF, personas had no local provider to invoke, fail-hard rule (#1077) didn't fire because the silent skip happened pre-resolver in download-models.sh. Two compounding bugs: (1) Tier-name divergence — install.sh sets CONTINUUM_TIER='primary' for 32GB+ Macs but src/shared/models.json + src/scripts/download-models.sh canon is 'full'. If 'primary' ever leaks to model-init's TIER env, the jq lookup `auto_download.by_tier[primary]` returns [] (silent), leaving install with always[] (voice/embedding/whisper/piper/kokoro/silero) only — no Qwen. (2) Container /proc/meminfo blindspot — docker-compose has `mem_limit: ${MODEL_INIT_MEM:-2g}` on model-init. download-models.sh:30 reads RAM from /proc/meminfo INSIDE the container. With cgroups-aware /proc/meminfo, that's the 2GB limit, NOT host RAM. Result: TIER auto-detects to `mba` regardless of host (RTX 5090 / 32GB+ Mac / 8GB MBA all see 2GB). Even when CONTINUUM_TIER isn't set externally, in-container detection silently bottoms-out at the smallest tier. Three changes — all single-purpose, additive (no semantic shifts elsewhere): 1. install.sh: rename CONTINUUM_TIER='primary' → 'full' (single source of truth = src/shared/models.json `tiers` keys). Updates inline comment + case-stmt fallback default. Three textual occurrences of the legacy name converted to the canonical name plus a note in the comment block explaining why. 2. docker-compose.yml: pass `TIER=${CONTINUUM_TIER:-full}` to model-init's env. Makes install.sh's hardware-tier choice flow through to the downloader instead of having the container guess from its own /proc/meminfo. The `:-full` default guarantees headed installs (no install.sh) still pull the full multimodal Qwen set rather than bottoming-out at mba. 3. src/scripts/download-models.sh: validate $TIER against {mba|mid|full} BEFORE the jq lookup. Unknown tier (e.g. residual 'primary' or any future divergence) errors loudly with the registry's actual valid set + the most likely cause. Per Joel's "no silent fallback to placeholder models" rule. Validation: - bash -n install.sh: syntax OK - bash -n src/scripts/download-models.sh: syntax OK - docker compose config --quiet: parses OK Out of scope (separate followups): - Lane B Docker volume/profile mechanics (continuum-8e97 owns) - Verify #1077 fail-hard fires when NO local model present at runtime - Linux install path doesn't set CONTINUUM_TIER; now defaults to `full` on non-Mac, which is right for Linux+RTX. MBA on Linux would need explicit env override — acceptable since the in-container /proc/meminfo bottom-out is now fail-loud rather than silent-mba Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion (#1120) Symptom from #1120 (claude-tab-2 reported, validating PR #1085): npm test -- --runTestsByPath src/tests/unit/seed-install-tier.test.ts fails before any test runs because src/scripts/test-with-server.ts imports a non-existent './system-startup' module. The canonical entry for npm-test mode lives at src/system/core/SystemOrchestrator.ts as SystemOrchestration.forTesting() — same factory used by the rest of the testing path. Update the import + replace startSystem('npm-test') with the canonical call. Loud-throw on failure so test runs surface startup errors rather than silently mis-behaving. Validation: npm run build:ts passes clean. Hooks ran without --no-verify. Card: continuum#1120. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the size: M label May 11, 2026

github-actions Bot added size: XL and removed size: M labels May 11, 2026

joelteply force-pushed the fix/install-tier-name-divergence branch 7 times, most recently from 7331be6 to faa5827 Compare May 12, 2026 05:36

This was referenced May 13, 2026

feat(ai-key): add redacted status command #1104

Merged

fix(mac): gate native LiveKit bridge behind explicit profile #1071

Open

joelteply and others added 13 commits May 13, 2026 14:31

fix(install): fail CI smoke on missing images

8985a30

fix(ci): use canary images for canary install smoke

014cab0

fix(core): prepare continuum IPC socket directory

0c2934e

fix(persona): fail loud on missing media before Rust cognition

77a699c

fix(persona): fail hard on impossible media attachment shape

ba2fd47

test(persona): assert hard failure for prebuilt media parts

faac9ac

fix(cognition): reject models with unknown providers

672b251

test(install): fail smoke on missing AI reply

cbab930

fix(install): fail on missing persona model artifacts

ff6e591

fix(install): require public model downloads

f2a01a1

fix(persona): reject fake ids in Rust cognition projection

2f5dab6

fix(install): bind model-init registry from checkout

2a9e55c

Test added 11 commits May 13, 2026 14:35

fix(persona): activate persisted users on install smoke

5270a7f

fix(persona): fail hard on empty Rust responses

bdc0bee

chore(persona): expose smoke phase timing

87df398

fix(install): mount persona response code in smoke

88f0674

fix(install): wait for model init before core

cdef58d

fix(install): mount core model registry config

fe7b979

test(install): fail fast on persona response errors

b9b6cc2

ci(install): require PR images for Carl smoke

a056f5b

ci(carl): force lavapipe inside smoke container

f16db1f

fix(gpu): accept headless vulkan device summaries

1b8b3f4

fix(ci): ignore generated ts-rs drift before image push

7ac34d3

joelteply mentioned this pull request May 13, 2026

fix(test): npm test single-file path fails on missing system-startup import #1120

Closed

joelteply force-pushed the fix/install-tier-name-divergence branch from 61bdeb4 to 7ac34d3 Compare May 13, 2026 19:39

Test added 2 commits May 15, 2026 14:14

merge canary into install tier fix

d0fc061

fix(eslint): ignore phantom nested source output

a7c59a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(install): align tier name to registry canon + pass TIER to model-init + fail loud on unknown tier#1085

fix(install): align tier name to registry canon + pass TIER to model-init + fail loud on unknown tier#1085
joelteply wants to merge 26 commits into
canaryfrom
fix/install-tier-name-divergence

joelteply commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 11, 2026

Lane A PR-1 — addresses RTX 5090 silent-no-replies install root cause

Root cause — two compounding bugs

Changes — 3 files, single-purpose, additive

Validation

Out of scope (separate followups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant