feat: route load_tokenizer through fastokens by default by hallerite · Pull Request #10 · PrimeIntellect-ai/renderers

hallerite · 2026-05-07T18:08:15Z

Summary

Add fastokens (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a required dependency and patch it in by default for every supported model except a small denylist. The patch is bracketed around from_pretrained, so the loaded tokenizer keeps the fastokens shim while the user's process-global AutoTokenizer.from_pretrained stays vanilla.

Update (post-merge with main): re-audited the four denylisted models. The DeepSeek-V3 family still fails to load under fastokens 0.1.1 (Metaspace pretokenizer). The MiniMax-M2 family "divergence" turned out to be the surface of a separate upstream bug in fastokens.unpatch_transformers() — investigation notes below the audit table. Denylist stays for now (conservative); plan to revisit after the upstream fastokens fix lands.

Audit results — every entry of `MODEL_RENDERER_MAP`

Probe: 5-case encoding (plain text, Lorem, emoji + CJK, literal <|im_start|>, 200-word long), vanilla vs fastokens-patched.

Result	Count	Models
Byte-identical	31/35	All Qwen3.x (10), Qwen3.5 (3), Qwen3.6 (1), Qwen3-VL (3), GLM-5 / 4.7 / 5.1 (3), GLM-4.5 family (2), Kimi-K2.x (3), Nemotron 3 (2), Llama-3.2 Instruct (2), gpt-oss (2)
Load error	2/35	`deepseek-ai/DeepSeek-V3{,-Base}` — fastokens 0.1.1 doesn't support the Metaspace pretokenizer
Apparent divergence	2/35	`MiniMaxAI/MiniMax-M2{,.5}` — see "MiniMax investigation" below

The 4 incompat models live in FASTOKENS_INCOMPATIBLE (frozenset in renderers/base.py) and skip the patch unconditionally. Unknown / fine-tuned models hit the patched path first and fall back to vanilla on any fastokens load error (logged at INFO).

Implementation notes

def load_tokenizer(model_name_or_path, *, use_fastokens=True):
    ...
    if not use_fastokens or model_name_or_path in FASTOKENS_INCOMPATIBLE:
        return AutoTokenizer.from_pretrained(...)
    try:
        return _patched_load(...)  # patches fastokens, loads, unpatches
    except Exception:
        logger.info("fastokens couldn't load %r; falling back ...")
        return AutoTokenizer.from_pretrained(...)

Per-call patch/unpatch keeps the side effect minimal: the returned tokenizer keeps the fastokens shim (because fastokens captures the backend at load time), but subsequent AutoTokenizer.from_pretrained calls outside load_tokenizer stay vanilla. Verified by test_patch_is_unloaded_after_call.

MiniMax "divergence" — actually an upstream fastokens bug

Deeper investigation after merging main:

fastokens always: encode("<|im_start|>") → [60, 124, 324, 95, 10314, 109675] (6 tokens, _ + start unmerged)
vanilla in a clean Python process: same 6 tokens — byte-identical with fastokens
vanilla AFTER any patch_transformers() + unpatch_transformers() cycle in the same process: [60, 124, 324, 22242, 109675] (5 tokens, with _start id 22242 merged)

The pollution comes from unpatch_transformers():

# patch:
_orig_fp = TokenizersBackend.from_pretrained   # inherited from PreTrainedTokenizerBase
                                               # → bound `method` object, NOT a classmethod descriptor
_originals["TokenizersBackend.from_pretrained"] = _orig_fp
TokenizersBackend.from_pretrained = _patched_from_pretrained

# unpatch:
TokenizersBackend.from_pretrained = _originals["TokenizersBackend.from_pretrained"]

from_pretrained lives on PreTrainedTokenizerBase, not on TokenizersBackend. Capturing it via attribute access returns a bound method; setattr-ing it back installs a stray attribute on TokenizersBackend.__dict__ that shadows the inherited classmethod. Subsequent loads see cls = TokenizersBackend rather than the actual tokenizer subclass, which steers MiniMax (declared tokenizer_class = 'GPT2Tokenizer') down a different load path that strips the declared NFC normalizer + Split-regex pre-tokenizer.

Fix is one line in fastokens — del TokenizersBackend.from_pretrained instead of setattr back. Filing upstream separately. With that fix, vanilla and fastokens produce byte-identical encodings for MiniMax-M2 / M2.5 across the probe suite, in any process load order.

The FASTOKENS_INCOMPATIBLE MiniMax entries stay in this PR — cheap insurance until the fastokens fix is released, and the conditional adds no overhead for callers using one of the 31 byte-identical models.

Separate-and-unrelated finding worth flagging: even with the fastokens bug fixed, AutoTokenizer.from_pretrained (training-side) and direct tokenizer.json loading (vLLM / sglang / fastokens) produce different tokenizers for MiniMax-M2 — the slow→fast conversion path strips state declared in tokenizer.json. This is a transformers / MiniMax model-card inconsistency, not a fastokens issue. Not in scope for this PR; flagging for visibility.

Tests

tests/test_load_tokenizer_fastokens.py — 10 cases pinning the policy:

FASTOKENS_INCOMPATIBLE exact-shape lock
Default path produces a fastokens-shim backend
use_fastokens=False produces a vanilla backend
Encode parity vanilla vs fastokens on Qwen3.5-9B (4 sample strings)
Each of the 4 incompat models loads via vanilla (skips the patch)
Patch leak test: a direct AutoTokenizer.from_pretrained outside load_tokenizer stays vanilla
Simulated fastokens load failure falls back to vanilla cleanly

Existing test suite passes unchanged post-merge: 1152 passed, 51 skipped, 1 xfailed (full run with main merged in, pytest tests/ --ignore=tests/test_client.py).

Test plan

pytest tests/test_load_tokenizer_fastokens.py — 10 cases pass
Full suite (pytest tests/ --ignore=tests/test_client.py) — 1152 pass, 51 skipped, 1 xfailed post-merge (no regressions)
Per-model byte parity probe over all 35 MODEL_RENDERER_MAP entries — 31 PARITY, 4 denylisted by deliberate review
Pre-commit hooks (ruff check + format) clean
Merge with main (uv.lock regenerated; renderers/base.py auto-merged cleanly alongside main's TRUSTED_REVISIONS Kimi-K2 changes)

🤖 Generated with Claude Code

Add ``fastokens`` (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a required dependency and patch it in by default for every supported model except a small denylist. The patch is bracketed: ``patch`` → ``from_pretrained`` → ``unpatch``, so the loaded tokenizer keeps the fastokens shim while the user's process-global ``AutoTokenizer.from_pretrained`` stays vanilla. Empirically verified across all 35 entries in MODEL_RENDERER_MAP: 31/35 byte-identical with vanilla on a 5-case encoding probe (plain, Lorem, emoji+CJK, special-token-literal text, long). 2/35 fail to load: deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1 doesn't support the Metaspace pretokenizer. 2/35 silently diverge on content containing literal ``<|im_start|>``- like text: MiniMaxAI/MiniMax-M2{,.5}. The 4 incompat models live in FASTOKENS_INCOMPATIBLE and skip the patch unconditionally. Unknown / fine-tuned models hit the patched fast path first and fall back to vanilla on any fastokens load error (logged at INFO). Existing 900-test suite passes unchanged with fastokens patched globally; new test_load_tokenizer_fastokens.py adds 10 cases pinning the policy: * FASTOKENS_INCOMPATIBLE shape (deliberate-review on changes) * Default path produces a fastokens-shim backend * ``use_fastokens=False`` produces a vanilla backend * Encode parity vanilla vs fastokens on a representative model * Each incompat model loads via vanilla (skips the patch) * The patch doesn't leak: a direct AutoTokenizer.from_pretrained outside load_tokenizer stays vanilla * Simulated fastokens load failure falls back to vanilla cleanly No version bump (batched with the open Qwen3.5 / Llama-3 PRs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflict in uv.lock by regenerating against the merged pyproject.toml (main's deps + fastokens>=0.1.1 from this branch). The fastokens patch coexists cleanly with main's new TRUSTED_REVISIONS map (Kimi-K2 family scoped trust_remote_code) — both codepaths live in renderers/base.py and gate on disjoint conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pull in main's urllib3 bump (#35); resolve uv.lock by regenerating against the merged pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallerite and others added 3 commits May 7, 2026 18:07

Merge branch 'main' into feat/fastokens-default

3817c3e

Pull in main's urllib3 bump (#35); resolve uv.lock by regenerating against the merged pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route load_tokenizer through fastokens by default#10

feat: route load_tokenizer through fastokens by default#10
hallerite wants to merge 3 commits into
mainfrom
feat/fastokens-default

hallerite commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Audit results — every entry of MODEL_RENDERER_MAP

Implementation notes

MiniMax "divergence" — actually an upstream fastokens bug

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented May 7, 2026 •

edited

Loading

Audit results — every entry of `MODEL_RENDERER_MAP`