Skip to content

feat: route load_tokenizer through fastokens by default#10

Open
hallerite wants to merge 3 commits into
mainfrom
feat/fastokens-default
Open

feat: route load_tokenizer through fastokens by default#10
hallerite wants to merge 3 commits into
mainfrom
feat/fastokens-default

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 7, 2026

Summary

Add fastokens (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a required dependency and patch it in by default for every supported model except a small denylist. The patch is bracketed around from_pretrained, so the loaded tokenizer keeps the fastokens shim while the user's process-global AutoTokenizer.from_pretrained stays vanilla.

Update (post-merge with main): re-audited the four denylisted models. The DeepSeek-V3 family still fails to load under fastokens 0.1.1 (Metaspace pretokenizer). The MiniMax-M2 family "divergence" turned out to be the surface of a separate upstream bug in fastokens.unpatch_transformers() — investigation notes below the audit table. Denylist stays for now (conservative); plan to revisit after the upstream fastokens fix lands.

Audit results — every entry of MODEL_RENDERER_MAP

Probe: 5-case encoding (plain text, Lorem, emoji + CJK, literal <|im_start|>, 200-word long), vanilla vs fastokens-patched.

Result Count Models
Byte-identical 31/35 All Qwen3.x (10), Qwen3.5 (3), Qwen3.6 (1), Qwen3-VL (3), GLM-5 / 4.7 / 5.1 (3), GLM-4.5 family (2), Kimi-K2.x (3), Nemotron 3 (2), Llama-3.2 Instruct (2), gpt-oss (2)
Load error 2/35 deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1 doesn't support the Metaspace pretokenizer
Apparent divergence 2/35 MiniMaxAI/MiniMax-M2{,.5} — see "MiniMax investigation" below

The 4 incompat models live in FASTOKENS_INCOMPATIBLE (frozenset in renderers/base.py) and skip the patch unconditionally. Unknown / fine-tuned models hit the patched path first and fall back to vanilla on any fastokens load error (logged at INFO).

Implementation notes

def load_tokenizer(model_name_or_path, *, use_fastokens=True):
    ...
    if not use_fastokens or model_name_or_path in FASTOKENS_INCOMPATIBLE:
        return AutoTokenizer.from_pretrained(...)
    try:
        return _patched_load(...)  # patches fastokens, loads, unpatches
    except Exception:
        logger.info("fastokens couldn't load %r; falling back ...")
        return AutoTokenizer.from_pretrained(...)

Per-call patch/unpatch keeps the side effect minimal: the returned tokenizer keeps the fastokens shim (because fastokens captures the backend at load time), but subsequent AutoTokenizer.from_pretrained calls outside load_tokenizer stay vanilla. Verified by test_patch_is_unloaded_after_call.

MiniMax "divergence" — actually an upstream fastokens bug

Deeper investigation after merging main:

  • fastokens always: encode("<|im_start|>")[60, 124, 324, 95, 10314, 109675] (6 tokens, _ + start unmerged)
  • vanilla in a clean Python process: same 6 tokens — byte-identical with fastokens
  • vanilla AFTER any patch_transformers() + unpatch_transformers() cycle in the same process: [60, 124, 324, 22242, 109675] (5 tokens, with _start id 22242 merged)

The pollution comes from unpatch_transformers():

# patch:
_orig_fp = TokenizersBackend.from_pretrained   # inherited from PreTrainedTokenizerBase
                                               # → bound `method` object, NOT a classmethod descriptor
_originals["TokenizersBackend.from_pretrained"] = _orig_fp
TokenizersBackend.from_pretrained = _patched_from_pretrained

# unpatch:
TokenizersBackend.from_pretrained = _originals["TokenizersBackend.from_pretrained"]

from_pretrained lives on PreTrainedTokenizerBase, not on TokenizersBackend. Capturing it via attribute access returns a bound method; setattr-ing it back installs a stray attribute on TokenizersBackend.__dict__ that shadows the inherited classmethod. Subsequent loads see cls = TokenizersBackend rather than the actual tokenizer subclass, which steers MiniMax (declared tokenizer_class = 'GPT2Tokenizer') down a different load path that strips the declared NFC normalizer + Split-regex pre-tokenizer.

Fix is one line in fastokens — del TokenizersBackend.from_pretrained instead of setattr back. Filing upstream separately. With that fix, vanilla and fastokens produce byte-identical encodings for MiniMax-M2 / M2.5 across the probe suite, in any process load order.

The FASTOKENS_INCOMPATIBLE MiniMax entries stay in this PR — cheap insurance until the fastokens fix is released, and the conditional adds no overhead for callers using one of the 31 byte-identical models.

Separate-and-unrelated finding worth flagging: even with the fastokens bug fixed, AutoTokenizer.from_pretrained (training-side) and direct tokenizer.json loading (vLLM / sglang / fastokens) produce different tokenizers for MiniMax-M2 — the slow→fast conversion path strips state declared in tokenizer.json. This is a transformers / MiniMax model-card inconsistency, not a fastokens issue. Not in scope for this PR; flagging for visibility.

Tests

tests/test_load_tokenizer_fastokens.py — 10 cases pinning the policy:

  • FASTOKENS_INCOMPATIBLE exact-shape lock
  • Default path produces a fastokens-shim backend
  • use_fastokens=False produces a vanilla backend
  • Encode parity vanilla vs fastokens on Qwen3.5-9B (4 sample strings)
  • Each of the 4 incompat models loads via vanilla (skips the patch)
  • Patch leak test: a direct AutoTokenizer.from_pretrained outside load_tokenizer stays vanilla
  • Simulated fastokens load failure falls back to vanilla cleanly

Existing test suite passes unchanged post-merge: 1152 passed, 51 skipped, 1 xfailed (full run with main merged in, pytest tests/ --ignore=tests/test_client.py).

Test plan

  • pytest tests/test_load_tokenizer_fastokens.py — 10 cases pass
  • Full suite (pytest tests/ --ignore=tests/test_client.py) — 1152 pass, 51 skipped, 1 xfailed post-merge (no regressions)
  • Per-model byte parity probe over all 35 MODEL_RENDERER_MAP entries — 31 PARITY, 4 denylisted by deliberate review
  • Pre-commit hooks (ruff check + format) clean
  • Merge with main (uv.lock regenerated; renderers/base.py auto-merged cleanly alongside main's TRUSTED_REVISIONS Kimi-K2 changes)

🤖 Generated with Claude Code

hallerite and others added 3 commits May 7, 2026 18:07
Add ``fastokens`` (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a
required dependency and patch it in by default for every supported
model except a small denylist. The patch is bracketed: ``patch`` →
``from_pretrained`` → ``unpatch``, so the loaded tokenizer keeps the
fastokens shim while the user's process-global
``AutoTokenizer.from_pretrained`` stays vanilla.

Empirically verified across all 35 entries in MODEL_RENDERER_MAP:

  31/35 byte-identical with vanilla on a 5-case encoding probe
        (plain, Lorem, emoji+CJK, special-token-literal text, long).
   2/35 fail to load: deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1
        doesn't support the Metaspace pretokenizer.
   2/35 silently diverge on content containing literal ``<|im_start|>``-
        like text: MiniMaxAI/MiniMax-M2{,.5}.

The 4 incompat models live in FASTOKENS_INCOMPATIBLE and skip the patch
unconditionally. Unknown / fine-tuned models hit the patched fast path
first and fall back to vanilla on any fastokens load error (logged at
INFO).

Existing 900-test suite passes unchanged with fastokens patched
globally; new test_load_tokenizer_fastokens.py adds 10 cases pinning
the policy:

* FASTOKENS_INCOMPATIBLE shape (deliberate-review on changes)
* Default path produces a fastokens-shim backend
* ``use_fastokens=False`` produces a vanilla backend
* Encode parity vanilla vs fastokens on a representative model
* Each incompat model loads via vanilla (skips the patch)
* The patch doesn't leak: a direct AutoTokenizer.from_pretrained
  outside load_tokenizer stays vanilla
* Simulated fastokens load failure falls back to vanilla cleanly

No version bump (batched with the open Qwen3.5 / Llama-3 PRs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in uv.lock by regenerating against the merged
pyproject.toml (main's deps + fastokens>=0.1.1 from this branch).

The fastokens patch coexists cleanly with main's new
TRUSTED_REVISIONS map (Kimi-K2 family scoped trust_remote_code) — both
codepaths live in renderers/base.py and gate on disjoint conditions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull in main's urllib3 bump (#35); resolve uv.lock by regenerating
against the merged pyproject.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant