feat: route load_tokenizer through fastokens by default#10
Open
hallerite wants to merge 3 commits into
Open
Conversation
Add ``fastokens`` (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a
required dependency and patch it in by default for every supported
model except a small denylist. The patch is bracketed: ``patch`` →
``from_pretrained`` → ``unpatch``, so the loaded tokenizer keeps the
fastokens shim while the user's process-global
``AutoTokenizer.from_pretrained`` stays vanilla.
Empirically verified across all 35 entries in MODEL_RENDERER_MAP:
31/35 byte-identical with vanilla on a 5-case encoding probe
(plain, Lorem, emoji+CJK, special-token-literal text, long).
2/35 fail to load: deepseek-ai/DeepSeek-V3{,-Base} — fastokens 0.1.1
doesn't support the Metaspace pretokenizer.
2/35 silently diverge on content containing literal ``<|im_start|>``-
like text: MiniMaxAI/MiniMax-M2{,.5}.
The 4 incompat models live in FASTOKENS_INCOMPATIBLE and skip the patch
unconditionally. Unknown / fine-tuned models hit the patched fast path
first and fall back to vanilla on any fastokens load error (logged at
INFO).
Existing 900-test suite passes unchanged with fastokens patched
globally; new test_load_tokenizer_fastokens.py adds 10 cases pinning
the policy:
* FASTOKENS_INCOMPATIBLE shape (deliberate-review on changes)
* Default path produces a fastokens-shim backend
* ``use_fastokens=False`` produces a vanilla backend
* Encode parity vanilla vs fastokens on a representative model
* Each incompat model loads via vanilla (skips the patch)
* The patch doesn't leak: a direct AutoTokenizer.from_pretrained
outside load_tokenizer stays vanilla
* Simulated fastokens load failure falls back to vanilla cleanly
No version bump (batched with the open Qwen3.5 / Llama-3 PRs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in uv.lock by regenerating against the merged pyproject.toml (main's deps + fastokens>=0.1.1 from this branch). The fastokens patch coexists cleanly with main's new TRUSTED_REVISIONS map (Kimi-K2 family scoped trust_remote_code) — both codepaths live in renderers/base.py and gate on disjoint conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull in main's urllib3 bump (#35); resolve uv.lock by regenerating against the merged pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add fastokens (Crusoe's Rust BPE tokenizer, ~10x faster encode) as a required dependency and patch it in by default for every supported model except a small denylist. The patch is bracketed around
from_pretrained, so the loaded tokenizer keeps the fastokens shim while the user's process-globalAutoTokenizer.from_pretrainedstays vanilla.Audit results — every entry of
MODEL_RENDERER_MAPProbe: 5-case encoding (plain text, Lorem, emoji + CJK, literal
<|im_start|>, 200-word long), vanilla vs fastokens-patched.deepseek-ai/DeepSeek-V3{,-Base}— fastokens 0.1.1 doesn't support the Metaspace pretokenizerMiniMaxAI/MiniMax-M2{,.5}— see "MiniMax investigation" belowThe 4 incompat models live in
FASTOKENS_INCOMPATIBLE(frozensetinrenderers/base.py) and skip the patch unconditionally. Unknown / fine-tuned models hit the patched path first and fall back to vanilla on any fastokens load error (logged at INFO).Implementation notes
Per-call patch/unpatch keeps the side effect minimal: the returned tokenizer keeps the fastokens shim (because fastokens captures the backend at load time), but subsequent
AutoTokenizer.from_pretrainedcalls outsideload_tokenizerstay vanilla. Verified bytest_patch_is_unloaded_after_call.MiniMax "divergence" — actually an upstream fastokens bug
Deeper investigation after merging
main:encode("<|im_start|>")→[60, 124, 324, 95, 10314, 109675](6 tokens,_+startunmerged)patch_transformers()+unpatch_transformers()cycle in the same process:[60, 124, 324, 22242, 109675](5 tokens, with_startid 22242 merged)The pollution comes from
unpatch_transformers():from_pretrainedlives onPreTrainedTokenizerBase, not onTokenizersBackend. Capturing it via attribute access returns a boundmethod;setattr-ing it back installs a stray attribute onTokenizersBackend.__dict__that shadows the inherited classmethod. Subsequent loads seecls = TokenizersBackendrather than the actual tokenizer subclass, which steers MiniMax (declaredtokenizer_class = 'GPT2Tokenizer') down a different load path that strips the declared NFC normalizer + Split-regex pre-tokenizer.Fix is one line in fastokens —
del TokenizersBackend.from_pretrainedinstead ofsetattrback. Filing upstream separately. With that fix, vanilla and fastokens produce byte-identical encodings for MiniMax-M2 / M2.5 across the probe suite, in any process load order.The
FASTOKENS_INCOMPATIBLEMiniMax entries stay in this PR — cheap insurance until the fastokens fix is released, and the conditional adds no overhead for callers using one of the 31 byte-identical models.Separate-and-unrelated finding worth flagging: even with the fastokens bug fixed,
AutoTokenizer.from_pretrained(training-side) and directtokenizer.jsonloading (vLLM / sglang / fastokens) produce different tokenizers for MiniMax-M2 — the slow→fast conversion path strips state declared intokenizer.json. This is a transformers / MiniMax model-card inconsistency, not a fastokens issue. Not in scope for this PR; flagging for visibility.Tests
tests/test_load_tokenizer_fastokens.py— 10 cases pinning the policy:FASTOKENS_INCOMPATIBLEexact-shape lockuse_fastokens=Falseproduces a vanilla backendAutoTokenizer.from_pretrainedoutsideload_tokenizerstays vanillaExisting test suite passes unchanged post-merge:
1152 passed, 51 skipped, 1 xfailed(full run with main merged in,pytest tests/ --ignore=tests/test_client.py).Test plan
pytest tests/test_load_tokenizer_fastokens.py— 10 cases passpytest tests/ --ignore=tests/test_client.py) — 1152 pass, 51 skipped, 1 xfailed post-merge (no regressions)main(uv.lock regenerated; renderers/base.py auto-merged cleanly alongside main'sTRUSTED_REVISIONSKimi-K2 changes)🤖 Generated with Claude Code