feat!: multilingual text-to-speech#1134
Conversation
8380a2a to
eb999a7
Compare
msluszniak
left a comment
There was a problem hiding this comment.
You should also update the code in documentation and documentation in general. Also address lint warnings, there are plenty of them that you need to add to cspell ignore.
|
Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body. |
…nsion/react-native-executorch into @is/multilingual-tts
10e8e1c to
38340f6
Compare
- tests/CMakeLists.txt: build phonemis from source (add_subdirectory) and propagate its include dir to rntests_core. The previous IMPORTED STATIC pointed at a libphonemis.a that nothing builds. - FrameTransformTest, ObjectDetectionTest, InstanceSegmentationTest: update bbox member access for #1130's BBox refactor (.x1/.y1/.x2/.y2 → .p1.x/.p1.y/.p2.x/.p2.y). - PoseEstimationTest: keypoint type became float in #1130; update the static_assert from int32_t to float. - FrameTransformTest: make the three Right_* tests platform-aware. Production inverseRotateBbox/inverseRotatePoints are a no-op on Android for Right (front-cam upright portrait); rotateFrameForModel rotates CW on Android vs CCW on iOS. Tests now have #if defined(__APPLE__) branches matching production. - SpeechToTextTest: GTEST_SKIP TranscribeReturnsValidChars with a TODO — known-failing on this branch, needs separate investigation. - run_tests.sh: fix two stale Hugging Face URLs (fsmn-vad and yolo26n-pose filenames had changed upstream, causing wget to 404 and silently abort the script).
msluszniak
left a comment
There was a problem hiding this comment.
Please make sure that iOS is also tested since I don't have any for testing.
There was a problem hiding this comment.
I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize
Uuu, good catch. have you tried other characters specific for a language like ü, etc.? |
Yeah, I tried the same for german-, french-, and spanish-specific characters and there wasn't any problem. |
Fixed. |
|
@IgorSwat inspired by Bartek's finding I'm trying some other other attack to expose some problem. Will come back with my finds. |
TTS edge-case findings from stress testingRan a battery of inputs against 1.
|
speed |
result |
|---|---|
0 |
throws bare std::exception (no message, no error code) |
NaN / Infinity / -1 / 1e9 |
silently accepted; emits ~5500 samples regardless of text |
1e-6 |
emits 150 945 samples (≈6.3 s) for "Hello world" |
1e-6 is the most worrying — audioLength = kTicksPerDuration * effectiveDuration is int32_t (Kokoro.cpp:349); a small enough speed overflows that and the synthesizer allocates unbounded memory. Suggested guard: reject non-finite or ≤ 0 speeds at the JS boundary and in Kokoro::generate, with a real RnExecutorchError(InvalidUserInput, …).
2. Streaming worker hangs on non-EOS content
Kokoro.cpp:171-189:
size_t chunkSize = (eosIt != inputTextBuffer_.rend())
? std::distance(eosIt, inputTextBuffer_.rend())
: 0;
if (chunkSize > 0 ||
streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
input = inputTextBuffer_.substr(0, chunkSize); // chunkSize still 0
inputTextBuffer_.erase(0, chunkSize); // erases nothing
streamSkippedIterations = 0; // reset, loop forever
}When streamInsert content has no end-of-sentence character, the buffer never drains. The skip-threshold force-flush path fires but uses chunkSize=0 to extract the chunk, so it produces an empty input and resets the counter. streamStop(false) then waits forever for the buffer to empty. streamStop(true) is the only recovery.
Repros: streamInsert('a'), streamInsert('hello world'), 2000× U+200D — all permanently hang the worker.
Suggested fix:
if (chunkSize > 0) {
// normal flush by EOS
} else if (streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
input = inputTextBuffer_.substr(0, searchLimit);
inputTextBuffer_.erase(0, searchLimit);
streamSkippedIterations = 0;
} else {
streamSkippedIterations++;
}3. streamStop(true) drops in-flight audio silently
Kokoro.cpp:137-145:
auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
if (this->isStreaming_) { // false after streamStop(true)
this->callInvoker_->invokeAsync(...);
}
};If streamStop(true) lands while a chunk is mid-synthesis, the synthesizer finishes the chunk and then the callback no-ops — the audio is generated and discarded with no signal. In a captioning / live-narration context that's a silently lost sentence.
Suggested fix: deliver the chunk that completed before the stop, or surface "aborted with in-flight chunk discarded" through onEnd/onCancel.
4. Observational (optional): solo punctuation produces ~0.5 s of audible artifacts
Single-character inputs like ., !, ?, ... produce 12k–17k samples of mostly-silent audio that contains low-amplitude artifacts the model emits while filling the duration predictor's window. stripAudio's silence threshold doesn't catch them, so the user hears a faint click/breath. Not a crash, not strictly wrong — flagging as observed behavior in case it's worth a guard later (e.g. early-return when all content phonemes are punctuation).
Reproducing
Stress-test version of the speech app lives on branch @ms/tts-stress-tests — the only changed file is apps/speech/screens/TextToSpeechScreen.tsx (preset chip rows + force-stop button + a switch from model.stream() to model.forward() so the hook's silent period-append doesn't mask the actual model behavior).
git fetch origin
git checkout @ms/tts-stress-tests
yarn install
cd apps/speech && yarn android
adb logcat -c && adb logcat ReactNativeJS:V AndroidRuntime:E DEBUG:V libc:E '*:F'Open the app → Text To Speech screen.
Top row — "Test presets" drives forward() (one-shot synthesis, no streaming wrapper):
space,spaces,newline,dot,excl,q,...— punctuation/empty edge cases (relates to finding 4)Hindi,Arabic,Chinese,Japanese,Hebrew,Russian,Korean— script-mismatch with English voice (Bartek's family ;p)emoji,emoji-mix,ZW-chars,NUL,EN+Hindi,diacritics— non-vocab / mixed-scriptspeed=0,speed=NaN,speed=Inf,speed=-1,speed=1e-6,speed=1e9— drives finding 1noPh:EN,noPh:nums,noPh:syms—phonemize: falsewith non-phoneme input
Bottom row — "Streaming tests" drives streamInsert + stream() directly (bypassing the hook's . append):
no-term:a,no-term:long— finding 2; these will hang the worker (tap the red Force stop button to recover)many-EOS— sanity check (multiple sentences in one insert)insert-flood-EOS— concurrency / buffer growth under load (no race observed)race:stop-during-synth— finding 3race:insert-during-synth— sanity check (no data loss observed)
Tap Force stop any time a streaming test hangs — it calls streamStop(true) so you can keep testing.
Interpreting the logs. Each tap emits one of:
I ReactNativeJS: [TTS-test] text=<json> speed=<n> phonemize=<bool>
I ReactNativeJS: [TTS-test] forward() returned <N> samples
I ReactNativeJS: [TTS-test] threw: <message>
I ReactNativeJS: [TTS-stream] start: <label>
I ReactNativeJS: [TTS-stream] <label> chunk #<n>: <N> samples (t=<ms>)
I ReactNativeJS: [TTS-stream] end: <label> — <chunks> chunks, <samples> samples, <ms>ms
I ReactNativeJS: [TTS-stream] threw: <label> -> <message>
Quick decoder:
returned 0 samples→ safe no-op (input had no usable phonemes).returned <large N>with weird input → check whether the model produced unintended audio (findings 1 / 4).threw: std::exceptionwith no detail → wrap with properRnExecutorchErrorsomewhere upstream.start: <label>followed by no chunks and a multi-minute duration → streaming worker is hung (finding 2); you need Force stop.start: <label>followed by no chunks but quick end (≲1 s) → in-flight chunk dropped (finding 3) — synthesizer ran, callback no-op'd.chunk #Nlines mean audio was delivered to JS. The streaming sanity tests (many-EOS,insert-flood-EOS,race:insert-during-synth) should each produce several chunks.
|
81c1766 to
81daf0d
Compare
|
I agree that 4 is rather good to skip. Regarding 2:
Where do we add this dot? |
|
Ok, what about textToSpeechModule? I cannot see the similar hack. |
We can't do that in textToSpeechModule. And I honestly do not see a point in doing so, if it is already done in the hook. |
|
Hmmm, the very minimum we need to do is to escalate this into separate issue since it is a serious problem. After 0.9 release I will work on the solution that both solves this issue and do not break llm integration. And by the way, hook is completely separate mechanism, if we have this hack in module that is re-used by the hook, then ok. But other way around, I disagree. |
msluszniak
left a comment
There was a problem hiding this comment.
As said, either me or you need to work on the eos issue, but except that, I have no other things to add. Great job overall! :))
URL refresh - Every URL constant in `modelUrls.ts`, `ocr/models.ts` follows the restructured HF layout under `resolve/v0.9.0` (`<model>_<size>_<backend>_<precision>.pte`, per-size + per-backend directories). Multi-backend URLs are hoisted to `modelUrls.ts` so the registry stays declarative. The `lfm2_5_350m_xnnpack_8w4da.pte` typo is corrected to `_8da4w.pte`. - `versions.ts` — `VERSION_TAG → resolve/v0.9.0`; `PREVIOUS_VERSION_TAG = resolve/v0.8.0` retained for the @deprecated Llama QLoRA aliases. `models` accessor - New `constants/modelRegistry.ts` exports `models`, a typed accessor grouped one-to-one with hooks. Each entry is a function — call it (optionally with `{ quant, backend }`) to get the resolved config. - Groups: `llm` (includes vision-capable LLMs like `lfm2_5_vl_*`), `classification`, `privacy_filter`, `object_detection`, `pose_estimation`, `semantic_segmentation`, `instance_segmentation`, `style_transfer`, `speech_to_text`, `text_to_speech`, `text_embedding`, `image_embedding`, `image_generation`, `vad`, `ocr`. - `text_to_speech` is nested by language (`en_us`, `en_gb`, `fr`, `es`, `it`, `pt`, `hi`, `pl`, `de`) and returns the new bundled Kokoro presets from #1134. - The `backend` parameter is typed to exactly the backends each model ships with — asking for a backend a model doesn't publish is a compile-time error. Defaults to the quantized variant. - `ocr({ language })` is parameterized by ISO language token. - ESLint's `camelcase` rule is relaxed to `properties: 'never'` so the snake_case property keys pass while bindings stay camelCase. Apps + docs - All example apps migrated to `models.*()`. Heavily-used groups are destructured at the top of the file (`const segmentation = models.semantic_segmentation;`). - Picker entries compared by `modelName` to handle accessor-function values. - `bare-rn` and the LLM playground default to LFM-2.5 1.2B Instruct. - Every documentation code snippet that selected a model via a named constant is rewritten to use the typed `models.<group>.<entry>()` accessor across `03-hooks/**`, `04-typescript-api/**`, fundamentals, and the Model Registry guide. - The 0.8.x version's Model Registry anchor is repointed to its own version so the link survives the rename on `next`. Deprecations - `LLAMA3_2_3B_QLORA`, `LLAMA3_2_1B_QLORA` — `@deprecated`; the .pte files stay at `v0.8.0` and the constants still resolve those URLs. Use `LLAMA3_2_*_SPINQUANT` going forward.
…eens
- `models.speech_to_text.whisper_{tiny,base,small}_en` switch from
`pair(base, quantized)` to `base(...)`. The quantized whisper builds
are slated for removal (#1134-followup) — picking them up by default
was wrong even before they go away.
- `apps/speech/screens/{TextToSpeechScreen,TextToSpeechLLMScreen,Quiz}.tsx`
migrated from the bare `KOKORO_*` constants to
`models.text_to_speech.<lang>.<voice>()`.
- `SpeechToTextScreen` and `voice_chat` pickers drop the "Whisper Tiny Q"
duplicate entry now that there's a single Whisper Tiny variant.
Description
Introduces major changes to the text-to-speech module based on Kokoro model, including:
Supported language current status:
Introduces a breaking change?
There are 2 major breaking changes introduced by this PR:
Changed "synthezation from phonemes" API.
Old API:
New API:
Changed predefined model - voice setups. Now both model files & voice/phonemization files are bundled together, due to languages like Polish or German having fine-tuned model weights.
Old API:
New API:
Type of change
Tested on
Testing instructions
Play around demo speech apps.
Unit tests for RNE-specific code will be added later on.
Phonemis package has it's own, wide range of unit tests implemented (see Phonemis repo)
Screenshots
Related issues
#712
Checklist
Additional notes