✨[0.3.39] Release Note: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging #128

JamePeng · 2026-05-17T15:46:15Z

JamePeng
May 17, 2026
Maintainer

Release 0.3.39: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging

Hi everyone, this is JamePeng.

This update is one of the more important maintenance and architecture updates for my llama-cpp-python fork. The main focus is to keep the Python package closer to the latest upstream llama.cpp / ggml runtime layout, while also improving Windows/Linux wheel compatibility, multimodal handler support, logging control, and hybrid model cache behavior.

1. Dynamic GGML backend wheels

This change will allow the n_cpu_moe and cpu_moe features added in v0.3.37 to achieve optimal CPU compatibility and AVX instruction acceleration.

Starting from this preview release, the prebuilt CUDA wheels are moving toward the new GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS runtime layout.

In simple terms, the wheel now ships GGML backends as dynamically loadable runtime libraries. For example, Windows wheels can include backend DLLs such as:

ggml-cpu-x64.dll
ggml-cpu-sse42.dll
ggml-cpu-haswell.dll
ggml-cpu-alderlake.dll
ggml-cpu-zen4.dll
ggml-cuda.dll

These libraries are packaged under:

site-packages/llama_cpp/lib

The Python runtime now explicitly loads packaged GGML dynamic backends from this directory after llama_backend_init().

This is important because the newer dynamic backend layout means llama_backend_init() alone may not register every packaged backend automatically. The package now calls ggml_backend_load_all_from_path() so that CPU variants and optional accelerator backends can be discovered before model loading.

This also means the old Basic / AVX2 wheel split is no longer the preferred direction. CPU instruction compatibility is now handled by GGML’s dynamic CPU backend selection rather than by publishing many CPU-specific wheel variants.

For Windows builds, I also moved the workflow toward the LLVM/Clang toolchain. MSVC may skip some x64 CPU variants such as zen4, cooperlake, or sapphirerapids due to compiler intrinsic support limitations, so LLVM/Clang is the better path for full CPU variant coverage.

2. Cleaner and more focused wheel builds

The CMake installation logic has been updated to better match the new upstream GGML backend layout.

Targets are now grouped more clearly into:

LLAMA_CPP_TARGETS
GGML_CORE_TARGETS
GGML_CPU_VARIANT_TARGETS
GGML_BACKEND_TARGETS

Missing targets such as llama-common and the separated ggml-cpu-* CPU backend variants are now explicitly included in the Python package installation path.

I also disabled non-wheel build targets such as examples, tests, tools, server, embedded UI, and curl support for prebuilt wheels. The goal is to keep wheel artifacts focused on runtime usage, not development or server-side auxiliary binaries.

On Windows, development-only files such as .lib, cmake/, and pkgconfig/ entries are cleaned from the Python runtime directories so that the wheel contains only the files needed at runtime.

3. CUDA wheel updates

The prebuilt wheel workflows are being updated around the new backend layout.

Supported CUDA versions are moving toward:

CUDA 12.4
CUDA 12.6
CUDA 12.8
CUDA 13.1

CUDA wheels now use a simpler local version suffix such as:

+cu131

instead of older suffixes such as:

+cu131.basic

CPU all-variants is now an internal runtime layout detail, not a user-facing wheel variant.

The CUDA architecture coverage is still intentionally broad. This makes the wheel larger than single-architecture builds, but it also makes the wheel more convenient across different GPU generations.

4. Qwen3-ASR support

This release adds a new Qwen3ASRChatHandler for Qwen3-ASR models.

The handler integrates MTMD multimodal logic for audio input and supports both audio_url and OpenAI-style base64 input_audio payloads. It injects audio data into the expected Qwen3-ASR template sequence:

<|audio_start|><|audio_pad|>[DATA]<|audio_end|>

The README has also been updated with a dedicated Qwen3-ASR usage example, including a helper for encoding local .wav and .mp3 files into input_audio payloads.

One important note: for Qwen3-ASR, I strongly recommend using BF16 quantization for the multimodal projector (mmproj). Lower precision projector quantization may noticeably degrade audio understanding quality.

Another detail is that Qwen3-ASR’s template behavior may drop normal text content, so task instructions should be placed in the system role.

5. MiniCPM-V-4.6 support

This release also adds a MiniCPMV46ChatHandler for MiniCPM-V-4.6.

This continues the work of keeping the multimodal handler layer aligned with newer upstream multimodal model formats.

6. Fine-grained logging API

The native llama.cpp / ggml logging system has been refactored and exposed more directly through the Llama class.

Previously, verbose=True/False was too limited. It mostly behaved like a binary switch between very quiet and very noisy native logs.

Now the package supports a more detailed verbosity scale:

0 = output only
1 = error
2 = warning
3 = info
4 = trace
5 = debug

The Llama class now accepts additional logging parameters such as:

verbosity
log_filters
log_filters_case_sensitive

It also exposes runtime methods such as:

set_verbosity
get_verbosity
set_log_filters
add_log_filters
clear_log_filters
reset_log_filters

This should make it much easier to suppress noisy backend logs, including repeated CUDA Graph messages, without hardcoding patches into the runtime.

The native logger is still process-global because llama.cpp / ggml use a global log callback, so changing verbosity or filters affects all Llama instances in the same Python process.

7. HybridCheckpointCache on-device support

This release adds on-device hybrid checkpoint support.

HybridCheckpointCache now supports both host mode and on-device mode.

Host mode remains the default and keeps Python-owned rollback history.

On-device mode uses LLAMA_STATE_SEQ_FLAGS_ON_DEVICE to keep checkpoint tensor payloads in llama_context-owned device buffers. This can reduce host-device copy overhead for hybrid or recurrent models.

There are also additional safety guards to avoid restoring stale on-device checkpoints, since device-side payloads may be overwritten by the backend. The default ctx_checkpoints value has also been reduced from 32 to 16.

The docs for Llama and LlamaCache have been updated to explain the new checkpoint_on_device option and the difference between host-owned and device-owned checkpoint data.

8. MTP status

I also synced the latest llama.cpp / ggml / mtmd API bindings, including the recent MTP-related API and context variables.

However, I want to be clear: MTP is not fully adapted in this release yet.

At the moment, the MTP-related changes are mainly there to keep the API and context variable layer compatible and to prevent runtime errors when using newer upstream code. I have not yet completed a full Python-side MTP integration.

My current plan is to wait another 1–2 weeks for upstream llama.cpp to stabilize the MTP implementation and related APIs before doing a more complete adaptation in this fork.

I think this is safer than rushing an integration while the upstream behavior is still moving quickly.

9. Upstream sync

This release also updates the vendored llama.cpp code and synchronizes llama / mtmd / ggml API bindings with the latest upstream changes available at the time of this release.

More details can be found in the compare link from the release changelog.

Final notes

This preview release is mainly about preparing the package for the new GGML dynamic backend world.

The biggest changes are not only new model handlers or new CUDA versions, but the shift toward:

dynamic backend loading
CPU all-variant runtime selection
cleaner wheel packaging
better logging control
safer hybrid checkpoint behavior
closer alignment with upstream llama.cpp

As always, this is still a fast-moving area. If you test the new wheels, especially on Windows with different CPU/GPU generations, feedback is very welcome.

Thanks to everyone who submitted issues and feedback to help shape these features.

— JamePeng

wudioql · 2026-05-18T11:36:51Z

wudioql
May 18, 2026

哎我去怎么升级cuda13.1了，cu130该怎么搞啊😂

1 reply

JamePeng May 18, 2026
Maintainer Author

13.0和13.1一样用，低于13.2的区别不大

JamePeng · 2026-05-18T13:15:51Z

JamePeng
May 18, 2026
Maintainer Author

Supplementary Notice: 2026-05-17 Windows Wheels Temporarily Removed

Hi everyone, this is JamePeng.

I have temporarily removed the Windows wheels published on 2026-05-17.

The reason is that the Windows packages were built with the new GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS dynamic backend layout, but the wheel packaging missed one required runtime dependency:

libomp140.x86_64.dll

GGML CPU all-variant backends built with LLVM/Clang + OpenMP depend on this runtime DLL. Since ggml-cpu-*.dll files are loaded dynamically through:

ggml_backend_load_all_from_path()

the OpenMP runtime must be packaged next to them under:

site-packages/llama_cpp/lib

Without libomp140.x86_64.dll, Windows may fail to dynamically load the packaged ggml-cpu-* backend DLLs at runtime.

I am rebuilding the 0.3.39 Windows wheels with the corrected packaging logic so that the LLVM OpenMP runtime is included properly.

Also Update llama.cpp to ggml-org/llama.cpp/commit/d14ce3dab4de197adec5166faa54ac5db8262f26

The rebuilt wheels are expected to be completed and published within the next 4 hours.

Special thanks to @qqba for reporting and helping identify this issue in #129. This feedback helped catch an important packaging gap in the new dynamic backend wheel layout.

Sorry for the inconvenience, and thank you for your patience.

--JamePeng

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨[0.3.39] Release Note: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging #128

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

✨[0.3.39] Release Note: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging #128

Uh oh!

JamePeng May 17, 2026 Maintainer

Release 0.3.39: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging

1. Dynamic GGML backend wheels

2. Cleaner and more focused wheel builds

3. CUDA wheel updates

4. Qwen3-ASR support

5. MiniCPM-V-4.6 support

6. Fine-grained logging API

7. HybridCheckpointCache on-device support

8. MTP status

9. Upstream sync

Final notes

Replies: 2 comments · 1 reply

Uh oh!

wudioql May 18, 2026

Uh oh!

Uh oh!

JamePeng May 18, 2026 Maintainer Author

Uh oh!

Uh oh!

JamePeng May 18, 2026 Maintainer Author

Supplementary Notice: 2026-05-17 Windows Wheels Temporarily Removed

JamePeng
May 17, 2026
Maintainer

Replies: 2 comments 1 reply

wudioql
May 18, 2026

JamePeng May 18, 2026
Maintainer Author

JamePeng
May 18, 2026
Maintainer Author