✨[0.3.39] Release Note: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging #128
Replies: 2 comments 1 reply
-
|
哎我去怎么升级cuda13.1了,cu130该怎么搞啊😂 |
Beta Was this translation helpful? Give feedback.
-
Supplementary Notice: 2026-05-17 Windows Wheels Temporarily RemovedHi everyone, this is JamePeng. I have temporarily removed the Windows wheels published on 2026-05-17. The reason is that the Windows packages were built with the new GGML CPU all-variant backends built with LLVM/Clang + OpenMP depend on this runtime DLL. Since the OpenMP runtime must be packaged next to them under: Without I am rebuilding the 0.3.39 Windows wheels with the corrected packaging logic so that the LLVM OpenMP runtime is included properly. Also Update llama.cpp to ggml-org/llama.cpp/commit/d14ce3dab4de197adec5166faa54ac5db8262f26 The rebuilt wheels are expected to be completed and published within the next 4 hours. Special thanks to @qqba for reporting and helping identify this issue in #129. This feedback helped catch an important packaging gap in the new dynamic backend wheel layout. Sorry for the inconvenience, and thank you for your patience. --JamePeng |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Release 0.3.39: Dynamic GGML Backends, Qwen3-ASR/MiniCPM-V-4.6, On-Device Hybrid Checkpoint, and Granular Logging
Hi everyone, this is JamePeng.
This update is one of the more important maintenance and architecture updates for my
llama-cpp-pythonfork. The main focus is to keep the Python package closer to the latest upstreamllama.cpp/ggmlruntime layout, while also improving Windows/Linux wheel compatibility, multimodal handler support, logging control, and hybrid model cache behavior.1. Dynamic GGML backend wheels
This change will allow the
n_cpu_moeandcpu_moefeatures added inv0.3.37to achieve optimal CPU compatibility and AVX instruction acceleration.Starting from this preview release, the prebuilt CUDA wheels are moving toward the new
GGML_BACKEND_DL+GGML_CPU_ALL_VARIANTSruntime layout.In simple terms, the wheel now ships GGML backends as dynamically loadable runtime libraries. For example, Windows wheels can include backend DLLs such as:
ggml-cpu-x64.dllggml-cpu-sse42.dllggml-cpu-haswell.dllggml-cpu-alderlake.dllggml-cpu-zen4.dllggml-cuda.dllThese libraries are packaged under:
The Python runtime now explicitly loads packaged GGML dynamic backends from this directory after
llama_backend_init().This is important because the newer dynamic backend layout means
llama_backend_init()alone may not register every packaged backend automatically. The package now callsggml_backend_load_all_from_path()so that CPU variants and optional accelerator backends can be discovered before model loading.This also means the old
Basic/AVX2wheel split is no longer the preferred direction. CPU instruction compatibility is now handled by GGML’s dynamic CPU backend selection rather than by publishing many CPU-specific wheel variants.For Windows builds, I also moved the workflow toward the LLVM/Clang toolchain. MSVC may skip some x64 CPU variants such as
zen4,cooperlake, orsapphirerapidsdue to compiler intrinsic support limitations, so LLVM/Clang is the better path for full CPU variant coverage.2. Cleaner and more focused wheel builds
The CMake installation logic has been updated to better match the new upstream GGML backend layout.
Targets are now grouped more clearly into:
LLAMA_CPP_TARGETSGGML_CORE_TARGETSGGML_CPU_VARIANT_TARGETSGGML_BACKEND_TARGETSMissing targets such as
llama-commonand the separatedggml-cpu-*CPU backend variants are now explicitly included in the Python package installation path.I also disabled non-wheel build targets such as examples, tests, tools, server, embedded UI, and curl support for prebuilt wheels. The goal is to keep wheel artifacts focused on runtime usage, not development or server-side auxiliary binaries.
On Windows, development-only files such as
.lib,cmake/, andpkgconfig/entries are cleaned from the Python runtime directories so that the wheel contains only the files needed at runtime.3. CUDA wheel updates
The prebuilt wheel workflows are being updated around the new backend layout.
Supported CUDA versions are moving toward:
CUDA wheels now use a simpler local version suffix such as:
instead of older suffixes such as:
CPU all-variants is now an internal runtime layout detail, not a user-facing wheel variant.
The CUDA architecture coverage is still intentionally broad. This makes the wheel larger than single-architecture builds, but it also makes the wheel more convenient across different GPU generations.
4. Qwen3-ASR support
This release adds a new
Qwen3ASRChatHandlerfor Qwen3-ASR models.The handler integrates MTMD multimodal logic for audio input and supports both
audio_urland OpenAI-style base64input_audiopayloads. It injects audio data into the expected Qwen3-ASR template sequence:The README has also been updated with a dedicated Qwen3-ASR usage example, including a helper for encoding local
.wavand.mp3files intoinput_audiopayloads.One important note: for Qwen3-ASR, I strongly recommend using BF16 quantization for the multimodal projector (
mmproj). Lower precision projector quantization may noticeably degrade audio understanding quality.Another detail is that Qwen3-ASR’s template behavior may drop normal text content, so task instructions should be placed in the
systemrole.5. MiniCPM-V-4.6 support
This release also adds a
MiniCPMV46ChatHandlerfor MiniCPM-V-4.6.This continues the work of keeping the multimodal handler layer aligned with newer upstream multimodal model formats.
6. Fine-grained logging API
The native llama.cpp / ggml logging system has been refactored and exposed more directly through the
Llamaclass.Previously,
verbose=True/Falsewas too limited. It mostly behaved like a binary switch between very quiet and very noisy native logs.Now the package supports a more detailed verbosity scale:
The
Llamaclass now accepts additional logging parameters such as:verbositylog_filterslog_filters_case_sensitiveIt also exposes runtime methods such as:
set_verbosityget_verbosityset_log_filtersadd_log_filtersclear_log_filtersreset_log_filtersThis should make it much easier to suppress noisy backend logs, including repeated CUDA Graph messages, without hardcoding patches into the runtime.
The native logger is still process-global because llama.cpp / ggml use a global log callback, so changing verbosity or filters affects all
Llamainstances in the same Python process.7. HybridCheckpointCache on-device support
This release adds on-device hybrid checkpoint support.
HybridCheckpointCachenow supports both host mode and on-device mode.Host mode remains the default and keeps Python-owned rollback history.
On-device mode uses
LLAMA_STATE_SEQ_FLAGS_ON_DEVICEto keep checkpoint tensor payloads inllama_context-owned device buffers. This can reduce host-device copy overhead for hybrid or recurrent models.There are also additional safety guards to avoid restoring stale on-device checkpoints, since device-side payloads may be overwritten by the backend. The default
ctx_checkpointsvalue has also been reduced from 32 to 16.The docs for
LlamaandLlamaCachehave been updated to explain the newcheckpoint_on_deviceoption and the difference between host-owned and device-owned checkpoint data.8. MTP status
I also synced the latest llama.cpp / ggml / mtmd API bindings, including the recent MTP-related API and context variables.
However, I want to be clear: MTP is not fully adapted in this release yet.
At the moment, the MTP-related changes are mainly there to keep the API and context variable layer compatible and to prevent runtime errors when using newer upstream code. I have not yet completed a full Python-side MTP integration.
My current plan is to wait another 1–2 weeks for upstream
llama.cppto stabilize the MTP implementation and related APIs before doing a more complete adaptation in this fork.I think this is safer than rushing an integration while the upstream behavior is still moving quickly.
9. Upstream sync
This release also updates the vendored llama.cpp code and synchronizes llama / mtmd / ggml API bindings with the latest upstream changes available at the time of this release.
More details can be found in the compare link from the release changelog.
Final notes
This preview release is mainly about preparing the package for the new GGML dynamic backend world.
The biggest changes are not only new model handlers or new CUDA versions, but the shift toward:
As always, this is still a fast-moving area. If you test the new wheels, especially on Windows with different CPU/GPU generations, feedback is very welcome.
Thanks to everyone who submitted issues and feedback to help shape these features.
— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions