Implemented generic multimodal chat handler.#125
Conversation
|
You can take a look at how to improve the injection process. #110 |
|
It seems there's no work on how to perform URL injection for multimedia; simply replacing it with a media marker isn't enough. This code also needs to be removed: Architecture-based tag guessing should not default unknown models to Qwen-style tags. Prefer detecting media tags from the actual chat template, or better, avoid tag guessing by normalizing OpenAI content parts into placeholders before rendering. and In addition, a check is needed to ensure that the number of replacement markers matches the number of incoming media. |
|
@JamePeng What do you think of this code? |
e1caafb to
628373c
Compare
|
You can test the multimodal usage of qwen3vl, qwen3.5/3.6, and gemma4. |
Signed-off-by: JamePeng <jame_peng@sina.com>
- Add a PowerShell step to the Windows CI workflow to locate and copy `libomp140.x86_64.dll` from the Visual Studio redistributables. - Place the runtime DLL into the `llama_cpp\lib` package directory. This ensures that the dynamically loaded `ggml-cpu-*.dll` variants (which are built with LLVM OpenMP on Windows) have their required dependencies packaged in the wheel. Without this, `ggml_backend_load_all_from_path()` can silently fail to load the CPU backends at runtime on end-user machines. Signed-off-by: JamePeng <jame_peng@sina.com>
What does it do?
It automatically uses the model's chat template and replaces all of the model's multimodal tags with the
media_markertag.This allows a much easier implementation for multimodal models, since the chat template doesn't need to be hard-coded for each model.
How to use it?
It is as simple as passing the
clip_model_pathparameter to theLlamaclass when created.Note
Using the previous implementation (e.g.
Qwen35ChatHandler) still works.I'm also looking forward to implement more model architectures. Please, reply if you want me to implement any.