Name and Version
Updated to 0.1.2 precompiled binaries on my Windows11 machine with RTX5090 Cuda13.1Dlls.
Operating systems
Windows11
GGML backends
CUDA 13.1
Hardware
9800X3D + RTX 5090 +64GB RAM
Models
Gemma-Dense-31B-Q4-Unsloth
Qwen3.6-27B
Problem description & steps to reproduce
I'm running llama.cpp in router mode with this .ini (tried to run it directly from CMD as well)
; =====================================================
; GLOBAL SETTINGS (shared by all models)
; =====================================================
[*]
c = 128000
b = 2048
ub = 1024
;threads = 12
fa = on
cache-type-v = turbo3_tcq
cache-type-k = turbo4
no-mmap = true
mlock = true
np = 1
jinja = true
cache-ram = 0
defrag-thold = 0.3
ctx-checkpoints = 0
; =====================================================
; Gemma-4 31B Q4 Unsloth
; =====================================================
[Gemma-Dense-31B-Q4-Unsloth]
model = models/Gemma4/gemma-4-31B-it-UD-Q4_K_XL.gguf
mmproj = models/Gemma4/mmproj-gemma-4-bf16.gguf
alias = Gemma-Dense-31B-XL-Unsloth
c = 128000
temp = 0.7
top-p = 0.95
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
reasoning = on
chat-template-kwargs = {"enable_thinking": true}
b = 2048
ub = 512
When testing with llama.cpp's WebUI, Gemma4-26B's token processing speed when reading 20K+ context drops down to 400tk/s.
I thought maybe this was just a model limitation thing however, I decided to copy paste my model config to previous TheTom turboquant fork, Token processing jumped up to 2000tok/s with the exact same model and stayed in that region even when the context reached to 80K+.
I don't know if this is relevant but I've also been quite disappointed with the DFlash models in general. I don't really see much of a speed difference between them and the regular models. Here's an example of how i'm using them:
; =====================================================
; Qwen3.6-27B with DFlash (max speed config)
; =====================================================
[qwen36-27b-Q5-dflash-uncensored]
model = models/Qwen3.6/Qwen3.6-27B-NEO-CODE-HERE-2T-OT-Q5_K_M.gguf
mmproj = models/Qwen3.6/mmproj-BF16.gguf
spec-draft-model = models/Qwen3.6/dflash-draft-3.6-q8_0.gguf
alias = qwen36-Q5-Uncensored-DFlash,Qwen3.6-27B-Q5-Uncesnsored-DFlash
spec-type = dflash
spec-draft-ngl = all
spec-dflash-cross-ctx = 1024
c = 256000
kv-unified = true
cache-ram = 0
b = 2048
ub = 512
temp = 0.6
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
reasoning = on
chat-template-kwargs = {"preserve_thinking": true}
defrag-thold = -1
It seems like the Dflash's token gen is about 30-40tps faster up until 5K context but then it drops to the same level as regular Q4 variants.
PS: It's not a VRAM issue. I actively monitor my VRAM and make sure it never goes above %90 usage.
Name and Version
Updated to 0.1.2 precompiled binaries on my Windows11 machine with RTX5090 Cuda13.1Dlls.
Operating systems
Windows11
GGML backends
CUDA 13.1
Hardware
9800X3D + RTX 5090 +64GB RAM
Models
Gemma-Dense-31B-Q4-Unsloth
Qwen3.6-27B
Problem description & steps to reproduce
I'm running llama.cpp in router mode with this .ini (tried to run it directly from CMD as well)
When testing with llama.cpp's WebUI, Gemma4-26B's token processing speed when reading 20K+ context drops down to 400tk/s.
I thought maybe this was just a model limitation thing however, I decided to copy paste my model config to previous TheTom turboquant fork, Token processing jumped up to 2000tok/s with the exact same model and stayed in that region even when the context reached to 80K+.
I don't know if this is relevant but I've also been quite disappointed with the DFlash models in general. I don't really see much of a speed difference between them and the regular models. Here's an example of how i'm using them:
It seems like the Dflash's token gen is about 30-40tps faster up until 5K context but then it drops to the same level as regular Q4 variants.
PS: It's not a VRAM issue. I actively monitor my VRAM and make sure it never goes above %90 usage.