Eval bug: Gemma4 Dense Models are very slow on RTX 5090

### Name and Version

Updated to 0.1.2 precompiled binaries on my Windows11 machine with RTX5090 Cuda13.1Dlls. 


### Operating systems

Windows11

### GGML backends

CUDA 13.1

### Hardware

9800X3D + RTX 5090 +64GB RAM

### Models

Gemma-Dense-31B-Q4-Unsloth
Qwen3.6-27B

### Problem description & steps to reproduce

I'm running llama.cpp in router mode with this .ini (tried to run it directly from CMD as well)



```
; =====================================================
; GLOBAL SETTINGS (shared by all models)
; =====================================================
[*]

c = 128000
b = 2048
ub = 1024
;threads = 12
fa = on
cache-type-v = turbo3_tcq
cache-type-k = turbo4
no-mmap = true
mlock = true
np = 1
jinja = true
cache-ram = 0
defrag-thold = 0.3
ctx-checkpoints = 0

; =====================================================
; Gemma-4 31B Q4 Unsloth
; =====================================================
[Gemma-Dense-31B-Q4-Unsloth]
model = models/Gemma4/gemma-4-31B-it-UD-Q4_K_XL.gguf
mmproj = models/Gemma4/mmproj-gemma-4-bf16.gguf
alias = Gemma-Dense-31B-XL-Unsloth
c = 128000
temp = 0.7
top-p = 0.95
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
reasoning = on
chat-template-kwargs = {"enable_thinking": true}
b = 2048
ub = 512

```

When testing with llama.cpp's WebUI, Gemma4-26B's token processing speed when reading 20K+ context drops down to 400tk/s. 

I thought maybe this was just a model limitation thing however, I decided to copy paste my model config to previous TheTom turboquant fork, Token processing jumped up to 2000tok/s with the exact same model and stayed in that region even when the context reached to 80K+. 

I don't know if this is relevant but I've also been quite disappointed with the DFlash models in general. I don't really see much of a speed difference between them and the regular models. Here's an example of how i'm using them:

```
; =====================================================
; Qwen3.6-27B with DFlash (max speed config)
; =====================================================
[qwen36-27b-Q5-dflash-uncensored]
model = models/Qwen3.6/Qwen3.6-27B-NEO-CODE-HERE-2T-OT-Q5_K_M.gguf
mmproj = models/Qwen3.6/mmproj-BF16.gguf
spec-draft-model = models/Qwen3.6/dflash-draft-3.6-q8_0.gguf
alias = qwen36-Q5-Uncensored-DFlash,Qwen3.6-27B-Q5-Uncesnsored-DFlash
spec-type = dflash
spec-draft-ngl = all
spec-dflash-cross-ctx = 1024
c = 256000
kv-unified = true
cache-ram = 0
b = 2048
ub = 512
temp = 0.6
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
reasoning = on
chat-template-kwargs = {"preserve_thinking": true}
defrag-thold = -1

```

It seems like the Dflash's token gen is about 30-40tps faster up until 5K context but then it drops to the same level as regular Q4 variants. 

PS: It's not a VRAM issue. I actively monitor my VRAM and make sure it never goes above %90 usage. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Gemma4 Dense Models are very slow on RTX 5090 #13

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: Gemma4 Dense Models are very slow on RTX 5090 #13

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions