sokann/Qwen3.6-27B-GGUF-5.076bpw
sokann • generalQwen3.6-27B-GGUF-5.076bpw
This is a 5.076 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.
From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:
| quant | this | bartowski Q4_K_L | unsloth UD-Q4_K_XL | mradermacher i1.Q4_K_M | ubergarm IQ5_KS |
|---|---|---|---|---|---|
| Size (BPW) | 5.076 | 5.493 | 5.235 | 4.919 | 5.919 |
| Size (GiB) | 15.893 | 17.198 | 16.393 | 15.401 | 18.532 |
| VRAM usage (GiB) | 14.931 | 15.940 | 15.727 | 14.735 | 17.570 |
| Mean PPL(Q) | 6.982381 ± 0.046281 | 6.992025 ± 0.046419 | 6.970005 ± 0.046042 | 6.938483 ± 0.045668 | 6.931115 ± 0.045750 |
| Mean PPL(base) | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 |
| Cor(ln(PPL(Q)), ln(PPL(base))) | 99.47% | 99.40% | 99.45% | 99.32% | 99.79% |
| Mean KLD | 0.019613 ± 0.000587 | 0.020410 ± 0.000637 | 0.020399 ± 0.000643 | 0.024690 ± 0.000686 | 0.008430 ± 0.000466 |
| Maximum KLD | 22.204016 | 21.812332 | 20.409454 | 20.942572 | 21.321548 |
| 99.9% KLD | 1.703208 | 2.161336 | 2.067440 | 2.667204 | 0.698605 |
| RMS Δp | 3.843 ± 0.058 % | 3.812 ± 0.059 % | 3.778 ± 0.058 % | 4.264 ± 0.062 % | 2.467 ± 0.061 % |
| Same top p | 94.767 ± 0.058 % | 94.824 ± 0.058 % | 94.824 ± 0.058 % | 94.203 ± 0.061 % | 96.618 ± 0.047 % |
With 24 GiB of VRAM, we can fit a context size of 128000 with F16 KV cache:
-c 128000 -wgt 1
or a context size of 262144 with quantized KV cache:
-c 262144 -wgt 1 -ctk q8_0 -khad -ctv q6_0 -vhad
Size
Size from llama-server output:
llm_load_print_meta: model size = 15.893 GiB (5.076 BPW)
llm_load_print_meta: repeating layers = 13.969 GiB (4.927 BPW, 24.353 B parameters)
...
llm_load_tensors: CUDA_Host buffer size = 985.16 MiB
llm_load_tensors: CUDA0 buffer size = 15288.91 MiB
This is sligtly bigger than the Qwen3.5-27B-4.915bpw quant, due to these changes:
- attention: mixture of IQ6_K and IQ5_K => Q6_0
- token_embd: IQ4_K => Q6_0
- output: IQ6_K => Q6_0
The recipe is almost identical with the IQ4_KS + Q6_0 recipe shared by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663, with ssm_alpha and ssm_beta getting a slight bump from Q6_0 to Q8_0.
IQ4_KS + Q6_0 form a good and fast combo, as noted by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663:
IQ4_KSis better thanIQ4_XS, has the same size, has the same performance on CUDA and CPUIQ5_Kwill wipe the floor withQ5_Kin terms of quantization accuracy at the same bpw. One issue is thatIQ5_KPP is lower on CUDA because of the block size of 16. It is about on par for TG on CUDA, about on par for TG on the CPU, and I think slightly faster PP on the CPU. If one does not want to take the CUDA performance penalty, one could replaceQ5_KwithIQ5_KS. This will be strictly faster, will use 0.25 bpw less thanQ5_K, and will have about the same quantization accuracy.- I think in may cases one can replace
Q6_KwithQ6_0, which is quite a bit faster on CUDA while giving about the same quantization accuracy asQ6_K.IQ6_Kis better, but slower.- The situation for
Q4_KvsIQ4_KSandIQ4_Kis similar toQ5_KvsIQ5_KSandIQ5_K
Recipe
blk\..*\.attn_q\.weight=q6_0
blk\..*\.attn_k\.weight=q6_0
blk\..*\.attn_v\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
blk\..*\.attn_gate\.weight=q6_0
blk\..*\.attn_qkv\.weight=q6_0
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q6_0
blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
token_embd\.weight=q6_0
output\.weight=q6_0
Speed
llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.350 | 1461.47 | 3.088 | 41.45 |
| 512 | 128 | 10240 | 0.389 | 1315.93 | 3.230 | 39.62 |
| 512 | 128 | 20480 | 0.433 | 1183.72 | 3.311 | 38.65 |
| 512 | 128 | 30720 | 0.474 | 1080.64 | 3.413 | 37.50 |
| 512 | 128 | 40960 | 0.517 | 990.04 | 3.532 | 36.24 |
| 512 | 128 | 51200 | 0.560 | 914.01 | 3.636 | 35.20 |
| 512 | 128 | 61440 | 0.604 | 847.72 | 3.725 | 34.36 |
| 512 | 128 | 71680 | 0.647 | 791.61 | 3.826 | 33.46 |
| 512 | 128 | 81920 | 0.691 | 741.16 | 3.945 | 32.44 |
| 512 | 128 | 92160 | 0.735 | 696.16 | 4.053 | 31.58 |
| 512 | 128 | 102400 | 0.779 | 657.14 | 4.143 | 30.89 |
| 512 | 128 | 112640 | 0.823 | 622.27 | 4.261 | 30.04 |
| 512 | 128 | 122880 | 0.866 | 591.36 | 4.391 | 29.15 |
Performance
This quant uses the imatrix from mradermacher. There are some long reasoning tasks that the full precision model served from https://dashscope-intl.aliyuncs.com/compatible-mode/v1 can solve at about 50:50 chance, and this quant can also solve at about 50:50 chance when using the imatrix from mradermacher. Without using any imatrix, this quant can't solve the tasks at all. This finding vindicates the importance of the importance matrix.
On agentic tasks tested with pi agent, the tasks that can be reliably solved by the full precision model can also be reliably solved by this quant.
For mainline users with 24 GiB of VRAM, I would recommend i1-Q4_K_M from mradermacher, which also performs quite well from limited testing.
For IK users that need even higher quality, I would recommend IQ5_KS from ubergarm, which is near lossless.
Verdict
We get Sonnet 4.5 at home with a used RTX 3090.