sokann/Qwen3.6-27B-GGUF-4.256bpw
sokann • generalQwen3.6-27B-GGUF-4.256bpw
This is a 4.256 BPW quantized model for the GPU poors with 16 GiB of VRAM. It works in both ik_llama.cpp and mainline llama.cpp.
It was quantized using the simplest ever recipe – Q8_0 for the tiny ssm_alpha and ssm_beta tensors, IQ4_XS for the rest.
From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:
| quant | this | bartowski Q3_K_M | unsloth UD-Q3_K_XL | mradermacher i1.IQ4_XS | bartowski IQ4_XS | unsloth IQ4_XS |
|---|---|---|---|---|---|---|
| Size (BPW) | 4.256 | 4.270 | 4.302 | 4.483 | 4.556 | 4.589 |
| Size (GiB) | 13.327 | 13.370 | 13.469 | 14.036 | 14.266 | 14.369 |
| VRAM usage (GiB) | 12.698 | 12.861 | 12.803 | 13.407 | 13.637 | 13.703 |
| Mean PPL(Q) | 7.098696 ± 0.047344 | 6.993009 ± 0.046208 | 6.995519 ± 0.046227 | 7.020660 ± 0.046587 | 6.996323 ± 0.046332 | 6.950126 ± 0.045846 |
| Mean PPL(base) | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 |
| Cor(ln(PPL(Q)), ln(PPL(base))) | 99.19% | 98.52% | 98.82% | 99.30% | 99.32% | 99.38% |
| Mean KLD | 0.033452 ± 0.000723 | 0.058818 ± 0.000881 | 0.046348 ± 0.000841 | 0.027289 ± 0.000660 | 0.026270 ± 0.000653 | 0.024728 ± 0.000603 |
| Maximum KLD | 23.255085 | 24.616274 | 24.175169 | 18.568180 | 22.992002 | 21.687405 |
| 99.9% KLD | 2.907350 | 3.986622 | 3.614290 | 2.667850 | 2.385293 | 2.201674 |
| RMS Δp | 4.936 ± 0.054 % | 6.690 ± 0.059 % | 5.867 ± 0.060 % | 4.449 ± 0.057 % | 4.352 ± 0.057 % | 4.264 ± 0.056 % |
| Same top p | 92.427 ± 0.069 % | 90.350 ± 0.077 % | 91.829 ± 0.071 % | 93.903 ± 0.062 % | 93.888 ± 0.062 % | 93.997 ± 0.062 % |
- Compared to Q3_K_M from bartowski and UD-Q3_K_XL from unsloth, this IQ4_XS quant uses slightly less VRAM while having better quality.
- The IQ4_XS quant from mradermacher, bartowski, and unsloth have much better quality, but they use more VRAM and are harder to fit into 16 GiB of VRAM.
With 16 GiB of VRAM, we can fit a context size of 65536 with quantized KV cache:
# mainline llama.cpp
-c 65536 -ctk q8_0 -ctv q8_0 -np 1
For brave souls that seek the TurboQuant experience (see #21038), we can also fit a context size of 128000 with more heavily quantized KV cache:
# mainline llama.cpp
-c 128000 -ctk q4_0 -ctv q4_0 -np 1
Size
Size from llama-server output:
llm_load_print_meta: model size = 13.327 GiB (4.256 BPW)
llm_load_print_meta: repeating layers = 12.069 GiB (4.257 BPW, 24.353 B parameters)
...
llm_load_tensors: CUDA_Host buffer size = 644.14 MiB
llm_load_tensors: CUDA0 buffer size = 13003.14 MiB
Recipe
blk\..*\.attn_q\.weight=iq4_xs
blk\..*\.attn_k\.weight=iq4_xs
blk\..*\.attn_v\.weight=iq4_xs
blk\..*\.attn_output\.weight=iq4_xs
blk\..*\.attn_gate\.weight=iq4_xs
blk\..*\.attn_qkv\.weight=iq4_xs
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq4_xs
blk\..*\.ffn_down\.weight=iq4_xs
blk\..*\.ffn_(gate|up)\.weight=iq4_xs
token_embd\.weight=iq4_xs
output\.weight=iq4_xs
Speed
llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.335 | 1526.19 | 2.632 | 48.64 |
| 512 | 128 | 10240 | 0.376 | 1362.66 | 2.787 | 45.93 |
| 512 | 128 | 20480 | 0.416 | 1231.97 | 2.870 | 44.60 |
| 512 | 128 | 30720 | 0.457 | 1119.71 | 2.964 | 43.19 |
| 512 | 128 | 40960 | 0.500 | 1024.24 | 3.080 | 41.56 |
| 512 | 128 | 51200 | 0.545 | 940.27 | 3.183 | 40.21 |
| 512 | 128 | 61440 | 0.589 | 868.63 | 3.277 | 39.06 |
| 512 | 128 | 71680 | 0.630 | 812.78 | 3.378 | 37.89 |
| 512 | 128 | 81920 | 0.673 | 760.29 | 3.497 | 36.60 |
| 512 | 128 | 92160 | 0.716 | 715.36 | 3.605 | 35.51 |
| 512 | 128 | 102400 | 0.761 | 672.98 | 3.696 | 34.64 |
| 512 | 128 | 112640 | 0.802 | 638.68 | 3.798 | 33.70 |
| 512 | 128 | 122880 | 0.843 | 607.28 | 3.917 | 32.68 |
Performance
This quant uses the imatrix from mradermacher. It performs well enough in long reasoning tasks and agentic tasks.