Back to Models
SO

sokann/Qwen3.6-27B-GGUF-5.076bpw

sokanngeneral

Qwen3.6-27B-GGUF-5.076bpw

This is a 5.076 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quantthisbartowski Q4_K_Lunsloth UD-Q4_K_XLmradermacher i1.Q4_K_Mubergarm IQ5_KS
Size (BPW)5.0765.4935.2354.9195.919
Size (GiB)15.89317.19816.39315.40118.532
VRAM usage (GiB)14.93115.94015.72714.73517.570
Mean PPL(Q)6.982381 ± 0.0462816.992025 ± 0.0464196.970005 ± 0.0460426.938483 ± 0.0456686.931115 ± 0.045750
Mean PPL(base)6.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base)))99.47%99.40%99.45%99.32%99.79%
Mean KLD0.019613 ± 0.0005870.020410 ± 0.0006370.020399 ± 0.0006430.024690 ± 0.0006860.008430 ± 0.000466
Maximum KLD22.20401621.81233220.40945420.94257221.321548
99.9% KLD1.7032082.1613362.0674402.6672040.698605
RMS Δp3.843 ± 0.058 %3.812 ± 0.059 %3.778 ± 0.058 %4.264 ± 0.062 %2.467 ± 0.061 %
Same top p94.767 ± 0.058 %94.824 ± 0.058 %94.824 ± 0.058 %94.203 ± 0.061 %96.618 ± 0.047 %

With 24 GiB of VRAM, we can fit a context size of 128000 with F16 KV cache:

-c 128000 -wgt 1

or a context size of 262144 with quantized KV cache:

-c 262144 -wgt 1 -ctk q8_0 -khad -ctv q6_0 -vhad

Size

Size from llama-server output:

llm_load_print_meta: model size       = 15.893 GiB (5.076 BPW)
llm_load_print_meta: repeating layers = 13.969 GiB (4.927 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   985.16 MiB
llm_load_tensors:      CUDA0 buffer size = 15288.91 MiB

This is sligtly bigger than the Qwen3.5-27B-4.915bpw quant, due to these changes:

  • attention: mixture of IQ6_K and IQ5_K => Q6_0
  • token_embd: IQ4_K => Q6_0
  • output: IQ6_K => Q6_0

The recipe is almost identical with the IQ4_KS + Q6_0 recipe shared by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663, with ssm_alpha and ssm_beta getting a slight bump from Q6_0 to Q8_0.

IQ4_KS + Q6_0 form a good and fast combo, as noted by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663:

  • IQ4_KS is better than IQ4_XS, has the same size, has the same performance on CUDA and CPU
  • IQ5_K will wipe the floor with Q5_K in terms of quantization accuracy at the same bpw. One issue is that IQ5_K PP is lower on CUDA because of the block size of 16. It is about on par for TG on CUDA, about on par for TG on the CPU, and I think slightly faster PP on the CPU. If one does not want to take the CUDA performance penalty, one could replace Q5_K with IQ5_KS. This will be strictly faster, will use 0.25 bpw less than Q5_K, and will have about the same quantization accuracy.
  • I think in may cases one can replace Q6_K with Q6_0, which is quite a bit faster on CUDA while giving about the same quantization accuracy as Q6_K. IQ6_K is better, but slower.
  • The situation for Q4_K vs IQ4_KS and IQ4_K is similar to Q5_K vs IQ5_KS and IQ5_K
Recipe
blk\..*\.attn_q\.weight=q6_0
blk\..*\.attn_k\.weight=q6_0
blk\..*\.attn_v\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
blk\..*\.attn_gate\.weight=q6_0
blk\..*\.attn_qkv\.weight=q6_0

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q6_0

blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

token_embd\.weight=q6_0
output\.weight=q6_0

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PPTGN_KVT_PP sS_PP t/sT_TG sS_TG t/s
51212800.3501461.473.08841.45
512128102400.3891315.933.23039.62
512128204800.4331183.723.31138.65
512128307200.4741080.643.41337.50
512128409600.517990.043.53236.24
512128512000.560914.013.63635.20
512128614400.604847.723.72534.36
512128716800.647791.613.82633.46
512128819200.691741.163.94532.44
512128921600.735696.164.05331.58
5121281024000.779657.144.14330.89
5121281126400.823622.274.26130.04
5121281228800.866591.364.39129.15

Performance

This quant uses the imatrix from mradermacher. There are some long reasoning tasks that the full precision model served from https://dashscope-intl.aliyuncs.com/compatible-mode/v1 can solve at about 50:50 chance, and this quant can also solve at about 50:50 chance when using the imatrix from mradermacher. Without using any imatrix, this quant can't solve the tasks at all. This finding vindicates the importance of the importance matrix.

On agentic tasks tested with pi agent, the tasks that can be reliably solved by the full precision model can also be reliably solved by this quant.

For mainline users with 24 GiB of VRAM, I would recommend i1-Q4_K_M from mradermacher, which also performs quite well from limited testing.

For IK users that need even higher quality, I would recommend IQ5_KS from ubergarm, which is near lossless.

Verdict

We get Sonnet 4.5 at home with a used RTX 3090.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes5
Downloads
📝

No reviews yet

Be the first to review sokann/Qwen3.6-27B-GGUF-5.076bpw!

Model Info

Providersokann
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes5
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor