Qwen3.6-27B-GGUF-5.076bpw

This is a 5.076 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quant	this	bartowski Q4_K_L	unsloth UD-Q4_K_XL	mradermacher i1.Q4_K_M	ubergarm IQ5_KS
Size (BPW)	5.076	5.493	5.235	4.919	5.919
Size (GiB)	15.893	17.198	16.393	15.401	18.532
VRAM usage (GiB)	14.931	15.940	15.727	14.735	17.570
Mean PPL(Q)	6.982381 ± 0.046281	6.992025 ± 0.046419	6.970005 ± 0.046042	6.938483 ± 0.045668	6.931115 ± 0.045750
Mean PPL(base)	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base)))	99.47%	99.40%	99.45%	99.32%	99.79%
Mean KLD	0.019613 ± 0.000587	0.020410 ± 0.000637	0.020399 ± 0.000643	0.024690 ± 0.000686	0.008430 ± 0.000466
Maximum KLD	22.204016	21.812332	20.409454	20.942572	21.321548
99.9% KLD	1.703208	2.161336	2.067440	2.667204	0.698605
RMS Δp	3.843 ± 0.058 %	3.812 ± 0.059 %	3.778 ± 0.058 %	4.264 ± 0.062 %	2.467 ± 0.061 %
Same top p	94.767 ± 0.058 %	94.824 ± 0.058 %	94.824 ± 0.058 %	94.203 ± 0.061 %	96.618 ± 0.047 %

With 24 GiB of VRAM, we can fit a context size of 128000 with F16 KV cache:

-c 128000 -wgt 1

or a context size of 262144 with quantized KV cache:

-c 262144 -wgt 1 -ctk q8_0 -khad -ctv q6_0 -vhad

Size

Size from llama-server output:

llm_load_print_meta: model size       = 15.893 GiB (5.076 BPW)
llm_load_print_meta: repeating layers = 13.969 GiB (4.927 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   985.16 MiB
llm_load_tensors:      CUDA0 buffer size = 15288.91 MiB

This is sligtly bigger than the Qwen3.5-27B-4.915bpw quant, due to these changes:

attention: mixture of IQ6_K and IQ5_K => Q6_0
token_embd: IQ4_K => Q6_0
output: IQ6_K => Q6_0

The recipe is almost identical with the IQ4_KS + Q6_0 recipe shared by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663, with ssm_alpha and ssm_beta getting a slight bump from Q6_0 to Q8_0.

IQ4_KS + Q6_0 form a good and fast combo, as noted by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663:

IQ4_KS is better than IQ4_XS, has the same size, has the same performance on CUDA and CPU

IQ5_K will wipe the floor with Q5_K in terms of quantization accuracy at the same bpw. One issue is that IQ5_K PP is lower on CUDA because of the block size of 16. It is about on par for TG on CUDA, about on par for TG on the CPU, and I think slightly faster PP on the CPU. If one does not want to take the CUDA performance penalty, one could replace Q5_K with IQ5_KS. This will be strictly faster, will use 0.25 bpw less than Q5_K, and will have about the same quantization accuracy.

I think in may cases one can replace Q6_K with Q6_0, which is quite a bit faster on CUDA while giving about the same quantization accuracy as Q6_K. IQ6_K is better, but slower.

The situation for Q4_K vs IQ4_KS and IQ4_K is similar to Q5_K vs IQ5_KS and IQ5_K

Recipe

blk\..*\.attn_q\.weight=q6_0
blk\..*\.attn_k\.weight=q6_0
blk\..*\.attn_v\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
blk\..*\.attn_gate\.weight=q6_0
blk\..*\.attn_qkv\.weight=q6_0

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q6_0

blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

token_embd\.weight=q6_0
output\.weight=q6_0

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.350	1461.47	3.088	41.45
512	128	10240	0.389	1315.93	3.230	39.62
512	128	20480	0.433	1183.72	3.311	38.65
512	128	30720	0.474	1080.64	3.413	37.50
512	128	40960	0.517	990.04	3.532	36.24
512	128	51200	0.560	914.01	3.636	35.20
512	128	61440	0.604	847.72	3.725	34.36
512	128	71680	0.647	791.61	3.826	33.46
512	128	81920	0.691	741.16	3.945	32.44
512	128	92160	0.735	696.16	4.053	31.58
512	128	102400	0.779	657.14	4.143	30.89
512	128	112640	0.823	622.27	4.261	30.04
512	128	122880	0.866	591.36	4.391	29.15

Performance

This quant uses the imatrix from mradermacher. There are some long reasoning tasks that the full precision model served from https://dashscope-intl.aliyuncs.com/compatible-mode/v1 can solve at about 50:50 chance, and this quant can also solve at about 50:50 chance when using the imatrix from mradermacher. Without using any imatrix, this quant can't solve the tasks at all. This finding vindicates the importance of the importance matrix.

On agentic tasks tested with pi agent, the tasks that can be reliably solved by the full precision model can also be reliably solved by this quant.

For mainline users with 24 GiB of VRAM, I would recommend i1-Q4_K_M from mradermacher, which also performs quite well from limited testing.

For IK users that need even higher quality, I would recommend IQ5_KS from ubergarm, which is near lossless.

Verdict

We get Sonnet 4.5 at home with a used RTX 3090.

sokann/Qwen3.6-27B-GGUF-5.076bpw

Qwen3.6-27B-GGUF-5.076bpw

Size

Speed

Performance

Verdict

No reviews yet

Model Info

Community

Rating Guidelines