Back to Models
SO

sokann/Qwen3.6-27B-GGUF-4.256bpw

sokanngeneral

Qwen3.6-27B-GGUF-4.256bpw

This is a 4.256 BPW quantized model for the GPU poors with 16 GiB of VRAM. It works in both ik_llama.cpp and mainline llama.cpp.

It was quantized using the simplest ever recipe – Q8_0 for the tiny ssm_alpha and ssm_beta tensors, IQ4_XS for the rest.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quantthisbartowski Q3_K_Munsloth UD-Q3_K_XLmradermacher i1.IQ4_XSbartowski IQ4_XSunsloth IQ4_XS
Size (BPW)4.2564.2704.3024.4834.5564.589
Size (GiB)13.32713.37013.46914.03614.26614.369
VRAM usage (GiB)12.69812.86112.80313.40713.63713.703
Mean PPL(Q)7.098696 ± 0.0473446.993009 ± 0.0462086.995519 ± 0.0462277.020660 ± 0.0465876.996323 ± 0.0463326.950126 ± 0.045846
Mean PPL(base)6.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.0455436.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base)))99.19%98.52%98.82%99.30%99.32%99.38%
Mean KLD0.033452 ± 0.0007230.058818 ± 0.0008810.046348 ± 0.0008410.027289 ± 0.0006600.026270 ± 0.0006530.024728 ± 0.000603
Maximum KLD23.25508524.61627424.17516918.56818022.99200221.687405
99.9% KLD2.9073503.9866223.6142902.6678502.3852932.201674
RMS Δp4.936 ± 0.054 %6.690 ± 0.059 %5.867 ± 0.060 %4.449 ± 0.057 %4.352 ± 0.057 %4.264 ± 0.056 %
Same top p92.427 ± 0.069 %90.350 ± 0.077 %91.829 ± 0.071 %93.903 ± 0.062 %93.888 ± 0.062 %93.997 ± 0.062 %
  • Compared to Q3_K_M from bartowski and UD-Q3_K_XL from unsloth, this IQ4_XS quant uses slightly less VRAM while having better quality.
  • The IQ4_XS quant from mradermacher, bartowski, and unsloth have much better quality, but they use more VRAM and are harder to fit into 16 GiB of VRAM.

With 16 GiB of VRAM, we can fit a context size of 65536 with quantized KV cache:

# mainline llama.cpp
-c 65536 -ctk q8_0 -ctv q8_0 -np 1

For brave souls that seek the TurboQuant experience (see #21038), we can also fit a context size of 128000 with more heavily quantized KV cache:

# mainline llama.cpp
-c 128000 -ctk q4_0 -ctv q4_0 -np 1

Size

Size from llama-server output:

llm_load_print_meta: model size       = 13.327 GiB (4.256 BPW)
llm_load_print_meta: repeating layers = 12.069 GiB (4.257 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   644.14 MiB
llm_load_tensors:      CUDA0 buffer size = 13003.14 MiB
Recipe
blk\..*\.attn_q\.weight=iq4_xs
blk\..*\.attn_k\.weight=iq4_xs
blk\..*\.attn_v\.weight=iq4_xs
blk\..*\.attn_output\.weight=iq4_xs
blk\..*\.attn_gate\.weight=iq4_xs
blk\..*\.attn_qkv\.weight=iq4_xs

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq4_xs

blk\..*\.ffn_down\.weight=iq4_xs
blk\..*\.ffn_(gate|up)\.weight=iq4_xs

token_embd\.weight=iq4_xs
output\.weight=iq4_xs

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PPTGN_KVT_PP sS_PP t/sT_TG sS_TG t/s
51212800.3351526.192.63248.64
512128102400.3761362.662.78745.93
512128204800.4161231.972.87044.60
512128307200.4571119.712.96443.19
512128409600.5001024.243.08041.56
512128512000.545940.273.18340.21
512128614400.589868.633.27739.06
512128716800.630812.783.37837.89
512128819200.673760.293.49736.60
512128921600.716715.363.60535.51
5121281024000.761672.983.69634.64
5121281126400.802638.683.79833.70
5121281228800.843607.283.91732.68

Performance

This quant uses the imatrix from mradermacher. It performs well enough in long reasoning tasks and agentic tasks.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes7
Downloads
📝

No reviews yet

Be the first to review sokann/Qwen3.6-27B-GGUF-4.256bpw!

Model Info

Providersokann
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes7
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor