cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
cHunter789 • generalQwen3.6-27B-i1-IQ4_XS (Fully Optimized)
Motivation
Recent updates in the llama.cpp repository (specifically commit 1dab5f5a44) introduced a hardcoded minimum quantization of q5_K for attn_qkv layers. While this was likely intended to preserve model quality, it causes a noticeable bloat in the final file sizes.
For comparison, the highly efficient Qwen3.5-27B iq4_xs by mradermacher weighed in at 14.7GB, whereas the equivalent Qwen3.6 i1-GGUF under the new commit rules swelled to over 15.1GB.
Methodology
To restore the optimal balance of size and performance, I modified the llama.cpp source code to revert the quantization of attn_qkv layers back to a pure IQ4_XS format. This mirrors the exact 1:1 layer quantization strategy originally used in mradermacher's Qwen3.5-27B release.
This model was quantized utilizing the imatrix provided by mradermacher: Qwen3.6-27B-i1-GGUF.
Performance vs. Size Trade-off
Extensive perplexity testing (llama-perplexity with pg19.txt, 65k context, Q8_0 cache) confirms that forcing pure IQ4_XS across all layers results in a statistically insignificant intelligence drop (+0.0039 PPL) while noticeably reducing the memory footprint.
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
🧠 Intelligence (Perplexity) Comparison
| Model Version | Perplexity (PPL) | Difference / Quality Drop |
|---|---|---|
Standard IQ4_XS (with q5_K attn_qkv) | 7.3765 ± 0.02760 | Baseline |
Custom IQ4_XS (pure / fully iq4) | 7.3804 ± 0.02762 | + 0.0039 (Negligible) |
Conclusion: By utilizing this custom build, users save 375 MiB of active memory and reduce the static file size closer to the 14.7GB mark, with a practically non-existent impact on output quality (~0.05% PPL variance).