Qwen3.6-27B - TurboQuant+ Config-I (MLX)

27B-parameter dense model with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of Qwen/Qwen3.6-27B (27B dense, 64 layers, hybrid GatedDeltaNet + full attention). The policy applies 4-bit to middle layers, protects boundary layers at 8-bit, and shields embeddings at full precision. See the Config-I paper for the policy derivation.

Compression

Size	Details
bf16 source	~54 GB
Config-I (mixed 4/8-bit)	~20 GB

Note: For dense models, Config-I's primary advantage over uniform 4-bit is quality preservation at boundary layers and embeddings, not size reduction. The aggressive 2-3 bit expert compression that drives size wins on MoE models does not apply here (no experts to compress).

Tested with vllm-swift

vllm-swift is a native Swift/Metal backend for vLLM. Install with Homebrew:

brew tap TheTom/tap && brew install vllm-swift

Or from source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh      # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

Serve this model:

vllm-swift download thetom-ai/Qwen3.6-27B-ConfigI-MLX
vllm-swift serve ~/models/Qwen3.6-27B-ConfigI-MLX \
    --served-model-name qwen3.6-27b \
    --max-model-len 40960 \
    --enable-auto-tool-choice --tool-call-parser hermes \
    --additional-config '{"kv_scheme": "turbo4", "kv_bits": 4}'

This gives you an OpenAI-compatible API at http://localhost:8000 with tool calling support and 3.2x KV cache compression.

Decode Speed (M5 Max 128GB)

Backend	Decode
vllm-swift (Swift/Metal)	23.7 tok/s
mlx-lm (Python/MLX)	19.8 tok/s

vllm-swift is +20% faster than Python mlx-lm on single-request decode.

KV Cache Compression (TurboQuant)

Measured at 4K context on M5 Max 128GB via vllm-swift:

Scheme	Compression	PPL	Decode	PPL vs fp16	Decode vs fp16
none (fp16)	1.0x	1.16	22.8 tok/s	—	—
turbo4 (K4V4)	3.2x	1.29	20.8 tok/s	+11%	-9%
turbo4v2 (K4V2)	3.8x	1.34	20.9 tok/s	+15%	-8%
turbo3v2 (K3V2)	4.6x	1.43	20.8 tok/s	+23%	-9%
turbo3 (K3V3)	4.6x	1.46	20.4 tok/s	+26%	-11%

Recommendation: turbo4 symmetric (K4V4). Best PPL (+11% vs fp16) with 3.2x compression and only -9% decode cost. All schemes are usable — even turbo3 at +26% PPL is still coherent (PPL 1.46 is low).

Config-I Policy (Qwen3.6 Dense Adaptation)

64 layers with hybrid attention (GatedDeltaNet + full attention at every 4th layer).

Component	Bits	Layers	Rationale
Attention Q/K/V/O	4-bit	middle 60	Standard attention compression
FFN gate/up	4-bit	middle 60	Read projections
FFN down	4-bit	middle 60	Write-back projections
GDN projections	4-bit	middle 60	Linear attention layers
Boundary (all tensors)	8-bit	first 2 + last 2	Boundary layer protection
Embeddings + lm_head	8-bit	-	Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

Field	Value
Format	MLX safetensors (standard)
Runtime	`mlx_lm` (Python), `mlx-swift-lm` (Swift), `vllm-swift`
Platform	Apple Silicon (M-series with 32GB+)
Quantized on	2026-04-24

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

thetom-ai/Qwen3.6-27B-ConfigI-MLX

Qwen3.6-27B - TurboQuant+ Config-I (MLX)

Compression

Tested with vllm-swift

Decode Speed (M5 Max 128GB)

KV Cache Compression (TurboQuant)

Config-I Policy (Qwen3.6 Dense Adaptation)

What is Config-I?

Compatibility

Links

No reviews yet

Model Info

Community

Rating Guidelines