Back to Models
TH

thetom-ai/Qwen3.6-27B-ConfigI-MLX

thetom-aigeneral

Qwen3.6-27B - TurboQuant+ Config-I (MLX)

27B-parameter dense model with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of Qwen/Qwen3.6-27B (27B dense, 64 layers, hybrid GatedDeltaNet + full attention). The policy applies 4-bit to middle layers, protects boundary layers at 8-bit, and shields embeddings at full precision. See the Config-I paper for the policy derivation.

Compression

SizeDetails
bf16 source~54 GB
Config-I (mixed 4/8-bit)~20 GB

Note: For dense models, Config-I's primary advantage over uniform 4-bit is quality preservation at boundary layers and embeddings, not size reduction. The aggressive 2-3 bit expert compression that drives size wins on MoE models does not apply here (no experts to compress).

Tested with vllm-swift

vllm-swift

vllm-swift is a native Swift/Metal backend for vLLM. Install with Homebrew:

brew tap TheTom/tap && brew install vllm-swift

Or from source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh      # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

Serve this model:

vllm-swift download thetom-ai/Qwen3.6-27B-ConfigI-MLX
vllm-swift serve ~/models/Qwen3.6-27B-ConfigI-MLX \
    --served-model-name qwen3.6-27b \
    --max-model-len 40960 \
    --enable-auto-tool-choice --tool-call-parser hermes \
    --additional-config '{"kv_scheme": "turbo4", "kv_bits": 4}'

This gives you an OpenAI-compatible API at http://localhost:8000 with tool calling support and 3.2x KV cache compression.

Decode Speed (M5 Max 128GB)

BackendDecode
vllm-swift (Swift/Metal)23.7 tok/s
mlx-lm (Python/MLX)19.8 tok/s

vllm-swift is +20% faster than Python mlx-lm on single-request decode.

KV Cache Compression (TurboQuant)

Measured at 4K context on M5 Max 128GB via vllm-swift:

SchemeCompressionPPLDecodePPL vs fp16Decode vs fp16
none (fp16)1.0x1.1622.8 tok/s
turbo4 (K4V4)3.2x1.2920.8 tok/s+11%-9%
turbo4v2 (K4V2)3.8x1.3420.9 tok/s+15%-8%
turbo3v2 (K3V2)4.6x1.4320.8 tok/s+23%-9%
turbo3 (K3V3)4.6x1.4620.4 tok/s+26%-11%

Recommendation: turbo4 symmetric (K4V4). Best PPL (+11% vs fp16) with 3.2x compression and only -9% decode cost. All schemes are usable — even turbo3 at +26% PPL is still coherent (PPL 1.46 is low).

Config-I Policy (Qwen3.6 Dense Adaptation)

64 layers with hybrid attention (GatedDeltaNet + full attention at every 4th layer).

ComponentBitsLayersRationale
Attention Q/K/V/O4-bitmiddle 60Standard attention compression
FFN gate/up4-bitmiddle 60Read projections
FFN down4-bitmiddle 60Write-back projections
GDN projections4-bitmiddle 60Linear attention layers
Boundary (all tensors)8-bitfirst 2 + last 2Boundary layer protection
Embeddings + lm_head8-bit-Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

FieldValue
FormatMLX safetensors (standard)
Runtimemlx_lm (Python), mlx-swift-lm (Swift), vllm-swift
PlatformApple Silicon (M-series with 32GB+)
Quantized on2026-04-24

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

Links


Quantized by @thetom-ai | GitHub | X | Sponsor

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes4
Downloads
📝

No reviews yet

Be the first to review thetom-ai/Qwen3.6-27B-ConfigI-MLX!

Model Info

Providerthetom-ai
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes4
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor