Back to Models
RD

rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm

rdtandimage

Qwen3.6-27B — PrismaQuant 5.5 bpp

PrismaQuant source License: Apache-2.0 vLLM native

Mixed-precision quantization of Qwen/Qwen3.6-27B produced by PrismaQuant — a per-Linear sensitivity-driven allocator that chooses each Linear module's format individually under a total-bit budget. Same allocator + activation-aware export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated into the DP so the achieved bpp hits the target exactly (5.500 not 5.28).

This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve — see Why 5.5 bpp below for the full sweep and selection rationale.


At a glance

MetricBF16 sourceThis artifactDelta
Size on disk54 GB~19 GB−65 %
Fraction of original weights100 %35 %
Average bits per param165.50
Multimodal (vision + text)
MTP speculative decoding head
Loads in vLLM (stock compressed-tensors)
Runtime backendanyvLLM only

Precision mix

Selected per-Linear by the allocator from measured Fisher sensitivity. On this dense 27B the allocator hit the 5.5 bpp budget exactly:

FormatWAUseCount (after expansion)
NVFP44-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global)4-bit (dynamic)Bulk dense MLPs + medium-sensitivity attention + most visual Linears349
MXFP88-bit (E4M3, group_size=32 with per-group E8M0 scale)8-bit (dynamic)High-sensitivity dense Linears the allocator won't risk at 4-bit35
BF1616-bit16-bitRouter-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed112 (linear) + 352 (layer_passthrough)

The allocator pre-aggregates fused-projection siblingsqkv_proj (q/k/v share one format) and gate_up_proj (gate+up share one format) — as single DP items. Previously sibling coupling was enforced as a post- pass that inflated the achieved bpp by up to 0.5 above target; the new pre-aggregation path collapses each group into one multi-choice item so the DP's solution is already sibling-consistent.

Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

  1. GPTQ-OBS one-shot rounding — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative.
  2. Closed-form per-group scale sweep — for each 16-weight NVFP4 group, enumerate grid=32 candidate scales spanning [0.5·s₀, 1.5·s₀], round each weight to its nearest codebook neighbor at every candidate scale, pick the (scale, rounding-set) configuration minimizing activation-weighted per-group MSE. Sub-second per Linear. Closed-form analog of Intel's AutoRound.

Measured per-Linear output-MSE vs RTN baseline (family-level measurement on Qwen3.6-35B-A3B; same pipeline applied here):

Pipeline variantout_mse ratio vs RTN
RTN (no passes)1.00
GPTQ only0.41
GPTQ + scale_sweep (this artifact)0.33

Why 5.5 bpp

Before quantizing we ran the allocator across the full target sweep {4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25} on the same Fisher- probed + RTN-costed stats this artifact was built from. Thanks to allocator pre-aggregation of fused siblings + convergence-based tightening, every target lands its budget exactly — achieved = target within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off across the Pareto frontier, not an apples-to-oranges approximation.

Target bppAchieved bppPredicted ΔlossNVFP4 / MXFP8 / BF16vs 5.5 bpp
4.54.500948416 / 1 / 0+99% Δloss, −18% size
4.754.750704373 / 12 / 32+48% Δloss, −14% size
5.05.000604347 / 14 / 56+27% Δloss, −9% size
5.255.250532321 / 20 / 76+12% Δloss, −5% size
5.55.500477300 / 30 / 87← this artifact
6.06.000393270 / 35 / 112−18% Δloss, +9% size
7.07.000276211 / 62 / 144−42% Δloss, +27% size
8.258.249180152 / 73 / 192−62% Δloss, +50% size

(Layer counts are at the un-expanded allocator level — per-Linear expansion inflates each count 1.0-1.4× after broadcasting sibling-group formats to members.)

Selection rationale. The Kneedle algorithm (Satopää et al.) places the knee at 5.5 bpp: on the normalized Δloss-vs-bpp curve, the farthest point below the chord from (min_bpp, max_Δloss) to (max_bpp, min_Δloss) is target 5.5. Reading across the frontier instead of committing to a single anchor like "4.75" or "6" makes the trade-off explicit:

  • Below 5.5 the loss curve steepens: 4.75 bpp saves 14% disk but pays +48% Δloss; 4.5 bpp saves 18% and pays +99%. Dense 27B can't be aggressively NVFP4'd the way MoE-A3B can, because every body Linear is active for every token — there are no "cheap" low-utilization experts to compress hard.
  • Above 5.5 the loss curve flattens: jumping to 6.0 bpp costs +9% disk for only −18% Δloss — a softer marginal gain than the knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction).
  • At the knee, 5.5 bpp strikes the maximum distance from the chord — the point where further bit-budget buys less marginal Δloss reduction than the bits already spent.

PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high- sensitivity dense Linears the allocator won't risk at 4-bit), 87 at BF16 (highest-sensitivity Linears preserved lossless).


Which layers are quantized

Text body (DeltaNet linear-attention + dense MLP, 64 layers)

  • Full attention Linears (q_proj / k_proj / v_proj / o_proj): qkv siblings share one format per layer (pre-aggregated)
  • DeltaNet linear-attention Linears (in_proj_qkv / in_proj_z / in_proj_a / in_proj_b / in_proj_ba / out_proj): each Linear's format chosen independently
  • Dense MLP (gate_proj / up_proj / down_proj): gate+up siblings share one format per layer; down chosen independently

Multi-token-prediction (MTP) head

  • One full-attention + dense-MLP decoder layer at the model tail, quantized by the same per-Linear policy — so --speculative-config method=mtp drafts at the same precision profile as the body.

Visual encoder (27 blocks — Qwen3.6-VL vision tower)

  • Fisher-driven per-Linear allocation: 108 of 110 visual Linears got placed by the full DP allocator on the basis of per-Linear activation-weighted cost (8 multimodal calibration samples).
  • Remaining 2 un-probed visual Linears (patch_embed.proj edges the probe didn't tap) stamped at NVFP4 uniformly.
  • model.visual.pos_embed stays BF16 — it's a learnable Parameter, not an nn.Linear, and vLLM's compressed-tensors loader cannot consume a quantized Parameter layout.

Passthrough (unquantized)

  • lm_head — kept at BF16 because vLLM's ParallelLMHead module only accepts a single weight parameter. The allocator measures lm_head's Fisher sensitivity and would pick NVFP4 for it, but the compressed-tensors runtime rejects a compressed lm_head with KeyError: lm_head.input_global_scale. This is a vLLM runtime limitation, not a PrismaQuant design decision.
  • RMSNorm weights (all layers + MTP + visual)
  • All biases
  • embed_tokens
  • model.visual.pos_embed

Serving (vLLM only)

This artifact is only runnable via vLLM's stock compressed-tensors support — there is no transformers-native runtime path for mixed NVFP4 + MXFP8 today. vLLM 0.11+ or equivalent is required.

vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  • FlashInfer NVFP4 attention is picked up automatically; set VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass to make the preference explicit.
  • MTP speculative decoding at n=3 is the measured optimum for this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4 regresses).
  • Visual inputs work via vLLM's standard image-text-to-text chat API — no special flags.

A full recipe with the flashinfer-cutlass backends, reasoning/tool parsers and chat-template pinning is available at spark-vllm-fresh/recipes/qwen3.6-27b.yaml.


Reproducing this artifact

Full pipeline is in the PrismaQuant repo:

  1. Sensitivity probe — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Shard granularity and layer-cache budget are auto-derived from available RAM via prismaquant.autoscale. Checkpoint-level reuse (per-Linear stats are pooled across prior shard pickles) means mid-run crashes resume cleanly regardless of LAYERS_PER_SHARD changes.
  2. Per-(Linear, format) cost measurement — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations.
  3. Multi-choice knapsack allocator — picks one format per Linear minimizing total predicted Δloss under the bit budget. Fused-sibling groups pre-aggregated into DP items to avoid post-pass overshoot. Target 5.5 bpp; achieved 5.500 bpp.
  4. Export — streams each body / visual / MTP shard, applies GPTQ + scale_sweep to its NVFP4 entries, writes the compressed-tensors format. lm_head passthrough at BF16 enforced at this stage.

Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe + ~15 min cost + ~20 min export. Subsequent iterations at different bpp targets reuse probe + cost artifacts and complete in minutes.


Known issues / limitations

  • vLLM only at serve time. No transformers-runtime path for this precision mix today.
  • lm_head stays BF16 because vLLM's ParallelLMHead does not register the NVFP4/MXFP8 compressed-tensors schemes. Allocator measured it and would have picked NVFP4; the runtime limitation forces BF16. Costs ~770 MB on the disk footprint.
  • MTP n=4 regresses on this family. Stick to n=3 unless you verify against the draft-head acceptance-rate trace.

Links

Citation

@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes11
Downloads
📝

No reviews yet

Be the first to review rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm!

Model Info

Providerrdtand
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes11
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor