sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
sakamakismile • generalQwen3.6-27B-Text-NVFP4-MTP
NVFP4-quantized text-only sibling of Qwen/Qwen3.6-27B, with the MTP (Multi-Token Prediction) head restored in bf16 so speculative decoding actually works.
What's different from sakamakismile/Qwen3.6-27B-NVFP4
This repo (-Text-NVFP4-MTP) | Qwen3.6-27B-NVFP4 | |
|---|---|---|
| Quantization format | modelopt (vLLM SM120 native path) | compressed-tensors |
| MTP head | Restored in bf16, working | Dropped during export → 0% draft acceptance |
| Vision tower | Stripped (text-only) | Present (kept for VLM use) |
| Suggested launch | with --speculative-config | without speculation |
The original Qwen3.6-27B-NVFP4 is left untouched so existing users (~15K downloads) are not disrupted. This is a focused text-only sibling for users who want maximum speed and don't need vision input.
Why this exists
Two HF Discussion threads on the original repo prompted this:
- #5 — slower than official FP8 on Blackwell — root cause is the
compressed-tensorsNVFP4 path being slower thanmodelopton Blackwell SM120; this repo usesmodeloptnatively. - #7 — MTP not responding —
AutoModelForCausalLM.from_pretraineddoes not load the MTP head, so it gets dropped during quantization, leading to 0% draft acceptance. This repo grafts the 15mtp.*tensors (bf16) back into the quantized checkpoint and adds them to the quantization ignore list.
Recipe is adapted from osoleve/Qwen3.5-27B-Text-NVFP4-MTP — credit and thanks.
Reproduce this quantization
This model was produced by the open-source lna-lab/GGUF-to-NVFP4-SM120 pipeline — Lna-Lab's production line for converting Qwen3.5 / 3.6 / Gemma 4 checkpoints into modelopt-format NVFP4 + working MTP, ready for vLLM on Blackwell SM120. The exact script is src/quantize/qwen36_27b_text_mtp.py; the 5-step MTP graft recipe is documented in docs/MTP_GRAFT_RECIPE.md.
Quantization details
- Base:
Qwen/Qwen3.6-27B(bf16, 27.78B params, hybrid linear-attn + full-attn, 64 layers, 1 MTP layer) - Quantizer:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Calibration: 20 samples from
neuralmagic/calibration(LLM split), max_seq_len 8192 - Ignored from quantization (kept in bf16):
lm_head- All
model.visual.*(vision tower) — then physically deleted in the text-only build - All
*linear_attn.conv1d*(Mamba-style SSM convolutions, 48 of the 64 layers) - All
mtp.*modules (the 1-layer MTP head: 15 tensors total, ~850 MB bf16) - Other defaults from
NVFP4_DEFAULT_CFG:*router*,*mlp.gate.*,*block_sparse_moe.gate*,*output_layer*
Usage with vLLM (Blackwell, SM120)
Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2
vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
--trust-remote-code \
--quantization modelopt \
--language-model-only \
--max-model-len 262144 \
--max-num-seqs 2 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:
--max-model-len 262144— full 256K context (Qwen3.6 trained max).--kv-cache-dtype fp8— halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.--max-num-seqs 2— load-bearing.--max-num-seqs 4plus--kv-cache-dtype fp8plus--speculative-config n=3plus--max-model-len 262144will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).num_speculative_tokens: 3— vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. Theqwen3_5_mtphandler is internally normalized tomtp(deprecated-name warning is harmless).
The mtp.fc weight is kept in bf16 in the safetensors (not NVFP4) — equivalent to the Lorbus-style "dequantize the fusion layer in the file" trick applied to NVFP4 instead of AutoRound. Side effect of the *mtp* ignore entry in the modelopt config, but load-bearing for the n=3 throughput.
Smaller-context launch (16K, no fp8) — fastest single-request decode
vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--language-model-only \
--quantization modelopt \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)
Same 256K + KV FP8 + max-num-seqs 2 production launch, T = 0:
| Repo | Format | MTP | Single (S/M/L) | 2-parallel agg (M/L) | vs baseline |
|---|---|---|---|---|---|
Qwen3.6-27B-NVFP4 (the family baseline) | compressed-tensors | ❌ | 56 / 59 / 59 | 119 / 119 | 1.0× |
Qwen3.6-27B-Text-NVFP4-MTP (this repo) | modelopt | ✅ n=3 | 104 / 98 / 100 | 189 / 207 | 1.67× / 1.74× |
Carnice-V2-27b-NVFP4-TEXT-MTP | modelopt | ✅ n=3 | 107 / 98 / 102 | 193 / 194 | 1.68× / 1.63× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP | modelopt | ✅ n=3 | 117 / 96 / 101 | 203 / 183 | 1.65× / 1.54× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM) | modelopt | ✅ n=3 | 118 / 97 / 100 | 183 / 198 | 1.66× / 1.66× |
(S = 50-token, M = 350-token, L = 700-token decodes.)
This repo lands at 1.74× the baseline's 2-parallel aggregate throughput on long-form decodes (207 tok/s vs 119) — the gain comes from two compounding fixes:
modeloptNVFP4 export — vLLM's native fast path on Blackwell SM120, vs thecompressed-tensorsslow fallback the baseline lands on.- bf16-restored MTP head +
num_speculative_tokens=3— single MTP layer applied recursively for ~1.9× decode multiplier via speculative decoding.
KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. Mean acceptance length 1.93 / 2.0 at n=1, ~3.0 / 4.0 at n=3.
Smaller-context single-request bench (16K, no fp8)
| Prompt | Tokens | n=1 tok/s | n=3 tok/s |
|---|---|---|---|
| Short (50 tok) | 50 | ~71 | 132.5 |
| Medium (350 tok) | 350 | ~85 | 105.5 |
| Long-form (700 tok) | 700 | ~85 | 106.5 |
GPU memory at load: ~15 GB.
Hardware target
Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on RTX 5090 and other Blackwell consumer/workstation cards with sufficient VRAM (the model is roughly 14 GB after NVFP4 + ~850 MB of bf16 MTP/conv1d/lm_head).
Acknowledgements
osoleve— for the MTP-restoration recipe on Qwen3.5Qwen— for the base modelnvidia-modeloptteam- The reporters of Discussions #5 and #7 — for catching this cleanly
Support the Base Model Authors
If you find this model useful, please consider supporting:
- Qwen Team (original model): Star the Qwen repo
License
This model inherits the Apache 2.0 license.