Back to Models
sakamakismile logo

sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

sakamakismilegeneral

Qwen3.6-27B-Text-NVFP4-MTP

NVFP4-quantized text-only sibling of Qwen/Qwen3.6-27B, with the MTP (Multi-Token Prediction) head restored in bf16 so speculative decoding actually works.

What's different from sakamakismile/Qwen3.6-27B-NVFP4

This repo (-Text-NVFP4-MTP)Qwen3.6-27B-NVFP4
Quantization formatmodelopt (vLLM SM120 native path)compressed-tensors
MTP headRestored in bf16, workingDropped during export → 0% draft acceptance
Vision towerStripped (text-only)Present (kept for VLM use)
Suggested launchwith --speculative-configwithout speculation

The original Qwen3.6-27B-NVFP4 is left untouched so existing users (~15K downloads) are not disrupted. This is a focused text-only sibling for users who want maximum speed and don't need vision input.

Why this exists

Two HF Discussion threads on the original repo prompted this:

  • #5 — slower than official FP8 on Blackwell — root cause is the compressed-tensors NVFP4 path being slower than modelopt on Blackwell SM120; this repo uses modelopt natively.
  • #7 — MTP not respondingAutoModelForCausalLM.from_pretrained does not load the MTP head, so it gets dropped during quantization, leading to 0% draft acceptance. This repo grafts the 15 mtp.* tensors (bf16) back into the quantized checkpoint and adds them to the quantization ignore list.

Recipe is adapted from osoleve/Qwen3.5-27B-Text-NVFP4-MTP — credit and thanks.

Reproduce this quantization

This model was produced by the open-source lna-lab/GGUF-to-NVFP4-SM120 pipeline — Lna-Lab's production line for converting Qwen3.5 / 3.6 / Gemma 4 checkpoints into modelopt-format NVFP4 + working MTP, ready for vLLM on Blackwell SM120. The exact script is src/quantize/qwen36_27b_text_mtp.py; the 5-step MTP graft recipe is documented in docs/MTP_GRAFT_RECIPE.md.

Quantization details

  • Base: Qwen/Qwen3.6-27B (bf16, 27.78B params, hybrid linear-attn + full-attn, 64 layers, 1 MTP layer)
  • Quantizer: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Calibration: 20 samples from neuralmagic/calibration (LLM split), max_seq_len 8192
  • Ignored from quantization (kept in bf16):
    • lm_head
    • All model.visual.* (vision tower) — then physically deleted in the text-only build
    • All *linear_attn.conv1d* (Mamba-style SSM convolutions, 48 of the 64 layers)
    • All mtp.* modules (the 1-layer MTP head: 15 tensors total, ~850 MB bf16)
    • Other defaults from NVFP4_DEFAULT_CFG: *router*, *mlp.gate.*, *block_sparse_moe.gate*, *output_layer*

Usage with vLLM (Blackwell, SM120)

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --max-model-len 262144 \
    --max-num-seqs 2 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:

  • --max-model-len 262144 — full 256K context (Qwen3.6 trained max).
  • --kv-cache-dtype fp8 — halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.
  • --max-num-seqs 2 — load-bearing. --max-num-seqs 4 plus --kv-cache-dtype fp8 plus --speculative-config n=3 plus --max-model-len 262144 will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).
  • num_speculative_tokens: 3 — vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. The qwen3_5_mtp handler is internally normalized to mtp (deprecated-name warning is harmless).

The mtp.fc weight is kept in bf16 in the safetensors (not NVFP4) — equivalent to the Lorbus-style "dequantize the fusion layer in the file" trick applied to NVFP4 instead of AutoRound. Side effect of the *mtp* ignore entry in the modelopt config, but load-bearing for the n=3 throughput.

Smaller-context launch (16K, no fp8) — fastest single-request decode

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --language-model-only \
    --quantization modelopt \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Same 256K + KV FP8 + max-num-seqs 2 production launch, T = 0:

RepoFormatMTPSingle (S/M/L)2-parallel agg (M/L)vs baseline
Qwen3.6-27B-NVFP4 (the family baseline)compressed-tensors56 / 59 / 59119 / 1191.0×
Qwen3.6-27B-Text-NVFP4-MTP (this repo)modelopt✅ n=3104 / 98 / 100189 / 2071.67× / 1.74×
Carnice-V2-27b-NVFP4-TEXT-MTPmodelopt✅ n=3107 / 98 / 102193 / 1941.68× / 1.63×
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTPmodelopt✅ n=3117 / 96 / 101203 / 1831.65× / 1.54×
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM)modelopt✅ n=3118 / 97 / 100183 / 1981.66× / 1.66×

(S = 50-token, M = 350-token, L = 700-token decodes.)

This repo lands at 1.74× the baseline's 2-parallel aggregate throughput on long-form decodes (207 tok/s vs 119) — the gain comes from two compounding fixes:

  • modelopt NVFP4 export — vLLM's native fast path on Blackwell SM120, vs the compressed-tensors slow fallback the baseline lands on.
  • bf16-restored MTP head + num_speculative_tokens=3 — single MTP layer applied recursively for ~1.9× decode multiplier via speculative decoding.

KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. Mean acceptance length 1.93 / 2.0 at n=1, ~3.0 / 4.0 at n=3.

Smaller-context single-request bench (16K, no fp8)

PromptTokensn=1 tok/sn=3 tok/s
Short (50 tok)50~71132.5
Medium (350 tok)350~85105.5
Long-form (700 tok)700~85106.5

GPU memory at load: ~15 GB.

Hardware target

Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on RTX 5090 and other Blackwell consumer/workstation cards with sufficient VRAM (the model is roughly 14 GB after NVFP4 + ~850 MB of bf16 MTP/conv1d/lm_head).

Acknowledgements

  • osoleve — for the MTP-restoration recipe on Qwen3.5
  • Qwen — for the base model
  • nvidia-modelopt team
  • The reporters of Discussions #5 and #7 — for catching this cleanly

Support the Base Model Authors

If you find this model useful, please consider supporting:

License

This model inherits the Apache 2.0 license.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes17
Downloads
📝

No reviews yet

Be the first to review sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP!

Model Info

Providersakamakismile
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes17
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor