Qwen3.6-27B-Text-NVFP4-MTP

Name: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Brand: sakamakismile
Rating: 0.0 (17 reviews)

NVFP4-quantized text-only sibling of Qwen/Qwen3.6-27B, with the MTP (Multi-Token Prediction) head restored in bf16 so speculative decoding actually works.

What's different from `sakamakismile/Qwen3.6-27B-NVFP4`

	This repo (`-Text-NVFP4-MTP`)	`Qwen3.6-27B-NVFP4`
Quantization format	`modelopt` (vLLM SM120 native path)	`compressed-tensors`
MTP head	Restored in bf16, working	Dropped during export → 0% draft acceptance
Vision tower	Stripped (text-only)	Present (kept for VLM use)
Suggested launch	with `--speculative-config`	without speculation

The original Qwen3.6-27B-NVFP4 is left untouched so existing users (~15K downloads) are not disrupted. This is a focused text-only sibling for users who want maximum speed and don't need vision input.

Why this exists

Two HF Discussion threads on the original repo prompted this:

#5 — slower than official FP8 on Blackwell — root cause is the compressed-tensors NVFP4 path being slower than modelopt on Blackwell SM120; this repo uses modelopt natively.
#7 — MTP not responding — AutoModelForCausalLM.from_pretrained does not load the MTP head, so it gets dropped during quantization, leading to 0% draft acceptance. This repo grafts the 15 mtp.* tensors (bf16) back into the quantized checkpoint and adds them to the quantization ignore list.

Recipe is adapted from osoleve/Qwen3.5-27B-Text-NVFP4-MTP — credit and thanks.

Reproduce this quantization

This model was produced by the open-source lna-lab/GGUF-to-NVFP4-SM120 pipeline — Lna-Lab's production line for converting Qwen3.5 / 3.6 / Gemma 4 checkpoints into modelopt-format NVFP4 + working MTP, ready for vLLM on Blackwell SM120. The exact script is src/quantize/qwen36_27b_text_mtp.py; the 5-step MTP graft recipe is documented in docs/MTP_GRAFT_RECIPE.md.

Quantization details

Base: Qwen/Qwen3.6-27B (bf16, 27.78B params, hybrid linear-attn + full-attn, 64 layers, 1 MTP layer)
Quantizer: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Calibration: 20 samples from neuralmagic/calibration (LLM split), max_seq_len 8192
Ignored from quantization (kept in bf16):
- lm_head
- All model.visual.* (vision tower) — then physically deleted in the text-only build
- All *linear_attn.conv1d* (Mamba-style SSM convolutions, 48 of the 64 layers)
- All mtp.* modules (the 1-layer MTP head: 15 tensors total, ~850 MB bf16)
- Other defaults from NVFP4_DEFAULT_CFG: *router*, *mlp.gate.*, *block_sparse_moe.gate*, *output_layer*

Usage with vLLM (Blackwell, SM120)

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --max-model-len 262144 \
    --max-num-seqs 2 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:

--max-model-len 262144 — full 256K context (Qwen3.6 trained max).
--kv-cache-dtype fp8 — halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.
--max-num-seqs 2 — load-bearing. --max-num-seqs 4 plus --kv-cache-dtype fp8 plus --speculative-config n=3 plus --max-model-len 262144 will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).
num_speculative_tokens: 3 — vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. The qwen3_5_mtp handler is internally normalized to mtp (deprecated-name warning is harmless).

The mtp.fc weight is kept in bf16 in the safetensors (not NVFP4) — equivalent to the Lorbus-style "dequantize the fusion layer in the file" trick applied to NVFP4 instead of AutoRound. Side effect of the *mtp* ignore entry in the modelopt config, but load-bearing for the n=3 throughput.

Smaller-context launch (16K, no fp8) — fastest single-request decode

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --language-model-only \
    --quantization modelopt \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Same 256K + KV FP8 + max-num-seqs 2 production launch, T = 0:

Repo	Format	MTP	Single (S/M/L)	2-parallel agg (M/L)	vs baseline
`Qwen3.6-27B-NVFP4` (the family baseline)	`compressed-tensors`	❌	56 / 59 / 59	119 / 119	1.0×
`Qwen3.6-27B-Text-NVFP4-MTP` (this repo)	`modelopt`	✅ n=3	104 / 98 / 100	189 / 207	1.67× / 1.74×
`Carnice-V2-27b-NVFP4-TEXT-MTP`	`modelopt`	✅ n=3	107 / 98 / 102	193 / 194	1.68× / 1.63×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP`	`modelopt`	✅ n=3	117 / 96 / 101	203 / 183	1.65× / 1.54×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP` (VLM)	`modelopt`	✅ n=3	118 / 97 / 100	183 / 198	1.66× / 1.66×

(S = 50-token, M = 350-token, L = 700-token decodes.)

This repo lands at 1.74× the baseline's 2-parallel aggregate throughput on long-form decodes (207 tok/s vs 119) — the gain comes from two compounding fixes:

modelopt NVFP4 export — vLLM's native fast path on Blackwell SM120, vs the compressed-tensors slow fallback the baseline lands on.
bf16-restored MTP head + num_speculative_tokens=3 — single MTP layer applied recursively for ~1.9× decode multiplier via speculative decoding.

KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. Mean acceptance length 1.93 / 2.0 at n=1, ~3.0 / 4.0 at n=3.

Smaller-context single-request bench (16K, no fp8)

Prompt	Tokens	n=1 tok/s	n=3 tok/s
Short (50 tok)	50	~71	132.5
Medium (350 tok)	350	~85	105.5
Long-form (700 tok)	700	~85	106.5

GPU memory at load: ~15 GB.

Hardware target

Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on RTX 5090 and other Blackwell consumer/workstation cards with sufficient VRAM (the model is roughly 14 GB after NVFP4 + ~850 MB of bf16 MTP/conv1d/lm_head).

Acknowledgements

osoleve — for the MTP-restoration recipe on Qwen3.5
Qwen — for the base model
nvidia-modelopt team
The reporters of Discussions #5 and #7 — for catching this cleanly

Support the Base Model Authors

If you find this model useful, please consider supporting:

Qwen Team (original model): Star the Qwen repo

License

This model inherits the Apache 2.0 license.

sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

Qwen3.6-27B-Text-NVFP4-MTP

What's different from `sakamakismile/Qwen3.6-27B-NVFP4`

Why this exists

Reproduce this quantization

Quantization details

Usage with vLLM (Blackwell, SM120)

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

Smaller-context launch (16K, no fp8) — fastest single-request decode

Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Smaller-context single-request bench (16K, no fp8)

Hardware target

Acknowledgements

Support the Base Model Authors

License

No reviews yet

Model Info

Community

Rating Guidelines

sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

Qwen3.6-27B-Text-NVFP4-MTP

What's different from sakamakismile/Qwen3.6-27B-NVFP4

Why this exists

Reproduce this quantization

Quantization details

Usage with vLLM (Blackwell, SM120)

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

Smaller-context launch (16K, no fp8) — fastest single-request decode

Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Smaller-context single-request bench (16K, no fp8)

Hardware target

Acknowledgements

Support the Base Model Authors

License

No reviews yet

Model Info

Community

Rating Guidelines

What's different from `sakamakismile/Qwen3.6-27B-NVFP4`