sakamakismile/Qwen3.6-27B-NVFP4
sakamakismile • generalQwen3.6-27B-NVFP4 Model Card
Overview
This is an NVFP4 quantized version of Qwen/Qwen3.6-27B by Lna-Lab, using custom Blackwell NVFP4 GEMM kernels.
This is the first NVFP4 release in our Qwen3.6-27B family — compressed-tensors format, vision tower preserved, no MTP head. ~35K downloads since release. For new deployments we strongly recommend the faster siblings below unless you have a reason to stay on compressed-tensors.
Key Compression Stats:
- Original Size: 55.6 GB
- Quantized Size: 19.7 GB (0.35x compression)
- Vision Tower: Preserved in BF16
- Hardware: Runs on a single NVIDIA Blackwell GPU
Faster siblings — modelopt + MTP format
Verified throughput at the same production launch (single 1× RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, 256K context, KV FP8, max-num-seqs 2):
| Repo | Format | MTP | Single tok/s | 2-parallel agg tok/s | vs this repo |
|---|---|---|---|---|---|
Qwen3.6-27B-NVFP4 (this) | compressed-tensors | ❌ | 58 (M / L) | 119 (M / L) | 1.0× (baseline) |
Qwen3.6-27B-Text-NVFP4-MTP | modelopt | ✅ n=3 | 98 / 100 | 189 / 207 | 1.67× / 1.74× |
Carnice-V2-27b-NVFP4-TEXT-MTP | modelopt | ✅ n=3 | 98 / 102 | 193 / 194 | 1.68× / 1.63× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP | modelopt | ✅ n=3 | 96 / 101 | 203 / 183 | 1.65× / 1.54× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM) | modelopt | ✅ n=3 | 97 / 100 | 183 / 198 | 1.66× / 1.66× |
The 1.6–1.7× jump comes from two compounding fixes that this older repo doesn't have:
modeloptNVFP4 export — vLLM's native fast path on Blackwell SM120, vs thecompressed-tensorsslow fallback this repo lands on.- bf16-restored MTP head +
num_speculative_tokens=3— the single MTP layer is applied recursively three times per draft pass, lifting decode by ≈ 1.9× via speculative decoding.
This repo is left untouched so existing setups (≈ 35K downloads) are not disrupted, but if you can switch quantization formats, the modelopt + MTP siblings are clearly faster.
NVFP4 Quantization Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen3.6-27B |
| Quantization Scheme | NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm |
| Size | 19.7 GB (single safetensors shard) |
| Requirements | NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19 |
Quantization Recipe
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
scheme: NVFP4
What's Quantized vs. Preserved
- Quantized (NVFP4): All
Linearlayers in the language model - Kept in BF16:
lm_head, all vision layers (model.visual.*), MLP gates
Quick Start (vLLM)
Production-config launch — 256K context, KV FP8, max-num-seqs 2
vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
--trust-remote-code \
--max-model-len 262144 \
--max-num-seqs 2 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3
Minimal launch (smaller context)
vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--dtype auto \
--trust-remote-code
Verified throughput (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)
Production-config bench (256K context, KV FP8, max-num-seqs 2), T = 0:
| Prompt | Single tok/s | 2-parallel agg tok/s | per-request |
|---|---|---|---|
| Short (50 tok) | 56.3 | 108.7 | 54.9 |
| Medium (350 tok) | 58.6 | 118.9 | 59.6 |
| Long-form (700 tok) | 58.7 | 118.8 | 59.4 |
KV cache size at 256K + fp8: 529,984 tokens → maximum concurrency 7.91× per request at full 256K context (highest in the family — no MTP draft model means more memory budget for KV cache). Available KV memory: 64.79 GiB on a 96 GB Blackwell card.
These numbers are the baseline that the modelopt + MTP siblings are measured against (see table at the top). Single-request decode is ~58 tok/s here vs ~100 tok/s on the MTP siblings; 2-parallel aggregate is ~119 tok/s vs ~190–207 tok/s.
Tested Environment
| Component | Version |
|---|---|
| vLLM | 0.19.1rc1+ (nightly) |
| Transformers | 5.5.4 |
| PyTorch | 2.11.0+cu130 |
| GPU | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
Credits
- Original Model: Qwen Team (Alibaba Group)
- NVFP4 Quantization: Lna-Lab
- Blackwell NVFP4 GEMM Kernels: lna-lab/blackwell-geforce-nvfp4-gemm
- Quantization Framework: vllm-project/llm-compressor
Base Model Information (Qwen3.6-27B)
Model Specifications
- Type: Causal Language Model with Vision Encoder
- Parameters: 27B
- Training Stages: Pre-training & Post-training
- Context Length: 262,144 tokens (natively), extensible up to 1,010,000 tokens
- Architecture:
- Hidden Dimension: 5,120
- Layers: 64
- Layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
Key Capabilities
Agentic Coding:
- Frontend workflows and repository-level reasoning
- Handling complex software engineering tasks
- Enhanced precision in code generation
Thinking Preservation:
- Retains reasoning context from historical messages
- Streamlines iterative development
- Reduces computational overhead
License
Apache 2.0 - See LICENSE