Qwen3.6-27B-NVFP4 Model Card

Overview

This is an NVFP4 quantized version of Qwen/Qwen3.6-27B by Lna-Lab, using custom Blackwell NVFP4 GEMM kernels.

This is the first NVFP4 release in our Qwen3.6-27B family — compressed-tensors format, vision tower preserved, no MTP head. ~35K downloads since release. For new deployments we strongly recommend the faster siblings below unless you have a reason to stay on compressed-tensors.

Key Compression Stats:

Original Size: 55.6 GB
Quantized Size: 19.7 GB (0.35x compression)
Vision Tower: Preserved in BF16
Hardware: Runs on a single NVIDIA Blackwell GPU

Faster siblings — modelopt + MTP format

Verified throughput at the same production launch (single 1× RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, 256K context, KV FP8, max-num-seqs 2):

Repo	Format	MTP	Single tok/s	2-parallel agg tok/s	vs this repo
`Qwen3.6-27B-NVFP4` (this)	`compressed-tensors`	❌	58 (M / L)	119 (M / L)	1.0× (baseline)
`Qwen3.6-27B-Text-NVFP4-MTP`	`modelopt`	✅ n=3	98 / 100	189 / 207	1.67× / 1.74×
`Carnice-V2-27b-NVFP4-TEXT-MTP`	`modelopt`	✅ n=3	98 / 102	193 / 194	1.68× / 1.63×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP`	`modelopt`	✅ n=3	96 / 101	203 / 183	1.65× / 1.54×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP` (VLM)	`modelopt`	✅ n=3	97 / 100	183 / 198	1.66× / 1.66×

The 1.6–1.7× jump comes from two compounding fixes that this older repo doesn't have:

modelopt NVFP4 export — vLLM's native fast path on Blackwell SM120, vs the compressed-tensors slow fallback this repo lands on.
bf16-restored MTP head + num_speculative_tokens=3 — the single MTP layer is applied recursively three times per draft pass, lifting decode by ≈ 1.9× via speculative decoding.

This repo is left untouched so existing setups (≈ 35K downloads) are not disrupted, but if you can switch quantization formats, the modelopt + MTP siblings are clearly faster.

NVFP4 Quantization Details

Parameter	Value
Base Model	Qwen/Qwen3.6-27B
Quantization Scheme	NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8)
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm
Size	19.7 GB (single safetensors shard)
Requirements	NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19

Quantization Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

What's Quantized vs. Preserved

Quantized (NVFP4): All Linear layers in the language model
Kept in BF16: lm_head, all vision layers (model.visual.*), MLP gates

Quick Start (vLLM)

Production-config launch — 256K context, KV FP8, max-num-seqs 2

vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser qwen3

Minimal launch (smaller context)

vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Verified throughput (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Production-config bench (256K context, KV FP8, max-num-seqs 2), T = 0:

Prompt	Single tok/s	2-parallel agg tok/s	per-request
Short (50 tok)	56.3	108.7	54.9
Medium (350 tok)	58.6	118.9	59.6
Long-form (700 tok)	58.7	118.8	59.4

KV cache size at 256K + fp8: 529,984 tokens → maximum concurrency 7.91× per request at full 256K context (highest in the family — no MTP draft model means more memory budget for KV cache). Available KV memory: 64.79 GiB on a 96 GB Blackwell card.

These numbers are the baseline that the modelopt + MTP siblings are measured against (see table at the top). Single-request decode is ~58 tok/s here vs ~100 tok/s on the MTP siblings; 2-parallel aggregate is ~119 tok/s vs ~190–207 tok/s.

Tested Environment

Component	Version
vLLM	0.19.1rc1+ (nightly)
Transformers	5.5.4
PyTorch	2.11.0+cu130
GPU	NVIDIA RTX PRO 6000 Blackwell (96 GB)

Credits

Original Model: Qwen Team (Alibaba Group)
NVFP4 Quantization: Lna-Lab
Blackwell NVFP4 GEMM Kernels: lna-lab/blackwell-geforce-nvfp4-gemm
Quantization Framework: vllm-project/llm-compressor

Base Model Information (Qwen3.6-27B)

Model Specifications

Type: Causal Language Model with Vision Encoder
Parameters: 27B
Training Stages: Pre-training & Post-training
Context Length: 262,144 tokens (natively), extensible up to 1,010,000 tokens
Architecture:
- Hidden Dimension: 5,120
- Layers: 64
- Layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))

Key Capabilities

Agentic Coding:

Frontend workflows and repository-level reasoning
Handling complex software engineering tasks
Enhanced precision in code generation

Thinking Preservation:

Retains reasoning context from historical messages
Streamlines iterative development
Reduces computational overhead

License

Apache 2.0 - See LICENSE

sakamakismile/Qwen3.6-27B-NVFP4