Back to Models
sakamakismile logo

sakamakismile/Qwen3.6-27B-NVFP4

sakamakismilegeneral

Qwen3.6-27B-NVFP4 Model Card

Overview

This is an NVFP4 quantized version of Qwen/Qwen3.6-27B by Lna-Lab, using custom Blackwell NVFP4 GEMM kernels.

This is the first NVFP4 release in our Qwen3.6-27B family — compressed-tensors format, vision tower preserved, no MTP head. ~35K downloads since release. For new deployments we strongly recommend the faster siblings below unless you have a reason to stay on compressed-tensors.

Key Compression Stats:

  • Original Size: 55.6 GB
  • Quantized Size: 19.7 GB (0.35x compression)
  • Vision Tower: Preserved in BF16
  • Hardware: Runs on a single NVIDIA Blackwell GPU

Faster siblings — modelopt + MTP format

Verified throughput at the same production launch (single 1× RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, 256K context, KV FP8, max-num-seqs 2):

RepoFormatMTPSingle tok/s2-parallel agg tok/svs this repo
Qwen3.6-27B-NVFP4 (this)compressed-tensors58 (M / L)119 (M / L)1.0× (baseline)
Qwen3.6-27B-Text-NVFP4-MTPmodelopt✅ n=398 / 100189 / 2071.67× / 1.74×
Carnice-V2-27b-NVFP4-TEXT-MTPmodelopt✅ n=398 / 102193 / 1941.68× / 1.63×
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTPmodelopt✅ n=396 / 101203 / 1831.65× / 1.54×
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM)modelopt✅ n=397 / 100183 / 1981.66× / 1.66×

The 1.6–1.7× jump comes from two compounding fixes that this older repo doesn't have:

  • modelopt NVFP4 export — vLLM's native fast path on Blackwell SM120, vs the compressed-tensors slow fallback this repo lands on.
  • bf16-restored MTP head + num_speculative_tokens=3 — the single MTP layer is applied recursively three times per draft pass, lifting decode by ≈ 1.9× via speculative decoding.

This repo is left untouched so existing setups (≈ 35K downloads) are not disrupted, but if you can switch quantization formats, the modelopt + MTP siblings are clearly faster.


NVFP4 Quantization Details

ParameterValue
Base ModelQwen/Qwen3.6-27B
Quantization SchemeNVFP4 (W4A4 — weights FP4, activations FP4, scales FP8)
Formatcompressed-tensors (native vLLM support)
Toolvllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm
Size19.7 GB (single safetensors shard)
RequirementsNVIDIA Blackwell GPU (SM 120), vLLM >= 0.19

Quantization Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

What's Quantized vs. Preserved

  • Quantized (NVFP4): All Linear layers in the language model
  • Kept in BF16: lm_head, all vision layers (model.visual.*), MLP gates

Quick Start (vLLM)

Production-config launch — 256K context, KV FP8, max-num-seqs 2

vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser qwen3

Minimal launch (smaller context)

vllm serve sakamakismile/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Verified throughput (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Production-config bench (256K context, KV FP8, max-num-seqs 2), T = 0:

PromptSingle tok/s2-parallel agg tok/sper-request
Short (50 tok)56.3108.754.9
Medium (350 tok)58.6118.959.6
Long-form (700 tok)58.7118.859.4

KV cache size at 256K + fp8: 529,984 tokens → maximum concurrency 7.91× per request at full 256K context (highest in the family — no MTP draft model means more memory budget for KV cache). Available KV memory: 64.79 GiB on a 96 GB Blackwell card.

These numbers are the baseline that the modelopt + MTP siblings are measured against (see table at the top). Single-request decode is ~58 tok/s here vs ~100 tok/s on the MTP siblings; 2-parallel aggregate is ~119 tok/s vs ~190–207 tok/s.


Tested Environment

ComponentVersion
vLLM0.19.1rc1+ (nightly)
Transformers5.5.4
PyTorch2.11.0+cu130
GPUNVIDIA RTX PRO 6000 Blackwell (96 GB)

Credits


Base Model Information (Qwen3.6-27B)

Model Specifications

  • Type: Causal Language Model with Vision Encoder
  • Parameters: 27B
  • Training Stages: Pre-training & Post-training
  • Context Length: 262,144 tokens (natively), extensible up to 1,010,000 tokens
  • Architecture:
    • Hidden Dimension: 5,120
    • Layers: 64
    • Layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))

Key Capabilities

Agentic Coding:

  • Frontend workflows and repository-level reasoning
  • Handling complex software engineering tasks
  • Enhanced precision in code generation

Thinking Preservation:

  • Retains reasoning context from historical messages
  • Streamlines iterative development
  • Reduces computational overhead

License

Apache 2.0 - See LICENSE

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes22
Downloads
📝

No reviews yet

Be the first to review sakamakismile/Qwen3.6-27B-NVFP4!

Model Info

Providersakamakismile
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes22
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor