palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4
palmfuture • imageQwen3.6-35B-A3B-GPTQ-Int4
GPTQ Int4 quantization of Qwen/Qwen3.6-35B-A3B, produced on consumer multi-GPU hardware (4× RTX 3060 12GB) using Python 3.13t free-threading.
This v2 release ships MTP (Multi-Token Prediction) speculative decoding weights verified working on both vLLM 0.19.1 and SGLang 0.5.10.
Quality
| Metric | Value |
|---|---|
| GPTQ success rate | 97.42% |
| RTN fallback rate | 2.58% |
| Loss mean | 1.38e-04 |
| Loss median | 9.29e-05 |
| Loss max | 2.14e-03 |
| Total modules | 30,720 |
| Perplexity (wikitext-2-raw-v1) | 6.1846 (~97.9% BF16 retention) |
Model specs
| Property | Value |
|---|---|
| Base model | Qwen3.6-35B-A3B (MoE, 35B total / 3B active) |
| Architecture | Qwen3_5MoeForConditionalGeneration (vision + text) |
| Experts | 256 (top-8 routing per token) |
| Hidden layers | 40 |
| Context length | 262,144 tokens |
| Quantization | GPTQ v2, 4-bit, group_size=128, symmetric |
| Quantized size | 24.4 GB (incl. MTP weights) |
| KV cache support | fp16, bf16, fp8_e4m3 (storage-only on Ampere) |
| MTP head | Included (BF16, 785 keys, split per-expert format) |
What's quantized vs kept bf16
Quantized (int4): All MoE expert weights (mlp.experts.*) across layers 0–39
Kept bf16 (per Qwen3.6 recipe):
- Attention layers (
*.attn.*) - MoE routers (
*.mlp.gate) - Shared experts (
*.shared_expert.*) - Multi-token prediction heads (
*.mtp.*) — see Speculative decoding below - Vision encoder (
*.visual.*) - Embeddings and
lm_head
Calibration recipe
Domain-mixed calibration set to ensure all 256 experts receive meaningful activation signal:
| Source | Samples | Purpose |
|---|---|---|
| allenai/c4 | 102 | General English text |
| allenai/tulu-3-sft-mixture | 77 | Instruction-following |
| codeparrot/codeparrot-clean-valid | 51 | Code generation |
| HuggingFaceH4/MATH-500 | 26 | Mathematical reasoning |
| Total | 256 (seq_len=1024) |
Hardware used for quantization
- GPUs: 4× NVIDIA RTX 3060 12GB
- Motherboard: SuperMicro C9X299-RPGF (LGA 2066)
- CPU: Intel i9-7900X (10c/20t, Skylake-X)
- RAM: 32 GB DDR4-2666
- Runtime: ~4h 20m wall-clock
Toolchain
| Component | Version |
|---|---|
| GPTQModel | 6.0.3 |
| Flash Linear Attention (FLA) | 0.4.2 |
| PyTorch | 2.11.0+cu128 |
| Triton | 3.6.0 (cp313t wheel) |
| Python | 3.13.13t (free-threading, no-GIL) |
| CUDA | 12.8 |
Python 3.13t was the enabler for multi-GPU quantization on consumer cards — the no-GIL runtime let GPTQModel's data-parallel quantizer actually use all 4 GPUs without serializing through the interpreter lock.
Usage
SGLang (recommended for production)
python -m sglang.launch_server \
--model-path palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 \
--quantization moe_wna16 \
--tp-size 4 \
--mem-fraction-static 0.89 \
--kv-cache-dtype fp8_e4m3 \
--context-length 262144 \
--port 30000
Verified on 4× RTX 3060 12GB with SGLang 0.5.10 (max_total_num_tokens=415184, context_len=262144, max_running_requests=39).
vLLM
vllm serve palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 \
--tensor-parallel-size 4 \
--max-model-len 200000 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--dtype bfloat16 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code
Important: Do not pass
--quantization moe_wna16to vLLM. Let vLLM auto-detect quantization fromconfig.json. Forcing the flag triggers aKeyError: 'experts.w2_weight'in theQwen3_5MoeMTPloader path even when MTP is disabled.
Transformers (single-GPU, for testing)
Requires trust_remote_code=True for the Qwen3.5-MoE architecture.
Recommended sampling
Follow the official Qwen3.6 sampling guidance. For workloads where you want to cap reasoning length (agents, coding tasks), see vllm-default-thinking-budget — a vLLM plugin I built for setting default thinking_token_budget and presence_penalty:
git clone https://github.com/palmfuture/vllm-default-thinking-budget
./vllm-default-thinking-budget/install.sh /path/to/your/vllm/venv
export VLLM_DEFAULT_THINKING_BUDGET=8192
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0
vllm serve palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 ...
Speculative decoding (MTP)
This release ships MTP (Multi-Token Prediction) weights in the per-expert split format expected by vLLM and SGLang loaders (785 MTP keys total, all BF16). Speculative decoding is verified working on both engines.
vLLM with MTP
vllm serve palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 \
--tensor-parallel-size 4 \
--max-model-len 200000 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--dtype bfloat16 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
--trust-remote-code
Verified on vLLM 0.19.1 + 4× RTX 3060 12GB:
| Metric | Value |
|---|---|
| Steady-state decode | 56–82 t/s |
| Avg draft acceptance rate | 70–89% (peak 88.9%) |
| Per-position acceptance | token 1: ~0.93, token 2: ~0.85 |
| Mean acceptance length | 2.4–2.8 / 2 draft tokens |
| KV pool | 160,800 tokens (262K max-model-len, 2.34× concurrency) |
SGLang with MTP (EAGLE)
SGLANG_ENABLE_SPEC_V2=1 \
SGLANG_MAMBA_CONV_DTYPE=float16 \
python -m sglang.launch_server \
--model-path palmfuture/Qwen3.6-35B-A3B-GPTQ-Int4 \
--tp-size 4 \
--dtype float16 \
--quantization moe_wna16 \
--context-length 200000 \
--mem-fraction-static 0.80 \
--mamba-scheduler-strategy extra_buffer \
--speculative-algorithm EAGLE \
--speculative-eagle-topk 1 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 4 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--port 30000
Verified on SGLang 0.5.10 + 4× RTX 3060 12GB: 52–141 t/s decode (peak with batching), 34–70% acceptance rate, 1.4–2.8 mean acceptance length.
Notes on MTP weight layout
This release stores MTP experts in per-expert split format (mtp.layers.0.mlp.experts.{i}.{gate,up,down}_proj.weight, 256 experts × 3 projections = 768 expert keys), matching the layout that the upstream vLLM/SGLang MTP loaders expect.
The Qwen3.6 BF16 base model stores these as fused 3D tensors (gate_up_proj shape [E, 2I, H], down_proj shape [E, H, I]). They are split bit-for-bit during release packaging — there is no quantization or numerical transformation of the MTP head, only a tensor reshape.
Reproducibility
The full per-module quantization log is published as quant_log.csv (1.8 MB). Each row records the layer, module, GPTQ loss (or RTN fallback marker), sample count, damping value, and wall-clock time — making the run fully auditable.
Known characteristics
- Per-layer RTN distribution. Layers 32–39 show ~7–10% RTN rate vs ~1–3% for early layers. Consistent with MoE routing concentration in deeper layers and memory pressure late in the run.
- Cold experts. 157 of 256 expert IDs fell back to RTN in at least one layer; 99 always got GPTQ. Top cold experts: 235, 249, 234, 197, 237. These rarely route at inference.
- Ampere-only limitations. On RTX 3060 (SM86), fp8 KV cache is storage-only (dequantized for attention). No FP8 compute path.
- vLLM
--quantizationflag. Do not pass--quantization moe_wna16to vLLM — it triggers aKeyErrorin the MTP loader path. SGLang requires the flag; vLLM must auto-detect.
Credits
- Qwen Team — base model
- ModelCloud / GPTQModel — quantization framework
- sustcsonglin / flash-linear-attention — GDN hybrid attention support
- Python 3.13 free-threading working group — no-GIL runtime
Quantized by @palmfuture.