shieldstar/Qwen3.6-35B-A3B-int4-AutoRound-EC
shieldstar • imageQwen3.6-35B-A3B-int4-AutoRound-EC
Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.
Calibration — Extended vs Intel default
| Intel (v0.13.0) | EC (this model) | |
|---|---|---|
iters | 200 (default) | 400 |
nsamples | 128 (default) | 256 |
seqlen | 512 (default) | 4096 |
batch_size | 8 (default) | 8 (default) |
grad_accum | 1 (default) | 1 (default) |
ignore_layers | mtp.fc | mtp.fc |
bits / group | 4 / 128 | 4 / 128 |
sym | true | true |
packing_format | auto_round:auto_gptq | auto_round:auto_gptq |
Intel does not publish their iters/nsamples/seqlen in the released artifact
or README. Our EC values are 2× Intel's CLI defaults on iters and
nsamples, and 8× on seqlen — targeting better activation coverage for
long-context and multimodal workloads.
Environment
| Component | Version |
|---|---|
| auto-round | 0.13.0 (merge commit 2dda035 — PR #1705) |
| transformers | 5.5.1 (matches Intel's exact pin for this variant) |
| torch | 2.6.0+cu124 |
| safetensors | 0.7.0 |
| huggingface_hub | 1.11.0 |
| Hardware | RunPod H200 SXM (1× 141 GB HBM3e) |
Reproducible invocation
# Pin exact auto-round version used (merge SHA, never a branch name)
pip install 'setuptools>=76,<81' 'setuptools_scm<8' 'packaging>=24.2'
pip install -U 'torch==2.6.0' 'torchvision==0.21.0' --index-url https://download.pytorch.org/whl/cu124
pip install 'transformers==5.5.1' safetensors huggingface_hub
pip install 'git+https://github.com/intel/auto-round.git@2dda035b275a297464565ba8d4d2cc24ae6a07a9'
# Quantize (EC params)
auto-round "Qwen/Qwen3.6-35B-A3B" \
--output_dir "./Qwen3.6-35B-A3B-int4-EC" \
--ignore_layers mtp.fc \
--iters 400 \
--nsamples 256 \
--seqlen 4096
No gptqmodel install needed — auto-round 0.13 has GPTQ packing built in.
Architecture
- Layers: 40 transformer blocks (
model.language_model.layers.0throughmodel.language_model.layers.39) - MoE: 128 experts, 3B active per token (
model.language_model.layers.*.mlp.experts) - Attention: hybrid — layers
[3, 7, 11, 15, 19, 23, 27, 31, 35, 39]are self_attn, rest arelinear_attn(DeltaNet) - MTP: 1 multi-token-prediction head (
mtp.layers.0,mtp.fc,mtp.norm) - Visual: 27 transformer blocks (
model.visual.blocks.0throughmodel.visual.blocks.26) + patch embed + merger
Files
| Path | What |
|---|---|
model-000{01..10}-of-00010.* | Quantized language-model shards (INT4 GPTQ, w4g128) |
model_extra_tensors.safetensors | Visual encoder + MTP + ignored-layer gates (WOQ INT4 / BF16 passthrough) |
config.json | Multimodal config with embedded quantization_config |
quantization_config.json | Standalone quant config (redundant with config.json) |
model.safetensors.index.json | Weight-map → shard |
generation_config.json | Default generation params |
preprocessor_config.json | Vision preprocessor config |
processor_config.json | Multimodal processor config |
chat_template.jinja | Qwen3 chat template (reasoning-enabled) |
tokenizer.json / tokenizer_config.json | Tokenizer |
Total size: ~20.9 GB (10 shards ≈ 19.6 GB + extras ≈ 1.35 GB) — Intel-layout-matched (1.00× ratio vs Intel/Qwen3.6-35B-A3B-int4-AutoRound).
Layout note — Intel-layout-matched (after in-place fix)
Earlier builds of this repo had a ~21 GB model_extra_tensors.safetensors
because auto-round 0.13 (PR #1705) auto-expanded block_name_to_quantize
to include model.visual.blocks — the visual encoder ended up INT4'd
inside extras, and several main-shard keys were duplicated into extras by
missing_tensors.py. That broke vLLM's qwen3_vl.py loader, which
expects visual weights to be BF16.
We rebuilt the repo in place (no re-quant): main shards untouched;
model_extra_tensors.safetensors now contains BF16 visual tensors pulled
from the base model (Qwen/Qwen3.6-35B-A3B, 167 tensors) plus the MTP
head. Duplicated main-shard keys were dropped from the index. Net:
~20.9 GB total (1.00× Intel's reference size), no INT4 visual, vLLM
loads cleanly.
MTP handling (project policy: RTN)
This repo's model_extra_tensors.safetensors contains INT4 MTP tensors
(~2,335 per-expert weights under mtp.layers.0.*) produced via auto-round's
native missing_tensors.py::copy_missing_tensors_from_source post-save
path. The quantization method is deterministic symmetric RTN
(group_size=128, no calibration data), which is the standard auto-round
behavior for Qwen3.6 MoE models — and the same path Intel's published
reference (Intel/Qwen3.6-35B-A3B-int4-AutoRound) uses.
Why RTN and not calibrated EC like our main LM blocks:
transformers>=5.5.1loadsQwen3_5MoeForConditionalGenerationwith_keys_to_ignore_on_load_unexpected = [r"^mtp.*"]at two locations (modeling_qwen3_5_moe.pyL881, L1902), silently stripping MTP tensors during load. The instantiated model has nomtpsubmodules for auto-round's calibrated block iteration to discover.- We tried exposing MTP via a custom extension (
roles/quant/code/qwen36_mtp_extension.py::attach_mtp) on 2026-04-23. The infrastructure worked (auto-round discoveredmtp.layers.0as a block family with 0 missing weights) but naive forward-hook wiring produced NaN calibration loss — mtp takesconcat(embed(next_token), model.norm(hidden_states)) → mtp.fc → mtp.layers[0], not raw hidden_states, so the calibration search diverged. - Shipping calibrated MTP with a broken forward pipeline would be
strictly worse than RTN (effectively random rounding). Proper EC
requires implementing the full MTP forward; deferred as a future
research task (see
docs/quant-new-model-checklist.md§10.4 policy C).
Practical implication: the 40 main LM blocks use our EC calibration (iters=400, nsamples=256, seqlen=4096). The 1 MTP block uses RTN. For speculative-decoding workloads where MTP acceptance rate matters, this is the same baseline as Intel's reference. For non-spec-dec deployments (most vLLM serving), MTP is not loaded and this choice is invisible.
License
Apache 2.0 (inherits from Qwen/Qwen3.6-35B-A3B).
Shoutouts
- Qwen team for the base Qwen3.6-35B-A3B model.
- Intel for the reference AutoRound INT4 recipe and post-quant checkpoint layout we built on. Special thanks to @lvliang-intel and @wenhuach21 for PR #1705 which added Qwen3.6 support.
- auto-round for the quantization tooling.