Qwen3.6-35B-A3B-int4-AutoRound-EC

Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.

Calibration — Extended vs Intel default

	Intel (v0.13.0)	EC (this model)
`iters`	200 (default)	400
`nsamples`	128 (default)	256
`seqlen`	512 (default)	4096
`batch_size`	8 (default)	8 (default)
`grad_accum`	1 (default)	1 (default)
`ignore_layers`	`mtp.fc`	`mtp.fc`
`bits / group`	4 / 128	4 / 128
`sym`	true	true
`packing_format`	auto_round:auto_gptq	auto_round:auto_gptq

Intel does not publish their iters/nsamples/seqlen in the released artifact or README. Our EC values are 2× Intel's CLI defaults on iters and nsamples, and 8× on seqlen — targeting better activation coverage for long-context and multimodal workloads.

Environment

Component	Version
auto-round	0.13.0 (merge commit `2dda035` — PR #1705)
transformers	5.5.1 (matches Intel's exact pin for this variant)
torch	2.6.0+cu124
safetensors	0.7.0
huggingface_hub	1.11.0
Hardware	RunPod H200 SXM (1× 141 GB HBM3e)

Reproducible invocation

# Pin exact auto-round version used (merge SHA, never a branch name)
pip install 'setuptools>=76,<81' 'setuptools_scm<8' 'packaging>=24.2'
pip install -U 'torch==2.6.0' 'torchvision==0.21.0' --index-url https://download.pytorch.org/whl/cu124
pip install 'transformers==5.5.1' safetensors huggingface_hub
pip install 'git+https://github.com/intel/auto-round.git@2dda035b275a297464565ba8d4d2cc24ae6a07a9'

# Quantize (EC params)
auto-round "Qwen/Qwen3.6-35B-A3B" \
    --output_dir "./Qwen3.6-35B-A3B-int4-EC" \
    --ignore_layers mtp.fc \
    --iters 400 \
    --nsamples 256 \
    --seqlen 4096

No gptqmodel install needed — auto-round 0.13 has GPTQ packing built in.

Architecture

Layers: 40 transformer blocks (model.language_model.layers.0 through model.language_model.layers.39)
MoE: 128 experts, 3B active per token (model.language_model.layers.*.mlp.experts)
Attention: hybrid — layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39] are self_attn, rest are linear_attn (DeltaNet)
MTP: 1 multi-token-prediction head (mtp.layers.0, mtp.fc, mtp.norm)
Visual: 27 transformer blocks (model.visual.blocks.0 through model.visual.blocks.26) + patch embed + merger

Files

Path	What
`model-000{01..10}-of-00010.*`	Quantized language-model shards (INT4 GPTQ, w4g128)
`model_extra_tensors.safetensors`	Visual encoder + MTP + ignored-layer gates (WOQ INT4 / BF16 passthrough)
`config.json`	Multimodal config with embedded `quantization_config`
`quantization_config.json`	Standalone quant config (redundant with config.json)
`model.safetensors.index.json`	Weight-map → shard
`generation_config.json`	Default generation params
`preprocessor_config.json`	Vision preprocessor config
`processor_config.json`	Multimodal processor config
`chat_template.jinja`	Qwen3 chat template (reasoning-enabled)
`tokenizer.json` / `tokenizer_config.json`	Tokenizer

Total size: ~20.9 GB (10 shards ≈ 19.6 GB + extras ≈ 1.35 GB) — Intel-layout-matched (1.00× ratio vs Intel/Qwen3.6-35B-A3B-int4-AutoRound).

Layout note — Intel-layout-matched (after in-place fix)

Earlier builds of this repo had a ~21 GB model_extra_tensors.safetensors because auto-round 0.13 (PR #1705) auto-expanded block_name_to_quantize to include model.visual.blocks — the visual encoder ended up INT4'd inside extras, and several main-shard keys were duplicated into extras by missing_tensors.py. That broke vLLM's qwen3_vl.py loader, which expects visual weights to be BF16.

We rebuilt the repo in place (no re-quant): main shards untouched; model_extra_tensors.safetensors now contains BF16 visual tensors pulled from the base model (Qwen/Qwen3.6-35B-A3B, 167 tensors) plus the MTP head. Duplicated main-shard keys were dropped from the index. Net: ~20.9 GB total (1.00× Intel's reference size), no INT4 visual, vLLM loads cleanly.

MTP handling (project policy: RTN)

This repo's model_extra_tensors.safetensors contains INT4 MTP tensors (~2,335 per-expert weights under mtp.layers.0.*) produced via auto-round's native missing_tensors.py::copy_missing_tensors_from_source post-save path. The quantization method is deterministic symmetric RTN (group_size=128, no calibration data), which is the standard auto-round behavior for Qwen3.6 MoE models — and the same path Intel's published reference (Intel/Qwen3.6-35B-A3B-int4-AutoRound) uses.

Why RTN and not calibrated EC like our main LM blocks:

transformers>=5.5.1 loads Qwen3_5MoeForConditionalGeneration with _keys_to_ignore_on_load_unexpected = [r"^mtp.*"] at two locations (modeling_qwen3_5_moe.py L881, L1902), silently stripping MTP tensors during load. The instantiated model has no mtp submodules for auto-round's calibrated block iteration to discover.
We tried exposing MTP via a custom extension (roles/quant/code/qwen36_mtp_extension.py::attach_mtp) on 2026-04-23. The infrastructure worked (auto-round discovered mtp.layers.0 as a block family with 0 missing weights) but naive forward-hook wiring produced NaN calibration loss — mtp takes concat(embed(next_token), model.norm(hidden_states)) → mtp.fc → mtp.layers[0], not raw hidden_states, so the calibration search diverged.
Shipping calibrated MTP with a broken forward pipeline would be strictly worse than RTN (effectively random rounding). Proper EC requires implementing the full MTP forward; deferred as a future research task (see docs/quant-new-model-checklist.md §10.4 policy C).

Practical implication: the 40 main LM blocks use our EC calibration (iters=400, nsamples=256, seqlen=4096). The 1 MTP block uses RTN. For speculative-decoding workloads where MTP acceptance rate matters, this is the same baseline as Intel's reference. For non-spec-dec deployments (most vLLM serving), MTP is not loaded and this choice is invisible.

License

Apache 2.0 (inherits from Qwen/Qwen3.6-35B-A3B).

Shoutouts

Qwen team for the base Qwen3.6-35B-A3B model.
Intel for the reference AutoRound INT4 recipe and post-quant checkpoint layout we built on. Special thanks to @lvliang-intel and @wenhuach21 for PR #1705 which added Qwen3.6 support.
auto-round for the quantization tooling.

shieldstar/Qwen3.6-35B-A3B-int4-AutoRound-EC