Back to Models
SH

shieldstar/Qwen3.6-35B-A3B-int4-AutoRound-EC

shieldstarimage

Qwen3.6-35B-A3B-int4-AutoRound-EC

Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.

Calibration — Extended vs Intel default

Intel (v0.13.0)EC (this model)
iters200 (default)400
nsamples128 (default)256
seqlen512 (default)4096
batch_size8 (default)8 (default)
grad_accum1 (default)1 (default)
ignore_layersmtp.fcmtp.fc
bits / group4 / 1284 / 128
symtruetrue
packing_formatauto_round:auto_gptqauto_round:auto_gptq

Intel does not publish their iters/nsamples/seqlen in the released artifact or README. Our EC values are Intel's CLI defaults on iters and nsamples, and on seqlen — targeting better activation coverage for long-context and multimodal workloads.

Environment

ComponentVersion
auto-round0.13.0 (merge commit 2dda035 — PR #1705)
transformers5.5.1 (matches Intel's exact pin for this variant)
torch2.6.0+cu124
safetensors0.7.0
huggingface_hub1.11.0
HardwareRunPod H200 SXM (1× 141 GB HBM3e)

Reproducible invocation

# Pin exact auto-round version used (merge SHA, never a branch name)
pip install 'setuptools>=76,<81' 'setuptools_scm<8' 'packaging>=24.2'
pip install -U 'torch==2.6.0' 'torchvision==0.21.0' --index-url https://download.pytorch.org/whl/cu124
pip install 'transformers==5.5.1' safetensors huggingface_hub
pip install 'git+https://github.com/intel/auto-round.git@2dda035b275a297464565ba8d4d2cc24ae6a07a9'

# Quantize (EC params)
auto-round "Qwen/Qwen3.6-35B-A3B" \
    --output_dir "./Qwen3.6-35B-A3B-int4-EC" \
    --ignore_layers mtp.fc \
    --iters 400 \
    --nsamples 256 \
    --seqlen 4096

No gptqmodel install needed — auto-round 0.13 has GPTQ packing built in.

Architecture

  • Layers: 40 transformer blocks (model.language_model.layers.0 through model.language_model.layers.39)
  • MoE: 128 experts, 3B active per token (model.language_model.layers.*.mlp.experts)
  • Attention: hybrid — layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39] are self_attn, rest are linear_attn (DeltaNet)
  • MTP: 1 multi-token-prediction head (mtp.layers.0, mtp.fc, mtp.norm)
  • Visual: 27 transformer blocks (model.visual.blocks.0 through model.visual.blocks.26) + patch embed + merger

Files

PathWhat
model-000{01..10}-of-00010.*Quantized language-model shards (INT4 GPTQ, w4g128)
model_extra_tensors.safetensorsVisual encoder + MTP + ignored-layer gates (WOQ INT4 / BF16 passthrough)
config.jsonMultimodal config with embedded quantization_config
quantization_config.jsonStandalone quant config (redundant with config.json)
model.safetensors.index.jsonWeight-map → shard
generation_config.jsonDefault generation params
preprocessor_config.jsonVision preprocessor config
processor_config.jsonMultimodal processor config
chat_template.jinjaQwen3 chat template (reasoning-enabled)
tokenizer.json / tokenizer_config.jsonTokenizer

Total size: ~20.9 GB (10 shards ≈ 19.6 GB + extras ≈ 1.35 GB) — Intel-layout-matched (1.00× ratio vs Intel/Qwen3.6-35B-A3B-int4-AutoRound).

Layout note — Intel-layout-matched (after in-place fix)

Earlier builds of this repo had a ~21 GB model_extra_tensors.safetensors because auto-round 0.13 (PR #1705) auto-expanded block_name_to_quantize to include model.visual.blocks — the visual encoder ended up INT4'd inside extras, and several main-shard keys were duplicated into extras by missing_tensors.py. That broke vLLM's qwen3_vl.py loader, which expects visual weights to be BF16.

We rebuilt the repo in place (no re-quant): main shards untouched; model_extra_tensors.safetensors now contains BF16 visual tensors pulled from the base model (Qwen/Qwen3.6-35B-A3B, 167 tensors) plus the MTP head. Duplicated main-shard keys were dropped from the index. Net: ~20.9 GB total (1.00× Intel's reference size), no INT4 visual, vLLM loads cleanly.

MTP handling (project policy: RTN)

This repo's model_extra_tensors.safetensors contains INT4 MTP tensors (~2,335 per-expert weights under mtp.layers.0.*) produced via auto-round's native missing_tensors.py::copy_missing_tensors_from_source post-save path. The quantization method is deterministic symmetric RTN (group_size=128, no calibration data), which is the standard auto-round behavior for Qwen3.6 MoE models — and the same path Intel's published reference (Intel/Qwen3.6-35B-A3B-int4-AutoRound) uses.

Why RTN and not calibrated EC like our main LM blocks:

  • transformers>=5.5.1 loads Qwen3_5MoeForConditionalGeneration with _keys_to_ignore_on_load_unexpected = [r"^mtp.*"] at two locations (modeling_qwen3_5_moe.py L881, L1902), silently stripping MTP tensors during load. The instantiated model has no mtp submodules for auto-round's calibrated block iteration to discover.
  • We tried exposing MTP via a custom extension (roles/quant/code/qwen36_mtp_extension.py::attach_mtp) on 2026-04-23. The infrastructure worked (auto-round discovered mtp.layers.0 as a block family with 0 missing weights) but naive forward-hook wiring produced NaN calibration loss — mtp takes concat(embed(next_token), model.norm(hidden_states)) → mtp.fc → mtp.layers[0], not raw hidden_states, so the calibration search diverged.
  • Shipping calibrated MTP with a broken forward pipeline would be strictly worse than RTN (effectively random rounding). Proper EC requires implementing the full MTP forward; deferred as a future research task (see docs/quant-new-model-checklist.md §10.4 policy C).

Practical implication: the 40 main LM blocks use our EC calibration (iters=400, nsamples=256, seqlen=4096). The 1 MTP block uses RTN. For speculative-decoding workloads where MTP acceptance rate matters, this is the same baseline as Intel's reference. For non-spec-dec deployments (most vLLM serving), MTP is not loaded and this choice is invisible.

License

Apache 2.0 (inherits from Qwen/Qwen3.6-35B-A3B).

Shoutouts

  • Qwen team for the base Qwen3.6-35B-A3B model.
  • Intel for the reference AutoRound INT4 recipe and post-quant checkpoint layout we built on. Special thanks to @lvliang-intel and @wenhuach21 for PR #1705 which added Qwen3.6 support.
  • auto-round for the quantization tooling.
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes4
Downloads
📝

No reviews yet

Be the first to review shieldstar/Qwen3.6-35B-A3B-int4-AutoRound-EC!

Model Info

Providershieldstar
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes4
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor