Qwen3.6-35B-A3B - TurboQuant+ Config-I (MLX)

Name: thetom-ai/Qwen3.6-35B-A3B-ConfigI-MLX
Brand: thetom-ai
Rating: 0.0 (8 reviews)

35B-parameter MoE compressed to 15 GB with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of Qwen/Qwen3.6-35B-A3B (35B total, 256 experts, top-8 active, ~3B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Status: Available for testing. Quality benchmarks (PPL, MMLU, NIAH) and multi-context coherency sweeps are pending. Use at your own risk.

Compression

	Size
bf16 source	72 GB
Uniform MLX 4-bit	20 GB
Config-I (3.613 bpw)	15 GB

Config-I Policy (Qwen3.6 MoE Adaptation)

256 experts, top-8 active per token. 40 layers with hybrid attention (GatedDeltaNet + full attention).

Component	Bits	Layers	Rationale
Expert MLP gate/up	2-bit	middle 36	98%+ of params, MoE-tolerant
Expert MLP down	3-bit	middle 36	Write-back sensitivity (Config-I finding)
Attention Q/K/V/O	4-bit	middle 36	Uniform per layer
Boundary (all tensors)	8-bit	first 2 + last 2	Boundary layer protection
MoE router	f16	all	Routing precision critical
Embeddings + lm_head	8-bit	-	Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a fraction of experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

Field	Value
Format	MLX safetensors (standard)
Avg bits	3.613 bpw
Runtime	`mlx_lm` (Python), `mlx-swift-lm` (Swift)
Platform	Apple Silicon (M-series with 24GB+)
Quantized on	2026-04-16

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

Speed (Swift mlx-swift-lm, Config-I)

Measured on M5 Max 128GB, 2026-04-16, with Swift mlx-swift-lm alpha branch (includes the chunked-prefill asyncEval fix from PR #49, merged into alpha). MLX_BENCH_METHOD=summarization.

Context (tokens)	Prompt (tok/s)	Generation (tok/s)	Peak memory
128	822	122	15.0 GB
512	2,240	120	15.1 GB
1,024	2,983	113	15.2 GB
2,048	3,920	120	15.5 GB
4,096	3,690	106	15.5 GB
8,192	3,364	112	15.6 GB
16,384	2,816	108	15.7 GB
32,768	2,025	99	16.1 GB

Requires alpha branch ≥ commit 83ff028 (PR #49 merged).

KV cache compression (TurboQuant)

Persistent KV cache footprint at 32K context is 5x smaller with turbo4v2 (K 4-bit, V 2-bit) asymmetric compression, with no measured speed cost:

Config	KV at 128	KV at 1k	KV at 4k	KV at 32k	tg @ 32k
none (fp16)	111 MB	308 MB	979 MB	7.07 GB	99 tok/s
`turbo4` (K4/V4 sym)	29 MB	82 MB	260 MB	1.88 GB	93 tok/s
`turbo3` (K3/V3 sym)	23 MB	63 MB	199 MB	1.44 GB	98 tok/s
`turbo4v2` (K4/V2 asym)	23 MB	63 MB	199 MB	1.44 GB	100 tok/s

Recommended: turbo4v2. Enable with MLX_BENCH_KV=turbo4v2 in the benchmark harness, or via the TurboQuantKVCache API in mlx-swift-lm alpha.

Prefill tok/s is within ±3% across all KV configs. GPU Peak memory during inference is dominated by prefill activations, so it does not change — the win is steady-state serving memory: multiplies the number of concurrent contexts that fit in a given GPU.

NIAH (Needle-in-a-Haystack)

Retrieval accuracy — a known verification code (BLUE TIGER 42) is inserted at 5 depth positions (10%, 25%, 50%, 75%, 90%) of a filler document at each context size. The model must output only the code.

ctx	@10%	@25%	@50%	@75%	@90%	pass
1024	✅	✅	✅	✅	✅	5/5
4096	✅	✅	✅	✅	✅	5/5
8192	✅	✅	✅	✅	✅	5/5
16384	✅	✅	✅	✅	✅	5/5

20 / 20 PASS. No retrieval drift from the 2-bit expert MLP compression. Tested with no KV compression (--kv none).

Perplexity (WikiText-2)

Standard LM perplexity via forced decode on the WikiText-2 test split. Word-level PPL = exp(total_NLL / word_count) — tokenizer-agnostic.

ctx	Base uniform 4-bit (20 GB)	Config-I + turbo4v2 (15 GB)	Δ
1,024	7.28	8.21	+13%
2,048	6.73	7.71	+15%
4,096	7.76	8.81	+14%

Config-I weight compression costs ~13% PPL vs uniform 4-bit for a 25% size reduction (20 → 15 GB). TurboQuant KV (turbo4v2) adds zero additional PPL — identical to 4 decimal places vs Config-I without KV compression.

Coherency

Sanity samples across context lengths. The prompt pads with technical text on linear attention, then asks a separate question about mixture-of-experts, testing whether the model still reasons coherently at long context (including noticing the context/question mismatch). Max 256 tokens output, reasoning trace included.

ctx ≈ 128 (prompt 32 tokens)

1.  **Analyze User Request:**
   - **Topic:** Mixture-of-Experts (MoE) transformer
   - **Requirements:** Explain what it does and why it saves compute
   - **Constraint:** 2-3 sentences

2.  **Identify Key Concepts:**
   - What is a MoE transformer? It replaces standard dense feed-forward layers
     with multiple "expert" neural networks.
   - How does it work? A routing mechanism selects a small subset of experts
     (usually 1 or 2) per token/input to process.
   - Why does it save compute? Only the selected experts are activated, so the
     total number of parameters is large, but the active parameters per token
     are small, reducing computation per forward pass.

ctx ≈ 1024 (prompt 895 tokens)

1.  **Deconstruct the user's request:**
    *   Context provided: Linear attention, specifically the "gated delta rule"
        variant. Key points: recurrent update, compact state matrix $S_t$,
        recurrence rule S_t = g_t * S_{t-1} + β_t * (v_t - k_t^T * S_{t-1}) * k_t^T,
        learned decay gate, write strength, selective forgetting/overwriting,
        O(T) prefill cost, O(1) decode time.
    *   Task: Explain what a "mixture-of-experts transformer" does and why it
        saves compute, based on the context.

2.  **Analyze the context vs. the task:**
    *   The provided text is about Linear Attention / Gated Delta Rules.
    *   The user asks about Mixture-of-Experts — a different architectural idea.
    *   Does the provided text mention MoE? No.

ctx ≈ 4096 (prompt 3870 tokens)

1.  **Deconstruct the User's Request:**
    *   Context: The user provided a large block of text about linear attention
        (repeated many times). Describes gated delta rule, recurrent updates,
        state matrices, O(T) prefill, O(1) decode, decay gates, write strength.
    *   Task: Explain MoE transformer based on the context.

2.  Analysis of the mismatch: The user asks about MoE but provides text about
    Linear Attention. The two are different architectural innovations —
    linear attention is about how tokens attend, MoE is about which subset of
    the FFN is active per token. They compose but don't substitute.

ctx ≈ 16384 (prompt 6488 tokens)

1.  **Analyze User Input:**
   - Context Provided: Repetitive paragraph about linear attention
     (recurrent updates, state matrices, O(T) prefill, O(1) decode,
     gated delta rule). Clearly padding, but the core concept is clear.
   - Question Asked: MoE transformer — what it does, why it saves compute.
   - Key Mismatch: Context is about linear attention, question about MoE.
     The model should address the question directly, not confuse the two.

2.  **Identify Core Concepts:**
   - What is a Mixture-of-Experts (MoE) transformer?
   - How does it save compute?

ctx ≈ 32768 (prompt 6488 tokens — bounded by prompt budget)

1.  **Analyze User Input:**
   - Context Provided: Repetitive paragraph about linear attention.
     Core concept: linear attention with gated delta rule.
   - Question: MoE transformer, compute savings.
   - Same context/question mismatch as shorter runs; model stays on task.

At all tested context lengths the model produces a coherent reasoning trace, correctly identifies the concept mismatch between padding text (linear attention) and the actual question (MoE), and stays on task.

Links

Quantized by @thetom-ai | GitHub | X | Sponsor

thetom-ai/Qwen3.6-35B-A3B-ConfigI-MLX