Back to Models
TH

thetom-ai/Qwen3.6-35B-A3B-ConfigI-MLX

thetom-aigeneral

Qwen3.6-35B-A3B - TurboQuant+ Config-I (MLX)

35B-parameter MoE compressed to 15 GB with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of Qwen/Qwen3.6-35B-A3B (35B total, 256 experts, top-8 active, ~3B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Status: Available for testing. Quality benchmarks (PPL, MMLU, NIAH) and multi-context coherency sweeps are pending. Use at your own risk.

Compression

Size
bf16 source72 GB
Uniform MLX 4-bit20 GB
Config-I (3.613 bpw)15 GB

Config-I Policy (Qwen3.6 MoE Adaptation)

256 experts, top-8 active per token. 40 layers with hybrid attention (GatedDeltaNet + full attention).

ComponentBitsLayersRationale
Expert MLP gate/up2-bitmiddle 3698%+ of params, MoE-tolerant
Expert MLP down3-bitmiddle 36Write-back sensitivity (Config-I finding)
Attention Q/K/V/O4-bitmiddle 36Uniform per layer
Boundary (all tensors)8-bitfirst 2 + last 2Boundary layer protection
MoE routerf16allRouting precision critical
Embeddings + lm_head8-bit-Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a fraction of experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

FieldValue
FormatMLX safetensors (standard)
Avg bits3.613 bpw
Runtimemlx_lm (Python), mlx-swift-lm (Swift)
PlatformApple Silicon (M-series with 24GB+)
Quantized on2026-04-16

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

Speed (Swift mlx-swift-lm, Config-I)

Measured on M5 Max 128GB, 2026-04-16, with Swift mlx-swift-lm alpha branch (includes the chunked-prefill asyncEval fix from PR #49, merged into alpha). MLX_BENCH_METHOD=summarization.

Context (tokens)Prompt (tok/s)Generation (tok/s)Peak memory
12882212215.0 GB
5122,24012015.1 GB
1,0242,98311315.2 GB
2,0483,92012015.5 GB
4,0963,69010615.5 GB
8,1923,36411215.6 GB
16,3842,81610815.7 GB
32,7682,0259916.1 GB

Requires alpha branch ≥ commit 83ff028 (PR #49 merged).

KV cache compression (TurboQuant)

Persistent KV cache footprint at 32K context is 5x smaller with turbo4v2 (K 4-bit, V 2-bit) asymmetric compression, with no measured speed cost:

ConfigKV at 128KV at 1kKV at 4kKV at 32ktg @ 32k
none (fp16)111 MB308 MB979 MB7.07 GB99 tok/s
turbo4 (K4/V4 sym)29 MB82 MB260 MB1.88 GB93 tok/s
turbo3 (K3/V3 sym)23 MB63 MB199 MB1.44 GB98 tok/s
turbo4v2 (K4/V2 asym)23 MB63 MB199 MB1.44 GB100 tok/s

Recommended: turbo4v2. Enable with MLX_BENCH_KV=turbo4v2 in the benchmark harness, or via the TurboQuantKVCache API in mlx-swift-lm alpha.

Prefill tok/s is within ±3% across all KV configs. GPU Peak memory during inference is dominated by prefill activations, so it does not change — the win is steady-state serving memory: multiplies the number of concurrent contexts that fit in a given GPU.

NIAH (Needle-in-a-Haystack)

Retrieval accuracy — a known verification code (BLUE TIGER 42) is inserted at 5 depth positions (10%, 25%, 50%, 75%, 90%) of a filler document at each context size. The model must output only the code.

ctx@10%@25%@50%@75%@90%pass
10245/5
40965/5
81925/5
163845/5

20 / 20 PASS. No retrieval drift from the 2-bit expert MLP compression. Tested with no KV compression (--kv none).

Perplexity (WikiText-2)

Standard LM perplexity via forced decode on the WikiText-2 test split. Word-level PPL = exp(total_NLL / word_count) — tokenizer-agnostic.

ctxBase uniform 4-bit (20 GB)Config-I + turbo4v2 (15 GB)Δ
1,0247.288.21+13%
2,0486.737.71+15%
4,0967.768.81+14%

Config-I weight compression costs ~13% PPL vs uniform 4-bit for a 25% size reduction (20 → 15 GB). TurboQuant KV (turbo4v2) adds zero additional PPL — identical to 4 decimal places vs Config-I without KV compression.

Coherency

Sanity samples across context lengths. The prompt pads with technical text on linear attention, then asks a separate question about mixture-of-experts, testing whether the model still reasons coherently at long context (including noticing the context/question mismatch). Max 256 tokens output, reasoning trace included.

ctx ≈ 128 (prompt 32 tokens)

1.  **Analyze User Request:**
   - **Topic:** Mixture-of-Experts (MoE) transformer
   - **Requirements:** Explain what it does and why it saves compute
   - **Constraint:** 2-3 sentences

2.  **Identify Key Concepts:**
   - What is a MoE transformer? It replaces standard dense feed-forward layers
     with multiple "expert" neural networks.
   - How does it work? A routing mechanism selects a small subset of experts
     (usually 1 or 2) per token/input to process.
   - Why does it save compute? Only the selected experts are activated, so the
     total number of parameters is large, but the active parameters per token
     are small, reducing computation per forward pass.

ctx ≈ 1024 (prompt 895 tokens)

1.  **Deconstruct the user's request:**
    *   Context provided: Linear attention, specifically the "gated delta rule"
        variant. Key points: recurrent update, compact state matrix $S_t$,
        recurrence rule S_t = g_t * S_{t-1} + β_t * (v_t - k_t^T * S_{t-1}) * k_t^T,
        learned decay gate, write strength, selective forgetting/overwriting,
        O(T) prefill cost, O(1) decode time.
    *   Task: Explain what a "mixture-of-experts transformer" does and why it
        saves compute, based on the context.

2.  **Analyze the context vs. the task:**
    *   The provided text is about Linear Attention / Gated Delta Rules.
    *   The user asks about Mixture-of-Experts — a different architectural idea.
    *   Does the provided text mention MoE? No.

ctx ≈ 4096 (prompt 3870 tokens)

1.  **Deconstruct the User's Request:**
    *   Context: The user provided a large block of text about linear attention
        (repeated many times). Describes gated delta rule, recurrent updates,
        state matrices, O(T) prefill, O(1) decode, decay gates, write strength.
    *   Task: Explain MoE transformer based on the context.

2.  Analysis of the mismatch: The user asks about MoE but provides text about
    Linear Attention. The two are different architectural innovations —
    linear attention is about how tokens attend, MoE is about which subset of
    the FFN is active per token. They compose but don't substitute.

ctx ≈ 16384 (prompt 6488 tokens)

1.  **Analyze User Input:**
   - Context Provided: Repetitive paragraph about linear attention
     (recurrent updates, state matrices, O(T) prefill, O(1) decode,
     gated delta rule). Clearly padding, but the core concept is clear.
   - Question Asked: MoE transformer — what it does, why it saves compute.
   - Key Mismatch: Context is about linear attention, question about MoE.
     The model should address the question directly, not confuse the two.

2.  **Identify Core Concepts:**
   - What is a Mixture-of-Experts (MoE) transformer?
   - How does it save compute?

ctx ≈ 32768 (prompt 6488 tokens — bounded by prompt budget)

1.  **Analyze User Input:**
   - Context Provided: Repetitive paragraph about linear attention.
     Core concept: linear attention with gated delta rule.
   - Question: MoE transformer, compute savings.
   - Same context/question mismatch as shorter runs; model stays on task.

At all tested context lengths the model produces a coherent reasoning trace, correctly identifies the concept mismatch between padding text (linear attention) and the actual question (MoE), and stays on task.

Links


Quantized by @thetom-ai | GitHub | X | Sponsor

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes8
Downloads
📝

No reviews yet

Be the first to review thetom-ai/Qwen3.6-35B-A3B-ConfigI-MLX!

Model Info

Providerthetom-ai
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes8
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor