rdtand/MiniMax-M2.7-PrismaQuant-3.20bit-vllm
rdtand • codeMiniMax-M2.7 — PrismaQuant 3.20 bpp (vLLM)
Mixed-precision quantization with joint expert pruning. Fits a 228 B parameter MoE in 90 GB on a single DGX Spark, served natively by vLLM with no patches.
Source: MiniMaxAI/MiniMax-M2.7
Quantizer: prismaquant · commit pinned in this artifact's mixed_native_manifest.json
TL;DR
| Metric | Value |
|---|---|
| Disk size | 90 GB (-58 % vs FP8 source 215 GB; -73 % vs BF16 ~456 GB) |
| Achieved bpp | 3.20 |
| Format mix | 30,780 NVFP4 + 2,204 FP8_SOURCE Linears |
| Experts kept | 10,912 of 15,872 (69 %) — 4,960 dropped via REAP saliency |
| Per-MoE-layer kept | uniform 176 of 256 (top-k=8) |
| Decode throughput on Spark | ~14 tok/s (single-stream, T0, 32k context) |
| vLLM patches required | 0 |
How it was produced
prismaquant solves a multi-choice knapsack over per-Linear cost / memory choices to land at a target bit budget. Two distinct contributions go into this artifact:
1. Closed-form Δloss proxy
For each (Linear, format) pair, the cost is
$$\Delta\mathrm{loss} \approx \tfrac{1}{2} \cdot H_{\text{trace}} \cdot \mathrm{MSE}_W$$
H_traceis the empirical Fisher diagonal trace, captured in one streaming forward+backward pass over the calibration set. It measures how curved the cross-entropy loss is at this Linear: highH_tracemeans a small weight perturbation moves the loss a lot.MSE_Wis the measured per-format weight round-trip error (NVFP4, FP8, BF16). Not an analytical formula — we run RTN on the actual weights and compute the error directly.
Multiplying gives the second-order Taylor estimate of how much the model's loss will rise if you replace the BF16 weight with the format's quantized version. The allocator picks per-Linear formats that minimize total Δloss subject to a total-bit budget.
2. Joint expert-prune + format choice
For MoE layers, prismaquant treats each MoE choice as a pair:
(quantization_format, dropped_expert_ids)
Both the format and the prune set are priced in the same knapsack via REAP-style saliency:
$$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$$
This is the dropout-loss estimate from the REAP family of MoE expert-importance scores: how much the layer's output norm drops in expectation when expert j is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.
Per-layer prune candidates emit floor(R · num_experts) lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a uniform-kept prune manifest so vLLM's MoE kernel sees a single num_local_experts per layer (this artifact: 176 of 256 kept everywhere).
3. Pareto sweep + kneedle pick
Before committing to a target bit budget, prismaquant computes the full pareto curve. Below is the actual sweep that produced this artifact:
| target bpp | achieved | size on disk | predicted Δloss | NVFP4 super-Linears | FP8_SOURCE super-Linears | experts dropped |
|---|---|---|---|---|---|---|
| 3.10 | 3.10 | 88.4 GB | 5,518 | 227 | 83 | 17,856 |
| 3.16 | 3.16 | 90.1 GB | 3,775 ← kneedle | 279 | 31 | 14,880 |
| 3.20 | 3.20 | 91.2 GB | 3,734 | 272 | 38 | 14,880 ← shipped |
| 3.25 | 3.25 | 92.6 GB | 3,733 | 271 | 39 | 14,880 |
| 3.30 | 3.30 | 94.1 GB | 3,733 | 236 | 74 | 14,880 |
| 3.40 | 3.40 | 96.9 GB | 3,732 | 199 | 111 | 14,880 |
| 3.50 | 3.50 | 99.7 GB | 2,496 | 268 | 42 | 11,904 |
| 3.60 | 3.60 | 102.6 GB | 2,495 | 217 | 93 | 11,904 |
The Δloss plateau between 3.16 and 3.40 (~3,732) shows the allocator is already squeezing most of the available signal in that band. The dramatic drop at 3.50 (-33 %) is from relaxing the prune ratio (15,872 → 11,904 experts dropped). The user-specified target was the 90-95 GB band; 3.20 was picked as the smallest practical size that captures essentially all the available quality in the band.
Format mix on disk
NVFP4 : 30,780 Linears (94.5 %) — experts + most attention/MLP projections
FP8_SOURCE : 2,204 Linears (6.7 %) — passthrough of natively-FP8 source weights
BF16 : 62 routers — output dim shrunk to kept-expert count
PRUNED : 14,880 Linear slots — 4,960 experts × 3 weights, dropped per REAP
Sample of per-layer assignments:
| Layer | Format mix |
|---|---|
| L00 (dense pre-MoE) | 532 FP8_SOURCE |
| L01 (dense pre-MoE) | 532 FP8_SOURCE |
| L02 (first MoE) | 532 NVFP4 |
| L30 (mid MoE) | 529 NVFP4 + 3 FP8_SOURCE |
| L61 (last layer) | 532 NVFP4 |
The allocator put the early dense layers (which dominate semantic embedding pathways) at FP8 for safety, then dropped the bulk of MoE expert weights to NVFP4 once their 0.5 · H · MSE_W showed it was safe. A few attention projections in mid-layers got FP8_SOURCE-pinned where the per-Linear sensitivity flagged NVFP4 as too aggressive.
Calibration
Calibration data: cal-mix-v1, a multi-domain mix balancing agentic, math, and coding sequences:
- Agentic: tool-call traces, multi-step reasoning chains, planning + execution dialogues
- Math: word problems, step-by-step solutions, symbolic manipulation
- Coding: Python / Rust / SQL / shell, both authoring and reading patterns
Volume: 32 chunks × 4 samples × 2048 seq-len ≈ 262 k tokens through the streaming probe. Each chunk runs phase-1 forward (saliency capture) + phase-3 reverse-sweep (Fisher per-Linear). The chunks share the same multi-domain composition — so all calibration matters for all three downstream regimes.
prismaquant's per-domain-saliency feature exists (allocator can use union / intersection / mean across domains), but for this release the calibration was domain-merged. Per-domain runs are a follow-up.
Quality
Spot-checked at temperature 0 across agentic / math / coding:
| Test | Result |
|---|---|
| Multi-segment train problem (math) | Step-by-step reasoning, exact answer 240 mi / 68.571 mph |
Python is_palindrome | Clean, correct |
Python quicksort | Clean, correct |
Python binary_search | Clean, correct |
Python longest_substring_without_repeat | Sliding-window, correct |
Python merge_two_lists (linked list) | Clean, correct |
Python fibonacci | Iterative, with worked example |
Rust Point::distance | Uses .hypot() (numerically stable) |
| SQL top-5 customers by 2024 volume | Clean, proper date-range filter |
| Tool calling | Clean function-call JSON emission |
Reasoning content via <think> | Captured by --reasoning-parser minimax_m2 |
Formal benchmarks (MMLU, GSM8K, HumanEval) deferred. The artifact is positioned as fits-on-Spark + serves-coherently across the three calibration domains; rigorous benchmark numbers in a follow-up release.
Serving
vllm serve <this-repo> \
--quantization compressed-tensors \
--trust-remote-code \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2
Recommended on UMA hardware: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to keep the cuda allocator from hoarding freed blocks.
Limitations / caveats
- Calibration scale: 262 k tokens is moderate. Heavy reasoning-chain or long-context workloads may benefit from a re-export with more diverse calibration.
- Domain-merged saliency: per-domain prune policies (union/intersection) supported by prismaquant but not exercised here. A re-export with domain-tagged calibration is a candidate next iteration.
- No formal benchmarks yet: MMLU / GSM8K / HumanEval pending. Headline result is "fits + coherent across cal-mix".
- No MTP heads: MiniMax-M2 has no MTP head (unlike Qwen3.5/3.6). No speculative-decoding accelerator.
- Pruned experts are gone: 4,960 of 15,872 (31 %) dropped per REAP. Tasks heavily dependent on those specific experts could see degradation. Empirical probes show none on agentic/math/coding prompts.
Reproduction
This artifact was produced by:
# 1. Probe + cost (multi-chunk, adaptive sampling, deferred Fisher sync)
python -m prismaquant.multi_chunk_probe \
--chunks-dir /work/chunks \
--model <minimax-m2.7-snapshot> \
--output /work/artifacts/probe.pkl \
--activation-cache-dir /work/act \
--work-dir /work/work \
--layers-per-shard 4 --unified-sweep \
--no-include-mtp --no-include-visual --no-include-lm-head \
--prefetch-lookahead 4 --prefetch-workers 2 \
--activation-rows-limit 256 \
--calibration-modality text-only \
--retain-cross-chunk-cache \
--adaptive-sampling \
--run-cost --cost-output /work/artifacts/cost.pkl \
--cost-formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16
# 2. Allocator (target_bits=3.20 picks the kneedle within the 90-95 GB band)
python -m prismaquant.allocator \
--probe /work/artifacts/probe.pkl \
--costs /work/artifacts/cost.pkl \
--formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16 \
--target-bits 3.20 \
--pareto-targets 3.10,3.16,3.20,3.25,3.30,3.40,3.50,3.60 \
--enable-expert-prune \
--prune-ratios 0.0,0.125,0.1875,0.25,0.3125,0.375 \
--prune-alpha 0.15 \
--layer-config /work/artifacts/layer_config_prune.json
# 3. Export (native compressed-tensors, GPTQ + scale-sweep activation-aware)
python -m prismaquant.export_native_compressed \
--model <minimax-m2.7-snapshot> \
--layer-config /work/artifacts/layer_config_prune.json \
--prune-manifest /work/artifacts/layer_config_prune.json.prune.json \
--output /work/exported \
--activation-cache-dir /work/act \
--device cuda
Full source + reproduction notes: https://github.com/RobTand/prismaquant
Acknowledgements
- MiniMaxAI — source model.
- vLLM — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
- REAP-style per-expert dropout-loss saliency.
- HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
- GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.
License
Inherits the MiniMax-M2.7 license from the source model. See base model card for terms.