Back to Models
DE

dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF

derviggeneral

m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF

GGUF quantizations of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B, the first publicly available REAP-40% pruned variant of MiniMax-M2.7.


Available quantizations

Sizes are approximate; the model card will refresh as each quant is uploaded to this repo.

VariantApprox. sizeTarget hardwareNotes
Q4_K_M~84 GB96 GB Apple Silicon (Mac Studio M4 Max)Recommended sweet spot. Smoke-test verified 5/5.
IQ4_XS~74 GB96 GB Apple Silicon with extra headroomSmaller than Q4_K_M, marginally lower quality.
Q3_K_M~66 GB64 GB Mac / 2×RTX 3090Budget option; expect some reasoning loss.
Q6_K~114 GB128 GB Mac UltraHigh-quality.
Q8_0~148 GB192+ GB systemsNear-lossless.
IQ4_NL-MoE~80 GB96 GB Mac / 2×RTX 3090MoE-aware: attn=Q8_0, experts=IQ4_NL, embed/output=Q6_K. Mirrors ubergarm's mainline-compatible recipe.

Which should you pick?

  • 96 GB Apple Silicon (Mac Studio M4 Max): Q4_K_M — ~84 GB leaves ~12 GB for KV cache at ~16K context.
  • 64 GB Mac: Q3_K_M is the only variant that fits. Expect some reasoning-quality degradation.
  • 128 GB Mac Ultra / 2× A6000: Q6_K for near-baseline quality.
  • 192+ GB system (dual H100 / RTX 6000 Ada): Q8_0 for minimal quality loss.
  • Alternative to Q4_K_M on 96 GB: IQ4_NL-MoE keeps attention at Q8_0 and quantizes only expert FFN tensors. Similar size, often better code/reasoning.

Evaluation

HumanEval pass@1 on Q4_K_M (on completed): 83.3 % (90 / 108)

For problems where the model completed its <think> reasoning within a 32 K-token generation budget, the Q4_K_M quant solved 90 of 108 correctly.

Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %

56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. Allocate ≥64 K tokens to approach the 83 % ceiling.

Methodology: 2 × H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000.

Prior methodology note: an earlier evaluation using raw /v1/completions with chat-prose stripping (non-canonical for reasoning models) reported 65.2 %. The numbers above use the canonical chat-completion path.

Smoke test (5 diverse pre-publish prompts): 5 / 5 PASS — trivial arithmetic, Python Fibonacci, Norwegian response, MoE semantic explanation, JSON tool-call echo.

Memory & context sizing for consumer hardware

96 GB Apple Silicon (primary target)

VariantFile sizectx 8Kctx 32Kctx 60Kctx 131K
Q4_K_M84 GB✓ w/ KV q8_0✓ w/ KV q4_0requires KV q4_0
IQ4_XS74 GB✓ w/ KV q8_0
Q3_K_M66 GB
IQ4_NL-MoE80 GB✓ w/ KV q8_0✓ w/ KV q4_0requires KV q4_0
Q6_K / Q8_0114 / 148 GBtoo large for 96 GB system

The native FP16 KV cache costs ~0.25 GB per 1K tokens for this architecture (62 layers × 1024 KV dim × 2 bytes). That is non-trivial at long context: Q4_K_M at ctx=60K needs ~15 GB of KV cache alone.

KV cache quantization — essential for long context on 96 GB

llama.cpp supports quantizing the KV cache with near-zero quality loss:

./llama-server -m MiniMax-M2.7-REAP-139B-A10B-Q4_K_M.gguf   -c 65536 -ngl 99   --cache-type-k q8_0 --cache-type-v q8_0
KV typeSize @ ctx=60KQuality impact
FP16 (default)15 GBbaseline
q8_07.5 GBessentially lossless (recommended)
q4_0 / q4_13.8 GBvery small degradation, worth it for extreme context

Other systems

  • 64 GB Mac / 2× RTX 3090: Q3_K_M with q8_0 KV fits at ctx=32K.
  • 128 GB Mac Ultra: Q6_K comfortably at ctx=32K, tight at longer context.
  • Dual H100 (160 GB) / 192 GB+ systems: Q8_0 near-lossless, full context.

Known minor imperfection

During integrity audit, one layer (layer 0) had expert keep-indices that differed from the REAP-retained set in ~86 of 154 positions. The bias-value mismatch is bounded by the layer-0 bias natural variance (max |Δ|=0.75 on values ∈ [8.06, 8.88]), so router behavior is essentially unchanged — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Details in the safetensors model card.

Citation

See the safetensors repo for full citation details. Core references:

  • Lasby et al., REAP the Experts (arXiv:2510.13999)
  • MiniMax-M2.7 base model (MiniMaxAI)

License

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.


Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes9
Downloads
📝

No reviews yet

Be the first to review dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF!

Model Info

Providerdervig
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes9
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor