AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
AEON-7 ⢠generalQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
Deployment, operations & benchmarks ā github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and
AGENTS.mdā an operator's manual that pre-empts common stale-documentation traps.
š Reference recipe credit: The modelopt + MTP graft pipeline used to build this variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config, the per-projection quantization choices, and the MTP-head graft technique on the un-abliterated base; we adapted the same recipe to AEON-Ultimate's abliterated weights. The reference benchmark numbers cited below are theirs. Full credit for the recipe ā sakamakismile.
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning) |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark / GB10 ā production validated with DFlash speculative decoding. Patched vllm-aeon-ultimate-dflash container. |
| Multimodal-NVFP4-MTP (this repo) | 27 GB | High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native mtp.* head. modelopt format, --quantization modelopt. Vision tower preserved. |
| Text-NVFP4-MTP | 20 GB | Same as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM. |
What this is
This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 ā the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through--quantization modelopt(different code path from the-NVFP4sibling release which uses--quantization compressed-tensors). - Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's
*linear_attn.conv1d*ignore plus our explicit*linear_attn*exclude keeps these intact. - Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
- MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16). The base contains MTP heads butQwen3_5ForConditionalGeneration.from_pretraineddrops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for--speculative-config '{"method":"qwen3_5_mtp",...}'.
Why MTP ā and where it actually wins
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself ā same architecture, same weights, same distribution.
Measured numbers on AEON-Ultimate (this exact variant)
| Hardware | Median tok/s | Peak tok/s | Spec-decode acceptance |
|---|---|---|---|
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | ~92 (this variant) / 111.4 (XS sibling) | 124.7 (XS sibling) | 67.7 % regular / 69.2 % XS |
| DGX Spark / GB10 (unified memory) ā MTP method | 24.1 (XS sibling) | 27.5 | 66.3 % |
| DGX Spark / GB10 ā DFlash method on this body š | 38.5 tok/s thinking-on / 38.1 thinking-off | 71.3 tok/s thinking-on / 68.4 off | DFlash v2 |
| RTX 5090, B100 / B200 | not yet measured by us ā community welcome |
Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189ā207 tok/s
- Mean MTP acceptance length: ~3.0ā4.0 (vs DFlash chains ~2.0ā2.3)
The hardware-routing punchline
On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak ā the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.
šÆ When to pick this variant ā measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | -NVFP4 (DFlash) ā not this MTP variant | Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark. |
| RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM) | This variant (Multimodal-NVFP4-MTP) ā if you need vision; Text if text-only | MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput. |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | Multimodal-XS if you use vision; Text-XS if text-only | XS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper ā no benefit. |
| B100 / B200 (sm_100, dedicated FP4) | This variant (Multimodal) or Text variant | Native FP4 + dedicated VRAM = MTP territory. |
Full bench numbers: GitHub repo Performance section.
Usage
vLLM serve
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
--local-dir ./aeon-ultimate-multimodal-nvfp4-mtp
# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
--quantization modelopt \
--trust-remote-code \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.94 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.
Configuration notes
--quantization modeloptis required (notcompressed-tensorsā different format).--speculative-config '{"method":"qwen3_5_mtp", ...}'activates the grafted MTP head as the spec-decode drafter. No external drafter download needed ā the head is in the safetensors of this repo.--gpu-memory-utilization 0.94is the validated cap on RTX PRO 6000;0.95causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class) - Calibration:
neuralmagic/calibrationLLM split, 20 samples Ć 8192 tokens - Excluded from quantization (kept BF16):
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG)*linear_attn*(added ā full GDN preservation)*visual*(added ā vision tower preservation)*mtp*(added ā MTP head preservation)*output_layer*,output.*
- MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export (AutoModelForCausalLM.from_pretraineddrops them; explicit graft restores) - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team ā
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own ā you supply the opinions, the judgment, and the ethics.
ā Support the work
If this release has been useful, tips are deeply appreciated ā they go directly toward more compute, more models, and more open releases.
āæ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ī Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
ā Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ā Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.



