AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS
AEON-7 ⢠generalQwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS
Deployment, operations & benchmarks ā github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and
AGENTS.mdā an operator's manual that pre-empts common stale-documentation traps.
š DGX Spark performance ā current production (v3 image, 2026-04-29)
Served with DFlash spec decode (not the MTP head) on this XS body, the v3 image (
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3) clocks 38.5 tok/s median, 71.3 tok/s peak thinking-on / 38.1 / 68.4 thinking-off ā a +18 % median / +26 % peak lift over the prior v2.1 image and a +17 % / +21 % stacked lift vs the original-NVFP4(compressed-tensors) production. Median TTFT is 247 ms (was 325 ms ā ā24 %). See the GitHub Performance section for the four-config comparison table.
š Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config ā including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving
linear_attn.conv1dat BF16 ā and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe ā sakamakismile.
What "XS" means ā and what it's not
This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).
| Multimodal-NVFP4-MTP (regular) | Multimodal-NVFP4-MTP-XS (this repo) | |
|---|---|---|
linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) |
linear_attn.conv1d (SSM 1D convolution ā recurrence-critical) | preserved BF16 | preserved BF16 ā |
linear_attn SSM state vectors (A_log, dt_bias, norm.weight) | preserved BF16 | preserved BF16 ā |
mtp.* head (grafted bf16 from base, bit-exact verified) | yes | yes |
| Vision tower | preserved BF16 | preserved BF16 |
| Total disk | ~27 GB | ~21 GB |
| VRAM footprint at runtime | ~28 GB | ~22 GB |
This is a smart, strategic quantization ā not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).
When to pick which:
- Pick the regular variant if you have ā„48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
- Pick this XS variant if you have 24ā32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.
We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d ā that would be a different (and not-recommended) variant we have explicitly chosen not to ship.
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark ā DFlash spec decode, validated |
| Multimodal-NVFP4-MTP | 27 GB | RTX PRO 6000 / B100/B200 ā MTP, GDN preserved BF16 |
| Text-NVFP4-MTP | 26 GB | Same as above without vision tower |
| Multimodal-NVFP4-MTP-XS (this repo) | 21 GB | RTX 5090 / smaller dedicated VRAM ā MTP, full FP4 incl. GDN projections |
| Text-NVFP4-MTP-XS | 20 GB | Same as this repo without vision tower |
What this is
The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 ā the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. modelopt format, served by vLLM through--quantization modelopt. - Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only
linear_attn.conv1dis kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly. - Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
- MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16, bit-exact verified). Powers--speculative-config '{"method":"qwen3_5_mtp",...}'for self-speculative decoding without a separate drafter.
Why MTP
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself ā same architecture, same weights, same distribution.
Indicative published numbers (sakamakismile's reference recipe on RTX 5090):
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
- Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)
Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.
šÆ When to pick this variant ā measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | Either: -NVFP4 (DFlash) (simpler, validated) or this XS body served with --speculative-config '{"method":"dflash",...}' (highest measured throughput ā see note below) | Spark prefers DFlash regardless of body. The XS body with DFlash spec lands at 37.6 tok/s median, 68.7 tok/s peak on Spark ā the highest measured config. The grafted MTP head in this repo is unused in that path. Never use --speculative-config '{"method":"qwen3_5_mtp",...}' on Spark ā that lands at only 24.1 tok/s median. |
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | Multimodal-NVFP4-MTP ā GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode | XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM. |
| B100 / B200 (sm_100, dedicated FP4) | Multimodal-NVFP4-MTP (preferred ā GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | This XS variant ā if you use vision; Text-XS if text-only | XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper ā no benefit. |
Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |
Usage
vLLM serve ā dedicated-VRAM Blackwell (default: MTP via grafted head)
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
--local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs
# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
--quantization modelopt \
--trust-remote-code \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.94 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.
vLLM serve ā DGX Spark (DFlash spec, not MTP ā measured winning config)
For DGX Spark, swap the spec method to DFlash. The XS body still benefits from FP4 silicon, but DFlash's k=15 chains are decisively better than MTP's n=3 on unified memory.
# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
--quantization modelopt \
--trust-remote-code \
--max-model-len 200000 \
--max-num-seqs 16 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--attention-backend flash_attn \
--speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":15}'
Production-validated v3 image: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3. Measured 38.1 tok/s median, 68.4 tok/s peak thinking-off and 38.5 / 71.3 thinking-on ā the highest single-stream config we've measured on Spark.
Configuration notes
--quantization modeloptis required for this body (notcompressed-tensorsā different format).--speculative-config '{"method":"qwen3_5_mtp", ...}'uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.--speculative-config '{"method":"dflash", ...}'uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 ā they prefer MTP.--gpu-memory-utilization 0.94is the validated cap on RTX PRO 6000;0.85is the cap on DGX Spark (unified memory thrashes higher).
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class) - Calibration:
neuralmagic/calibrationLLM split, 20 samples Ć 8192 tokens - Excluded from quantization (kept BF16) ā XS variant differences from the regular variant in bold:
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG default ā kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)*linear_attn*is NOT broadly excluded (XS difference ā the projection matmulsin_proj_qkv,in_proj_z,in_proj_a/b,out_projget NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)*visual*(vision tower preservation)*mtp*(MTP head preservation)*output_layer*,output.*
- MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team ā
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own ā you supply the opinions, the judgment, and the ethics.
ā Support the work
If this release has been useful, tips are deeply appreciated ā they go directly toward more compute, more models, and more open releases.
āæ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ī Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
ā Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ā Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.



