Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Name: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS
Brand: AEON-7
Rating: 0.0 (15 reviews)

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🏆 DGX Spark performance — current production (v3 image, 2026-04-29)

Served with DFlash spec decode (not the MTP head) on this XS body, the v3 image (ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3) clocks 38.5 tok/s median, 71.3 tok/s peak thinking-on / 38.1 / 68.4 thinking-off — a +18 % median / +26 % peak lift over the prior v2.1 image and a +17 % / +21 % stacked lift vs the original -NVFP4 (compressed-tensors) production. Median TTFT is 247 ms (was 325 ms — −24 %). See the GitHub Performance section for the four-config comparison table.

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

	Multimodal-NVFP4-MTP (regular)	Multimodal-NVFP4-MTP-XS (this repo)
`linear_attn` projections (`in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj`)	preserved BF16 (~11 GB)	quantized to NVFP4 (~3 GB)
`linear_attn.conv1d` (SSM 1D convolution — recurrence-critical)	preserved BF16	preserved BF16 ✅
`linear_attn` SSM state vectors (`A_log`, `dt_bias`, `norm.weight`)	preserved BF16	preserved BF16 ✅
`mtp.` head (grafted bf16 from base, bit-exact verified)*	yes	yes
Vision tower	preserved BF16	preserved BF16
Total disk	~27 GB	~21 GB
VRAM footprint at runtime	~28 GB	~22 GB

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Variants

Format	Size	Use case
BF16	51 GB	Full-precision reference weights
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark — DFlash spec decode, validated
Multimodal-NVFP4-MTP	27 GB	RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16
Text-NVFP4-MTP	26 GB	Same as above without vision tower
Multimodal-NVFP4-MTP-XS (this repo)	21 GB	RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections
Text-NVFP4-MTP-XS	20 GB	Same as this repo without vision tower

What this is

The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	Either: `-NVFP4` (DFlash) (simpler, validated) or this XS body served with `--speculative-config '{"method":"dflash",...}'` (highest measured throughput — see note below)	Spark prefers DFlash regardless of body. The XS body with DFlash spec lands at 37.6 tok/s median, 68.7 tok/s peak on Spark — the highest measured config. The grafted MTP head in this repo is unused in that path. Never use `--speculative-config '{"method":"qwen3_5_mtp",...}'` on Spark — that lands at only 24.1 tok/s median.
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	Multimodal-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode	XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM.
B100 / B200 (sm_100, dedicated FP4)	Multimodal-NVFP4-MTP (preferred — GDN BF16 fits) or this XS	Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly.
RTX 5090 (sm_120, 32 GB dedicated VRAM)	This XS variant ✅ if you use vision; Text-XS if text-only	XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint.
A100 / H100 (no native FP4)	BF16	NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.

Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

vLLM serve — DGX Spark (DFlash spec, not MTP — measured winning config)

For DGX Spark, swap the spec method to DFlash. The XS body still benefits from FP4 silicon, but DFlash's k=15 chains are decisively better than MTP's n=3 on unified memory.

# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --attention-backend flash_attn \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":15}'

Production-validated v3 image: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3. Measured 38.1 tok/s median, 68.4 tok/s peak thinking-off and 38.5 / 71.3 thinking-on — the highest single-stream config we've measured on Spark.

Configuration notes

--quantization modelopt is required for this body (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.
--speculative-config '{"method":"dflash", ...}' uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.85 is the cap on DGX Spark (unified memory thrashes higher).

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)
- *linear_attn* is NOT broadly excluded (XS difference — the projection matmuls in_proj_qkv, in_proj_z, in_proj_a/b, out_proj get NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)
- *visual* (vision tower preservation)
- *mtp* (MTP head preservation)
- *output_layer*, output.*
MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export
Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC) _{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}	Ξ Ethereum (ETH) _{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}
◎ Solana (SOL) _{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}	ⓜ Monero (XMR) _{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS