AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4
AEON-7 โข generalQwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4
Deployment, operations & benchmarks โ github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured throughput benchmarks (32 tok/s median / 56 tok/s peak / 350 ms TTFT on DGX Spark), and
AGENTS.mdโ an operator's manual that pre-empts common stale-documentation traps for AI coding agents working on this stack.Production container:
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3(2026-04-28) โ sm_121a-tuned vLLM v0.20.0 release commit, patched CUTLASS NVFP4, DFlash speculative decoding, FlashInfer 0.6.9 stable, and async scheduling enabled by default. v2.1 remains pullable for rollback.๐ v3 DGX Spark headline (current production, measured 2026-04-29)
Thinking OFF Thinking ON (default) Median tok/s 38.1 38.5 โก Peak tok/s 68.4 71.3 โก Median TTFT 247 ms 249 ms Cumulative gain vs original production (v2.1 + regular
-NVFP4+ old DFlash): +17 % median / +21 % peak thinking-off, +18 % / +26 % thinking-on, โ24 % TTFT (325 โ 247 ms). Three stacked wins: v3 image (vLLM v0.20.0 release + FlashInfer 0.6.9 stable), XS body via modelopt, and DFlash drafter v2 (z-lab 2026-04-27 push). Full data and four-config comparison: GitHub Performance section.
Variants
| Format | HuggingFace repo | Disk | Quant tool | Spec decode | Hardware target | When to pick this |
|---|---|---|---|---|---|---|
| NVFP4 (this repo) | โฆ-NVFP4 | 26 GB | llm-compressor | DFlash k=15 | DGX Spark (GB10 / sm_121a) | Production-validated for DGX Spark with the patched vllm-aeon-ultimate-dflash container. |
| Multimodal-NVFP4-MTP | โฆ-Multimodal-NVFP4-MTP | 27 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX PRO 6000 Blackwell ยท B100/B200 | MTP via the model's native mtp.* head (grafted bf16 from base). modelopt format, --quantization modelopt. Vision tower preserved. GDN linear-attention preserved BF16 for best long-context fidelity. |
| Text-NVFP4-MTP | โฆ-Text-NVFP4-MTP | 26 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX PRO 6000 ยท text-only | Same recipe as the Multimodal MTP sibling but with vision tower stripped. GDN preserved BF16. |
| Multimodal-NVFP4-MTP-XS | โฆ-Multimodal-NVFP4-MTP-XS | 21 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX 5090 ยท tighter dedicated VRAM | Strategic split: GDN projection matmuls (in_proj_qkv/z/a/b, out_proj) โ NVFP4; linear_attn.conv1d kept BF16 to preserve the recurrence-critical SSM convolution. Vision tower preserved. |
| Text-NVFP4-MTP-XS | โฆ-Text-NVFP4-MTP-XS | 20 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX 5090 text-only ยท 24 GB cards | Same conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship. |
| BF16 | โฆ-BF16 | 51 GB | โ | โ | A100 / H100 80 GB ยท multi-GPU | Full-precision reference weights. Ampere / Hopper / pre-Blackwell hardware, fine-tuning, or quant-recipe development. |
๐ฏ Hardware routing โ measured, not theoretical
Pick by memory architecture, not just GPU model:
Hardware class Use this Why DGX Spark / GB10 (unified memory, sm_121a) this -NVFP4(DFlash) repo โ โ or the modelopt-Multimodal-NVFP4-MTP-XSbody served with DFlash for +15-21 % more throughput (see note โ)Bench on Spark: DFlash beats MTP method by +56 % median (37.6 vs 24.1 tok/s), +150 % peak (68.7 vs 27.5) on the same XS body. Don't run MTP-method on Spark. RTX PRO 6000 / RTX 5090 / B100 / B200 (dedicated VRAM, sm_120/sm_100) -NVFP4-MTPor-NVFP4-MTP-XSMTP wins on dedicated VRAM. RTX PRO 6000 measured: XS hits 111.4 tok/s median with 69 % MTP acceptance โ beats no-spec by ~10 %. A100 / H100 (no native FP4) -BF16NVFP4 dequantizes to BF16 on Ampere/Hopper โ no benefit. Full bench numbers: GitHub repo Performance section.
Regular MTP vs XS โ strategic quantization, not a precision compromise
The GatedDeltaNet
linear_attn.*block has two distinct components: the heavy projection matmuls (in_proj_qkv,in_proj_z,in_proj_a/b,out_projโ ~11 GB total) and the SSM 1D convolution kernel (linear_attn.conv1dโ small, but recurrence-critical).
- Regular MTP variants keep both at BF16. Maximum numerical safety margin, larger footprint.
- XS variants quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) but explicitly preserve
linear_attn.conv1dat BF16. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 โ the same principle modelopt'sNVFP4_DEFAULT_CFGapplies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is not "everything to FP4" โ that would be a different (and not-recommended) variant we have explicitly chosen not to ship.๐ New on Spark (2026-04-28): XS body + DFlash spec is the new winning config
If you want maximum DGX Spark throughput today, the highest-measured configuration is:
- Model body:
-Multimodal-NVFP4-MTP-XS(modelopt format)- Spec method: DFlash k=15 via
z-lab/Qwen3.6-27B-DFlashv2 (2026-04-27 push) โ not the MTP head that ships with the XS variant- Same Spark settings (
--max-num-seqs 16,--gpu-memory-utilization 0.85,--max-model-len 200000)- vLLM args:
--quantization modelopt --speculative-config '{"method":"dflash","model":"/path/to/dflash-drafter","num_speculative_tokens":15}'Measured on the v3 image (2026-04-28): 38.1 / 68.4 tok/s thinking-off, 38.5 / 71.3 tok/s thinking-on vs this
-NVFP4repo's prior 32.5 / 56.7 thinking-off baseline. The XS body's NVFP4-quantized GDN projections share the same dispatch path as the rest of the body (one fewer BF16โFP4 cast per layer per token), and the new DFlash drafter v2 has measurably better acceptance (mean accepted length 2.60 / round vs 2.0โ2.3 prior). This-NVFP4(compressed-tensors) repo + DFlash remains the simpler, validated path; the XS+DFlash combo is the higher-throughput path once you've been through one boot to populate the autotuner cache. Full numbers in the GitHub Performance section.
The production deployment format for Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 on Blackwell-class hardware. Same model, same 0/100 refusal rate, same preserved-and-enhanced capabilities of the BF16 source โ compressed from 51 GB BF16 to 26 GB NVFP4 for native FP4 tensor-core throughput on DGX Spark (GB10 / sm_121a), B100 / B200, and RTX PRO 6000 Blackwell.
The BF16 source is itself the product of 72 hours of continuous research drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of the next-generation abliteration toolchain. See the BF16 model card for the full pipeline narrative and capability data.
Why NVFP4 โ and Why It's Effectively Lossless
NVFP4 is not a "compressed lite" version. It is the format NVIDIA designed for Blackwell-and-later silicon to be the production deployment format โ accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.
The accuracy guarantee comes from a two-level scaling structure that older 4-bit formats (INT4, Q4_0/Q4_K, NF4) do not have:
- E2M1 element format โ 4-bit floating point per weight (sign / 2-bit exponent / 1-bit mantissa).
- Block size 16 with FP8 E4M3 per-block scales โ every 16 weights share an 8-bit floating-point scale, which dramatically out-resolves the INT8 scales used by older schemes when the local weight distribution is heavy-tailed.
- FP32 per-tensor scale โ global re-scale applied at the kernel boundary so block-level FP8 scales never have to span the full tensor's dynamic range.
The combined effect is that local outliers โ the long-tailed weights that destroy older 4-bit formats โ are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid. Typical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is โค 0.001, which is below the noise floor of stochastic sampling. A user cannot observe the difference between this model and its BF16 source; the difference is smaller than the variance from changing your random seed.
On native FP4 silicon โ Blackwell tcgen05 / UTCQMMA paths, sm_121a CUTLASS on GB10 โ this format runs at full FP4 tensor-core throughput. The GPU does not dequantize back to BF16 internally. You get the speed of true 4-bit compute and the accuracy of 16-bit weights at the same time. On older silicon (A100, H100) NVFP4 dequantizes at kernel boundaries โ works correctly, but no throughput advantage; for those cards use the BF16 release directly.
This release is multimodal-preserved (vision tower stays BF16 โ text + image inference fully functional) and hybrid-attention-preserved (the 48 linear-attention / GatedDeltaNet layers stay BF16; FP4 applies only to the 16 full-attention layers' output projections and all MLPs, where it is well-behaved). Mamba state and SSM dynamics are mathematically incompatible with FP4 and remain in BF16 by design, not by compromise.
What Changed vs BF16
| Aspect | BF16 (source) | NVFP4 (this release) |
|---|---|---|
| Disk size | 51 GB | 26 GB (49% reduction) |
| Refusal rate | 0/100 | 0/100 inherited (KL โค 0.001 from source โ below sampling noise) |
| Multimodal | preserved | preserved (vision BF16, no degradation) |
| Hybrid SSM | repaired + intact | intact (linear_attn BF16-preserved) |
| Hardware target | A100 / H100 / RTX PRO 6000 BF16 | DGX Spark (GB10), B100/B200, RTX PRO 6000 Blackwell with native FP4 throughput |
| KL vs BF16 source | n/a | expected โค0.001 (typical for this recipe class) |
The NVFP4 quantization scheme is NVIDIA-mandated: E2M1 element format, block_size=16, FP8 E4M3 per-block scales, FP32 per-tensor scale, symmetric signed.
Quantization Recipe
Tool: llm-compressor 0.10.1.dev107 (vllm-project) using QuantizationModifier(scheme="NVFP4") post-training quantization.
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=[
"lm_head", # always
"re:.*embed_tokens.*", # always
"re:.*\\.visual\\..*", # vision tower BF16 โ preserves multimodal
"re:.*visual\\..*",
"re:.*linear_attn\\..*", # SSM/GDN BF16 โ Mamba state collapses under FP4
"re:.*norm.*",
"re:.*q_norm.*",
"re:.*k_norm.*",
],
)
Calibration: open-platypus, 512 samples ร 4096 tokens.
Pipeline: sequential with sequential_targets=["Qwen3_5DecoderLayer"] โ required for hybrid stacks (mixed full + linear attention layers); without explicit targeting, llm-compressor's auto-discovery silently skips layers.
Loader: AutoModelForImageTextToText to preserve the Qwen3_5ForConditionalGeneration multimodal class.
Processor: passed explicitly to oneshot() to avoid the "model processor required when a dataset is provided" failure on multimodal builds without torchvision.
Verification (pass):
- 1 shard, 1952 keys
- 64 quantized full-attention projections (16 layers ร 4 q/k/v/o)
- 432
linear_attn.*keys preserved BF16 (48 layers ร 9 modules) - 333
visual.*keys preserved BF16 (vision tower intact) - 319 norm keys preserved BF16
lm_headandembed_tokenspreserved BF16- NVFP4-packed weights present
input_global_scalemagnitudes 142โ346 (healthy range)
Wall-clock quant time: ~57 minutes on 1ร RTX PRO 6000 Blackwell (96 GB).
Deployment
vLLM on DGX Spark (GB10 / sm_121a) โ recommended
Use the production-validated patched image ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 (:latest also points here as of 2026-04-30). It bundles the SM121 CUTLASS NVFP4 patches, FlashInfer 0.6.9 stable, TurboQuant, and the DFlash drafter integration. The patched CUTLASS path uses native FP4 tensor-core kernels and outperforms the Marlin fallback โ do NOT force VLLM_NVFP4_GEMM_BACKEND=marlin (that's the workaround for stock vLLM builds where CUTLASS is broken on SM121).
For a fully-flagged production setup including DFlash speculative decoding (k=15), use the docker-compose recipe in the deployment repo. For a minimal manual docker run without DFlash:
docker run --gpus all --ipc=host --network=host \
-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-v /path/to/model:/models/aeon-ultimate \
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
vllm serve /models/aeon-ultimate \
--served-model-name aeon-ultimate \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--no-enable-prefix-caching \
--load-format safetensors \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend flash_attn \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm
Key settings (tuned for DGX Spark 128 GB unified memory):
--max-num-seqs 64โ Conservative for 262K context. Raise to 128 only for short-context workloads. The DGX Spark's 128 GB is unified between CPU and GPU; KV cache for 128 concurrent long-context sequences will exhaust it.--max-num-batched-tokens 32768โ Safe prefill budget on DGX Spark. This matches vLLM's inductor compile-range ceiling for this image (compile_ranges_endpoints: [32768]); above 32k, prefill falls back to eager mode. The stock vLLM default of 65536 will OOM under concurrent long-context requests on Spark's unified memory.--gpu-memory-utilization 0.85โ Leaves 15 % headroom for KV cache spikes. Do not push above 0.88 on DGX Spark โ unified memory means 0.90+ thrashes.--max-model-len 262144โ Full context window. Reduce to 131072 if you need more concurrent sequences.
Python (transformers) โ for testing or non-vLLM serving
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model_id = "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
dtype=torch.bfloat16, # vision tower + non-quantized weights
device_map="cuda:0",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires compressed-tensors >= 0.12 for NVFP4 dequant on the fly.
Hardware notes
| Hardware | Notes |
|---|---|
| DGX Spark (GB10, sm_121a) | Primary target. Use patched vLLM CUTLASS path. Expect ~50 tok/s single-stream after warmup. |
| B100 / B200 (sm_100) | Native FP4 compute via tcgen05/UTCQMMA โ fastest hardware for this format. |
| RTX PRO 6000 Blackwell (sm_120) | Native FP4 via CUTLASS path. Excellent throughput. |
| A100 / H100 (sm_80, sm_90) | NVFP4 dequantizes to BF16/FP8 at kernel level โ works but no FP4 throughput advantage. Use BF16 release instead for best perf on these. |
Provenance
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16โ see source card for full pipeline (FernflowerAI SSM repair โ abliterix-v1.4 abliteration โ trial 46 of 50 selected for capability preservation). - Original base:
Qwen/Qwen3.6-27Bby Alibaba. - Quantization tool: llm-compressor by vllm-project.
- NVFP4 scheme: NVIDIA NVFP4 specification.
User Responsibility & Arbitration Clause
By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:
-
Sole Responsibility. You, the user, are solely and exclusively responsible for every prompt issued, every response produced, every downstream action taken in reliance on those responses, and any harm โ direct, indirect, consequential, or otherwise โ that results.
-
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
-
Legal Compliance. You are responsible for ensuring that your use complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
-
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
-
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater โ not lesser โ caution, forethought, and ethical discipline when operating this model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
-
No Endorsement of Outputs. The authors, contributors, and publishers do not endorse, adopt, or take responsibility for any specific output. Outputs are a stochastic function of the prompt, the weights, and the sampler state โ not a statement of position by any human.
-
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
-
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
-
Severability. If any provision is held unenforceable in a given jurisdiction, the remaining provisions remain in full force, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
-
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.
This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.
License
Apache 2.0 (inherited from Qwen/Qwen3.6-27B).
โ Support the work
If this release has been useful, tips are deeply appreciated โ they go directly toward more compute, more models, and more open releases.
โฟ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
ฮ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
โ Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
โ Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.



