Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4

Name: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4
Brand: AEON-7
Rating: 0.0 (33 reviews)

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured throughput benchmarks (32 tok/s median / 56 tok/s peak / 350 ms TTFT on DGX Spark), and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps for AI coding agents working on this stack.

Production container: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 (2026-04-28) — sm_121a-tuned vLLM v0.20.0 release commit, patched CUTLASS NVFP4, DFlash speculative decoding, FlashInfer 0.6.9 stable, and async scheduling enabled by default. v2.1 remains pullable for rollback.

🏆 v3 DGX Spark headline (current production, measured 2026-04-29)

Thinking OFF Thinking ON (default)
Median tok/s 38.1 38.5 ⚡
Peak tok/s 68.4 71.3 ⚡
Median TTFT 247 ms 249 ms

Cumulative gain vs original production (v2.1 + regular -NVFP4 + old DFlash): +17 % median / +21 % peak thinking-off, +18 % / +26 % thinking-on, −24 % TTFT (325 → 247 ms). Three stacked wins: v3 image (vLLM v0.20.0 release + FlashInfer 0.6.9 stable), XS body via modelopt, and DFlash drafter v2 (z-lab 2026-04-27 push). Full data and four-config comparison: GitHub Performance section.

Variants

Format	HuggingFace repo	Disk	Quant tool	Spec decode	Hardware target	When to pick this
NVFP4 (this repo)	`…-NVFP4`	26 GB	llm-compressor	DFlash k=15	DGX Spark (GB10 / sm_121a)	Production-validated for DGX Spark with the patched `vllm-aeon-ultimate-dflash` container.
Multimodal-NVFP4-MTP	`…-Multimodal-NVFP4-MTP`	27 GB	nvidia-modelopt	qwen3_5_mtp n=3	RTX PRO 6000 Blackwell · B100/B200	MTP via the model's native `mtp.` head (grafted bf16 from base). modelopt format, `--quantization modelopt`. Vision tower preserved. GDN linear-attention preserved BF16* for best long-context fidelity.
Text-NVFP4-MTP	`…-Text-NVFP4-MTP`	26 GB	nvidia-modelopt	qwen3_5_mtp n=3	RTX PRO 6000 · text-only	Same recipe as the Multimodal MTP sibling but with vision tower stripped. GDN preserved BF16.
Multimodal-NVFP4-MTP-XS	`…-Multimodal-NVFP4-MTP-XS`	21 GB	nvidia-modelopt	qwen3_5_mtp n=3	RTX 5090 · tighter dedicated VRAM	Strategic split: GDN projection matmuls (`in_proj_qkv/z/a/b`, `out_proj`) → NVFP4; `linear_attn.conv1d` kept BF16 to preserve the recurrence-critical SSM convolution. Vision tower preserved.
Text-NVFP4-MTP-XS	`…-Text-NVFP4-MTP-XS`	20 GB	nvidia-modelopt	qwen3_5_mtp n=3	RTX 5090 text-only · 24 GB cards	Same conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship.
BF16	`…-BF16`	51 GB	—	—	A100 / H100 80 GB · multi-GPU	Full-precision reference weights. Ampere / Hopper / pre-Blackwell hardware, fine-tuning, or quant-recipe development.

🎯 Hardware routing — measured, not theoretical

Pick by memory architecture, not just GPU model:

Hardware class Use this Why
DGX Spark / GB10 (unified memory, sm_121a) this -NVFP4 (DFlash) repo ✅ — or the modelopt -Multimodal-NVFP4-MTP-XS body served with DFlash for +15-21 % more throughput (see note ↓) Bench on Spark: DFlash beats MTP method by +56 % median (37.6 vs 24.1 tok/s), +150 % peak (68.7 vs 27.5) on the same XS body. Don't run MTP-method on Spark.
RTX PRO 6000 / RTX 5090 / B100 / B200 (dedicated VRAM, sm_120/sm_100) -NVFP4-MTP or -NVFP4-MTP-XS MTP wins on dedicated VRAM. RTX PRO 6000 measured: XS hits 111.4 tok/s median with 69 % MTP acceptance — beats no-spec by ~10 %.
A100 / H100 (no native FP4) -BF16 NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.

Full bench numbers: GitHub repo Performance section.

Regular MTP vs XS — strategic quantization, not a precision compromise

The GatedDeltaNet linear_attn.* block has two distinct components: the heavy projection matmuls (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj — ~11 GB total) and the SSM 1D convolution kernel (linear_attn.conv1d — small, but recurrence-critical).

Regular MTP variants keep both at BF16. Maximum numerical safety margin, larger footprint.

XS variants quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) but explicitly preserve linear_attn.conv1d at BF16. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 — the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is not "everything to FP4" — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

🚀 New on Spark (2026-04-28): XS body + DFlash spec is the new winning config

If you want maximum DGX Spark throughput today, the highest-measured configuration is:

Model body: -Multimodal-NVFP4-MTP-XS (modelopt format)

Spec method: DFlash k=15 via z-lab/Qwen3.6-27B-DFlash v2 (2026-04-27 push) — not the MTP head that ships with the XS variant

Same Spark settings (--max-num-seqs 16, --gpu-memory-utilization 0.85, --max-model-len 200000)

vLLM args: --quantization modelopt --speculative-config '{"method":"dflash","model":"/path/to/dflash-drafter","num_speculative_tokens":15}'

Measured on the v3 image (2026-04-28): 38.1 / 68.4 tok/s thinking-off, 38.5 / 71.3 tok/s thinking-on vs this -NVFP4 repo's prior 32.5 / 56.7 thinking-off baseline. The XS body's NVFP4-quantized GDN projections share the same dispatch path as the rest of the body (one fewer BF16↔FP4 cast per layer per token), and the new DFlash drafter v2 has measurably better acceptance (mean accepted length 2.60 / round vs 2.0–2.3 prior). This -NVFP4 (compressed-tensors) repo + DFlash remains the simpler, validated path; the XS+DFlash combo is the higher-throughput path once you've been through one boot to populate the autotuner cache. Full numbers in the GitHub Performance section.

The production deployment format for Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 on Blackwell-class hardware. Same model, same 0/100 refusal rate, same preserved-and-enhanced capabilities of the BF16 source — compressed from 51 GB BF16 to 26 GB NVFP4 for native FP4 tensor-core throughput on DGX Spark (GB10 / sm_121a), B100 / B200, and RTX PRO 6000 Blackwell.

The BF16 source is itself the product of 72 hours of continuous research drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of the next-generation abliteration toolchain. See the BF16 model card for the full pipeline narrative and capability data.

Why NVFP4 — and Why It's Effectively Lossless

NVFP4 is not a "compressed lite" version. It is the format NVIDIA designed for Blackwell-and-later silicon to be the production deployment format — accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.

The accuracy guarantee comes from a two-level scaling structure that older 4-bit formats (INT4, Q4_0/Q4_K, NF4) do not have:

E2M1 element format — 4-bit floating point per weight (sign / 2-bit exponent / 1-bit mantissa).
Block size 16 with FP8 E4M3 per-block scales — every 16 weights share an 8-bit floating-point scale, which dramatically out-resolves the INT8 scales used by older schemes when the local weight distribution is heavy-tailed.
FP32 per-tensor scale — global re-scale applied at the kernel boundary so block-level FP8 scales never have to span the full tensor's dynamic range.

The combined effect is that local outliers — the long-tailed weights that destroy older 4-bit formats — are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid. Typical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is ≤ 0.001, which is below the noise floor of stochastic sampling. A user cannot observe the difference between this model and its BF16 source; the difference is smaller than the variance from changing your random seed.

On native FP4 silicon — Blackwell tcgen05 / UTCQMMA paths, sm_121a CUTLASS on GB10 — this format runs at full FP4 tensor-core throughput. The GPU does not dequantize back to BF16 internally. You get the speed of true 4-bit compute and the accuracy of 16-bit weights at the same time. On older silicon (A100, H100) NVFP4 dequantizes at kernel boundaries — works correctly, but no throughput advantage; for those cards use the BF16 release directly.

This release is multimodal-preserved (vision tower stays BF16 — text + image inference fully functional) and hybrid-attention-preserved (the 48 linear-attention / GatedDeltaNet layers stay BF16; FP4 applies only to the 16 full-attention layers' output projections and all MLPs, where it is well-behaved). Mamba state and SSM dynamics are mathematically incompatible with FP4 and remain in BF16 by design, not by compromise.

What Changed vs BF16

Aspect	BF16 (source)	NVFP4 (this release)
Disk size	51 GB	26 GB (49% reduction)
Refusal rate	0/100	0/100 inherited (KL ≤ 0.001 from source — below sampling noise)
Multimodal	preserved	preserved (vision BF16, no degradation)
Hybrid SSM	repaired + intact	intact (linear_attn BF16-preserved)
Hardware target	A100 / H100 / RTX PRO 6000 BF16	DGX Spark (GB10), B100/B200, RTX PRO 6000 Blackwell with native FP4 throughput
KL vs BF16 source	n/a	expected ≤0.001 (typical for this recipe class)

The NVFP4 quantization scheme is NVIDIA-mandated: E2M1 element format, block_size=16, FP8 E4M3 per-block scales, FP32 per-tensor scale, symmetric signed.

Quantization Recipe

Tool: llm-compressor 0.10.1.dev107 (vllm-project) using QuantizationModifier(scheme="NVFP4") post-training quantization.

from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=[
        "lm_head",                  # always
        "re:.*embed_tokens.*",      # always
        "re:.*\\.visual\\..*",      # vision tower BF16 — preserves multimodal
        "re:.*visual\\..*",
        "re:.*linear_attn\\..*",    # SSM/GDN BF16 — Mamba state collapses under FP4
        "re:.*norm.*",
        "re:.*q_norm.*",
        "re:.*k_norm.*",
    ],
)

Calibration: open-platypus, 512 samples × 4096 tokens. Pipeline: sequential with sequential_targets=["Qwen3_5DecoderLayer"] — required for hybrid stacks (mixed full + linear attention layers); without explicit targeting, llm-compressor's auto-discovery silently skips layers. Loader: AutoModelForImageTextToText to preserve the Qwen3_5ForConditionalGeneration multimodal class. Processor: passed explicitly to oneshot() to avoid the "model processor required when a dataset is provided" failure on multimodal builds without torchvision.

Verification (pass):

1 shard, 1952 keys
64 quantized full-attention projections (16 layers × 4 q/k/v/o)
432 linear_attn.* keys preserved BF16 (48 layers × 9 modules)
333 visual.* keys preserved BF16 (vision tower intact)
319 norm keys preserved BF16
lm_head and embed_tokens preserved BF16
NVFP4-packed weights present
input_global_scale magnitudes 142–346 (healthy range)

Wall-clock quant time: ~57 minutes on 1× RTX PRO 6000 Blackwell (96 GB).

Deployment

vLLM on DGX Spark (GB10 / sm_121a) — recommended

Use the production-validated patched image ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 (:latest also points here as of 2026-04-30). It bundles the SM121 CUTLASS NVFP4 patches, FlashInfer 0.6.9 stable, TurboQuant, and the DFlash drafter integration. The patched CUTLASS path uses native FP4 tensor-core kernels and outperforms the Marlin fallback — do NOT force VLLM_NVFP4_GEMM_BACKEND=marlin (that's the workaround for stock vLLM builds where CUTLASS is broken on SM121).

For a fully-flagged production setup including DFlash speculative decoding (k=15), use the docker-compose recipe in the deployment repo. For a minimal manual docker run without DFlash:

docker run --gpus all --ipc=host --network=host \
  -e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -v /path/to/model:/models/aeon-ultimate \
  ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
  vllm serve /models/aeon-ultimate \
    --served-model-name aeon-ultimate \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --dtype auto \
    --quantization compressed-tensors \
    --max-model-len 262144 \
    --max-num-seqs 64 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.85 \
    --enable-chunked-prefill \
    --no-enable-prefix-caching \
    --load-format safetensors \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend flash_attn \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm

Key settings (tuned for DGX Spark 128 GB unified memory):

--max-num-seqs 64 — Conservative for 262K context. Raise to 128 only for short-context workloads. The DGX Spark's 128 GB is unified between CPU and GPU; KV cache for 128 concurrent long-context sequences will exhaust it.
--max-num-batched-tokens 32768 — Safe prefill budget on DGX Spark. This matches vLLM's inductor compile-range ceiling for this image (compile_ranges_endpoints: [32768]); above 32k, prefill falls back to eager mode. The stock vLLM default of 65536 will OOM under concurrent long-context requests on Spark's unified memory.
--gpu-memory-utilization 0.85 — Leaves 15 % headroom for KV cache spikes. Do not push above 0.88 on DGX Spark — unified memory means 0.90+ thrashes.
--max-model-len 262144 — Full context window. Reduce to 131072 if you need more concurrent sequences.

Python (transformers) — for testing or non-vLLM serving

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model_id = "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,   # vision tower + non-quantized weights
    device_map="cuda:0",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires compressed-tensors >= 0.12 for NVFP4 dequant on the fly.

Hardware notes

Hardware	Notes
DGX Spark (GB10, sm_121a)	Primary target. Use patched vLLM CUTLASS path. Expect ~50 tok/s single-stream after warmup.
B100 / B200 (sm_100)	Native FP4 compute via `tcgen05`/UTCQMMA — fastest hardware for this format.
RTX PRO 6000 Blackwell (sm_120)	Native FP4 via CUTLASS path. Excellent throughput.
A100 / H100 (sm_80, sm_90)	NVFP4 dequantizes to BF16/FP8 at kernel level — works but no FP4 throughput advantage. Use BF16 release instead for best perf on these.

Provenance

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — see source card for full pipeline (FernflowerAI SSM repair → abliterix-v1.4 abliteration → trial 46 of 50 selected for capability preservation).
Original base: Qwen/Qwen3.6-27B by Alibaba.
Quantization tool: llm-compressor by vllm-project.
NVFP4 scheme: NVIDIA NVFP4 specification.

User Responsibility & Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

Sole Responsibility. You, the user, are solely and exclusively responsible for every prompt issued, every response produced, every downstream action taken in reliance on those responses, and any harm — direct, indirect, consequential, or otherwise — that results.
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
Legal Compliance. You are responsible for ensuring that your use complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
No Endorsement of Outputs. The authors, contributors, and publishers do not endorse, adopt, or take responsibility for any specific output. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
Severability. If any provision is held unenforceable in a given jurisdiction, the remaining provisions remain in full force, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.

License

Apache 2.0 (inherited from Qwen/Qwen3.6-27B).

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC) _{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}	Ξ Ethereum (ETH) _{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}
◎ Solana (SOL) _{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}	ⓜ Monero (XMR) _{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4

Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4

🏆 v3 DGX Spark headline (current production, measured 2026-04-29)

Variants

🎯 Hardware routing — measured, not theoretical

Regular MTP vs XS — strategic quantization, not a precision compromise

🚀 New on Spark (2026-04-28): XS body + DFlash spec is the new winning config