Back to Models
AM

AMAImedia/Qwen3.5-35B-A3B-Darwin-Opus-NOESIS-AWQ-INT4

AMAImediacode

Qwen3.5-35B-A3B-Darwin-Opus-NOESIS-AWQ-INT4

Custom AWQ-style INT4 quantization of FINAL-Bench/Darwin-35B-A3B-Opus converted from Q8_0 GGUF, optimized for RAM-constrained machines (64 GB RAM, RTX 3060 6 GB).

Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).


⚠️ License notice

This model is derived from FINAL-Bench/Darwin-35B-A3B-Opus, which itself is derived from Qwen/Qwen3.5-35B-A3B — both licensed under Apache 2.0. This INT4 quantization retains the same Apache 2.0 license — see the LICENSE file in this repository for the full text.


Model summary

PropertyValue
Base modelFINAL-Bench/Darwin-35B-A3B-Opus
Quantization sourceFINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF (Q8_0, ~36.9 GB)
Architectureqwen3_5_moe — Qwen3.5 MoE with Gated DeltaNet
Total parameters35B
Active parameters~3B per forward pass (8 routed + 1 shared expert)
Experts per layer256 routed + 1 shared
Layers40 (hybrid: 30 GDN/linear_attention + 10 full_attention, every 4th)
Hidden size2 048
Original vocab size248 320
Context length262 144 tokens (native)
Languages201
Quantization formatCustom nibble AWQ-INT4 (group_size=128, symmetric, no AutoAWQ)
Precision: linear layersnibble uint8 (weight_i4 [out, in//2] + weight_scale_i4 [n_groups, out])
Precision: MoE expertsnibble uint8 3D (gate_up_proj_q4 [256, out, in//2] + scales/zeros)
Precision: lm_headBF16 (AWQ standard — output projection kept full precision)
Precision: embed_tokensBF16
Disk footprint~17.8 GB
Inference RAM (CPU offload)~20 GB RAM + ~5.4 GB VRAM (device_map="auto")
trust_remote_coderequired
Quantization libraryCustom pipeline (NOESIS v14.7), no AutoAWQ dependency
RNG seed1729 (NOESIS reproducibility lock)

Architecture note: Darwin-35B-A3B-Opus was created with Darwin V5 — a diagnostic-guided evolutionary merge engine (DARE-TIES via mergekit).

  • Father: Qwen/Qwen3.5-35B-A3B (base architecture + RLHF)
  • Mother: Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled (LoRA SFT)

Key diagnostic finding: Mother had 50–65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin V5 compensated by reducing Mother density and using Father's living experts to fill inactive slots. Layer 38 (reasoning core) uses 90% Mother weights (peak probe cosine distance).


Benchmark results (original BF16 model, Q8_0 ≈ BF16)

BenchmarkDarwin-35B-A3B-OpusFather (Qwen3.5-35B-A3B)Mother (Claude 4.6 Opus Distilled)
GPQA Diamond90.0%84.2%85.0%
MMMLU (29 langs)85.0%85.2%

Why a custom format (not AutoAWQ / transformers AwqConfig)

AutoAWQ and transformers AwqConfig only quantize standard nn.Linear modules. Darwin-35B stores all 256 routed experts as merged nn.Parameter tensors [256, out_features, in_features] inside Qwen3_5MoeExperts — not as 256 individual nn.Linear modules. AutoAWQ skips them, leaving ~80% of the model weights in BF16 and causing OOM on any device with less than ~65 GB RAM.

This quantization handles both components with a single custom pass:

ComponentApproach
All nn.Linear (attn, MLP shared expert, router)Linear4bit — nibble uint8, dequantize on forward
mlp.experts (256 routed experts per layer)Darwin35BExpertsInt4 — nibble uint8 3D, dequantize on forward
lm_head, in_proj_a/bBF16 (kept full precision)

Source was the Q8_0 GGUF (not BF16 safetensors), processed layer-by-layer: peak RAM during quantization ~22 GB (one transformer block ~800 MB BF16 at a time).


How to use

Requires trust_remote_code=True — uses custom Darwin35BForCausalLMInt4 class. Do NOT use AutoAWQForCausalLM.from_quantized() — this is not AutoAWQ GEMV format.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "amaimedia/Qwen3.5-35B-A3B-Darwin-Opus-NOESIS-AWQ-INT4"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    max_memory={0: "5.4GiB", "cpu": "54GiB"},
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

messages = [{"role": "user", "content": "Explain the Mixture of Experts architecture."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

CPU-only inference (no GPU):

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cpu",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

Note: This format dequantizes weights to BF16 on each forward pass (no dedicated INT4 CUDA kernel). Inference speed is proportional to your CPU/RAM bandwidth. For production fast inference, use the AWQ-INT8 variant (higher quality, larger) or the original GGUF Q8_0 with llama.cpp.


Thinking mode

Darwin-35B-A3B-Opus supports thinking mode (enabled by default at temperature < 0.7). Use <think> tags or set the generation config to control reasoning:

# Disable thinking (faster, less verbose)
out = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=1.0,
    do_sample=True,
)

# Enable extended thinking (default at temperature ≤ 0.6)
out = model.generate(
    **inputs,
    max_new_tokens=4096,
    temperature=0.6,
    do_sample=True,
)

NOESIS context

In NOESIS this model serves as a high-capability reasoning teacher for Specialists M4-CHAT, M5-CODE, and M6-RESEARCH during knowledge distillation (step110 in extraction_master.py). Proposed KD weight: w=0.25.

⚠️ KD pipeline note: Darwin-35B-A3B-Opus has vocab_size=248 320 (Qwen3.5 extended vocab including codec and vision tokens), while NOESIS student models use Qwen3-8B native vocab 151 936. Logit extraction requires vocab head truncation to index 151 936 via purify_logits() before ensemble aggregation in build_ensemble_labels.py.

IDRoleSize
M1ASR (150+ langs)10B/3B
M2Dubbing LM (30 langs full)10B/3B
M3TTS + voice cloning10B/3B
M4Chat + creative writing10B/3B
M5Code + math10B/3B
M6Deep research (1M ctx)10B/3B
M7Prompt engineering4B/0.8B
M8Quality control (PRM)4B/0.8B
M9Orchestrator + routing4B/0.8B

Provenance

A noesis_provenance.json file ships alongside the model weights with the full quantization trace: source GGUF path, NOESIS version, quantization methodology, group size, and specialist assignment.


Acknowledgements & citation

Base model: Darwin-35B-A3B-Opus by FINAL-Bench (Darwin V5 evolutionary merge of Qwen3.5-35B-A3B + Claude 4.6 Opus Reasoning Distilled).

@misc{darwin35b_opus,
  title     = {Darwin-35B-A3B-Opus},
  author    = {FINAL-Bench},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}
}
@misc{darwin35b_opus_gguf,
  title     = {Darwin-35B-A3B-Opus-Q8-GGUF},
  author    = {VIDRAFT},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}
}

Quantization & NOESIS integration:

@misc{noesis_v14,
  title     = {NOESIS v14.7: DHCF-FNO Multilingual Dubbing Platform},
  author    = {Bolotnikov, Ilia},
  year      = {2026},
  publisher = {AMAImedia},
  url       = {https://amaimedia.com}
}
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes10
Downloads
📝

No reviews yet

Be the first to review AMAImedia/Qwen3.5-35B-A3B-Darwin-Opus-NOESIS-AWQ-INT4!

Model Info

ProviderAMAImedia
Categorycode
Reviews0
Avg. Rating / 5.0

Community

Likes10
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor