Back to Models
FINAL-Bench logo

FINAL-Bench/Darwin-28B-Opus

FINAL-Bench โ€ข general

Darwin-28B-Opus โ€” Qwen3.6-27B ร— Opus-Distilled Evolutionary Merge

GPQA 36B

Genesis 9B NEG 27B

31B 36B

Family FINAL Bench

Qwen3.6-27B dense ยท 27.6B parameters ยท Hybrid Linear/Full Attention ยท BF16 ยท Thinking Mode ยท Apache 2.0 Darwin V7 evolutionary merge: Father ร— Opus-distilled Mother โ†’ 88.89% on GPQA Diamond (3-stage adaptive evaluation)


Abstract

Darwin-28B-Opus is the first reasoning model of the Darwin series built on the Qwen3.6 generation backbone. Produced by the Darwin V7 evolutionary breeding engine from two publicly available parents, it combines the strong bilingual reasoning of Qwen3.6-27B with Claude Opus 4-style chain-of-thought distilled behaviour.

On the GPQA Diamond graduate-level reasoning benchmark (198 PhD-level questions), Darwin-28B-Opus scores 88.89 % under the standard 3-stage adaptive evaluation, slightly edging out its larger MoE sibling Darwin-36B-Opus (88.4 %) and clearly surpassing its Qwen3.5-generation counterpart Darwin-27B-Opus (86.9 %).


๐Ÿงฌ Model Lineage

RoleModelRole in the Merge
Father (็ˆถ)Qwen/Qwen3.6-27BQwen3.6 generation dense backbone with hybrid linear/full attention.
Mother (ๆฏ)rico03/Qwen3.6-27B-Claude-Opus-Reasoning-DistilledClaude Opus reasoning-distilled variant of the same backbone (Jackrong-style distillation, 14 k traces).
OffspringDarwin-28B-Opus (this model)Darwin V7 evolutionary merge; Qwen3.6 architecture retained, Opus reasoning style inherited.

Why 28B? The 28B label denotes the Qwen3.6-generation member of the Darwin lineup (+1 over the Qwen3.5-era Darwin-27B-Opus). The actual parameter count is 27.6 B, and the architecture exactly follows Qwen3.6-27B.


โš™๏ธ Technical Specifications

ComponentValue
ArchitectureQwen3_5ForConditionalGeneration (Qwen3.6 generation, hybrid linear + full attention)
Parameters27.6 B (BF16)
Hidden size5 120
Intermediate size17 408
Head dim256
Layers64 (3 linear : 1 full attention, full_attention_interval = 4)
Precisionbfloat16
Context lengthInherited from base (long-chain reasoning supported)
LicenseApache 2.0

๐Ÿ† Benchmark โ€” GPQA Diamond (198 questions)

Darwin-28B-Opus is evaluated under our standard 3-stage adaptive evaluation protocol, identical to the protocol used across the Darwin series.

StageDecoding ProtocolCostAccuracy
Stage 1Single-shot greedy baseline1ร—74.75 % (148 / 198)
Stage 2Majority vote ร—8 at temperature 0.7 on Stage-1 wrongs8ร—83.84 % (166 / 198)
Stage 3Adaptive ensemble refinement (close-tie tiebreaker + iterative MTI on residual hard questions)โ‰ˆ 20ร—๐Ÿฅ‡ 88.89 % (176 / 198)

Key performance indicators:

  • Stage 1 โ†’ Stage 3: +14.14 %p through adaptive protocol
  • vs Darwin-27B-Opus (86.9 %): +1.99 %p
  • vs Darwin-36B-Opus (88.4 %): +0.49 %p
  • vs Darwin-31B-Opus (85.9 %): +2.99 %p

๐Ÿš€ Usage

Standard inference (Stage 1 baseline)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained(
    "FINAL-Bench/Darwin-28B-Opus",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-28B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user",
     "content": "Solve: If f(x) = xยณ โˆ’ 3x + 2, find all critical points and classify them."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
print(tok.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

Enhanced accuracy (Stage 2-3 adaptive)

For leaderboard-grade accuracy, combine:

  1. Stage 1 greedy baseline,
  2. Stage 2 maj@8 temperature sampling on low-confidence answers,
  3. Stage 3 adaptive refinement on still-disputed answers.

Reference implementation is provided in the Darwin-series evaluation harness.


๐ŸŽฏ Recommended Use-Cases

  • Graduate-level STEM reasoning (GPQA / science qualifying exams)
  • Mathematical problem solving (MATH, AIME-style problems)
  • Code generation and debugging (HumanEval, MBPP)
  • Complex multi-step chain-of-thought tasks
  • Bilingual reasoning (strong English + Korean; also Chinese / Japanese)

โš ๏ธ Limitations

  • At 27.6 B parameters in bfloat16, full inference requires โ‰ˆ 55 GB of VRAM (e.g., a single A100-80GB or B200).
  • Optimised for English first, with secondary support for Korean, Chinese, and Japanese.
  • Deep Opus-style reasoning traces tend to be verbose โ€” control with max_new_tokens as needed.

๐Ÿ“š Citation

@misc{darwin28b_opus_2026,
  title  = {Darwin-28B-Opus: Evolutionary Merging of Qwen3.6-27B with Claude-Opus-Distilled Reasoning},
  author = {FINAL-Bench / Darwin Research Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-28B-Opus}},
  note   = {Darwin V7 ยท Mother-centric Ratio Interpolation merge ยท 88.89 % GPQA Diamond (3-stage)}
}

๐Ÿ”— Related Darwin Models

  • Darwin-36B-Opus โ€” MoE 36B, Qwen3.6-35B-A3B ร— Opus distilled, GPQA 88.4 %
  • Darwin-31B-Opus โ€” 31B dense, multilingual-strong reasoning, GPQA 85.9 %
  • Darwin-27B-Opus โ€” 27B dense (Qwen3.5 generation), GPQA 86.9 %
  • Darwin-9B-NEG โ€” 9B with Native Entropy Gating, GPQA 84.3 %
  • Darwin-9B-Opus โ€” the Qwen3.5-9B Darwin member
  • Darwin-4B-Genesis โ€” smallest Darwin member

Darwin V7 ยท Qwen3.6 generation flagship ยท Sealed 2026-04-25 ยท FINAL-Bench

Visit Website
โ€”

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes22
Downloadsโ€”
๐Ÿ“

No reviews yet

Be the first to review FINAL-Bench/Darwin-28B-Opus!

Model Info

ProviderFINAL-Bench
Categorygeneral
Reviews0
Avg. Ratingโ€” / 5.0

Community

Likes22
Downloadsโ€”

Rating Guidelines

โ˜…โ˜…โ˜…โ˜…โ˜…Exceptional
โ˜…โ˜…โ˜…โ˜…Great
โ˜…โ˜…โ˜…Good
โ˜…โ˜…Fair
โ˜…Poor