Back to Models
FINAL-Bench logo

FINAL-Bench/Darwin-4B-David

FINAL-Benchgeneral

Darwin-4B-David — The First Second-Generation Darwin Model

Gen1 Gen2 Gen3

9B 9B Space 31B 31B Space

35B 35B Space Q8 GGUF bartowski GGUF

FINAL Bench ALL Bench

Gemma 4 E4B Dense | 4.5B Params | Thinking Mode | 128K Context | 140+ Languages | BF16 | Apache 2.0
The first-ever second-generation Darwin model — "Evolution of Evolution"


Overview

Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — a model evolved from an already-evolved model.

The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: "Merge = Evolve" applied recursively.

The name "David" pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a 4.5B small model challenges models many times its size.


Family Tree

Darwin-4B-David

Generation Comparison

Gen 0 (Original)Gen 1 (Opus)Gen 2 (David)
Modelgemma-4-E4B-itDarwin-4B-OpusDarwin-4B-David
ParentsGoogle trainingOriginal + Claude distillEvolved model + DECKARD
GPQA Diamond58.6%85.0% (+26.4%p)
Recursive evolutionNone2× (evolution of evolution)
Core genesGeneral-purposeClaude reasoningReasoning + Creativity + Thinking

Parent Models

RoleModelCharacteristics
Father (Gen-1 Evolved)FINAL-Bench/Darwin-4B-OpusDarwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation
MotherDavidAU/DECKARD-Expresso-UniverseBF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default

Model Diagnostic Scan (MDS)

Father (Darwin-4B-Opus) MDS Scan Mother (DECKARD-Expresso-Universe) MDS Scan

Left: Father (Darwin-4B-Opus) — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
Right: Mother (DECKARD-Expresso-Universe) — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.


Benchmarks

Key Results

Benchmarkgemma-4-E4B-it (Original)Darwin-4B-David (Gen-2)ImprovementConditions
GPQA Diamond58.6%85.0%+26.4%pGenerative, maj@8, 50Q sampling
ARC-Challenge64.93%64.93%±025-shot, chat template, BF16, loglikelihood
KMMLU48.47%48.46%±05-shot, 225Q, loglikelihood

GPQA Diamond Evaluation Details

GPQA Diamond (graduate-level scientific reasoning) was evaluated using generative (thinking mode) evaluation.

SettingValue
DatasetIdavidrein/gpqa, gpqa_diamond split
Questions50 (sampled from 198 total)
Evaluation methodmaj@8 (8 independent generations per question, majority vote determines final answer)
Prompt formatEpoch AI standard (ANSWER: LETTER)
Thinking modeEnabled (chat_template, enable_thinking)
max_new_tokens4,096
temperature1.0
top_p / top_k0.95 / 64
PrecisionBF16
Choice shufflingFixed seed per question (MD5 hash)

Why maj@8:

  • Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
  • 8 independent generations with majority voting reflects the model's stable reasoning capability
  • maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)

Note on 50-question sampling:

  • GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
  • 50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
  • Full 198-question evaluation is planned

Note on lm-eval Loglikelihood Results

ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in generation quality, reasoning chains, or creativity. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.


MRI-Guided Evolution Recipe

Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.

Layer RangeWeightStrategy
Layer 0-30.81Absorb Mother's embedding-adjacent layers
Layer 15-160.91Maximum Mother creativity/character layer reinforcement
Layer 22-250.95Maximum absorption of Mother's KOREAN hotspot
Layer 26-270.40Father priority preservation zone
Layer 30-400.48Father REASONING/MATH preservation
Layer 40-420.62Output layer balance

Parent Comparison

Father vs Mother layer-wise importance comparison

Evolution Parameters

SettingValue
Merge methodDARE-TIES (direct PyTorch, no mergekit dependency)
Density0.800 ~ 0.850
Normalizationnormalize: true
Evolution methodDarwin mergekit (MRI-guided)
Population size20
Phase 1 (proxy search)200 steps
Phase 2 (real merge)10 steps, top 5 elite
Fitness functionkmmlu_lite (Korean knowledge)
Best fitness0.8412 (84.12%)
Total time45.3 minutes (H100 ×1)

Darwin V6 vs Conventional Merging

Capabilitymergekit (DARE-TIES)Darwin V6
ImplementationLibrary call (mergekit CLI)Direct PyTorch tensor operations, no external dependency
Ratio selectionUniform ratio across all tensorsPer-tensor ratio from MDS diagnostic (independent ratios per tensor)
Pre-merge analysisNoneStatic tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
TransplantNot supportedratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validationBenchmark score onlyLayer-by-layer Health Check: child vs both parents, interference and function loss detection
Search methodManual tuningCMA-ES evolution with adaptive genome
ReproducibilityConfig filegenome_hash seed guarantees identical output for identical genome
GPU efficiencySingle merge per runPhase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated)

Significance of Second-Generation Evolution

  1. Proof of "Evolution of Evolution": The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.

  2. 85% GPQA Diamond at 4.5B parameters: +26.4%p over the original 58.6%. This surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters — an exceptional result in parameter efficiency.

  3. Apache 2.0 + Edge deployment: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.

  4. Multimodal preservation: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.

  5. Community synergy: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.


Model Specifications

ArchitectureGemma 4 E4B Dense
Effective Parameters4.5B (8B total with embeddings)
Layers42
Sliding Window512 tokens
PrecisionBF16
Context128K
Vocabulary262K
Languages140+
Thinkingenable_thinking=True chain-of-thought
Vision Encoder~150M (image, video)
Audio Encoder~300M (speech recognition)
LicenseApache 2.0

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-4B-David",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Disable Thinking Mode

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)

VRAM Requirements

SetupVRAMStatus
BF16 Full Precision~16 GB
NVIDIA RTX 4090 24GB24 GBSingle GPU, very comfortable
NVIDIA RTX 3090 24GB24 GBSingle GPU, comfortable
NVIDIA RTX 4080 16GB16 GBSingle GPU
NVIDIA T4 16GB16 GBCloud/Colab friendly
Jetson Orin NX 16GB16 GBEdge deployment ready

Darwin Opus Family

ModelGenArchitectureParametersContextBaseGPQA Diamond
Darwin-4B-David🥈 Gen 2Dense (E4B)4.5B128KDarwin-4B-Opus × DECKARD85.0%
Darwin-4B-OpusGen 1Dense (E4B)4.5B128Kgemma-4-E4B-it
Darwin-9B-OpusGen 1Dense9B131KQwen3.5-9B
Darwin-31B-OpusGen 1Dense31B256Kgemma-4-31B-it
Darwin-35B-A3B-OpusGen 1MoE35B (3B active)256KQwen3.5-35B-A3B90.0%

Roadmap

  • Full 198-question GPQA Diamond evaluation (maj@8)
  • MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
  • GRPO + TinyLoRA reinforcement learning
  • SSD self-distillation
  • Cross-architecture breeding research (Transformer × Mamba FFN transplantation)

References


Built By

DeveloperVIDRAFT
EngineDarwin V6 (Diagnostic-Guided Evolutionary Merge)
GenerationGeneration 2 — First in Darwin history
ArchitectureGemma-4-E4B Dense
LicenseApache 2.0

Citation

@misc{vidraft_darwin_4b_david_2026,
  title        = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
  subtitle     = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
}
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes38
Downloads
📝

No reviews yet

Be the first to review FINAL-Bench/Darwin-4B-David!

Model Info

ProviderFINAL-Bench
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes38
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor