Darwin-4B-David — The First Second-Generation Darwin Model

Gemma 4 E4B Dense | 4.5B Params | Thinking Mode | 128K Context | 140+ Languages | BF16 | Apache 2.0
The first-ever second-generation Darwin model — "Evolution of Evolution"

Overview

Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — a model evolved from an already-evolved model.

The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: "Merge = Evolve" applied recursively.

The name "David" pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a 4.5B small model challenges models many times its size.

Family Tree

Darwin-4B-David

Generation Comparison

	Gen 0 (Original)	Gen 1 (Opus)	Gen 2 (David)
Model	gemma-4-E4B-it	Darwin-4B-Opus	Darwin-4B-David
Parents	Google training	Original + Claude distill	Evolved model + DECKARD
GPQA Diamond	58.6%	—	85.0% (+26.4%p)
Recursive evolution	None	1×	2× (evolution of evolution)
Core genes	General-purpose	Claude reasoning	Reasoning + Creativity + Thinking

Parent Models

Role	Model	Characteristics
Father (Gen-1 Evolved)	FINAL-Bench/Darwin-4B-Opus	Darwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation
Mother	DavidAU/DECKARD-Expresso-Universe	BF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default

Model Diagnostic Scan (MDS)

Father (Darwin-4B-Opus) MDS Scan Mother (DECKARD-Expresso-Universe) MDS Scan

Left: Father (Darwin-4B-Opus) — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
Right: Mother (DECKARD-Expresso-Universe) — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.

Benchmarks

Key Results

Benchmark	gemma-4-E4B-it (Original)	Darwin-4B-David (Gen-2)	Improvement	Conditions
GPQA Diamond	58.6%	85.0%	+26.4%p	Generative, maj@8, 50Q sampling
ARC-Challenge	64.93%	64.93%	±0	25-shot, chat template, BF16, loglikelihood
KMMLU	48.47%	48.46%	±0	5-shot, 225Q, loglikelihood

GPQA Diamond Evaluation Details

GPQA Diamond (graduate-level scientific reasoning) was evaluated using generative (thinking mode) evaluation.

Setting	Value
Dataset	Idavidrein/gpqa, gpqa_diamond split
Questions	50 (sampled from 198 total)
Evaluation method	maj@8 (8 independent generations per question, majority vote determines final answer)
Prompt format	Epoch AI standard (`ANSWER: LETTER`)
Thinking mode	Enabled (chat_template, enable_thinking)
max_new_tokens	4,096
temperature	1.0
top_p / top_k	0.95 / 64
Precision	BF16
Choice shuffling	Fixed seed per question (MD5 hash)

Why maj@8:

Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
8 independent generations with majority voting reflects the model's stable reasoning capability
maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)

Note on 50-question sampling:

GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
Full 198-question evaluation is planned

Note on lm-eval Loglikelihood Results

ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in generation quality, reasoning chains, or creativity. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.

MRI-Guided Evolution Recipe

Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.

Layer Range	Weight	Strategy
Layer 0-3	0.81	Absorb Mother's embedding-adjacent layers
Layer 15-16	0.91	Maximum Mother creativity/character layer reinforcement
Layer 22-25	0.95	Maximum absorption of Mother's KOREAN hotspot
Layer 26-27	0.40	Father priority preservation zone
Layer 30-40	0.48	Father REASONING/MATH preservation
Layer 40-42	0.62	Output layer balance

Parent Comparison

Father vs Mother layer-wise importance comparison

Evolution Parameters

Setting	Value
Merge method	DARE-TIES (direct PyTorch, no mergekit dependency)
Density	0.800 ~ 0.850
Normalization	normalize: true
Evolution method	Darwin mergekit (MRI-guided)
Population size	20
Phase 1 (proxy search)	200 steps
Phase 2 (real merge)	10 steps, top 5 elite
Fitness function	kmmlu_lite (Korean knowledge)
Best fitness	0.8412 (84.12%)
Total time	45.3 minutes (H100 ×1)

Darwin V6 vs Conventional Merging

Capability	mergekit (DARE-TIES)	Darwin V6
Implementation	Library call (mergekit CLI)	Direct PyTorch tensor operations, no external dependency
Ratio selection	Uniform ratio across all tensors	Per-tensor ratio from MDS diagnostic (independent ratios per tensor)
Pre-merge analysis	None	Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
Transplant	Not supported	ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validation	Benchmark score only	Layer-by-layer Health Check: child vs both parents, interference and function loss detection
Search method	Manual tuning	CMA-ES evolution with adaptive genome
Reproducibility	Config file	genome_hash seed guarantees identical output for identical genome
GPU efficiency	Single merge per run	Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated)

Significance of Second-Generation Evolution

Proof of "Evolution of Evolution": The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.
85% GPQA Diamond at 4.5B parameters: +26.4%p over the original 58.6%. This surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters — an exceptional result in parameter efficiency.
Apache 2.0 + Edge deployment: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.
Multimodal preservation: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.
Community synergy: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.

Model Specifications


Architecture	Gemma 4 E4B Dense
Effective Parameters	4.5B (8B total with embeddings)
Layers	42
Sliding Window	512 tokens
Precision	BF16
Context	128K
Vocabulary	262K
Languages	140+
Thinking	enable_thinking=True chain-of-thought
Vision Encoder	~150M (image, video)
Audio Encoder	~300M (speech recognition)
License	Apache 2.0

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-4B-David",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Disable Thinking Mode

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)

VRAM Requirements

Setup	VRAM	Status
BF16 Full Precision	~16 GB
NVIDIA RTX 4090 24GB	24 GB	Single GPU, very comfortable
NVIDIA RTX 3090 24GB	24 GB	Single GPU, comfortable
NVIDIA RTX 4080 16GB	16 GB	Single GPU
NVIDIA T4 16GB	16 GB	Cloud/Colab friendly
Jetson Orin NX 16GB	16 GB	Edge deployment ready

Darwin Opus Family

Model	Gen	Architecture	Parameters	Context	Base	GPQA Diamond
Darwin-4B-David	🥈 Gen 2	Dense (E4B)	4.5B	128K	Darwin-4B-Opus × DECKARD	85.0%
Darwin-4B-Opus	Gen 1	Dense (E4B)	4.5B	128K	gemma-4-E4B-it	—
Darwin-9B-Opus	Gen 1	Dense	9B	131K	Qwen3.5-9B	—
Darwin-31B-Opus	Gen 1	Dense	31B	256K	gemma-4-31B-it	—
Darwin-35B-A3B-Opus	Gen 1	MoE	35B (3B active)	256K	Qwen3.5-35B-A3B	90.0%

Roadmap

Full 198-question GPQA Diamond evaluation (maj@8)
MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
GRPO + TinyLoRA reinforcement learning
SSD self-distillation
Cross-architecture breeding research (Transformer × Mamba FFN transplantation)

References

DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
DavidAU DECKARD Series: https://huggingface.co/DavidAU
MTI: Minimal Test-Time Intervention (arXiv:2510.13940)

Built By


Developer	VIDRAFT
Engine	Darwin V6 (Diagnostic-Guided Evolutionary Merge)
Generation	Generation 2 — First in Darwin history
Architecture	Gemma-4-E4B Dense
License	Apache 2.0

Citation

@misc{vidraft_darwin_4b_david_2026,
  title        = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
  subtitle     = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
}

FINAL-Bench/Darwin-4B-David