Darwin V6: Diagnostic-Guided Evolutionary Model Merging
Full Model Family
Introducing the Darwin model family.
The Darwin V6 engine diagnoses two AI models at the tensor level, then uses evolutionary algorithms to find optimal merge ratios and combines them into a single model. Currently 6 models are publicly available across Gemma 4 and Qwen 3.5 architectures, with 8 repositories including GGUF quantized versions.
Model Family
Darwin-35B-A3B-Opus (Qwen 3.5 MoE)
| Father | Qwen3.5-35B-A3B-it |
| Mother | Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled |
| Architecture | 35B total / 3B active (MoE) |
| GPQA Diamond | 90.0% (loglikelihood, full 198 questions) |
| ARC-Challenge | 85.08% |
| MMMLU | 85.0% |
| vs Father | GPQA +5.8%p |
| Model | Darwin-35B-A3B-Opus |
Darwin-35B-A3B-Opus Q8 GGUF (Official Quantization)
8-bit quantized version. Compatible with llama.cpp, Ollama, and LM Studio.
Darwin-35B-A3B-Opus GGUF (bartowski Quantization)
Multiple quantization levels by bartowski (Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.). Community-standard quantization format.
bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF
Darwin-31B-Opus (Gemma 4 Dense)
| Father | google/gemma-4-31B-it |
| Mother | TeichAI/gemma-4-31B-it-Claude-Opus-Distill |
| Architecture | Dense 31B, 256K context, 140+ languages, Vision, Thinking mode |
| GPQA Diamond | 66.0% (generative thinking, greedy, 50Q) |
| Father (same condition) | 60.0% — +10% relative improvement |
| ARC-Challenge | 82.89% |
| Model | Darwin-31B-Opus |
| Demo | Live Demo |
Darwin-9B-Opus (Qwen 3.5 Dense)
| Father | Qwen3.5-9B |
| Mother | Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled |
| Architecture | Dense 9B |
| Model | Darwin-9B-Opus |
| Demo | Live Demo |
Darwin-4B-Opus (Gemma 4 E4B)
| Father | google/gemma-4-E4B-it |
| Mother | arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled |
| Architecture | Effective 4B (total 11.4B), 128K context, text + image + audio |
| ARC-Challenge | 82.92% |
| Note | Can run in-browser via WebGPU after ONNX conversion |
| Model | Darwin-4B-Opus |
Model Diagnostic Scan (MDS)
Left: Father (gemma-4-E4B-it) — balanced generalist. Right: Mother (Claude-Opus-Distill) — reasoning concentration in late layers from Claude Opus distillation.
What Darwin V6 Does
Conventional merging tools (mergekit, etc.) apply a single ratio to all tensors. Set ratio=0.5 and every tensor in the model blends at the same proportion, with no distinction between which tensors matter for reasoning versus coding.
Darwin V6 diagnoses both parent models at the tensor level before merging. This process is called MDS (Model Diagnostic Scan) and consists of two stages.
First, static tensor analysis. It measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy) for every tensor.
Second, functional probing. Five diagnostic prompts (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped. This determines each layer's functional importance.
The two results are combined to produce per-tensor optimal ratios:
combined = static(entropy/std/norm) x 0.4 + probe(cosine_distance) x 0.6
final_ratio = mri_ratio x mri_trust + genome_ratio x (1 - mri_trust)
When one parent is overwhelmingly superior for a tensor (ratio < 0.15 or > 0.85), Darwin transplants it directly without interpolation. Zero noise injection. The mri_trust parameter itself is optimized by a CMA-ES evolutionary algorithm, so the optimal transplant intensity is determined automatically for each model pair.
After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference or function loss.
The base merge operations (DARE-TIES, SLERP, Linear) are implemented directly in PyTorch. mergekit is not used. The core of Darwin is not the merge algorithm itself, but the per-tensor diagnostic system and evolutionary ratio optimization built on top of it.
Darwin V6 vs mergekit
| Capability | mergekit | Darwin V6 |
|---|---|---|
| Ratio selection | Uniform ratio across all tensors | Independent ratio per tensor |
| Pre-merge analysis | None | Static tensor profiling + 5-probe functional analysis |
| Post-merge validation | Benchmark score only | Layer-by-layer Health Check (interference + function loss) |
| Search method | Manual tuning | CMA-ES evolutionary search, 14-dimensional adaptive genome |
| Transplant | Not supported | Direct transplant when ratio is extreme, zero interpolation |
What the Evolutionary Algorithm Discovered
The optimal genome for Darwin-31B-Opus reveals a striking pattern.
ffn_ratio=0.93 — Mother (Claude Opus Distill) dominates FFN layers at 93%. The evolutionary algorithm independently discovered that the core of reasoning capability is stored in FFN weights.
block_5 (L50-L59)=0.86 — The final 10 layers out of 60 favor Mother at 86%. The reasoning core is concentrated in the latter half of the model.
attn_ratio=0.32 — Attention layers go the opposite direction, with Father (Gemma 4) at 68%. This preserves the original multimodal and long-context processing capabilities.
This pattern aligns precisely with the MDS heatmap showing Mother's functional distribution across layers. The evolutionary algorithm reached the same conclusion without directly seeing the MDS results.
Benchmark Summary
| Model | Benchmark | Score | Father | Improvement |
|---|---|---|---|---|
| Darwin-35B-A3B-Opus | GPQA Diamond (loglikelihood, 198Q) | 90.0% | 84.2% | +5.8%p |
| Darwin-35B-A3B-Opus | MMMLU | 85.0% | - | - |
| Darwin-35B-A3B-Opus | ARC-Challenge | 85.08% | - | - |
| Darwin-31B-Opus | GPQA Diamond (generative, 50Q) | 66.0% | 60.0% | +10% relative |
| Darwin-31B-Opus | ARC-Challenge | 82.89% | - | - |
| Darwin-4B-Opus | ARC-Challenge | 82.92% | - | - |
All benchmarks measured under identical conditions (same questions, same seed, same decoding settings) comparing against the Father model. Gemma 4 architecture has limited compatibility with lm-eval's loglikelihood method due to its multimodal wrapper structure; only generative evaluation produces valid results for Gemma 4 based models.
Try It
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-35B-A3B-Opus",
torch_dtype=torch.bfloat16,
device_map="auto",
)
GGUF (Ollama)
ollama run FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF
Live Demos
Run Darwin V6 Yourself
The Darwin V6 engine is available as a Space. If you have a compatible model pair, you can run diagnostic-guided merging yourself:
All Links
Models
| Model | Link |
|---|---|
| Darwin-35B-A3B-Opus | huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus |
| Darwin-31B-Opus | huggingface.co/FINAL-Bench/Darwin-31B-Opus |
| Darwin-9B-Opus | huggingface.co/FINAL-Bench/Darwin-9B-Opus |
| Darwin-4B-Opus | huggingface.co/FINAL-Bench/Darwin-4B-Opus |
GGUF
| Version | Link |
|---|---|
| Q8 Official | FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF |
| bartowski | bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF |
Demos
| Model | Link |
|---|---|
| 31B Demo | spaces/FINAL-Bench/Darwin-31B-Opus |
| 35B Demo | spaces/FINAL-Bench/Darwin-35B-A3B-Opus |
| 9B Demo | spaces/FINAL-Bench/Darwin-9B-Opus |
Benchmarks
| Link | |
|---|---|
| FINAL Bench | spaces/FINAL-Bench/Leaderboard |
| ALL Bench | spaces/FINAL-Bench/all-bench-leaderboard |
License & Credits
All Darwin models are Apache 2.0.
DARE-TIES algorithm: Yadav et al., 2023 — re-implemented, not library-dependent.
Parent models by: Google DeepMind (Gemma 4), Alibaba (Qwen 3.5), TeichAI, Jackrong, arsovskidev (Claude Opus Distill).
Darwin V6 engine and models by [VIDRAFT]