Training mRNA Language Models Across 25 Species for $165
Part II: Building the Pipeline, From Structure Prediction to Codon Optimization
By OpenMed, Open-Source Agentic AI for Healthcare & Life Sciences
TL;DR: We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.
Contents
- What We Built
- The Architecture Exploration
- The Pipeline
- Scaling to Multi-Species
- The End-to-End Workflow
- Where This Stands and What's Next
- References
Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That is the pipeline OpenMed set out to build, and this post documents the process from start to finish.
In Part I, we mapped the landscape of protein AI: the architectures powering structure prediction, the open-source tools available for protein design, and the ecosystem of models from AlphaFold to ESMFold. That was a survey. This is the build.
At OpenMed, we set out to build a complete pipeline that takes a protein idea from concept to expression-ready DNA. That means three stages: predict the 3D structure of a protein, design amino acid sequences that fold into that structure, and optimize the underlying DNA codons so the protein actually expresses in the target organism. Along the way, we ran extensive experiments comparing transformer architectures for codon optimization, scaled our best model to 25 species, and built tooling that ties it all together.
This is not a polished success story. It is a transparent account of what worked, what surprised us, and what we would do differently, with runnable code and full results at every step.
1. What We Built
The pipeline has three components, each addressing a different stage of the protein engineering workflow described in Part I. Structure prediction determines what shape a protein takes. Sequence design determines which amino acids will produce that shape. Codon optimization determines which DNA will produce those amino acids efficiently in a living cell.
| Component | What We Did | Key Result |
|---|---|---|
| Protein Folding | ESMFold v1 predictions on 30 protein chains | Avg PTM: 0.79, working batch pipeline |
| Sequence Design | ProteinMPNN on scaffold 7K00 | 42% sequence recovery |
| mRNA Optimization | Trained multiple transformer variants on 250k CDS, then scaled to 381k sequences across 25 species | CodonRoBERTa-large-v2: perplexity 4.10, CAI 0.40; Multi-species suite: 4 models spanning 25 organisms (55 GPU-hours) |
The mRNA optimization work is where we invested the most effort, and where we have the most to share. The folding and design components use established tools (ESMFold from Meta, ProteinMPNN from the Baker Lab, both covered in depth in Part I). The codon optimization component is entirely ours: new models, new training infrastructure, new evaluation metrics.
2. The Architecture Exploration
In Part I, we surveyed the protein AI landscape and noted that most biological language models are adaptations of NLP architectures. The open question was which architecture. BERT variants dominate protein modeling (ESM-2, ProtTrans), but codon sequences have different statistical properties than both natural language and amino acid sequences. Codons are triplets drawn from a small 64-token alphabet, with strong positional dependencies and species-specific usage biases. We needed to find out what works from first principles.
The core question: which transformer architecture works best for codon-level language modeling?
This matters because codon optimization is crucial for therapeutic mRNA, vaccines, and recombinant protein production. The genetic code is degenerate: the same protein can be encoded by astronomically many different DNA sequences, but some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine, for example, was codon-optimized for human expression. We wanted to build a model that could learn these preferences directly from natural coding sequences, rather than relying on hand-crafted frequency tables.
The Contenders
We started with a small CodonBERT baseline (6M params, following Sanofi's published architecture) and scaled up through two families: ModernBERT, which represented the latest efficiency innovations from the NLP community, and RoBERTa, the proven workhorse behind Meta's ESM protein language models.
| Model | Parameters | Architecture | Hypothesis |
|---|---|---|---|
| CodonBERT (baseline) | 6M | BERT-tiny (6 layers) | Minimal baseline to establish floor performance |
| ModernBERT-base | 90M | ModernBERT (22 layers, RoPE) | Modern innovations: long context, efficient attention |
| CodonRoBERTa-base | 92M | RoBERTa (12 layers) | Proven MLM architecture, same family as ESM-2 |
| CodonRoBERTa-large | 312M | RoBERTa (24 layers) | Test whether more parameters improve codon modeling |
| CodonRoBERTa-large-v2 | 312M | RoBERTa (24 layers, refined) | Same architecture, better hyperparameters |
The choice of RoBERTa was deliberate. As we discussed in Part I, Meta's ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences. We hypothesized that the same architecture family that learned amino acid patterns might also learn codon patterns. ModernBERT was the counterpoint: a 2024 architecture with RoPE embeddings, Flash Attention, and alternating local/global attention layers, representing everything the NLP community had learned since RoBERTa's 2019 release.
The Training Setup
To ensure a fair comparison, every model was trained on identical data with the same evaluation protocol. We used 250,000 coding sequences (CDS) from E. coli RefSeq, covering chromosome and complete assembly accessions. This is a clean, well-annotated dataset where codon usage patterns are well-characterized in the literature, giving us ground truth to validate against.
Our tokenizer maps each codon to a single token: 64 codons plus 5 special tokens (PAD, UNK, CLS, SEP, MASK) for a 69-token vocabulary. This is intentionally minimal. Unlike BPE tokenizers used in NLP, where subword boundaries are statistically learned, codon boundaries are biologically defined. Every three nucleotides encode one amino acid. Our tokenizer respects this.
Training ran on 4 A100 GPUs (80GB) with FSDP sharding, using 15,000 to 25,000 steps depending on model size. All models used masked language modeling (MLM) with 15% masking rate, the same objective used by ESM-2 for protein sequences.
The Results
| Model | Perplexity | CAI Spearman | Synonymous Recovery | Status |
|---|---|---|---|---|
| CodonRoBERTa-large-v2 | 4.10 | 0.404 | 7.7% | Best Overall |
| CodonRoBERTa-base | 4.01 | 0.219 | 8.5% | Best Efficiency |
| CodonRoBERTa-large | 4.01 | 0.025 | 7.6% | Good MLM, weak bio |
| ModernBERT-base | 26.24 | 0.070 | 8.5% | Underperformed |
| CodonBERT (baseline) | 17.18 | -0.629 | 0.0% | Baseline |
The result was unambiguous: RoBERTa outperformed ModernBERT by 6x on perplexity (4.01 vs 26.24). This was not a marginal difference. ModernBERT, despite its modern attention patterns and efficient architecture, fundamentally underperformed the classic RoBERTa design on codon sequences.
What We Learned
1. Pre-trained NLP weights do not transfer to biology
We initialized ModernBERT from its published English-language checkpoint, expecting the learned attention patterns to provide a useful starting point. They did not. Our best explanation: ModernBERT's pre-training on English text instilled inductive biases (subword frequency distributions, positional attention patterns) that actively interfere with learning codon statistics. RoBERTa, initialized randomly and trained purely on biological data, had no such baggage. This aligns with what the field has seen more broadly: ESM-2 and ProtTrans both train from scratch on biological data rather than fine-tuning from NLP checkpoints.
2. Hyperparameter tuning unlocked biological alignment
This was the most surprising and practically important finding of the exploration. Compare CodonRoBERTa-large v1 and v2:
| Version | Perplexity | CAI Spearman |
|---|---|---|
| v1 (lr=1e-4) | 4.01 | 0.025 |
| v2 (lr=5e-5, longer warmup) | 4.10 | 0.404 |
Same architecture. Same data. Same number of parameters. The only differences: half the learning rate and a longer warmup (2,000 steps vs 1,000). Yet v2's predicted codon likelihoods are 16x better correlated with real biological codon preferences, as measured by Codon Adaptation Index.
The perplexity actually got slightly worse (4.10 vs 4.01), which means v2 is marginally less accurate at predicting the exact masked codon. But it is dramatically better at predicting codons that biology actually uses. The slower training schedule let the model settle into representations that capture genuine biological signal rather than overfitting to surface statistics.
This is a crucial insight for anyone training biological language models: MLM loss alone does not measure biological relevance. Domain-specific metrics are essential. In our case, CAI correlation turned out to be the metric that separates a useful model from a technically impressive but biologically meaningless one.
3. The base model is remarkably efficient
CodonRoBERTa-base (92M params) achieved nearly identical perplexity to the large model (4.01 vs 4.10) with 3.4x fewer parameters and proportionally less training time. Its CAI correlation (0.219) is lower than v2's (0.404), but still well above the baseline and ModernBERT. For teams without access to multi-GPU clusters, the base model is a practical choice that captures most of the codon modeling performance at a fraction of the cost.
3. The Pipeline
In Part I, we described the three-stage workflow that most computational protein engineering projects follow: predict structure, design sequences, optimize codons. Here we run each stage with real data and report what we actually got.
- Fold: Predict the 3D structure (ESMFold)
- Design: Generate sequences that fold into that structure (ProteinMPNN)
- Optimize: Choose the best codons for expression (CodonRoBERTa)
3.1 Protein Folding with ESMFold
ESMFold architecture. The model parses a single amino acid sequence through the ESM-2 encoder, then predicts 3D coordinates via a folding trunk and structure module. Figure from Bertoline et al., Biomolecules 2024, CC-BY 4.0.
As covered in Part I, ESMFold is Meta's single-sequence structure predictor. It uses ESM-2, a 15-billion-parameter protein language model trained on 65 million UniRef sequences, as its backbone. The key advantage over AlphaFold 2 is speed: ESMFold skips the computationally expensive multiple sequence alignment (MSA) step and predicts structures directly from a single amino acid sequence. That makes it seconds per protein instead of hours.
The tradeoff is accuracy. ESMFold achieves ~0.87 TM-score on CASP14 targets vs. AlphaFold's ~0.92. For rapid prototyping and candidate screening, that gap is acceptable. When a pipeline generates 100 designed sequences and needs to refold all of them to check viability, speed matters more than the last few percentage points of accuracy.
Our Results: 30 Protein Chains
We ran ESMFold on 30 protein chains sourced from the Protein Data Bank. These are real experimental structures with known ground truth, spanning sequence lengths from 211 to 519 residues. The set deliberately includes both easy targets (single-domain proteins) and challenging ones (chains from a multi-chain ribosomal complex, PDB 7K00) to stress-test the model.
import json
# Load our actual results
metrics = json.load(open('outputs/esmfold_metrics.json'))
# Summary statistics
n_chains = len(metrics) # 30
avg_plddt = sum(m['mean_plddt'] for m in metrics) / n_chains # 33.8
avg_ptm = sum(m['ptm'] for m in metrics) / n_chains # 0.79
print(f"Chains: {n_chains}")
print(f"Average pLDDT: {avg_plddt:.1f}")
print(f"Average PTM: {avg_ptm:.2f}")
Results breakdown:
| Metric | Value | Interpretation |
|---|---|---|
| Chains predicted | 30 | - |
| Average pLDDT | 33.8 | Per-residue confidence (lower than expected) |
| Average PTM | 0.79 | Topology confidence (good) |
| Sequence lengths | 211-519 | Typical protein sizes |
The PTM scores are solid: anything above 0.5 suggests the model has the overall topology correct, and our average of 0.79 indicates high confidence in the predicted folds. The pLDDT scores are lower than published ESMFold benchmarks, which initially concerned us. The explanation turned out to be our test set composition: the ribosomal chains from 7K00 are part of a large multi-chain complex, and ESMFold (which predicts single chains in isolation) cannot model the inter-chain contacts that stabilize these structures. For single-domain proteins in our set, pLDDT scores were consistently above 70.
Running ESMFold
# Activate environment
source .env_esmfold/bin/activate
# Batch prediction
python scripts/esmfold_batch.py \
--seq_dir data/pdb/sequences \
--out_dir data/esmfold/out \
--metrics outputs/esmfold_metrics.json \
--device cuda:0
Each prediction takes ~10-30 seconds on an A100. The output includes:
- PDB structure files
- pLDDT scores (per-residue confidence, 0-100)
- PTM scores (topology confidence, 0-1)
- Predicted Aligned Error (PAE) matrices
3.2 Sequence Design with ProteinMPNN
ProteinMPNN architecture. (A) The encoder processes backbone atom distances; the decoder autoregressively generates amino acid sequences. (B) Random decoding order improves diversity. (C) Tied positions enable symmetric and multi-state design. Figure from Dauparas et al., Science 2022, CC-BY 4.0.
As we described in Part I, protein design is the inverse of protein folding. Folding goes sequence to structure: given amino acids, predict the 3D shape. Inverse folding goes the other way: given a target 3D shape, find amino acid sequences that will fold into it.
ProteinMPNN, from David Baker's lab at the University of Washington, is the current gold standard for this task. It was published in Science in 2022 and has since been validated experimentally: designed sequences fold into their target structures at rates far exceeding random or earlier computational methods. The architecture treats the protein backbone as a graph, where nodes are amino acid positions and edges connect spatially proximate residues (K-nearest neighbors in 3D). A message-passing neural network propagates information through this graph, then autoregressively generates a sequence one residue at a time.
Our Results: Scaffold 7K00
We ran ProteinMPNN on PDB structure 7K00 (a large multi-chain ribosomal complex):
python proteinmpnn/protein_mpnn_run.py \
--pdb_path data/pdb/raw/7K00.cif \
--out_folder outputs/proteinmpnn_smoke \
--num_seq_per_target 3 \
--sampling_temp 0.1
Results:
| Metric | Value |
|---|---|
| Sequences generated | 3 |
| Best score | 0.89 |
| Sequence recovery | ~42% |
Here's what the output looks like:
>7K00, score=1.7100, global_score=1.7100
GIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKI/MKRTFQPSVLK...
>T=0.1, sample=1, score=0.8857, seq_recovery=0.4203
SKKVVIKLVCSCGCGFEYCDFRDIEKNPEKIERVLYCPICQKYVLFTEAPL/PPGPFRPDREV...
The first line is the native (natural) sequence extracted from the crystal structure. Subsequent lines are ProteinMPNN's designed variants. At temperature 0.1 (low randomness), the model recovers ~42% of the original amino acids, purely from 3D geometry. This is a strong result: it means the model independently rediscovered nearly half the residues that evolution selected, using only the backbone coordinates as input.
Several practical notes from running ProteinMPNN. Scores are negative log-likelihoods, so lower is better. The 42% recovery rate is typical for well-resolved structures and consistent with the original paper's benchmarks. Higher sampling temperatures produce more diverse but riskier sequences. For real design work, the most powerful feature is partial design: catalytic residues, binding site amino acids, or any positions with known functional importance can be fixed in place, while ProteinMPNN redesigns only the scaffold around them. This is the standard approach for engineering a more stable version of an enzyme without disrupting its active site.
3.3 mRNA Optimization
This is where the pipeline transitions from existing tools to our own models. ESMFold and ProteinMPNN are established, well-validated software that we integrated. Codon optimization is where we built something new.
Why Codon Choice Matters
Codon usage frequencies vary dramatically between organisms. These heatmaps compare codon preferences across E. coli, yeast, and CHO cells, the three expression hosts covered by our multi-species models. Figure from Kim et al., J. Microbiol. Biotechnol. 2025, CC-BY 4.0.
The genetic code is degenerate: most amino acids are encoded by multiple codons. Leucine, for example, has six: TTA, TTG, CTT, CTC, CTA, and CTG. All six produce the same amino acid in the final protein. Methionine and tryptophan are the exceptions, with only one codon each.
This redundancy means that for any given protein, there are astronomically many DNA sequences that encode it. A typical 300-amino-acid protein has roughly 10^150 possible codon combinations. They all produce the same amino acid chain, but they do not all produce the same amount of protein. Codon choice affects translation speed (because tRNA molecules are not equally abundant for all codons), mRNA stability (because the nucleotide sequence affects how quickly the transcript degrades), co-translational folding (because translation pauses at rare codons give the protein time to fold), and immune recognition (because the innate immune system in mammalian cells can distinguish native from foreign mRNA patterns). In practice, bad codon choices can reduce protein expression by 100x. This is why every mRNA vaccine, every recombinant protein therapeutic, and every gene therapy vector goes through codon optimization.
The Traditional Approach and Why It Is Limited
The scale of the codon optimization problem. For a typical mRNA, there are over 10^600 possible codon sequences encoding the same protein. The challenge is finding the arrangement that maximizes expression. Figure from Zhang et al. (LinearDesign), Nature 2023, CC-BY 4.0.
The classical method is simple: measure which codons appear most frequently in highly-expressed genes of the target organism, then replace every codon with the most frequent synonym. This is codified as the Codon Adaptation Index (CAI), a per-sequence score that measures how closely the codon usage matches the organism's preferred distribution.
CAI-based optimization works, but it is crude. It treats each codon position independently, ignoring the sequence context. It produces repetitive sequences (the same "optimal" codon used everywhere for a given amino acid), which can cause ribosome stalling and mRNA secondary structure problems. And it misses complex dependencies: the optimal codon at position 50 might depend on what codons are at positions 48 and 52, which a frequency table cannot capture.
Our Approach: Masked Language Modeling
We reframe codon optimization as a language modeling problem. Instead of looking up frequencies in a table, we train a transformer on hundreds of thousands of natural coding sequences using masked language modeling (MLM), the same pre-training objective used by BERT, RoBERTa, and Meta's ESM protein models. The model sees a codon sequence with 15% of positions masked and learns to predict the missing codons from context.
What the model learns, implicitly, is the grammar of codon usage: which codon patterns appear in nature, which codons tend to co-occur, and how preferences shift depending on the surrounding sequence context. This is fundamentally richer than a frequency table because the model captures long-range dependencies across the entire coding sequence.
CodonRoBERTa: Our Best Model
After our architecture exploration (see above), CodonRoBERTa-large-v2 emerged as the winner:
# configs/mrna/production/roberta_large_v2.yaml
model_type: roberta
hidden_size: 1024
num_hidden_layers: 24
num_attention_heads: 16
intermediate_size: 4096
vocab_size: 69
max_position_embeddings: 8192
learning_rate: 5e-5 # Critical: lower than v1
warmup_steps: 2000 # Critical: longer warmup
max_steps: 25000
Training:
python scripts/training/run_mlm_train.py \
--config configs/mrna/roberta_large_v2.yaml \
--train_file data/mrna/processed/train_250k.fasta \
--output_dir outputs/models/CodonRoBERTa-large-v2
Evaluation: Three Metrics That Matter
Evaluating a codon language model is not straightforward. As we learned from the v1/v2 comparison above, a model can have excellent perplexity (accurately predicting masked codons) while having poor biological alignment (predicting codons that nature does not actually prefer). We evaluate on three complementary axes:
1. Perplexity measures how well the model predicts masked codons, computed as the exponentiated cross-entropy loss. A perplexity of 4.10 means the model is, on average, choosing between ~4 equally likely codons at each masked position. Given that most amino acids have 2-6 synonymous codons, this indicates the model has learned meaningful preferences rather than guessing uniformly. Lower is better. CodonRoBERTa-large-v2: 4.10.
2. CAI Correlation (Spearman) measures whether a model's predicted codon likelihoods align with known biological codon usage preferences. We compute the Codon Adaptation Index for each test sequence, then correlate it with the model's pseudo-log-likelihood score. A positive correlation means the model assigns higher probability to sequences that biology actually uses. This is the metric that matters most for practical codon optimization, because it directly measures whether the model has learned biologically relevant patterns vs. just statistical ones. CodonRoBERTa-large-v2: 0.404 (p < 10^-20).
3. Synonymous Recovery asks: when the model predicts a codon for a masked position, does it at least get the amino acid right? Even if it picks the wrong synonym (e.g., CTT instead of CTC for leucine), predicting the correct amino acid shows the model understands the protein-level constraint. CodonRoBERTa-large-v2: 12.1% top-1 synonymous.
Running the Evaluations
# Perplexity
python scripts/evals/advanced/eval_perplexity.py \
--model outputs/models/CodonRoBERTa-large-v2/final \
--test_file data/mrna/processed/test_6k.fasta \
--output outputs/eval_results/CodonRoBERTa-large-v2/perplexity.json
# CAI Correlation
python scripts/evals/advanced/eval_cai_correlation.py \
--model outputs/models/CodonRoBERTa-large-v2/final \
--test_file data/mrna/processed/test_6k.fasta \
--output outputs/eval_results/CodonRoBERTa-large-v2/cai_correlation.json
# Synonymous Recovery
python scripts/evals/advanced/eval_synonymous_recovery.py \
--model outputs/models/CodonRoBERTa-large-v2/final \
--test_file data/mrna/processed/test_6k.fasta \
--output outputs/eval_results/CodonRoBERTa-large-v2/synonymous_recovery.json
The Final Leaderboard
Putting it all together across our model variants:
| Model | Params | Perplexity | CAI Spearman | Best For |
|---|---|---|---|---|
| CodonRoBERTa-large-v2 | 312M | 4.10 | 0.404 | Production use |
| CodonRoBERTa-base | 92M | 4.01 | 0.219 | Limited compute |
| CodonRoBERTa-large | 312M | 4.01 | 0.025 | - |
| ModernBERT-base | 90M | 26.24 | 0.070 | - |
| CodonBERT (baseline) | 6M | 17.18 | -0.629 | Baseline only |
The RoBERTa family dominates across the board. For production use, CodonRoBERTa-large-v2 is the clear choice: it has the strongest biological alignment (CAI 0.404) while maintaining competitive perplexity. For teams with limited compute, CodonRoBERTa-base delivers nearly the same perplexity at 3.4x fewer parameters. ModernBERT underperformed substantially, which we attribute to its NLP-pretrained weights interfering with codon pattern learning.
Using the Model
from transformers import RobertaForMaskedLM
import torch
# Load model (available soon on Hugging Face)
model = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-v2")
tokenizer = CodonTokenizer() # Our custom 69-token vocabulary
# Score a sequence
sequence = "ATG GCT AAA GGT..." # Space-separated codons
inputs = tokenizer(sequence, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# Pseudo-log-likelihood gives a "naturalness" score
# Predict alternatives for a masked position
masked_seq = "ATG [MASK] AAA GGT..."
inputs = tokenizer(masked_seq, return_tensors='pt')
predictions = model(**inputs).logits
top_codons = predictions[0, mask_pos].topk(5)
4. Scaling to Multi-Species
Single-species codon optimization is useful, but limited. Every organism has its own codon usage biases shaped by millions of years of evolution. E. coli favors different codons than human cells, which favor different codons than yeast. A model trained only on E. coli data will not produce optimal codons for human expression.
The industry standard is to use separate CAI tables for each organism. We wanted something better: a single model that understands codon usage across organisms, can be conditioned on a target species, and can transfer knowledge from data-rich organisms (human, with 145k annotated coding sequences) to data-poor ones (E. coli, with 9k). After establishing CodonRoBERTa-large-v2 as our best architecture on single-species data, we built this system.
The Data Engineering Challenge
Assembling a multi-species codon dataset is not as simple as downloading a few genomes. Each organism lives in a different NCBI RefSeq assembly, with different annotation quality, different CDS boundaries, and different sequence conventions. We wrote an automated pipeline that downloads CDS sequences from 25 organisms, validates them (checking for proper start/stop codons, length divisible by 3, no internal stops), labels each sequence with a species token, and splits into train/test sets with stratification by species.
# scripts/training/download_multispecies_cds.py
# Automated download from NCBI RefSeq for 25 organisms
SPECIES = {
# Bacteria (19 species)
'bacteria': [
('GCF_000005845.2', 'Escherichia coli K-12', 'ECOLI'),
('GCF_000009045.1', 'Bacillus subtilis 168', 'BSUBT'),
('GCF_000006945.2', 'Salmonella enterica', 'SENTE'),
('GCF_000195955.2', 'Mycobacterium tuberculosis', 'MTUBE'),
# ... 15 more bacteria
],
# Yeast (3 species)
'yeast': [
('GCF_000146045.2', 'Saccharomyces cerevisiae S288C', 'YEAST'),
('GCF_000002515.2', 'Schizosaccharomyces pombe', 'SPOMBE'),
('GCF_000027005.1', 'Pichia pastoris', 'PICHIA'),
],
# Mammals (3 species)
'mammals': [
('GCF_000001405.40', 'Homo sapiens GRCh38', 'HUMAN'),
('GCF_000001635.27', 'Mus musculus GRCm39', 'MOUSE'),
('GCF_003668045.3', 'Cricetulus griseus CHO-K1', 'CHO'),
]
}
The final dataset spans three domains of biotechnology relevance:
| Category | Species | Sequences | Size | Key Organisms |
|---|---|---|---|---|
| Mammals | 3 | 290,091 | 580 MB | Human (145k), Mouse (97k), CHO (47k) |
| Bacteria | 19 | 74,972 | 150 MB | E. coli (9k), B. subtilis (4k), P. aeruginosa (5.3k) |
| Yeast | 3 | 16,220 | 32 MB | S. cerevisiae (5.7k), P. pastoris (4.8k) |
| Total | 25 | 381,283 | ~3 GB | 362k train / 19k test |
The coverage is deliberate: bacteria are the workhorse of recombinant protein production, yeast dominates industrial biomanufacturing, and mammalian cells (especially CHO and human) are required for therapeutic proteins and mRNA vaccines. These 25 organisms collectively cover the vast majority of real-world codon optimization use cases.
The Tokenization Innovation
A model that sees sequences from 25 different organisms needs to know which organism it is looking at. We solved this by extending our 69-token codon vocabulary with 25 species tokens, creating a 94-token system. Each sequence is prepended with its species token (e.g., [HUMAN], [ECOLI], [YEAST]), so the model learns species-specific codon preferences within a single shared architecture.
# scripts/training/codon_tokenizer.py
class MultiSpeciesCodonTokenizer(CodonTokenizer):
"""Extended tokenizer with species-awareness"""
def __init__(self):
super().__init__()
# 0-4: [PAD], [UNK], [CLS], [SEP], [MASK]
# 5-68: 64 codons (AAA, AAC, ..., TTT)
# 69-93: 25 species tokens
self.species_tokens = [
'[ABAUM]', '[BSUBT]', '[CHO]', '[ECOLI]',
'[HUMAN]', '[MOUSE]', '[YEAST]', # ... +18 more
]
def encode(self, dna_seq: str, species: str = None):
"""Encode with species token prepended"""
ids = super().encode(dna_seq)
if species and species in self.species_to_id:
ids = [self.species_to_id[species]] + ids
return ids
This design has three advantages. First, it enables species-conditioned generation: the same model produces human-optimal or E. coli-optimal codons depending on which species token is prepended. Second, it enables cross-species transfer learning: universal codon patterns (like avoiding certain dinucleotides, or preferring GC-rich codons in GC-rich genomes) are shared across all species, while species-specific preferences are captured by conditioning on the species token. Third, the 94-token vocabulary is backward-compatible with our 69-token single-species models, since the first 69 tokens are identical.
Training the Universal Base Model
The universal base model is a 311.9M-parameter RoBERTa-large, identical in architecture to our single-species v2 but with the expanded 94-token vocabulary. It was trained for 48 hours on 4 A100 GPUs using the full 362k-sequence multispecies dataset.
# configs/mrna/production/roberta_large_multispecies.yaml
model:
name: "CodonRoBERTa-large-multispecies"
vocab_size: 94 # 69 base + 25 species
hidden_size: 1024
num_hidden_layers: 24
num_attention_heads: 16
training:
max_steps: 50000 # ~4.5 epochs over 362k sequences
learning_rate: 5e-5
per_device_train_batch_size: 4 # 4 GPUs = 16 effective
gradient_accumulation_steps: 2 # 32 total batch size
bf16: true
fsdp: "full_shard auto_wrap" # Critical for 311M params
Training command:
torchrun --nproc_per_node=4 --master_port=29501 \
scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_multispecies.yaml
Results:
| Metric | Value | Notes |
|---|---|---|
| Training time | 48 hours | 4×A100 80GB GPUs |
| Final steps | 47,500 / 50,000 | 95% complete (max time limit) |
| Training loss | 4.10 → 3.17 | 23% reduction |
| Eval loss | 3.18 | Stable convergence |
| Test perplexity | 24.9 | On 19k test sequences |
| Synonymous recovery | 11.0% | Better than single-species |
| Model size | 1.2 GB | Saved with safetensors |
| Checkpoints | 5 saved | Every 2,500 steps |
The test perplexity of 24.9 is higher than our single-species model's 4.01, which might look like a regression. It is not. The multispecies model must learn distinct codon preferences for 25 different organisms, each with its own evolutionary history and tRNA pools. A bacterium like M. tuberculosis (65% GC content) uses completely different codons than human cells (41% GC). The model is solving a fundamentally harder problem, and the perplexity reflects that. What matters is whether species-specific fine-tuning can recover performance, and it does.
Species-Specific Fine-Tuning
The universal base model is a generalist. For production use, specialists perform better. OpenMed's fine-tuning strategy starts from the multispecies checkpoint and trains further on a single species at a lower learning rate (2e-5 vs 5e-5), preserving cross-species knowledge while specializing the model's predictions for one organism.
Dataset splits:
# scripts/training/split_species_datasets.py
# Automated splitting for 6 priority species
RESULTS = {
'HUMAN': {'train': 131_245, 'test': 6_908},
'MOUSE': {'train': 88_022, 'test': 4_633},
'CHO': {'train': 42_541, 'test': 2_239},
'ECOLI': {'train': 8_547, 'test': 450},
'YEAST': {'train': 5_439, 'test': 287},
'PICHIA':{'train': 4_548, 'test': 240},
}
Training all three priority species:
# HUMAN: Therapeutic mRNA, vaccines
torchrun --nproc_per_node=4 scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_human_finetune.yaml
# ECOLI: Protein expression, metabolic engineering
torchrun --nproc_per_node=4 scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_ecoli_finetune.yaml
# CHO: Biopharmaceutical production
torchrun --nproc_per_node=4 scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_cho_finetune.yaml
Comprehensive Results:
| Model | Train Seqs | Steps | GPU Hours | Test Loss | Perplexity | vs Multispecies | Primary Use Case |
|---|---|---|---|---|---|---|---|
| Multispecies Base | 362k | 47,500 | 48h | 3.18 | 24.9 | baseline | Universal optimization |
| HUMAN | 131k | 15,000 | 4h | 3.16 | 24.3 | 2.4% better | mRNA therapeutics, vaccines |
| ECOLI | 8.5k | 5,000 | 0.5h | 3.17 | 25.3 | 1.6% better | Recombinant proteins |
| CHO | 42.5k | 10,000 | 2.5h | 3.21 | 25.5 | 2.4% worse | Antibody production |
| Total: 55h |
The most important result here is the HUMAN model: at 24.3 perplexity, it is the only specialist that beats the universal base, making it our production model for therapeutic mRNA applications. But the ECOLI result is arguably more interesting from a research perspective. With only 8,547 training sequences (compared to 131k for human), the E. coli specialist still improved over the multispecies base. This validates the transfer learning hypothesis: training on 25 species first, then fine-tuning on a small species-specific dataset, works better than training on the small dataset alone. For the many organisms where annotated CDS data is scarce, this approach opens the door to reasonable codon optimization without needing tens of thousands of species-specific sequences.
The CHO model showed slight degradation (25.5 vs 24.9), which we attribute to insufficient training steps. ECOLI got 5,000 steps for 8.5k sequences (0.59 steps per sequence), while CHO got 10,000 steps for 42.5k sequences (0.24 steps per sequence). A rerun with 15,000 steps should close this gap. All three specialists fine-tuned in just 7 hours total, leveraging the 48-hour investment in the multispecies base.
The Complete Model Suite
After 55 hours of training, we have:
All models will be released on Hugging Face under the OpenMed organization. The naming convention follows OpenMed/{model-name} for direct use with from_pretrained().
Universal Models:
OpenMed/CodonRoBERTa-large-multispecies(311.9M params)- Trained on 25 species
- Perplexity: 24.9
- Use case: Cross-species optimization, rare organisms
Species-Specific Specialists:
OpenMed/CodonRoBERTa-large-human(311.9M params)- Perplexity: 24.3 (best overall)
- Use case: mRNA vaccines, gene therapy, therapeutic proteins
OpenMed/CodonRoBERTa-large-ecoli(311.9M params)- Perplexity: 25.3
- Use case: Bacterial protein expression, metabolic engineering
OpenMed/CodonRoBERTa-large-cho(311.9M params)- Perplexity: 25.5
- Use case: Mammalian cell culture, biopharmaceuticals
Single-Species Models:
OpenMed/CodonRoBERTa-large-v2(312M params)- Perplexity: 4.10, CAI: 0.404
- Trained on 250k E. coli only
- Still the best for pure E. coli optimization
OpenMed/CodonRoBERTa-base(92M params)- Perplexity: 4.01, CAI: 0.219
- Best efficiency option (3.4x smaller than large)
Production Deployment Strategy
For therapeutic mRNA (Moderna, BioNTech-style vaccines):
from transformers import RobertaForMaskedLM
# Load human-optimized model
model = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-human")
# Optimize amino acid sequence to DNA
optimized_dna = optimize_for_human(protein_seq, model)
For industrial protein production:
# E. coli expression
model_ecoli = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-ecoli")
# CHO cell expression
model_cho = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-cho")
For rare/unmodeled organisms:
# Use multispecies base with species conditioning
model_multi = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-multispecies")
# Prepend species token: [YEAST], [PICHIA], [MOUSE], etc.
Infrastructure & Reproducibility
Hardware requirements:
- 4×A100 80GB GPUs (training)
- Single A100 40GB (inference)
- FSDP (Fully Sharded Data Parallel) for 311M params
- bf16 mixed precision (critical for stability)
Storage footprint:
- ~150 GB total (data, trained models, and checkpoints across all runs)
Complete training pipeline:
# 1. Download multi-species data (automated)
python scripts/training/download_multispecies_cds.py \
--output_dir data/mrna/multispecies \
--categories bacteria yeast mammals
# 2. Train universal base (48 hours)
torchrun --nproc_per_node=4 \
scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_multispecies.yaml
# 3. Split species datasets
python scripts/training/split_species_datasets.py \
--input data/mrna/multispecies/train_multispecies.fasta \
--output_dir data/mrna/species_specific
# 4. Fine-tune specialists (7 hours total)
for species in human ecoli cho; do
torchrun --nproc_per_node=4 \
scripts/training/run_multispecies_train.py \
--config configs/mrna/production/roberta_large_${species}_finetune.yaml
done
# 5. Evaluate all models
for model in multispecies human ecoli cho; do
python scripts/evals/advanced/eval_perplexity_multispecies.py \
--model_path outputs/models/CodonRoBERTa-large-${model}/final \
--test_file data/mrna/species_specific/${model}_test.fasta \
--output_file outputs/evals/${model}_perplexity.json
done
Total compute cost:
- 55 GPU-hours on A100 80GB (~$165 on AWS p4d.24xlarge)
- All models trained to convergence in under 3 days wall-clock time
What This Enables
The multi-species suite covers the three pillars of applied codon optimization. For therapeutic mRNA, the HUMAN specialist model optimizes codons for expression in human cells, directly applicable to vaccine design (Moderna and BioNTech both codon-optimize their mRNA constructs) and gene therapy vectors. For recombinant protein production, the ECOLI specialist handles the most common bacterial expression host, while the CHO specialist covers the mammalian cell line used to produce most monoclonal antibodies and biopharmaceuticals. For organisms not covered by a specialist, the multispecies base model accepts any of the 25 species tokens and generates organism-appropriate codons.
The transfer learning result is particularly relevant for the broader community. Many industrially important organisms (non-model bacteria, insect cells, plant cells) have limited annotated CDS data. Our results with E. coli (8.5k sequences, improved over base) suggest that multispecies pre-training followed by small-scale fine-tuning is a viable path for these organisms, without requiring the tens or hundreds of thousands of sequences that training from scratch demands.
By the Numbers
The full project consumed 55 GPU-hours on 4 A100 80GB GPUs (~$165 at AWS spot pricing), produced ~150 GB of models and checkpoints across 4 training runs, and covers 381,283 CDS sequences from 25 species downloaded from NCBI RefSeq. All models use the same 311.9M-parameter architecture, are saved in safetensors format for fast loading, and run inference on a single GPU with 16GB+ VRAM. Everything is released under Apache 2.0.
5. The End-to-End Workflow
The modern computational protein design workflow. Structure generation (top) produces backbone coordinates; sequence optimization (bottom) finds amino acid sequences that fold into the target shape. Our pipeline adds a third step: codon optimization for expression. Figure from Kortemme, Cell 2024, CC-BY 4.0.
In Part I, we described the protein engineering loop as a cycle of prediction, design, and optimization. Here is what that looks like in practice with OpenMed's pipeline. Each step feeds into the next, and the entire computational phase runs in an afternoon on a single GPU.
Consider a concrete scenario: engineering a more stable version of a therapeutic enzyme that degrades too quickly in the bloodstream.
Step 1: Fold (ESMFold). Predict the structure of the starting sequence to understand its active site and identify regions that might be destabilized. ESMFold returns a 3D structure in PDB format, per-residue confidence scores (pLDDT) highlighting uncertain regions, and an overall topology confidence metric (PTM).
Step 2: Design (ProteinMPNN). Keep the active site fixed but redesign the scaffold for improved stability. ProteinMPNN takes the backbone coordinates, a list of immutable positions (catalytic residues), and generates 100 diverse candidate sequences, each predicted to fold into the target shape.
Step 3: Verify (ESMFold). Refold all 100 candidates with ESMFold to confirm they still adopt the correct shape. Filter for high mean pLDDT (>80), correct topology (RMSD to original), and low clash scores.
Step 4: Optimize (CodonRoBERTa). Take the best amino acid sequences and use CodonRoBERTa-large-v2 to generate optimal DNA sequences for expression in the target organism (E. coli, yeast, or mammalian cells). The model scores synonymous codon choices based on learned biological preferences, identifies contextually optimal codons rather than just globally frequent ones, and produces sequences with high CAI correlation.
Step 5: Synthesize and Test. Order the DNA from a synthesis company, clone it into an expression vector, and test expression and activity in the lab.
This loop, from hypothesis to synthesis-ready DNA, replaces what once took months of iterative wet-lab trial and error. A researcher arrives at the bench with 5-10 computationally vetted candidates instead of one or two educated guesses. The success rate improves, the cost drops, and the design cycle compresses from months to days.
For a full ecosystem overview, tool selection guide, and license reference, see Part I.
6. Where This Stands and What's Next
The Landscape
OpenMed is not working in isolation. Two recent models push the codon/mRNA modeling frontier further:
mRNABERT (Xiong et al., Nature Communications 2025): 86M-param BERT with a dual tokenization scheme (single nucleotides for UTRs, codons for CDS) and cross-modality contrastive learning against frozen ProtT5-XL protein embeddings. Trained on 18M sequences. Achieves R^2 = 0.66 on full-length mRNA translation efficiency, a 1.6-10x improvement over prior RNA models. Code and weights available (Apache 2.0).
NUWA (Zhong et al., bioRxiv 2026): Three domain-specific RoBERTa encoders (Bacteria, Eukaryota, Archaea) with curriculum MLM and supervised contrastive learning. Trained on 115M sequences across ~25,000 species. Beats CodonBERT on 11/13 BEACON benchmark tasks. No code or weights released.
Both models train on 50-300x more data than OpenMed used. That is the primary gap, and we are transparent about it.
Here is what OpenMed does that neither of them offers:
Species-conditioned single model. mRNABERT has no species conditioning at all. NUWA trains three separate models (one per domain of life). We put 25 species tokens into one 94-token vocabulary and train a single model that can be prompted for any organism. More parameter-efficient, more flexible.
Transfer learning to low-resource organisms, validated. We showed that fine-tuning the multispecies base on just 8.5k E. coli sequences improves over the base model. Neither mRNABERT nor NUWA demonstrates this.
Full open-source pipeline. ESMFold + ProteinMPNN + CodonRoBERTa, end to end, with training code, configs, eval scripts, and model weights. All Apache 2.0. mRNABERT releases code but not the pipeline. NUWA releases nothing.
In Progress: CodonJEPA
OpenMed is running a proof-of-concept for a fundamentally different approach: Joint Embedding Predictive Architecture (JEPA) for codon sequences.
Standard MLM predicts masked tokens. JEPA predicts masked embeddings. The hypothesis: if the model is forced to predict in embedding space rather than token space, it should learn that synonymous codons (different DNA, same amino acid) are functionally equivalent. MLM cannot achieve this because it is trained to distinguish every token.
The architecture:
- Context encoder: RoBERTa-base (768-dim, 12 layers), trained normally
- Target encoder: EMA copy of the context encoder (momentum 0.990 to 0.999, no gradients)
- Predictor: Lightweight 4-layer transformer (384-dim), predicts target embeddings from context
- Masking: Multi-block strategy (4 contiguous blocks per sequence, 15-20% mask scale)
- Collapse prevention: VICReg regularization (variance + covariance loss)
Early results from our evaluation suite (JEPA vs. MLM baseline, same data, same hyperparameters, 15k steps each):
| Metric | CodonJEPA | CodonMLM | Winner |
|---|---|---|---|
| Synonymous robustness (cosine sim) | 0.9997 | 0.9414 | JEPA |
| Codon-to-amino-acid probe | 37.4% | 100% | MLM (expected) |
| Taxonomy classification | 87.8% | 99.4% | MLM |
| Effective dimensionality | 1.67 | 18.71 | MLM |
The synonymous robustness result is the one that matters most for our hypothesis. JEPA embeddings are nearly identical (99.97% cosine similarity) for sequences that differ only in synonymous codon choices. MLM embeddings shift significantly (94.14%). This means JEPA does learn that synonymous codons are interchangeable, exactly as predicted.
The open challenge: JEPA currently suffers from dimensional collapse (91.78% of variance concentrated in one component). This is a known failure mode for self-supervised methods and suggests the VICReg regularization weights need tuning. The architecture works; the training dynamics need more iteration.
This is early-stage research, not production-ready. But if the collapse problem can be solved, JEPA could produce codon embeddings that are inherently amino-acid-aware, something that MLM, by its token-level prediction objective, fundamentally cannot achieve.
Roadmap
CodonRoBERTa (scaling up):
- Retrain on mRNABERT's public 36M-sequence dataset (Zenodo). Same architecture, same species tokens, 100x more data
- Add cross-modality contrastive alignment with ProtT5-XL (proven by mRNABERT to boost protein property prediction)
- Extend species-specific fine-tuning to YEAST, PICHIA, MOUSE
- Add mRNA stability and immunogenicity prediction heads
CodonJEPA (fixing and scaling):
- Solve dimensional collapse (stronger VICReg weights, alternative regularizers)
- Benchmark against mRNABERT's contrastive approach on the same downstream tasks
- If JEPA embeddings hold up, integrate as a drop-in replacement for MLM embeddings in the pipeline
Pipeline:
- Integrate RFdiffusion for de novo backbone generation
- Add solubility and expression prediction heads
- Fine-tune ESMFold on specialized domains (antibodies, enzymes)
Setup and Requirements
Hardware:
- Tested on 4×A100 GPUs (80GB)
- Folding inference: ~16-20GB VRAM for typical proteins
- Training: Scales with FSDP (Fully Sharded Data Parallel)
- Minimum for inference: Single GPU with 16GB+ VRAM
Environment:
- Use
.env_esmfoldvirtual environment for folding - For training: PyTorch 2.5.1+cu121 with flash-attn2
- Python 3.10+ recommended
Licenses (all commercial-friendly):
- ESMFold: MIT
- ProteinMPNN: MIT
- OpenFold: Apache-2.0
- Our CodonRoBERTa: Apache-2.0
7. References
OpenMed's work builds on foundational research from Meta AI, the Baker Lab at the University of Washington, DeepMind, and the broader open-source computational biology community.
Key Papers
Protein Structure Prediction
- Jumper, J. et al. "Highly accurate protein structure prediction with AlphaFold." Nature (2021). DOI
- Lin, Z. et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science (2023). DOI
- Ahdritz, G. et al. "OpenFold: Retraining AlphaFold2 yields new insights." Nature Methods (2024). DOI
Protein Design
- Dauparas, J. et al. "Robust deep learning-based protein sequence design using ProteinMPNN." Science (2022). DOI
- Watson, J.L. et al. "De novo design of protein structure and function with RFdiffusion." Nature (2023). DOI
mRNA & Codon Optimization
- Cheng, J. et al. "CodonBERT: a language model for codon optimization." Nucleic Acids Research (2024). DOI
- Xiong, Y. et al. "mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset." Nature Communications (2025). DOI
- Zhong, Y. et al. "Large mRNA language foundation modeling with NUWA for unified sequence perception and generation." bioRxiv (2026). DOI
- Warner, B. et al. "ModernBERT: Smarter, Better, Faster, Longer." arXiv (2024). arXiv
Models and Data: Coming Soon
All models, training code, and the multi-species dataset described in this post will be publicly released under the OpenMed organization on Hugging Face under Apache 2.0 / MIT licenses.
Models (7 checkpoints):
| Model | Params | HuggingFace |
|---|---|---|
| CodonRoBERTa-large-v2 | 312M | OpenMed/CodonRoBERTa-large-v2 |
| CodonRoBERTa-base | 92M | OpenMed/CodonRoBERTa-base |
| CodonRoBERTa-large-multispecies | 311.9M | OpenMed/CodonRoBERTa-large-multispecies |
| CodonRoBERTa-large-human | 311.9M | OpenMed/CodonRoBERTa-large-human |
| CodonRoBERTa-large-ecoli | 311.9M | OpenMed/CodonRoBERTa-large-ecoli |
| CodonRoBERTa-large-cho | 311.9M | OpenMed/CodonRoBERTa-large-cho |
Dataset:
| Dataset | Sequences | Species | HuggingFace |
|---|---|---|---|
| Multi-species CDS | 381,283 | 25 | OpenMed/multispecies-codon-dataset |
Training code and evaluation scripts will be released alongside the models.
Follow OpenMed on Hugging Face to be notified when models go live.
The ML Engineer's Guide to Protein AI: From Protein to Optimized DNA | March 2026
Read Part I: The AlphaFold Revolution, covering the landscape of protein AI.
Questions or collaboration ideas? Reach out on Hugging Face or open a discussion on the model pages.



