CodonTranslator

CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only data_v3 release.

This repository is the public model and training-code release. It contains:

  • final_model/: inference-ready weights
  • src/, train.py, sampling.py: training and inference code
  • resplit_data_v3.py: the data_v3 reconstruction pipeline
  • slurm/: the single-node H200 training and data rebuild submission scripts
  • CodonTranslator/ and pyproject.toml: a lightweight packaged inference wrapper

Training configuration

  • Architecture: hidden=750, layers=20, heads=15, mlp_ratio=3.2
  • Attention: mha
  • Precision: bf16
  • Parallelism: FSDP full shard
  • Effective global batch: 1536
  • Weight decay: 1e-4
  • Dataset: alegendaryfish/CodonTranslator-data

Dataset release

The corresponding public dataset and species embedding release is:

  • alegendaryfish/CodonTranslator-data

That dataset repo contains:

  • final representative-only train/, val/, test/ parquet shards
  • embeddings_v2/
  • split audit files and reconstruction metadata

Quick start

Install

git clone https://huggingface.co/alegendaryfish/CodonTranslator
cd CodonTranslator
conda env create -f environment.yml
conda activate codontranslator
pip install -r requirements.txt
pip install -e .

Both import styles are supported:

from CodonTranslator import CodonTranslator
from codontranslator import CodonTranslator

Train

python train.py \
  --train_data /path/to/train \
  --val_data /path/to/val \
  --embeddings_dir /path/to/embeddings_v2 \
  --output_dir outputs \
  --fsdp \
  --bf16 \
  --attn mha \
  --hidden 750 \
  --layers 20 \
  --heads 15 \
  --mlp_ratio 3.2 \
  --batch_size 48 \
  --grad_accum 4 \
  --epochs 3 \
  --lr 7e-5 \
  --weight_decay 1e-4

The included Slurm launchers use the same training flags as the local single-node H200 workflow:

  • slurm/train_v3_h200_8x_single.sbatch
  • slurm/submit_train_v3_h200_8x_chain.sh

Sample

python sampling.py \
  --model_path final_model \
  --embeddings_dir /path/to/embeddings_v2 \
  --species "Panicum hallii" \
  --protein_sequence "MSEQUENCE" \
  --strict_species_lookup

Notes

  • Training uses precomputed embeddings_v2 for species conditioning.
  • The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
  • final_model/ is the published inference entrypoint.
  • For compatibility, released model directories contain both trainer_config.json and config.json.

Sampling arguments

  • enforce_mapping: when True, each generated codon is constrained to encode the provided amino acid at that position.
  • temperature: softmax temperature. Lower values are more deterministic; 0 selects argmax greedily.
  • top_k: keep only the k highest-logit codon candidates before sampling.
  • top_p: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches p.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train alegendaryfish/CodonTranslator