CodonTranslator
CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only data_v3 release.
This repository is the public model and training-code release. It contains:
final_model/: inference-ready weightssrc/,train.py,sampling.py: training and inference coderesplit_data_v3.py: thedata_v3reconstruction pipelineslurm/: the single-node H200 training and data rebuild submission scriptsCodonTranslator/andpyproject.toml: a lightweight packaged inference wrapper
Training configuration
- Architecture:
hidden=750,layers=20,heads=15,mlp_ratio=3.2 - Attention:
mha - Precision:
bf16 - Parallelism: FSDP full shard
- Effective global batch:
1536 - Weight decay:
1e-4 - Dataset:
alegendaryfish/CodonTranslator-data
Dataset release
The corresponding public dataset and species embedding release is:
alegendaryfish/CodonTranslator-data
That dataset repo contains:
- final representative-only
train/,val/,test/parquet shards embeddings_v2/- split audit files and reconstruction metadata
Quick start
Install
git clone https://huggingface.co/alegendaryfish/CodonTranslator
cd CodonTranslator
conda env create -f environment.yml
conda activate codontranslator
pip install -r requirements.txt
pip install -e .
Both import styles are supported:
from CodonTranslator import CodonTranslator
from codontranslator import CodonTranslator
Train
python train.py \
--train_data /path/to/train \
--val_data /path/to/val \
--embeddings_dir /path/to/embeddings_v2 \
--output_dir outputs \
--fsdp \
--bf16 \
--attn mha \
--hidden 750 \
--layers 20 \
--heads 15 \
--mlp_ratio 3.2 \
--batch_size 48 \
--grad_accum 4 \
--epochs 3 \
--lr 7e-5 \
--weight_decay 1e-4
The included Slurm launchers use the same training flags as the local single-node H200 workflow:
slurm/train_v3_h200_8x_single.sbatchslurm/submit_train_v3_h200_8x_chain.sh
Sample
python sampling.py \
--model_path final_model \
--embeddings_dir /path/to/embeddings_v2 \
--species "Panicum hallii" \
--protein_sequence "MSEQUENCE" \
--strict_species_lookup
Notes
- Training uses precomputed
embeddings_v2for species conditioning. - The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
final_model/is the published inference entrypoint.- For compatibility, released model directories contain both
trainer_config.jsonandconfig.json.
Sampling arguments
enforce_mapping: whenTrue, each generated codon is constrained to encode the provided amino acid at that position.temperature: softmax temperature. Lower values are more deterministic;0selects argmax greedily.top_k: keep only thekhighest-logit codon candidates before sampling.top_p: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reachesp.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support