CodonTranslator

CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only data_v3 release.

This repository is the public model and training-code release. It contains:

final_model/: inference-ready weights
src/, train.py, sampling.py: training and inference code
resplit_data_v3.py: the data_v3 reconstruction pipeline
slurm/: the single-node H200 training and data rebuild submission scripts
CodonTranslator/ and pyproject.toml: a lightweight packaged inference wrapper

Training configuration

Architecture: hidden=750, layers=20, heads=15, mlp_ratio=3.2
Attention: mha
Precision: bf16
Parallelism: FSDP full shard
Effective global batch: 1536
Weight decay: 1e-4
Dataset: alegendaryfish/CodonTranslator-data

Dataset release

The corresponding public dataset and species embedding release is:

alegendaryfish/CodonTranslator-data

That dataset repo contains:

final representative-only train/, val/, test/ parquet shards
embeddings_v2/
split audit files and reconstruction metadata

Quick start

Install

git clone https://huggingface.co/alegendaryfish/CodonTranslator
cd CodonTranslator
conda env create -f environment.yml
conda activate codontranslator
pip install -r requirements.txt
pip install -e .

Both import styles are supported:

from CodonTranslator import CodonTranslator

from codontranslator import CodonTranslator

Train

python train.py \
  --train_data /path/to/train \
  --val_data /path/to/val \
  --embeddings_dir /path/to/embeddings_v2 \
  --output_dir outputs \
  --fsdp \
  --bf16 \
  --attn mha \
  --hidden 750 \
  --layers 20 \
  --heads 15 \
  --mlp_ratio 3.2 \
  --batch_size 48 \
  --grad_accum 4 \
  --epochs 3 \
  --lr 7e-5 \
  --weight_decay 1e-4

The included Slurm launchers use the same training flags as the local single-node H200 workflow:

slurm/train_v3_h200_8x_single.sbatch
slurm/submit_train_v3_h200_8x_chain.sh

Sample

python sampling.py \
  --model_path final_model \
  --embeddings_dir /path/to/embeddings_v2 \
  --species "Panicum hallii" \
  --protein_sequence "MSEQUENCE" \
  --strict_species_lookup

Notes

Training uses precomputed embeddings_v2 for species conditioning.
The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
final_model/ is the published inference entrypoint.
For compatibility, released model directories contain both trainer_config.json and config.json.

Sampling arguments

enforce_mapping: when True, each generated codon is constrained to encode the provided amino acid at that position.
temperature: softmax temperature. Lower values are more deterministic; 0 selects argmax greedily.
top_k: keep only the k highest-logit codon candidates before sampling.
top_p: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches p.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

alegendaryfish
/

CodonTranslator