Model Card for ProtGPT3-MSA

Model Details

Model Description

ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.

Unlike the single-sequence ProtGPT3 checkpoints, ProtGPT3-MSA can be prompted with sets of homologous protein sequences, enabling few-shot, family-conditioned protein generation without task-specific fine-tuning. At inference time, users can provide homologous protein sequences as context and generate additional family-consistent sequences.

ProtGPT3-MSA was initialized from the final ProtGPT3-112M training checkpoint and further trained to autoregressively model sets of 16 concatenated protein sequences. The model supports both aligned and unaligned prompting modes.

Uses

Direct Use

ProtGPT3-MSA is intended for few-shot, homolog-conditioned protein sequence generation. Users can prompt the model with related protein sequences from a target protein family to generate additional family-consistent sequences.

Downstream Use

ProtGPT3-MSA can be used in protein design workflows where users have a small set of homologous sequences and want to generate plausible additional sequences from the same family. It may be combined with computational screening, structural prediction, fitness prediction, solubility filtering, or other downstream validation pipelines.

Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.

The model should not be used for irresponsible or harmful biological design applications.

Bias, Risks, and Limitations

ProtGPT3-MSA learns from public protein sequence and MSA datasets and may reproduce biases present in those datasets. The model depends on the quality, relevance, and diversity of the homologous sequences provided in the prompt. Poor, unrelated, noisy, contaminated, or incorrectly aligned prompts may reduce generation quality.

Generated sequences may be nonfunctional, unstable, insoluble, repetitive, low-complexity, or biologically implausible. As with other generative protein models, ProtGPT3-MSA may present dual-use risks if applied irresponsibly.

Recommendations

Users should provide high-quality homologous protein sequences and validate generated sequences with appropriate downstream computational and experimental methods. For family-conditioned generation, users should carefully curate prompts and assess generated sequences using task-relevant criteria such as sequence identity, structural confidence, family-level consistency, solubility, and functional plausibility.

How to Get Started with the Model

Install dependencies:

pip install transformers accelerate torch

Load the model and tokenizer:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import random
import re

# ---- Intialise useful methods to prompt ProtGPT3-MSA ----
def process_style(seq: str, gap: bool):
    """Remove gaps, uppercase insertions, drop X."""
    if gap:
        # keep gaps
        return re.sub(r"[X]", "", seq.upper())
    else:
        # remove gaps
        return re.sub(r"[X]", "", seq.replace("-", "").upper())

def build_prompt(
    sequences: list, 
    gap: bool = False,
) -> str:
    """Build prompt for ProtGPT3-MSA"""

    random.shuffle(sequences)

    direction = "1" # change this to "2" for reversed C-to-N generation

    if gap:
        gap_token = "<gap>"
        assert all(len(s) == len(sequences[0]) for s in sequences), "Sequences in the prompt have different len(), but should be aligned, either align them or use no_gap mode"
    else:
        gap_token = "<no_gap>"

    tokens: List[str] = ["<|bos|>", direction, gap_token]
    for seq in sequences:
        tokens.append("<s>")
        tokens.extend(list(process_style(seq,gap=gap)))

    # Match train-time separator before continuation
    tokens.append("<s>")
    return " ".join(tokens)
## --------------------------------------

model_id = "protgpt3/ProtGPT3-MSA"  # Replace with the final checkpoint name

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=False, add_eos_token=False, padding_side="left") # BOS token manually added in build_prompt

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()

Few-shot generation with unaligned homologs

Use the <no_gap> modality token for unaligned sequences. Separate homologous sequences with the <s> separator token.

import torch


homologs = [
    "MKTAYIAKQRQISFVKSHFSRQDILD",
    "MKTVYIAKQRQISFVKSHFSRQDILD",
    "MKTAYIAKQRQINNVKSHFSRQNILD",
    # Add up to 15 homologous protein sequences
]

prompt = build_prompt(sequences=homologs)

inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
print(generated)

Few-shot generation with aligned homologs

Use the <gap> modality token for aligned sequences. Gap characters may be included in the prompted sequences.

import torch

# must have the same length and be aligned
aligned_homologs = [
    "MKTAYIAKQRQI--SFVKSHFSRQDILD",
    "MKTVYIAKQRQI--SFVKSHFSRQDILD",
    "MKTAYIAKQRQINNSFVKSHFSRQNILD",
]

prompt = build_prompt(sequences=aligned_homologs, gap=True)

inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
print(generated)

Extracting the newly generated sequence

Depending on tokenizer behavior and special-token handling, the decoded output may include the full prompt plus the continuation. A simple post-processing approach is to split on the sequence separator token and inspect the final generated segment:

decoded = tokenizer.decode(output_ids[0], skip_special_tokens=False)

segments = decoded.split("<s>")
generated_sequence = segments[-1].replace(tokenizer.eos_token or "", "").strip()

print(generated_sequence)

Notes on prompting

  • Use <no_gap> for unaligned homologous sequences.
  • Use <gap> for aligned MSA-style inputs containing gap characters.
  • Separate protein sequences with <s>.
  • Provide up to 15 homologous sequences as context.
  • Sampling parameters such as temperature and top_p can affect sequence quality, diversity, and family consistency.
  • Generated sequences should be validated before experimental use.

Training Details

Training Data

ProtGPT3-MSA was trained on approximately 8.5M MSAs from the OpenProteinSet Uniclust30 dataset. From each MSA, 16 sequences were sampled without replacement and concatenated in random order. This process was repeated 15 times for each MSA, resulting in approximately 560B training tokens.

Training Procedure

Preprocessing

Each training example consisted of 16 concatenated protein sequences sampled from the same MSA. A special sequence separator token, <s>, was used to mark sequence boundaries.

Training included both aligned and unaligned modalities:

  • <gap>: aligned modality, where sequences include gap tokens
  • <no_gap>: unaligned modality, where sequences are provided without gaps

The model was trained autoregressively to predict concatenated protein sequences token by token.

Technical Specifications

Model Architecture and Objective

ProtGPT3-MSA is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained to model concatenated sets of related protein sequences, enabling homolog-conditioned generation through prompting.

The model processes up to 16 concatenated protein sequences and supports both aligned and unaligned modalities. During inference, users may provide up to 15 homologous sequences and generate an additional sequence conditioned on the prompt.

Compute Infrastructure

Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

Citation

BibTeX:

@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}

More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

Downloads last month
33
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AI4PD/ProtGPT3-MSA