RNAErnie2

RNAErnie2 is a BERT-based RNA language model trained from scratch on a large-scale RNA sequence dataset with up to 2048-nucleotide context length. It is a retrained successor to RNAErnie that replaces the PaddlePaddle-based ERNIE backbone with a standard PyTorch BERT architecture, extends the pretraining corpus to RNACentral v22 (~31M sequences, length <= 2048), and switches to an RNA-native vocabulary (U instead of T).

Architecture

Parameter Value
Layers 12
Attention heads 12
Embedding dimension 768
Intermediate size 3072
Vocabulary size 11
Positional encoding Absolute learned
Architecture Post-LN BERT / BertForMaskedLM
Max sequence length 2048

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5, A=6, U=7, C=8, G=9, N=10

Pretraining

  • Objective: Masked language modelling (MLM)
  • Data: RNACentral v22, ~31 million RNA sequences with length <= 2048
  • Source checkpoint: LLM-EDA/RNAErnie on HuggingFace Hub
  • Tokenisation note: Sequences use U (not T). Input T is silently converted to U by the tokenizer.

Checkpoint selection

There is a single publicly released RNAErnie2 checkpoint. The weights are taken from LLM-EDA/RNAErnie with one minor adjustment: cls.predictions.decoder.bias is stored explicitly (it was implicitly tied to cls.predictions.bias in the original save and was absent from the file).

Parity Verification

Hidden-state representations and MLM logits verified identical (max abs diff < 2e-5) to the original BertForMaskedLM at all 13 representation levels (embedding + 12 layers). Verified on GPU with PyTorch 2.7 / CUDA 12.

Implementation Notes

Custom BERT implementation (modeling_rnaernie2.py) with eager, SDPA, and Flash Attention 2 backends, following the architecture of Taykhoom/BERT-updated. The original LLM-EDA/RNAErnie used standard HF BERT with no custom attention backends.

Related Models

See the full RNAErnie collection.

Model Context Training data Notes
RNAErnie 512 RNACentral (nts<=512) Original; PaddlePaddle backbone
RNAErnie2 2048 RNACentral v22 (~31M seqs) This model; PyTorch BERT

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()

sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
token_emb = out.last_hidden_state           # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]       # (batch, seq_len, 768)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()

enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 11)

SDPA / Flash Attention 2

model = AutoModel.from_pretrained(
    "Taykhoom/RNAErnie2",
    attn_implementation="sdpa",   # or "flash_attention_2"
    trust_remote_code=True,
)

Fine-tuning

Standard HF conventions. For sequence-level tasks, use the CLS token embedding (last_hidden_state[:, 0, :]) as input to a classification head.

Citation

@article{wang2024_rnaernie,
  title   = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning},
  author  = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  pages   = {548--557},
  year    = {2024},
  doi     = {10.1038/s42256-024-00836-4}
}

Credits

Original model and code by Wang et al. Source: GitHub / HuggingFace. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0, following the original repository.

Downloads last month
62
Safetensors
Model size
87.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/RNAErnie2