Orthrus¶
Mamba-based mature mRNA foundation model. Produces 512-dimensional global embeddings from full mRNA sequences for downstream property prediction (half-life, ribosome load, localization, RBP interaction, isoform function).
- Paper: Nature Methods 2026
- Upstream: https://github.com/bowang-lab/Orthrus
- License: MIT (code + weights)
- Device: GPU only — Mamba's selective-scan kernel is CUDA-only in the bundled
mamba_ssmwheel. Skipped under-profile cpuwith a warning. Single image variant:rnazoo-orthrus:latest
What it does¶
Orthrus is a self-supervised foundation model trained on 32.7 million transcripts from GENCODE, RefSeq, and Zoonomia ortholog alignments (10 model organisms, 400+ mammalian species), using contrastive learning over splice-isoform pairs and orthologous transcript pairs. The encoder is a Mamba state-space model — unlike transformer-based foundation models (RNA-FM, RiNALMo, ERNIE-RNA) which scale O(L²) in attention memory, Mamba scales linearly in sequence length, so Orthrus handles long mRNAs (>10 kb) without OOM.
Why only 4-track?¶
Orthrus has two upstream variants and this module ships only the 4-track v1 model (orthrus_v1_4_track, 512-d output):
| Variant | Channels | Output | What's needed at inference |
|---|---|---|---|
| 4-track (bundled) | A, C, G, U one-hot (4) | 512-d | A FASTA. That's it. |
| 6-track | nucleotides (4) + CDS-mask + splice-junction-mask | 512-d | The mature spliced sequence and per-position CDS bounds and exon-junction positions. Upstream uses GenomeKit, which precompiles a GTF/GFF annotation together with a 2bit reference genome (~1 GB per assembly). |
The 6-track model produces better embeddings on downstream property tasks because it gets told upfront where the protein-coding region is and where introns were spliced out. But the input contract is much heavier: users would need to provide either (a) transcript IDs + a bundled reference genome + annotation, or (b) a custom format that pre-encodes the two extra channels. The 4-track FASTA-in path matches how the rest of the model zoo works (RNA-FM, RiNALMo, ERNIE-RNA all take plain FASTA), so we ship that and revisit 6-track if a user needs it.
Input format¶
FASTA of complete mature mRNA sequences (5'UTR + CDS + 3'UTR + poly-A, or as much as you have of the spliced transcript). DNA (T) is auto-converted to U at parse time.
Important: Orthrus was trained exclusively on full mature transcripts. Partial sequences (e.g. CDS only, single exons, ncRNA fragments) are out-of-distribution and produce embeddings that do not reflect the model's learned mRNA representations. The wrapper warns when sequences are shorter than --min-len (default 200 nt) but does not refuse them.
Example (tests/data/orthrus_test.fa): two synthetic ~500 nt mature-mRNA-shaped sequences with 5'UTR + ORF + 3'UTR + poly-A structure.
Output format¶
A directory containing:
sequence_embeddings.npy: NumPy array of shape(N, 512)— one 512-d embedding per input sequence (mean-pooled across non-padding positions bymodel.representation())labels.txt: one FASTA header per line, in the same order as the embedding rows
With --per-token:
<label>_tokens.npy: per-sequence NumPy array of shape(L, 512)— one 512-d embedding per nucleotide position
Run with Docker¶
See the Direct Docker guide for the shared
docker runrecipe.
docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
-v /path/to/input.fa:/data/input.fa \
-v /path/to/output:/out \
ghcr.io/ericmalekos/rnazoo-orthrus:latest \
orthrus_predict.py -i /data/input.fa -o /out
Add --per-token for per-token embeddings.
Run with Nextflow¶
Under -profile cpu the process logs a warning and skips. Results appear in results/orthrus/orthrus_out/.
Parameters¶
| Parameter | Default | Description |
|---|---|---|
--orthrus_variant |
v1_4_track |
Model variant. Currently only v1_4_track is bundled. |
--orthrus_per_token |
false |
Also output per-token (L x 512) embeddings per sequence. |
Reading the output¶
import numpy as np
embeddings = np.load("orthrus_out/sequence_embeddings.npy") # (N, 512)
labels = open("orthrus_out/labels.txt").read().strip().split("\n")
for label, emb in zip(labels, embeddings):
print(f"{label}: {emb.shape}") # (512,)
Why Mamba (linear memory)¶
Compared to the transformer foundations in RNAZoo:
| Model | Embedding | Architecture | Memory at L=10k nt |
|---|---|---|---|
| RNA-FM | 640-d | Transformer (12-layer) | ~2.5 GB attention matrix (full attn) |
| RiNALMo | 1280-d | Transformer (33-layer, 650M params) | ~7 GB attention matrix (full attn) |
| ERNIE-RNA | 768-d | Transformer (12-layer) | ~2.5 GB attention matrix (full attn) |
| Orthrus | 512-d | Mamba SSM (6-layer, ~10M params) | Linear (~MB scale) |
For mRNAs >5 kb, Orthrus is often the only foundation model in the zoo that fits on a single consumer GPU.
Limitations¶
- Mature transcripts only. Partial sequences are OOD.
- GPU required. No CPU fallback in the bundled image.
- 4-track only. The 6-track variant (which adds CDS/splice tracks for slightly better embeddings) is not exposed; GenomeKit + reference genome staging would be needed to wire it in.
- Embedding dimension is 512 — smaller than RiNALMo (1280) and ERNIE-RNA (768), comparable to RNA-FM (640). This is the SSM hidden dim and is fixed by the architecture.