Orthrus¶

Mamba-based mature mRNA foundation model. Produces 512-dimensional global embeddings from full mRNA sequences for downstream property prediction (half-life, ribosome load, localization, RBP interaction, isoform function).

Paper: Nature Methods 2026
Upstream: https://github.com/bowang-lab/Orthrus
License: MIT (code + weights)
Device: GPU only — Mamba's selective-scan kernel is CUDA-only in the bundled mamba_ssm wheel. Skipped under -profile cpu with a warning. Single image variant:
- rnazoo-orthrus:latest

What it does¶

Orthrus is a self-supervised foundation model trained on 32.7 million transcripts from GENCODE, RefSeq, and Zoonomia ortholog alignments (10 model organisms, 400+ mammalian species), using contrastive learning over splice-isoform pairs and orthologous transcript pairs. The encoder is a Mamba state-space model — unlike transformer-based foundation models (RNA-FM, RiNALMo, ERNIE-RNA) which scale O(L²) in attention memory, Mamba scales linearly in sequence length, so Orthrus handles long mRNAs (>10 kb) without OOM.

Why only 4-track?¶

Orthrus has two upstream variants and this module ships only the 4-track v1 model (orthrus_v1_4_track, 512-d output):

Variant	Channels	Output	What's needed at inference
4-track (bundled)	A, C, G, U one-hot (4)	512-d	A FASTA. That's it.
6-track	nucleotides (4) + CDS-mask + splice-junction-mask	512-d	The mature spliced sequence and per-position CDS bounds and exon-junction positions. Upstream uses GenomeKit, which precompiles a GTF/GFF annotation together with a 2bit reference genome (~1 GB per assembly).

The 6-track model produces better embeddings on downstream property tasks because it gets told upfront where the protein-coding region is and where introns were spliced out. But the input contract is much heavier: users would need to provide either (a) transcript IDs + a bundled reference genome + annotation, or (b) a custom format that pre-encodes the two extra channels. The 4-track FASTA-in path matches how the rest of the model zoo works (RNA-FM, RiNALMo, ERNIE-RNA all take plain FASTA), so we ship that and revisit 6-track if a user needs it.

Input format¶

FASTA of complete mature mRNA sequences (5'UTR + CDS + 3'UTR + poly-A, or as much as you have of the spliced transcript). DNA (T) is auto-converted to U at parse time.

Important: Orthrus was trained exclusively on full mature transcripts. Partial sequences (e.g. CDS only, single exons, ncRNA fragments) are out-of-distribution and produce embeddings that do not reflect the model's learned mRNA representations. The wrapper warns when sequences are shorter than --min-len (default 200 nt) but does not refuse them.

Example (tests/data/orthrus_test.fa): two synthetic ~500 nt mature-mRNA-shaped sequences with 5'UTR + ORF + 3'UTR + poly-A structure.

Output format¶

A directory containing:

sequence_embeddings.npy: NumPy array of shape (N, 512) — one 512-d embedding per input sequence (mean-pooled across non-padding positions by model.representation())
labels.txt: one FASTA header per line, in the same order as the embedding rows

With --per-token:

<label>_tokens.npy: per-sequence NumPy array of shape (L, 512) — one 512-d embedding per nucleotide position

Run with Docker¶

See the Direct Docker guide for the shared docker run recipe.

docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
  -v /path/to/input.fa:/data/input.fa \
  -v /path/to/output:/out \
  ghcr.io/ericmalekos/rnazoo-orthrus:latest \
  orthrus_predict.py -i /data/input.fa -o /out

Add --per-token for per-token embeddings.

Run with Nextflow¶

nextflow run main.nf -profile docker,gpu --orthrus_input /path/to/input.fa

Under -profile cpu the process logs a warning and skips. Results appear in results/orthrus/orthrus_out/.

Parameters¶

Parameter	Default	Description
`--orthrus_variant`	`v1_4_track`	Model variant. Currently only `v1_4_track` is bundled.
`--orthrus_per_token`	`false`	Also output per-token (L x 512) embeddings per sequence.

Reading the output¶

import numpy as np

embeddings = np.load("orthrus_out/sequence_embeddings.npy")  # (N, 512)
labels = open("orthrus_out/labels.txt").read().strip().split("\n")

for label, emb in zip(labels, embeddings):
    print(f"{label}: {emb.shape}")  # (512,)

Why Mamba (linear memory)¶

Compared to the transformer foundations in RNAZoo:

Model	Embedding	Architecture	Memory at L=10k nt
RNA-FM	640-d	Transformer (12-layer)	~2.5 GB attention matrix (full attn)
RiNALMo	1280-d	Transformer (33-layer, 650M params)	~7 GB attention matrix (full attn)
ERNIE-RNA	768-d	Transformer (12-layer)	~2.5 GB attention matrix (full attn)
Orthrus	512-d	Mamba SSM (6-layer, ~10M params)	Linear (~MB scale)

For mRNAs >5 kb, Orthrus is often the only foundation model in the zoo that fits on a single consumer GPU.

Limitations¶

Mature transcripts only. Partial sequences are OOD.
GPU required. No CPU fallback in the bundled image.
4-track only. The 6-track variant (which adds CDS/splice tracks for slightly better embeddings) is not exposed; GenomeKit + reference genome staging would be needed to wire it in.
Embedding dimension is 512 — smaller than RiNALMo (1280) and ERNIE-RNA (768), comparable to RNA-FM (640). This is the SSM hidden dim and is fixed by the architecture.