RNA-FM¶
Extract general-purpose RNA embeddings from sequence using a pretrained foundation model.
- Paper: Nature Machine Intelligence 2024
- Upstream: https://github.com/ml4bio/RNA-FM
- License: MIT
- Device: CPU or GPU. Two image variants:
rnazoo-rnafm:latest— CUDA-enabled (default, used with-profile gpu)rnazoo-rnafm-cpu:latest— CPU-only (smaller, used with-profile cpu)
What it does¶
RNA-FM is a BERT-style foundation model pretrained on 23 million non-coding RNA sequences. It produces 640-dimensional embeddings for each nucleotide position, which can be used as features for downstream tasks (structure prediction, function annotation, etc.). Per-sequence embeddings are computed by mean-pooling over positions.
Input format¶
FASTA file of RNA sequences using RNA alphabet (A, C, G, U). DNA sequences (with T) are automatically converted to U.
Maximum sequence length: 1022 nt. RNA-FM has 1024 usable positional slots; BOS/EOS consume 2. Longer sequences are truncated with a warning.
Example (tests/data/rnafm_test.fa):
>test_rna_1
GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCUGACUGUCUUUCCGAACGGGCGUUUCUUUUCCUCCGCGCUACCUGCCAGG
>test_rna_2
AUUCCGAGAGCUAACGGAGAACUCUGUUCGAUUUAAGCUGUAAGAUGGCAGUAGCUUACUAGGCAGGAAAAGACCCUGUUGAGCUUGACUCUAGUU
Output format¶
A directory containing:
sequence_embeddings.npy: NumPy array of shape(N, 640)— one 640-d embedding per input sequence (mean-pooled over positions)labels.txt: one FASTA header per line, in the same order as the embedding rows
With --per-token:
- <label>_tokens.npy: per-sequence NumPy array of shape (L, 640) — one 640-d embedding per nucleotide position
Run with Docker¶
See the Direct Docker guide for the shared
docker runrecipe (UID,HOME,USERenv vars, and GPU flag). Below are the model-specific parts.
# CPU
docker run --rm \
-v /path/to/input.fa:/data/input.fa \
-v /path/to/output:/out \
ghcr.io/ericmalekos/rnazoo-rnafm-cpu:latest \
rnafm_predict.py -i /data/input.fa -o /out
# GPU
docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
-v /path/to/input.fa:/data/input.fa \
-v /path/to/output:/out \
ghcr.io/ericmalekos/rnazoo-rnafm:latest \
rnafm_predict.py -i /data/input.fa -o /out
Add --per-token to either invocation for per-token embeddings.
Run with Nextflow¶
# CPU
nextflow run main.nf -profile docker,cpu --rnafm_input /path/to/input.fa
# GPU
nextflow run main.nf -profile docker,gpu --rnafm_input /path/to/input.fa
Only models with input provided will run — no ignore flags needed.
Results appear in results/rnafm/rnafm_out/.
Parameters¶
| Parameter | Default | Description |
|---|---|---|
--rnafm_per_token |
false |
Also output per-token (L x 640) embeddings per sequence |
Reading the output¶
import numpy as np
# Per-sequence embeddings
embeddings = np.load("rnafm_out/sequence_embeddings.npy") # (N, 640)
labels = open("rnafm_out/labels.txt").read().strip().split("\n")
for label, emb in zip(labels, embeddings):
print(f"{label}: {emb.shape}") # (640,)
# Per-token embeddings (if --per-token was used)
tokens = np.load("rnafm_out/test_rna_1_tokens.npy") # (L, 640)
print(f"Per-token shape: {tokens.shape}")
Example output¶
Each row is a 640-dimensional vector representing the RNA sequence. These embeddings can be used directly as input features for classifiers, regressors, or clustering.
Limitations¶
- Maximum input length is 1022 nucleotides (1024 positional slots minus 2 for BOS/EOS). Longer sequences are truncated.
- This is the ncRNA model. An mRNA-specific model (mRNA-FM, 1280-d) also exists but is not included in this module.