Skip to content

RNA-FM

Extract general-purpose RNA embeddings from sequence using a pretrained foundation model.

  • Paper: Nature Machine Intelligence 2024
  • Upstream: https://github.com/ml4bio/RNA-FM
  • License: MIT
  • Device: CPU or GPU. Two image variants:
    • rnazoo-rnafm:latest — CUDA-enabled (default, used with -profile gpu)
    • rnazoo-rnafm-cpu:latest — CPU-only (smaller, used with -profile cpu)

What it does

RNA-FM is a BERT-style foundation model pretrained on 23 million non-coding RNA sequences. It produces 640-dimensional embeddings for each nucleotide position, which can be used as features for downstream tasks (structure prediction, function annotation, etc.). Per-sequence embeddings are computed by mean-pooling over positions.

Input format

FASTA file of RNA sequences using RNA alphabet (A, C, G, U). DNA sequences (with T) are automatically converted to U.

Maximum sequence length: 1022 nt. RNA-FM has 1024 usable positional slots; BOS/EOS consume 2. Longer sequences are truncated with a warning.

Example (tests/data/rnafm_test.fa):

>test_rna_1
GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCUGACUGUCUUUCCGAACGGGCGUUUCUUUUCCUCCGCGCUACCUGCCAGG
>test_rna_2
AUUCCGAGAGCUAACGGAGAACUCUGUUCGAUUUAAGCUGUAAGAUGGCAGUAGCUUACUAGGCAGGAAAAGACCCUGUUGAGCUUGACUCUAGUU

Output format

A directory containing:

  • sequence_embeddings.npy: NumPy array of shape (N, 640) — one 640-d embedding per input sequence (mean-pooled over positions)
  • labels.txt: one FASTA header per line, in the same order as the embedding rows

With --per-token: - <label>_tokens.npy: per-sequence NumPy array of shape (L, 640) — one 640-d embedding per nucleotide position

Run with Docker

See the Direct Docker guide for the shared docker run recipe (UID, HOME, USER env vars, and GPU flag). Below are the model-specific parts.

# CPU
docker run --rm \
  -v /path/to/input.fa:/data/input.fa \
  -v /path/to/output:/out \
  ghcr.io/ericmalekos/rnazoo-rnafm-cpu:latest \
  rnafm_predict.py -i /data/input.fa -o /out

# GPU
docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
  -v /path/to/input.fa:/data/input.fa \
  -v /path/to/output:/out \
  ghcr.io/ericmalekos/rnazoo-rnafm:latest \
  rnafm_predict.py -i /data/input.fa -o /out

Add --per-token to either invocation for per-token embeddings.

Run with Nextflow

# CPU
nextflow run main.nf -profile docker,cpu --rnafm_input /path/to/input.fa

# GPU
nextflow run main.nf -profile docker,gpu --rnafm_input /path/to/input.fa

Only models with input provided will run — no ignore flags needed.

Results appear in results/rnafm/rnafm_out/.

Parameters

Parameter Default Description
--rnafm_per_token false Also output per-token (L x 640) embeddings per sequence

Reading the output

import numpy as np

# Per-sequence embeddings
embeddings = np.load("rnafm_out/sequence_embeddings.npy")  # (N, 640)
labels = open("rnafm_out/labels.txt").read().strip().split("\n")

for label, emb in zip(labels, embeddings):
    print(f"{label}: {emb.shape}")  # (640,)

# Per-token embeddings (if --per-token was used)
tokens = np.load("rnafm_out/test_rna_1_tokens.npy")  # (L, 640)
print(f"Per-token shape: {tokens.shape}")

Example output

Shape: (2, 640)
Labels:
test_rna_1
test_rna_2

Each row is a 640-dimensional vector representing the RNA sequence. These embeddings can be used directly as input features for classifiers, regressors, or clustering.

Limitations

  • Maximum input length is 1022 nucleotides (1024 positional slots minus 2 for BOS/EOS). Longer sequences are truncated.
  • This is the ncRNA model. An mRNA-specific model (mRNA-FM, 1280-d) also exists but is not included in this module.