Skip to content

RNAZoo

A Nextflow pipeline model zoo for RNA deep learning.

What's included

16 models across 5 tracks. Every container has its model weights baked in at build time — no runtime downloads. Image sizes below are the compressed download size from GHCR; on disk they roughly double after extraction.

Model Track Training set Input Output GPU image CPU image
RiboNN Translation 78 human cell-type TE TSV (tx_id, UTR5, CDS, UTR3) per-cell-type TE TSV 2.8 GB 1.2 GB
Riboformer Translation ribo-seq, 5 species Dir (WIG + GFF + FASTA) model_prediction.txt 4.0 GB 2.3 GB
RiboTIE Translation human ribo-seq (8 SRRs) Dir (FASTA + GTF + BAMs + YAML) per-sample GTF / CSV / NPY 3.9 GB 1.3 GB
seq2ribo Translation 4 human cell-line ribo-seq + sTASEP sim FASTA mRNA seq2ribo_output.json 10.3 GB — (GPU only)
TranslationAI Translation 47K human RefSeq mRNAs FASTA mRNA *_predTIS / *_predTTS / *_predORFs.txt 1.9 GB 0.6 GB
Saluki Translation 66 mRNA-decay datasets (human + mouse) FASTA (UTR lowercase, CDS UPPERCASE) preds.npy 4.2 GB 1.4 GB
CodonTransformer Translation 1M genes across 164 organisms FASTA protein optimized DNA FASTA 3.7 GB 1.2 GB
RNA-FM Foundation 23M ncRNAs (RNAcentral) FASTA RNA sequence_embeddings.npy + labels.txt 4.2 GB 1.7 GB
RiNALMo Foundation 36M ncRNAs (RNAcentral) FASTA RNA sequence_embeddings.npy + labels.txt 5.6 GB 3.1 GB
ERNIE-RNA Foundation 20M ncRNAs (RNAcentral) FASTA RNA sequence_embeddings.npy + labels.txt 5.7 GB — (single image)
Orthrus Foundation 32.7M mRNAs (GENCODE+RefSeq+Zoonomia, contrastive) FASTA mature mRNA (4-track) sequence_embeddings.npy + labels.txt ~5 GB — (GPU only)
RNAformer Structure bpRNA + PDB (LoRA-finetuned) FASTA RNA structures.txt (dot-bracket) 3.8 GB — (single image)
RhoFold Structure PDB + bpRNA self-distillation FASTA RNA PDB + ss.ct + results.npz 4.2 GB 1.7 GB
SPOT-RNA Structure bpRNA + PDB + Rfam FASTA RNA structures.txt + per-seq bpseq / ct / prob 2.7 GB 0.6 GB
MultiRM Modification ~300K human modification sites FASTA RNA modification_scores.tsv + predicted_sites.tsv 3.5 GB 1.0 GB
UTR-LM mRNA Design 5'UTRs, 5 species + MPRA (MRL) FASTA 5'UTR predictions.tsv 4.9 GB 2.4 GB

Totals: CPU set is ~28 GB across 14 images; GPU set is ~70 GB across 16 images. See the installation page for the matching pre-pull commands.

Quick start

# Run the test suite (13 models on CPU, ~5 min)
nextflow run . -profile test,docker,cpu

# Run a single model — only models you provide input for will run
nextflow run . -profile docker,cpu --rnafm_input my_sequences.fa

# Run multiple models in parallel
nextflow run . -profile docker,cpu \
  --rnafm_input seqs.fa \
  --rnaformer_input seqs.fa \
  --multirm_input seqs.fa

# Use a YAML params file for complex runs
nextflow run . -profile docker,cpu -params-file my_params.yml

With plain Docker (no Nextflow required)

# Run one model against a FASTA (CPU)
docker run --rm \
    -u $(id -u):$(id -g) -e HOME=/tmp -e USER=$(whoami) \
    -v $PWD/seqs.fa:/data/input.fa -v $PWD/out:/out \
    ghcr.io/ericmalekos/rnazoo-rnafm-cpu:latest \
    rnafm_predict.py -i /data/input.fa -o /out

See the Direct Docker guide for invocations of every model.

Design principles

  • One Docker image per model — weights baked in at build time, no runtime downloads
  • CPU by default — GPU-only models auto-skip under --profile cpu
  • Per-model input/output — each model uses its native format, no forced preprocessing
  • Portable — runs anywhere with Docker or Singularity + Nextflow

License

RNAZoo pipeline code is open source. Individual models carry their own licenses — see each model's page for details. Most are MIT/Apache-2.0; some have non-commercial restrictions noted on their pages.