Skip to content

UTR-LM

Predict 5'UTR expression metrics: mean ribosome loading, translation efficiency, or expression level.

  • Paper: Nature Machine Intelligence 2024
  • Upstream: https://github.com/a96123155/UTR-LM
  • License: GPL-3.0
  • Device: CPU or GPU (lightweight ~5M parameter model). Two image variants:
    • rnazoo-utrlm:latest — CUDA-enabled (default, used with -profile gpu)
    • rnazoo-utrlm-cpu:latest — CPU-only (smaller, used with -profile cpu)

What it does

UTR-LM is a 5'UTR language model pretrained on RNA sequences from 5 species using masked language modeling with secondary structure and minimum free energy supervision. It predicts expression-related metrics from 5'UTR sequences:

  • MRL (Mean Ribosome Loading): from synthetic 50-nt UTR library (Sample et al.)
  • TE (Translation Efficiency): log-transformed, cell-line specific
  • EL (Expression Level): log-transformed RNA-seq expression, cell-line specific

Input format

FASTA file of 5'UTR DNA sequences (A, C, G, T alphabet; U is auto-converted to T):

>synthetic_utr_1
AATTCCGGAATTCCGGAATTCCGGAATTCCGGAATTCCGGAATTCCGGAA
>synthetic_utr_2
GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC
  • MRL task: uses 50-nt sequences (last 50 nt if longer)
  • TE/EL tasks: uses last 100 nt of the 5'UTR

Shorter sequences are automatically padded.

Output format

predictions.tsv — one prediction per sequence:

header  sequence    mean_ribosome_loading
synthetic_utr_1 AATTCCGGAATTCCGGAATTCCGGAATTCCGGAATTCCGGAATTCCGGAA  8.053340
synthetic_utr_2 GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC  5.337085
synthetic_utr_3 TGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATG  6.575327

Run with Docker

See the Direct Docker guide for the shared docker run recipe (UID, HOME, USER env vars, and GPU flag). Below are the model-specific parts.

# CPU — MRL prediction
docker run --rm \
  -v /path/to/input.fa:/data/input.fa \
  -v /path/to/output:/out \
  ghcr.io/ericmalekos/rnazoo-utrlm-cpu:latest \
  utrlm_predict.py -i /data/input.fa -o /out --task mrl --model-dir /opt/utrlm/Model

# GPU — same command, swap image and add nvidia runtime
docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
  -v /path/to/input.fa:/data/input.fa \
  -v /path/to/output:/out \
  ghcr.io/ericmalekos/rnazoo-utrlm:latest \
  utrlm_predict.py -i /data/input.fa -o /out --task mrl --model-dir /opt/utrlm/Model

For TE/EL prediction add --task te|el --cell-line HEK|pc3|Muscle.

Run with Nextflow

# CPU
nextflow run main.nf -profile docker,cpu --utrlm_input /path/to/input.fa --utrlm_task mrl

# GPU
nextflow run main.nf -profile docker,gpu --utrlm_input /path/to/input.fa --utrlm_task mrl

Only models with input provided will run — no ignore flags needed.

Results appear in results/utrlm/utrlm_out/.

Parameters

Parameter Default Description
--utrlm_task mrl Prediction task: mrl, te, or el
--utrlm_cell_line HEK Cell line for TE/EL tasks: HEK, pc3, or Muscle

Available tasks

Task Label Input Length Cell Lines Description
mrl Mean Ribosome Loading 50 nt N/A From synthetic UTR library
te Translation Efficiency (log) 100 nt HEK, pc3, Muscle Cell-line specific
el Expression Level (log) 100 nt HEK, pc3, Muscle RNA-seq based

Fine-tuning on your own data

UTR-LM can be fine-tuned on your own expression data (MRL, TE, or EL measurements for 5'UTR sequences).

Input format

CSV or TSV file with columns: name, utr (5'UTR sequence), and a numeric label column. Example:

name,utr,my_mrl
seq1,AATTCCGGAATTCCGG...,5.2
seq2,GCGCGCGCGCGCGCGC...,3.8

Step 1: Fine-tune

nextflow run main.nf -profile docker,cpu \
  --utrlm_finetune_input my_training_data.csv \
  --utrlm_finetune_label my_mrl \
  --utrlm_finetune_task mrl

Output: utrlm_finetune/best_model.pt (fine-tuned checkpoint) + utrlm_finetune/predictions.tsv.

Step 2: Predict with fine-tuned model

Use the saved checkpoint for subsequent predictions on new sequences:

nextflow run main.nf -profile docker,cpu \
  --utrlm_input new_sequences.fa \
  --utrlm_checkpoint utrlm_finetune/best_model.pt \
  --utrlm_task mrl

Fine-tuning parameters

Parameter Default Description
--utrlm_finetune_label (required) Column name with target values
--utrlm_finetune_task mrl Task type: mrl, te, or el (determines backbone + input length)
--utrlm_finetune_epochs 100 Training epochs
--utrlm_finetune_patience 20 Early stopping patience
--utrlm_finetune_lr 0.01 Learning rate
--utrlm_finetune_pretrained (none) Optional: initialize from an existing checkpoint

Technical notes

  • Uses a custom fork of Facebook's ESM library with RNA-specific alphabet and secondary structure heads.
  • Architecture: 6-layer ESM2 transformer (128-d, 16 heads) + linear classification head.
  • Weights are bundled in the upstream repo (~1.1 GB total for all tasks/folds).
  • MRL has 1 fold; TE/EL have 10 folds each (use --folds all for ensemble averaging).
  • The model uses the CLS/BOS token embedding for per-sequence prediction.