Bioinformatics Seminar

Time: 11AM
Venue: Davis Auditorium and Online

22 October 2024

Evaluation of de novo assembly for long-read RNAseq: a step towards reference unbiased transcriptome analysis

Alex Yan
WEHI Bioinformatics/BCBC

Long-read sequencing has significantly advanced transcriptomics by enabling the full-length of transcripts to be read. However, current analysis methods often depend on a high-quality reference genome and gene annotation. Recently, de novo assembly methods have been developed to utilise long-read data in cases where a reference genome is unavailable, such as in non-model species. Despite the potential of these tools, there remains a lack of benchmarking and established protocols for optimal reference-free, long-read transcriptome assembly and differential expression analysis. Here, we comprehensively evaluate the current state-of-the-art long-read de novo transcriptome assembly tools, RATTLE, RNA-Bloom2 and isONform, and compare their performance to one of the leading short-read assemblers, Trinity. We assess various metrics, including assembly quality and computational efficiency, across a range of datasets, which include simulated data and spike-in sequin transcripts, where ground truth is known, and real data from human cell line and pea (Pisum sativum) samples, using the reference-guided assembler Bambu to define truth. To represent contemporary analysis scenarios, the datasets cover depths from 6 million to 60 million reads, cDNA and direct RNA sequencing, and critically, we assessed the downstream impact of assembly choice on the detection of differential gene and transcript expression. Our results confirm that long-read sequencing outperforms short-read for reference-free analysis, though limitations remain compared to reference-guided approaches, and suggest scope for improvement. Of the de novo pipelines, RNA-Bloom2, coupled with Corset for transcript clustering, was the best performing in both accuracy and computational efficiency. Our findings provide valuable insights for selecting the most effective strategy for long-read differential expression analysis when a high-quality reference genome is unavailable.

The Walter and Eliza Hall Institute of Medical Research