Bioinformatics Seminars

Bioinformatics Seminar

Time: 10:45am Tuesdays.
Venue:
Level 7 Seminar Room 2, WEHI1

12 March 2019

Removing unwanted variation from TCGA RNA-seq data

Ramyar Molania
WEHI Bioinformatics

The Cancer Genome Atlas (TCGA) Research Network assessed a large number of cancer tissue samples of over 33 human tumors for aberrations in the levels of DNA, RNA, and protein to increase our comprehensive biological insight into cancers. This enormous amount of molecular information has been incredibly used for different research and clinical applications. To generate the data for each cancer type, fresh frozen samples were collected from multiple institutions, allocated to different batches, and processed using different procedures at multiple time points. All these elements can introduce unwanted variation that may obscure biological signals or compromise down-stream analysis.

Importantly, the TCGA consortium changed flow cell chemistry in 2012. We identified a substantial batch effect in the TCGA RNA-seqV2 data set that was introduced by the change in the flow cell chemistry. We demonstrated that the batch effect influence downstream analysis such as identification of co-expressed genes and the comparison of paired primary and metastatic samples. We proposed an approach based on our recently developed normalization method, RUV-III to remove this batch effect. In the absence of true replicates, we used pseudo- replicates to remove batch effects.

We used a range of statistical tools including RLE plots and principal component analysis, as well as biological positive controls to assess how effectively batch effects were removed and biological heterogeneity was preserved by RUV-III. We demonstrated that RUV-III normalization led to accurate estimates of gene co-expression and more precise classification of molecular subtypes of cancer.

In summary, the use of RUV-III based on pseudo-technical replicates and suitably chosen negative control genes can lead to satisfactory normalization of RNA-seq data where current normalization methods exhibit shortcomings.


Search past seminars