Bioinformatics Seminars

Bioinformatics Seminar

Time:
Venue: Na

13 August 2019

Na

SVEnsemble: an algorithm for ensemble structural variant calling using re-evaluated quality scores via probabilistic random forest

Ruining Dong
WEHI Bioinformatics

The ever-growing amount of deeply-sequenced whole-genome sequencing (WGS) data allows detection of complex genomic rearrangements ; including single-nucleotide variations ; copy number variants ; and structural variants (SVs). These advancements have led to significant discoveries in the mechanisms of human diseases ; especially cancer. However ; due to the complexity of genomic rearrangements in human genomes ; SV calling remains a challenging problem. So far ; dozens of SV calling tools have been contributing to human cancer studies ; yet every SV calling tool has their own strengths and weaknesses.

A strategy to consolidate these are ensemble SV callers ; which combine SV calling results from multiple callers into a single output. Ensemble callers have shown advantage of overcoming individual callers' weaknesses ; yielding higher precision as well as sensitivity. However ; such outcomes rely on a smart selection of SV calls to combine. Despite the potential of superior performance ; there is currently no optimised method of SV ensembles. Furthermore ; the quality of SVs are assessed primarily based on the output quality scores from individual callers ; which are unreliable for this purpose.

In this seminar ; I will report a work-in-progress project which addresses these challenges. SVEnsemble ; an SV ensemble algorithm which merges call sets from multiple SV callers and re-calibrates SV quality scores using probabilistic machine learning models. We apply multiple SV-calling tools to one WGS sample. Selected features from the results are merged and trained against a database of SV for medical and population genetics ; generating a unique prediction model for that sample. Each SV is assigned with a re-calibrated quality score using the probabilistic score. The algorithm is evaluated with multiple reference datasets accompanied with high-quality truth sets. The algorithm shows an all-round performance improvement compared to conventional ensemble callers as well as individual callers. Novel SV calls ; which are not reported in the gnomad-SV database ; are identified by SVEnsemble.;;;


Search past seminars