Bioinformatics Seminar

Time: 11AM
Venue: Zoom Webinar

7 September 2021

This is a PhD Confirmation Seminar

Towards high precision, low-missing-data label-free quantitative mass spectrometry by (deep) machine learning

Soroor Zadeh
WEHI Bioinformatics

In this talk, I will cover topics that are broadly related to modelling and estimation of missing values in label-free quantification (LFQ). I will present latest developments and results for msImputev2, which uses techniques from machine learning to model and impute missing values in the data. However, imputation can obscure the amount of available information, particularly if imputed values are considered as equally certain as the actually observed values in downstream processing (e.g. differential abundance analysis). Peptide identity propagation (PIP) can substantially reduce missing values in label-free mass spectrometry quantification by transferring peptides identified by tandem mass (MS/MS) spectra in one run to runs where the peptides are not identified by MS/MS. The existing frameworks for matching identifications between runs, perform peak tracing and propagation based on similarity of peptide features using only a limited number of dimensions available in MS1 data. These approaches are also inherently limited by the lack of probability measure to assign confidence and filter likely false positive results. We introduce an embedding-based approached for PIP that uses a higher dimensional representation of MS1 measurements that are optimized to capture peptide identities by deep neural networks. We developed a propagation framework for DDA-PASEF datasets that works entirely on MaxQuant results. Our PIP framework perform propagation in more than two dimensions, does not rely on deterministic tolerances for identification transfer, and assign probabilities for each transferred identity, which are limitations of current workflows. The proposed embedding approach enables quantification of empirical false discovery rate (FDR) for peptide identification and increase in depth of coverage through co-embedding the runs from the experiment with experimental libraries. In a number of published datasets with technical and biological variability acquired with different retention gradient lengths, we demonstrate that our embedding-based PIP framework reduces missing-values in MaxQuant results, maintains high quantification precision and accuracy, and low false transfer rate.

The Walter and Eliza Hall Institute of Medical Research