Ivana Blaženović Postdoctoral Researcher

Slides:

Advertisements

Similar presentations

The Use of Graph Matching Algorithms to Identify Biochemical Substructures in Synthetic Chemical Compounds Application to Metabolomics Mai Hamdalla, David.

Advertisements

Protein Quantitation II: Multiple Reaction Monitoring

Improvements in Mass Spectrometry for Life Science Research – Does Agilent Have the Answer? Ashley Sage PhD.

Welcome! Mass Spectrometry meets Cheminformatics Tobias Kind and Julie Leary UC Davis Course 7: Concepts for LC-MS Class website: CHE Spring 2008.

Mass spectrometry in organic chemistry

How to identify peptides October 2013 Gustavo de Souza IMM, OUS.

Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.

Smart Templates for Chemical Identification in GCxGC-MS QingPing Tao 1, Stephen E. Reichenbach 2, Mingtian Ni 3, Arvind Visvanathan 2, Michael Kok 2, Luke.

ProReP - Protein Results Parser v3.0©

Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)

Proteomics Informatics Workshop Part I: Protein Identification

Previous Lecture: Regression and Correlation

My contact details and information about submitting samples for MS

Proteomics Josh Leung Biology 1220 April 13 th, 2010.

Chemalys September 2009 Chemalys Jan Nordin Chemalys Massworks Extend the Limits of Your LC/MS-System Brukermøte i Massespektrometri 27 mai 2010.

Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.

Organic Mass Spectrometry

Raul Garcia-Sanchez Research Investigator: Dr. Paul R. Mahaffy Code 699, NASA Goddard Space Flight Center Research Mentor: Dr. Prabhakar Misra Department.

Mass spectrometry session. Summary Fiehn (1) Standardization important Reporting important, but has to be feasible Does not matter which MS instrument.

Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR

Common parameters At the beginning one need to set up the parameters.

Organic Mass Spectrometry

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Laxman Yetukuri T : Modeling of Proteomics Data

Temple University MASS SPECTROMETRY FURTHER INVESTIGATIONS Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.

Finnish Custom Lab., Pekka Ravio

EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.

Pathway analysis in metabolomics

Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information

Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.

Constructing high resolution consensus spectra for a peptide library

Computational Challenges in Metabolomics (Part 1)

Quantitation using Pseudo-Isobaric Tags (QuPIT) and Quantitation using Pseudo-isobaric Amino acids in Cell culture (QuPAC) Parimal Samir Andrew J. Link.

Data independent acquisition methods for metabolomics Stephen Tate, Ron Bonner AB SCIEX, 71 Four Valley Drive, Concord, ON, L4K 4V8 Canada A high resolution.

MS Libraries for Forensics: DART-MS and GC-MS

이 장 우. 1. Introduction  HPLC-MS/MS methodology achieved its preferred status -Highly selective and effectively eliminated interference -Without.

Metabolomics Part 2 Mass Spectrometry

Who is NCCT? National Center for Computational Toxicology – part of EPA’s Office of Research and Development Research driven by EPA’s Chemical Safety for.

Metabolomics Data Analysis

NonTarget 2016 Ascona, Switzerland

Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D

Metabolomic Profiling in Drug Discovery: Understanding the Factors that Influence a Metabolomics Study and Strategies to Reduce Biochemical and Chemical.

Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1

Jarrett Egertson, Ph.D. MacCoss Lab

Accelerating Research in Life Sciences

A Database of Peak Annotations of Empirically Derived Mass Spectra

LC-MS/MS Identification of Impurities Present in Synthetic Peptide Drugs Dr Anna Meljon*, Dr Alan Thompson, Dr Osama Chahrour, and Dr John Malone Almac.

The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.

MassMatrix Search Results Explained

Accelerating Research in Life Sciences

Metabolomics Part 2 Mass Spectrometry

Mass Spectrometry meets Cheminformatics WCMC Metabolomics Course 2013

Implementation of volatile organic compound identification

Bioinformatics Solutions Inc.

Presentation Title NEMC 2018 Dale Walker, Bruce Quimby Agilent

Proteomics Informatics David Fenyő

Interpretation of Mass Spectra I

Metabolomics: Preanalytical Variables

Standards Development for Metabolomics

Proteomics Informatics –

Nat. Rev. Nephrol. doi: /nrneph

Accelerating Research in Life Sciences

NoDupe algorithm to detect and group similar mass spectra.

High level view of the MAE algorithm.

Proteomics Informatics David Fenyő

Interpretation of Mass Spectra

Operation manual of AI SIDA

Skyline for Small Molecules, a Sneak Peek at Emerging Capabilities

Presentation transcript:

Data Processing and Compound Identification in Untargeted Metabolomics and Exposome Research Ivana Blaženović Postdoctoral Researcher West Coast Metabolomics Center Pittcon 2017

Metabolomics Analysis of the metabolome (mass spectrometry) Metabolome = complete set of small molecules found in a biological sample D. Grapov (WCMC, 2015)

Separation, detection, bioinformatics Analysis of Metabolomic Data Pre-analysis Data processing Statistical analysis Multivariate analysis Significant compounds Structure elucidation Validation Biomarker Separation, detection, bioinformatics Structure elucidation Extraction Bioknowledge T. Kind (WCMC, 2015)

The central dilemma in metabolomics T. Kind (WCMC, 2015)

2 X 𝟏𝟎 𝟔 Chemicals = Metabolomics 20 Amino Acids = Proteomics Omics data complexity 2 X 𝟏𝟎 𝟔 Chemicals = Metabolomics 20 Amino Acids = Proteomics 4 bases = Genomics Chemical Complexity Data complexity increases with number of structures present in analyzed sample Wishart, D. S. Bioanalysis. 2011, 3, 1769- 1782 (adapted)

Why are there so many unknown compounds? Endogenous pathways ~ 1,000 metabolites ~ 5,000 lipids The epimetabolome modified metabolites with specific biological functions ~ 1,000 metabolites e.g. diacetyl-spermine  cancer methyl-glycine  cancer dimethyl-arginine  asthma oxylipins  inflammation methyl-nicotinamide  pluripotency We are exposed to many compounds: chemicals, food metabolites… Food ~ 200,000 metabolites some / many in circulation O. Fiehn (WCMC, 2015)

Why is it (still) so hard to identify compounds? In silico fragmentation tools retrieve candidate structures and fragment them (using different algorithms and approaches) AND THEN compare those fragments to the product ions in a measured spectrum to determine which candidate explains best the measured compound by assigning it a score. Only 0.088% of known chemicals have MS/MS spectra Mass spectral libraries are very small and lack diversity T. Kind (WCMC, 2015)

Critical Assessment of Small Molecule Analysis (CASMI) Organizers of CASMI change every year so does the focus of the contest. However, it addresses the bottlenecks in metabolomics research. www.casmi-contest.org/2016/index.shtml

Critical Assessment of Small Molecule Analysis (CASMI) 2016 Objective: structure elucidation of unknown natural products Provided data sets: training (312 MS/MS) and challenge (208 MS/MS) MS/MS spectra: ESI Q Exactive Plus Orbitrap, <5 ppm mass accuracy and MS/MS resolution of 35,000, 20/35/50 HCD nominal collision energies. Category 1 Category 2 Category 3 Best structure Identification on Natural Products Best Automatic Structural Identification – In Silico Fragmentation Only Best Automatic Structural Identification – Full Information Automated methods mimic the approaches of an experienced chemist when determining correct structure based on the MSMS data. This is important as many analysts are using metabolomics platforms are not necessarily with chemistry background. Spectral meta data included the chemspider ID, compound name, monoisotopic mass, molecular formula, SMILES, InChI AND iNDHIKEY. Same Data

In silico fragmentation tools Open source User friendly Participants of CASMI 2016 contest MetFrag: retrieves candidate structures and fragments them using bond dissociation approach CFM-ID: employs a method for learning a generative model of collision induced dissociation fragmentation MAGMa+: parameter optimized version of the original MAGMa software. A python wrap around script: it analyzes substructures and utilizes different bond dissociations MS-FINDER: rule based tool, accounts for mass accuracy, isotopic ratio, neutral loss assignement and the exsistance of the compound in the built-in comprehensive database

In silico only (Category 2) Objectives In silico only (Category 2) A Performance evaluation Improve the existing results

In silico + metadata (Category 3) Objectives In silico + metadata (Category 3) Performance evaluation Improve the existing results

Training set Results Category 2: Best Automatic Structural Identification – In Silico Fragmentation Only Software Top hit Top 10 MetFrag 17% 57% MAGMa+ 16% 48% CFM-ID 15% 55% MS-FINDER 10% 38%

Voting / consensus model Criteria A) In silico only (Category 2) # 1 Presence of each candidate / software # 2 𝜔= 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑+𝑓𝑎𝑙𝑠𝑒𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒𝑠 # 3 𝑆 = 𝐴 𝑅𝑎𝑛𝑘𝑖𝑛𝑔 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝐴 𝜔 (𝑡𝑜𝑝 10 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝐴) # 4 1st + 2nd + 3rd = input for voting/consensus model We have developed a voting consensus model which combines the results of all tools we tested and creates a new ranking FOR EVERY CANDIDATE STRUCTURE based on two criteria. If a tool placed a candidate structure high, meaning in the top 20, we wanted to rank this structure high as well. Moreover, if all 4 tested tools placed the candidate structure high, we wanted to give it a boost as well. We have therefore assigned primary scores ranging from 1-4 to account for that. As it is expected that different tools will have different quality, to account for that we wanted to calculate how accurate each tool is when ranking a correct structure using a training data set. Blaženović et al. (under review, 2017)

Voting / consensus model – Training set (Category 2: in silico only) Voting / consensus model improved correct annotations by only 5% Rank Model Top hit Top 10 # 1 MetFrag + CFM-ID in silico Voting/consensus 22% 62% # 2 MetFrag + CFM-ID + MAGMa+ in silico Voting/consensus 20% 60% # 3 MetFrag + MS-FINDER + CFM-ID + MAGMa+ in silico Voting/consensus 19% 58% Blaženović et al. (under review, 2017)

Voting / consensus model Criteria B) Metadata allowed (Category 3) # 1 In silico consensus rank # 2 DB presence (derived from MS-FINDER DB) # 3 2 X DB STOFF - IDENT # 4 4 X DB MS/MS 𝐅𝐢𝐧𝐚𝐥 𝐬𝐜𝐨𝐫𝐞=𝑰𝒏 𝒔𝒊𝒍𝒊𝒄𝒐 𝐜𝐨𝐧𝐬𝐞𝐧𝐬𝐮𝐬 𝐫𝐚𝐧𝐤+𝐃𝐁 𝐩𝐫𝐞𝐬𝐞𝐧𝐜𝐞 +𝟐 𝐗 𝐃 𝐁 𝐒𝐓𝐎𝐅𝐅−𝐈𝐃𝐄𝐍𝐓 +𝟒 𝐗 𝐃 𝐁 𝐌𝐒/𝐌𝐒 Blaženović et al. (under review, 2017)

Voting / consensus model – Training set (Category 3: database boosting) Rank Model Top hit Top 10 # 1 MetFrag + CFM-ID + DB Voting/consensus 77.9% 94.9% # 2 MetFrag + MS-FINDER + CFM-ID + DB Voting/consensus 77.6% 95.5% # 3 MetFrag + CFM-ID + MAGMa(+) + DB Voting/consensus 76.9% 95.2% # 4 MS-FINDER + DB 76.6% 94.2% Voting / consensus model with database boosting improved correct annotations by 56% Blaženović et al. (under review, 2017)

Voting / consensus model – Training set (Category 3: power of metadata) Rank Model Top hit Top 10 # 1 MetFrag + CFM-ID + DB + MS/MS Voting/consensus 92.9% 98.1% # 2 CFM-ID + MAGMa+ ID_sorted + DB + MS/MS Voting/consensus 92.6% # 3 MAGMa+ ID_sorted + DB + MS/MS Voting/consensus 92.3% 98.4% Voting / consensus model with database and MS/MS boosting improved correct annotations by additional 15% Blaženović et al. (under review, 2017)

What about using only mass spectral similarity search? Software: NIST MS PepSearch NIST and MassBank MS/MS libraries were searched with 5 ppm precursor window Data set Number of hits Dot product score Training (312 MS/MS spectra) 88.4% 183 - 999 Validation (208 MS/MS spectra) 60% 441 - 999 Most analysts will rely on dot product score of 700 and above m/z Int Training 109 111.0231 117308 121.0075 193638.5 139.018 113030.4 181.0759 1.05E+08 4,7-Phenanthroline

Validation set performance CASMI 2016 Category 3 winner: Tobias Kind with 70% correct top hits 93% 87% 78% 73% 22% 25% In silico + DB + MS/MS In silico + DB In silico only Correct Hits Training set Validation set Blaženović et al. (under review, 2017)

Training vs. Validation data set Molecular descriptors Training set did not fully represent challenge / validation set Blaženović et al. (under review, 2017)

Summary Pure in silico algorithms only identified 17% of the compounds correctly Establishment and implementation of voting/consensus model to CASMI Categories 2 and 3 resulting in > 93% correct annotations True challenge: identification of the “unknown – unknown” compounds that are not present in any DB Sharing MS/MS spectra is needed for in silico software improvement

Acknowledgement Dr. Oliver Fiehn Dr. Tobias Kind Hrvoje Torbašinović Slobodan Obrenović Sajjan S. Mehta Dr. Hiroshi Tsugawa Jian Ji Dr. Shen Tong Dr. Oliver Fiehn

Thank you! Fiehn Lab UC Davis