Computational Challenges in Metabolomics (Part 1) David Wishart, University of Alberta Dagstuhl Seminar on Computational Mass Spectrometry Schloss Dagstuhl, Germany Aug. 23-28, 2015
The Pyramid of Life Genome Metabolomics Proteomics Genomics Proteome Metabolome Physiological Influence Environmental Influence Proteome Genome
Why Small Molecules Count 100% of all agricultural products (herbicides, pesticides, fertilizers) are small molecules >99% of all compounds that give food or drinks their aroma, color and taste are small molecules 91% of all known drugs are small molecules >85% of all common clinical assays test for small molecules 60% of all drugs are derived from pre-existing metabolites 10-15% of identified genetic disorders involve diseases of small molecule metabolism
Proteomics vs. Metabolomics
Proteomics vs. Metabolomics Very MS or MS/MS oriented Good separation is critical Generates lots of raw data Peptide and protein ID Isotopic labeling (ICAT) helps Possible to derive 3D structure Permits protein imaging Very dependent on databases Spectral processing and deconvolution is challenging Quantitation is challenging Data analysis requires MV stats Data integration is challenging Better software is needed Very MS or MS/MS oriented Good separation is critical Generates lots of raw data Chemical ID Isotopic labeling (SIL) helps Possible to derive 3D structure Permits metabolite imaging Very dependent on databases Spectral processing and deconvolution is challenging Quantitation is challenging Data analysis requires MV stats Data integration is challenging Better software is needed
Proteomics vs. Metabolomics
Proteomics Workflow Biofluid/Extracts HPLC or PAGE Tryptic Digest MALDI plate Protein ID Mass Fingerprint MS analysis
Protein ID by PMF-MS
Metabolomics Workflow Biological or Tissue Samples Extraction Biofluids or Extracts Compound ID LC/GC-MS Spectra LC-MS or GC-MS
Compound ID by GC/LC-MS LC/GC-MS total Ion chromatogram CH3
Proteomics vs. Metabolomics Polymers of 20 amino acids (chemically similar) 185 million sequences (from DNA sequencing) Sequence defines MS & MS/MS spectra Trypsin gives definable cleavages MS alone can ID proteins (PMF) MS/MS fragmentation at 1 fixed energy MS/MS fragmentation is easily predictable and very distinct 30 common PTMs PTMs are somewhat predictable 1000s of distinct chemical classes (chemically diverse) No information from DNA sequencing Structure defines MS & MS/MS spectra (adducts, fragments) No trypsin for small molecules (CID only) MS alone cannot ID metabolites Different energies for different molecules MS/MS & EI-MS fragments not easily predictable, often similar >400 PTMs via metabolism PTMs are hard to predict
Challenges for Metabolomics Most MS-based metabolomics studies ID <100 cmpds (<1% of the known metabolome) Metabolite ID requires accurate, referential MS/MS or EI-MS spectra and/or RT information Limited experimental MS/MS, EI-MS & RT data The chemical space of most metabolomes is not fully known (perhaps >5 million compounds total) <1% of the chemicals in PubChem are relevant to metabolomics Metabolomics needs specialized compound and spectral (MS/MS, EI-MS, NMR) databases Metabolomics needs computational tools to predict biologically viable metabolites and their spectra
LC-MS Spectral DBs MoNA – 236,604 spectra, 69,946 cmpds** (12,000) METLIN – 68,124 spectra, 13,048 cmpds mzCloud – 422,349 spectra, 2975 cmpds NIST14 MS/MS – 234,284 spectra, 9344 cmpds MassBank – 28,185 spectra, 11,500 cmpds Wiley LC-MSn – >10,000 spectra, 4500 poisons ReSpect – 9107 spectra, 3595 cmpds GNPS – 9000 spectra, 4200 natural products Total #compounds with exp. MS/MS spectra ~20,000 Less than 60% are biologically relevant
How to Get Missing Spectra? Obtain or synthesize all biologically relevant molecules (metabolites, HPVs, drugs, pollutants, foods, etc.), prepare or synthesize their metabolites and collect their NMR, LC-MS and GC-MS spectra COST - 5,000,000 cmpds X $1000/cmpd = $5 billion OR Do this entire exercise computationally COST - 5,000,000 cmpds X $0.10/cmpd = $500,000
Computational Metabolomics Predicted biotransformations (50,000 --> 5,000,000) Known biomolecules (50,000) Match observed spectra to predicted spectra to ID compounds Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
The Human Metabolome Database (HMDB) A web-accessible resource containing detailed information on 41,993 “quantified”, “detected” and “expected” metabolites Data mined from the literature and other eDBs 100’s of drug metabolites 1000’s of xenobiotics >10,000 reference spectra Supports sequence, spectral, structure and text searches as well as compound browsing Full data downloads http://www.hmdb.ca
The Drug Database (DrugBank v. 4.3) 1602 small molecule drugs >5000 experimental drugs Data mined from the literature and other eDBs >1000 drugs with metabolizing enzyme data >1200 drug metabolites >600 MS+NMR spectra >4200 unique drug targets 208 data fields/drug Supports sequence, spectral, structure and text searches as well as compound browsing Full data downloads http://www.drugbank.ca
The Toxic Exposome Database (T3DB) Comprehensive data on toxic compounds (drugs, pesticides, herbicides, endocrine disruptors, drugs, solvents, carcinogens, etc.) Data mined from the literature and other eDBs >3600 toxic compounds >1900 reference spectra ~2100 toxic targets Supports sequence, spectral, structure, text searches as well as compound browsing Full data downloads http://www.t3db.ca
Computational Metabolomics Predicted biotransformations (50,000 --> 5,000,000) Known biomolecules (50,000) Match observed spectra to predicted spectra to ID compounds Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
Secondary Metabolism Diazepam Tempazepam Oxazepam Nordazepam CH3 Tempazepam Oxazepam Nordazepam Diazepam N-(2-Benzoyl-4-chlorophenyl)-2-acetamidoacetamide
BioTransformer
BioTransformer - Flowchart Query Molecule Other Reactions Phase I Reaction-specific structural constraints Enzyme metabolite? (Machine Learning) YES YES YES NO SOM Predictor (Machine Learning) Metabolite Generator NO SOMs NO Metabolites All structures are generated as SMILES, SDF or MOL files No metabolites
BioTransformer – SOM Prediction Preference Learning based on 100 atomic (e.g. atom type) and 10 molecular features (e.g. mass) SOM predictor was trained for 9 CYP450s Average Prediction accuracy of 84.54% Structures generated based on 92 Phase I reactions
BioTransformer Results ? 6,230 Phase I metabolites ? 9,510 Phase II metabolites 5,000 compounds ? 6,110 Microbial metabolites ? 12,340 Other metabolites 34,000 metabolites ~220,000
Computational Metabolomics Predicted biotransformations (50,000 --> 5,000,000) Known biomolecules (50,000) Match observed spectra to predicted spectra to ID compounds Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
Computational Challenges in Metabolomics (Part 2) Sebastian Böcker, Friedrich Schiller University Dagstuhl Seminar on Computational Mass Spectrometry Schloss Dagstuhl, Germany Aug. 23-28, 2015