1 Metabolomics a Promising ‘omics Science By Susan Simmons University of North Carolina Wilmington.

Slides:

Advertisements

Similar presentations

Managerial Economics in a Global Economy

Advertisements

Structural Equation Modeling

Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.

Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo.

Dimension reduction (1)

Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.

1 Statistics in Metabolomics David Banks ISDS Duke University.

The Simple Linear Regression Model: Specification and Estimation

Principal Component Analysis

Chapter 10 Simple Regression.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Metabolomics DNA RNA Protein Biochemicals (Metabolites) Genomics – 25,000 Genes Transcriptomics – 100,000 Transcripts Metabolomics – 2,800 Compounds Proteomics.

Metabolomics Bob Ward German Lab Food Science and Technology.

Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.

Clustered or Multilevel Data

Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.

Metabolomic Data Processing & Statistical Analysis

Proteomics Informatics Workshop Part III: Protein Quantitation

Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.

Chemometrics Method comparison

Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Objectives of Multiple Regression

Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.

Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota

Metabolomics 5/2/2014. ‘Omics Family Tree W. M. Claudino, et al., Journal of Clinical Oncology, 2007, 25(19), pp /2/2014.

Whole Genome Expression Analysis

2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.

Metrological Experiments in Biomarker Development (Mass Spectrometry—Statistical Issues) Walter Liggett Statistical Engineering Division Peter Barker Biotechnology.

© 2010 SRI International - Company Confidential and Proprietary Information Quantitative Proteomics: Approaches and Current Capabilities Pathway Tools.

BIOMARKERS Diagnostics and Prognostics. OMICS Molecular Diagnostics: Promises and Possibilities, p. 12 and 26.

Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.

Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington.

Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.

Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.

Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April

Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.

STATISTICS FOR HIGH DIMENSIONAL BIOLOGICAL RECORDINGS Dr Cyril Pernet, Centre for Clinical Brain Sciences Brain Research Imaging Centre

Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.

Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Principle Component Analysis and its use in MA clustering Lecture 12.

Review of statistical modeling and probability theory Alan Moses ML4bio.

Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)

Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.

Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

Multiple Regression.

Md Firoz Khan, Mohd Talib Latif, Norhaniza Amil

A new R package statTarget Hemi Luan Hong Kong Baptist University.

Statistics in MSmcDESPOT

Multiple Regression.

Proteomics Informatics David Fenyő

Metabolomics: Preanalytical Variables

Standards Development for Metabolomics

Somi Jacob and Christian Bach

Diagnostics and Prognostics

Proteomics Informatics David Fenyő

Label propagation algorithm

Presentation transcript:

1 Metabolomics a Promising ‘omics Science By Susan Simmons University of North Carolina Wilmington

2 Collaborators  Dr. David Banks, Duke  Dr. Chris Beecher, University of Michigan  Dr. Xiaodong Lin, University of Cincinnati  Dr. Young Truong, UNC  Dr. Jackie Hughes-Oliver, NC State  Dr. Stanley Young, NISS  Dr. Ann Stapleton, UNCW Biology  Dr. Robert Simmons, MD

3 What is Metabolomics?  The word metabolome was first used less than a decade ago (1998) and referred to all low molecular mass compounds synthesized and modified by a living cell or organism (Villas- Boas, 2007)  The complete human metabolome consists of endogenous (~1800) and exogenous metabolites (MANY!!)  Human Metabolome Project

4

5 Fluorene degradation - Reference pathway ( Kyoto Encyclopedia of Genes and Genomes)

6 Mass Distribution of Compounds in the Human Metabolome  Metabolome natively biosynthesized monomeric  Complex metabolites  Xenobiome

7 History of Metabolomics  Machinery to detect metabolites have existed since the late 1960’s  First paper appeared in 1971 (Robinson and Pauling)  First paper involving “metabolomics” came about in the late 1990’s

8 Why Metabolomics can be promising  Easy to use screening for disease  Assist in identifying gene function  Drug discovery  Assessment of toxicity (especially liver toxicity) in new drugs.  Nutrigenomics and diet strategies

9 Genomics,Proteomics and Metabolomics

10 The emerging science of Metabolomics

11 Metabolomics DNA RNA Protein Biochemicals (Metabolites) Genomics – 25,000 Genes Transcriptomics – 100,000 Transcripts Metabolomics – 1,800 Compounds Proteomics – 1,000,000 Proteins

12 Biochemical Profile Map to Metabolic Pathways Biochemical Profile

13 Data Collection and Measurement Issues To obtain data, a tissue sample is taken from a patient. Then:  The sample is prepped and put onto wells on a silicon plate.  Each well’s aliquot is subjected to gas and/or liquid chromatography.  After separation, the sample goes to a mass spectrometer.

14 MS platforms Sample Preparation GC MS/ei Data Set Metabolyzer LC MS /+ MS /- Data Extraction -peak identification -peak alignment -peak deconvolution Chemical Identification - reference databases -ion spectra -grouping related ions -compound id Quantitation Quality Control Data Reduction PreparationAnalysisInformatics LIMS No Interpretation Interface

15 Data Collection and Measurement Issues The sample prep involves stabilizing the sample, adding spiked-in calibrants, and creating multiple aliquots (some are frozen) for QC purposes. This is roboticized. Sources of error in this step include:  within-subject variation  within-tissue variation  contamination by cleaning solvents  calibrant uncertainty  evaporation of volatiles.

16 Data Collection and Measurement Issues The result of this is a set of m/z ratios and timestamps for each ion, which can be viewed as a 2-D histogram in the m/z x time plane. One now estimates the amount of each metabolite. This entails normalization, which also introduces error. The caveats pointed out in Baggerley et al. (Proteomics, 2003) apply.

17 Data Collection and Measurement Issues  Baseline correction  Alignment  Estimating quantity of specific metabolites.

18

19 Data Collection and Measurement Issues Let z be the vector of raw data, and let x be the estimates. Then the measurement equation is: G(z) = x = µ + ε where µ is the vector of unknown true values and ε is decomposable into separate components. For metabolite i, the estimate X i is: g i (z) = lnΣ w ij ∫∫sm(z) – c(m,t)dm dt.

20 Data Collection and Measurement Issues The law of propagation of error (this is essentially the delta method) says that the variance in X is about Σ n i=1 (∂g /∂ z i ) 2 Var[z i ] + Σ i≠k 2 (∂g/∂z i )(∂g/∂z k ) Cov[z i, z k ] The weights depend upon the values of the spiked in calibrants, so this gets complicated.

21 Data Collection and Measurement Issues Cross-platform experiments are also crucial for medical use. This leads to key comparison designs. Here the same sample (or aliquots of a standard solution or sample) are sent to multiple labs. Each lab produces its spectrogram. It is impossible to decide which lab is best, but one can estimate how to adjust for interlab differences.

22 Data Collection and Measurement Issues The Mandel bundle-of-lines model is what we suggest for interlaboratory comparisons. This assumes: X ik = α i + β i θ k + ε ik where X ik is the estimate at lab i for metabolite k, θ k is the unknown true quantity of metabolite k, and ε ik ~ N(0,σ ik 2 ).

23 Data Collection and Measurement Issues To solve the equations given values from the labs, one must impose constraints. A Bayesian can put priors on the laboratory coefficients and the error variance. Metabolomics needs a multivariate version, with models for the rates at which compounds volatilize.

24

25

26 Statistical issues  Many missing values!!!  Outliers  Distribution of metabolites are not normally distributed  n<p  Correlated metabolites

27 Statistical Issues  PCA or ICA  Partial Least Squares  Clustering  Random Forest, SVM  rSVD

28 Statistical issues Dealing with missing values  Replacing missing values by 0’s is not necessarily a good idea. Not truly 0.  Minimum, half-min, uniform(0, minimum)  Random forest imputation  Observing conditional distribution (Dr. Young Truong at UNC)

29 Statistical Issues Prediction and Classification  Partial least squares  Random Forest  SVM  Neural networks

30 Statistical Issues Identifying relationships  MDS  Clustering  rSVD (PowerMV from NISS)

31 ALS metabolomic data set We had abundance data on 317 metabolites from 63 subjects. Of these, 32 were healthy, 22 had ALS but were not on medication, and 9 had ALS and were taking medication. The goal was to classify the two ALS groups and the healthy group. Here p>n. Also, some abundances were below detectability.

32 ALS metabolomic data set Using the Breiman-Cutler code for Random Forests, the out-of-bag error rate was 7.94%; 29 of the ALS patients and 29 of the healthy patients were correctly classified. 20 of the 317 metabolites were important in the classification, and three were dominant. RF can detect outliers via proximity scores. There were four such.

33 ALS Metabolomic data set Several support vector machine approaches were tried on this data:  Linear SVM  Polynomial SVM  Gaussian SVM  L 1 SVM (Bradley and Mangasarian, 1998)  SCAD SVM (Fan and Li, 2000) The SCAD SVM had the best loo error rate, 14.3%.

34 ALS Metabolomic data set Robust SVD (Liu et al., 2003) is used to simultaneously cluster patients (rows) and metabolites (columns). Given the patient by metabolite matrix X, one writes X ik = r i c k + ε ik where r i and c k are row and column effects. Then one can sort the array by the effect magnitudes.

35 ALS metabolomic data set To do a rSVD use alternating L 1 regression, without an intercept, to estimate the row and column effects. First fit the row effect as a function of the column effect, and then reverse. Robustness stems from not using OLS. Doing similar work on the residuals gives the second singular value solution.

36

37 NCI data set  NCI 60 cell lines  9 cancer types: breast, CNS, colon, melanoma, renal, leukemia, prostate, ovarian, lung  GC-LS  Melanoma vs CNS (8 cell lines for melanoma and 6 cell lines for CNS)

38 Variable Importance using RF

39 Component 1 versus 2

40 Useful websites  Deconvolution of peaks, software AMDIS ( NIST, Gaithersburg, USA)  Human Metabolome database (  KEGG (   Many, many others

41 Concluding Remarks  Many interesting statistical issues still need to be addressed. Measurement issues and interlaboratory differences need to be properly addressed. Statistical issues in analyzing metabolomic data still remain an interesting challenge.  Metabolomics is an important part in understanding systems biology.