1 Statistics in Metabolomics David Banks ISDS Duke University.

Slides:



Advertisements
Similar presentations
Quality is a Lousy Idea-
Advertisements

Improvements in Mass Spectrometry for Life Science Research – Does Agilent Have the Answer? Ashley Sage PhD.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
Mass Spectrometry Kyle Chau and Andrew Gioe. Computation of Molecular Mass -Mass Spectrum is a plot of intensity as a function of mass- charge ratio,
1 Metabolomics a Promising ‘omics Science By Susan Simmons University of North Carolina Wilmington.
Analytical Chemistry.
HPLC Coupled with Quadrupole Mass Spectrometry and Forensic Analysis of Cocaine.
Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Lecture 8. GC/MS.
x – independent variable (input)
Metabolomics Bob Ward German Lab Food Science and Technology.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
EART20170 Computing, Data Analysis & Communication skills
Ordinary least squares regression (OLS)
Data Handling l Classification of Errors v Systematic v Random.
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Basics of regression analysis
Systems of Linear Equations
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
1 Regression and Calibration EPP 245 Statistical Analysis of Laboratory Data.
SPH 247 Statistical Analysis of Laboratory Data April 9, 2013SPH 247 Statistical Analysis of Laboratory Data1.
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Objectives of Multiple Regression
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Analytical chemistry MLAB 243 Level 4 Lecture time: every WED 8 -10
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
P Values Robin Beaumont 10/10/2011 With much help from Professor Chris Wilds material University of Auckland.
Metrological Experiments in Biomarker Development (Mass Spectrometry—Statistical Issues) Walter Liggett Statistical Engineering Division Peter Barker Biotechnology.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
© 2010 SRI International - Company Confidential and Proprietary Information Quantitative Proteomics: Approaches and Current Capabilities Pathway Tools.
Quality WHAT IS QUALITY
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes Chen, et al (2012) Robert Magie and Ronni Park.
Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
A Short Overview of Microarrays Tex Thompson Spring 2005.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Simplest (Empirical) Formula
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
Single-Subject and Correlational Research Bring Schraw et al.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
CHROMATOGRAPHY Chromatography is used to separate and analyse small amounts of mixtures Methods involve a stationary phase and a mobile phase. There are.
Experimental Designs The objective of Experimental design is to reduce the magnitude of random error resulting in more powerful tests to detect experimental.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Quality is a Lousy Idea-
Statistical Data Analysis - Lecture /04/03
Analytical Chemistry.
Quality is a Lousy Idea-
Metabolomics: Preanalytical Variables
Classification by multivariate linear regression
Label propagation algorithm
CHROMATOGRAPHY.
Presentation transcript:

1 Statistics in Metabolomics David Banks ISDS Duke University

2 1. Background Metabolomics is the next step after genomics and proteomics. There are about 25,000 genes, most of which have unknown functions. There are about 1,000,000 proteins, most of which are unstudied.

3

4 In contrast to the *omics areas: There are only about 900 main metabolites, and we know their chemical structures Also, we know (pretty well) the biochemical pathways that determine their production rates Metabolites are low-weight molecular compounds produced in the course of processing raw materials.

5 Some common metabolites include: cholesterol glucose, sucrose, fructose amino acids lactic acid, uric acid ATP, ADP drug metabolites, legal and illegal These are produced in metabolic pathways, such as the Krebs (citrate) cycle for oxidation of glucose.

6

7 These pathways contain important information about the amount of each metabolite: Stoichiometric equations show how much material is produced in a given reaction; i.e., mass balance. Rate equations govern the speed at which reactions take place, and the location of the Gibbs equilibrium This gives metabolomics an edge.

8 Biochemical Profile Map to Metabolic Pathways Biochemical Profile

9 The purposes of metabolomics are: Early detection of disease, such as necrosis, ALS, Alzheimer’s, and infection or inflammation. Assessment of toxicity (especially liver toxicity) in new drugs. Diet strategies, drug testing. Elucidating biochemical pathways. There is less raw information than for other *omics, but more context.

10 2. Measurement Issues To obtain data, a tissue sample is taken from a patient. Then: The sample is prepped and put onto wells on a silicon plate. Each well’s aliquot is subjected to gas and/or liquid chromatography. After separation, the sample goes to a mass spectrometer.

11 The sample prep involves stabilizing the sample, adding spiked-in calibrants, and creating multiple aliquots (some are frozen) for QC purposes. This is roboticized. Sources of error in this step include: within-subject variation within-tissue variation contamination by cleaning solvents calibrant uncertainty evaporation of volatiles.

12 Gas chromatography creates an ionized aerosol, and each droplet evaporates to a single ion. This is separated by mass in the column, then ejected to the spectrometer. Sources of error in this step include: imperfect evaporation adhesion in the column ion fragmentation or adductance

13 The fourier mass spectrometer determines the mass to charge ratio of the ion from the field strength required to keep the ion spinning in a circle. This avoids the entry-time uncertainty in TOF machines, so the only main error is uncertainty about the field strength Some laboratories use MALDI-TOF equipment, and the error sources are slightly different.

14

15 The result of this is a set of m/z ratios and timestamps for each ion, which can be viewed as a 2-D histogram in the m/z x time plane. One now estimates the amount of each metabolite. This entails normalization, which also introduces error. The caveats pointed out in Baggerley et al. (Proteomics, 2003) apply.

16

17 3. Statistical Problems Understanding the uncertainty budget in metabolomic data, which entails both quality control and cross-platform comparisons. Identifying the peaks in the m/z x t plane, and estimating quantity of specific metabolites. Finding markers for disease or toxicity, or measuring change.

Uncertainty The classical NIST approach to this is to: build a model for the error terms do a designed experiment with replicated measurements fit a measurement equation to the data See Cameron, “Error Analysis,” ESS Vol. 9, 1982.

19 Let z be the vector of raw data, and let x be the estimates. Then the measurement equation is: G(z) = x = µ + ε where µ is the vector of unknown true values and ε is decomposable into separate components. For metabolite i, the estimate X i is: g i (z) = lnΣ w ij ∫∫sm(z) – c(m,t)dm dt.

20 The law of propagation of error (this is essentially the delta method) says that the variance in X is about Σ n i=1 (∂g /∂ z i ) 2 Var[z i ] + Σ i≠k 2 (∂g/∂z i )(∂g/∂z k ) Cov[z i, z k ] The weights depend upon the values of the spiked in calibrants, so this gets complicated.

21 Cross-platform experiments are also crucial for medical use. This leads to key comparison designs. Here the same sample (or aliquots of a standard solution or sample) are sent to multiple labs. Each lab produces its spectrogram. It is impossible to decide which lab is best, but one can estimate how to adjust for interlab differences.

22 The Mandel bundle-of-lines model is what we suggest for interlaboratory comparisons. This assumes: X ik = α i + β i θ k + ε ik where X ik is the estimate at lab i for metabolite k, θ k is the unknown true quantity of metabolite k, and ε ik ~ N(0,σ ik 2 ).

23 To solve the equations given values from the labs, one must impose constraints. A Bayesian can put priors on the laboratory coefficients and the error variance. Metabolomics needs a multivariate version, with models for the rates at which compounds volatilize. We plan to use this model to compare the Metabolon lab in RTP to Chris Newgard’s lab at Duke.

Peak Identification A classic problem in proteomics is to locate peaks and estimate their area or volume. Unlike proteomics, metabolite peak location is mostly known. So Bayesian methods seem good (cf. Clyde and House). Metabolon uses proprietary software.

25

26

27

Data Mining Different tools are appropriate for different kinds of metabolomic studies. The work we have done focuses on: Random Forests Support Vector Machines Robust Singular Value Decomposition

29 We had abundance data on 317 metabolites from 63 subjects. Of these, 32 were healthy, 22 had ALS but were not on medication, and 9 had ALS and were taking medication. The goal was to classify the two ALS groups and the healthy group. Here p>n. Also, some abundances were below detectability.

30 Using the Breiman-Cutler code for Random Forests, the out-of-bag error rate was 7.94%; 29 of the ALS patients and 29 of the healthy patients were correctly classified. 20 of the 317 metabolites were important in the classification, and three were dominant. RF can detect outliers via proximity scores. There were four such.

31 Several support vector machine approaches were tried on this data: Linear SVM Polynomial SVM Gaussian SVM L 1 SVM (Bradley and Mangasarian, 1998) SCAD SVM (Fan and Li, 2000) The SCAD SVM had the best loo error rate, 14.3%.

32 The L 1 SVM attempts to mimic the automatic variable selection in the LASSO (Tibshirani, 1996) by solving the programming problem: Min b,w Σ[1 – y i (b+w T x i )] + + λΣ | w k | where the first sum is over n and the second is over p. SCAD replaces the L 1 penalty with a nonconvex penalty.

33 The SCAD SVM selected 18 of the metabolites as being important; the L 1 selected 32. This suggests that the automatic variable selection in L 1 SVM is not very effective. A further multiple tree analysis with FIRMPlus TM software from the GoldenHelix Co. did not achieve good classification. So Random Forests wins. And the selected metabolites make sense.

34 Robust SVD (Liu et al., 2003) is used to simultaneously cluster patients (rows) and metabolites (columns). Given the patient by metabolite matrix X, one writes X ik = r i c k + ε ik where r i and c k are row and column effects. Then one can sort the array by the effect magnitudes.

35 To do a rSVD use alternating L 1 regression, without an intercept, to estimate the row and column effects. First fit the row effect as a function of the column effect, and then reverse. Robustness stems from not using OLS. Doing similar work on the residuals gives the second singular value solution.

36

Preterm Labor The NIH wanted to decide whether amniotic fluid samples from women in preterm labor could support classification:  Term delivery  Preterm delivery with inflammation  Preterm delivery without inflammation.

38 The analysis had samples from 113 women in preterm labor. We tried all of the usual classification methods. As before, Random Forests gave the best results. The various SVMs were about 5-10% less predictive. The main information was contained in amino acids and carbohydrates.

39 Predicted Term Inflamm. No Inf. Term True Inflamm No Inf RF accuracy was 100/113 = 88.49%.

40 For those with term delivery, amino acids were low, carbohydrates were high. For those who had preterm delivery without inflammation, both amino acids and carbohydrates were low. For those who had inflammation, the carbohydrates were very low and the amino acids were high.

41 My collaborators in this research are: Chris Beecher, Metabolon, Inc. Adele Cutler, USU Leanna House, Duke University Jackie Hughes-Oliver, NCSU Xiadong Lin, U. of Cincinnati Susan Simmons, UNC-Wilmington Young Truong, UNC-Chapel Hill Stan Young, NISS