Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)

Slides:



Advertisements
Similar presentations
METHODS FOR HAPLOTYPE RECONSTRUCTION
Advertisements

Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
Quantitative Genetics
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae Speaker: Zhu YANG 6 th step, 2006.
Reconstructing Circular Order from Inaccurate Adjacency Information Applications in NMR Data Interpretation Ming-Yang Kao.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Chapter 11 Multiple Regression.
REGRESSION AND CORRELATION
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Correlational Designs
Chapter 7 Correlational Research Gay, Mills, and Airasian
Correlation and Regression Analysis
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Objectives of Multiple Regression
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Simple Linear Regression
Correlation.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
(a.k.a: The statistical bare minimum I should take along from STAT 101)
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Applications The General Linear Model. Transformations.
Genetic Variation Influences Glutamate Concentrations in Brains of Patients with Multiple Sclerosis Robby Bonanno.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Correlation and Regression SCATTER DIAGRAM The simplest method to assess relationship between two quantitative variables is to draw a scatter diagram.
Chapter 10 Correlation and Regression
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Chapter 13 Multiple Regression
Correlation & Regression Analysis
Chapter 8: Simple Linear Regression Yang Zhenlin.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Lecture 22: Quantitative Traits II
Business Research Methods
Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Chapter 7. Classification and Prediction
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Regression Analysis Part D Model Building
Correlation – Regression
Genome Wide Association Studies using SNP
Elementary Statistics
CHAPTER 29: Multiple Regression*
CORRELATION ANALYSIS.
Product moment correlation
Microbiome: Metabolomics
Presentation transcript:

Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)

GWAS GWAS on metabolic quantitative traits uncover the genetically determined metabolic individuality in the general associate with levels of specific metabolites within or in close proximity to metabolic enzymes or transporters with known disease or pharmaceutical relevance Moreover, compared to GWAS with clinical endpoints the effect sizes of the genotypes are exceptionally high.

GWAS and metabolites Metabolomics techniques used: targeted mass spectrometry (MS)-based approach untargeted nuclear magnetic resonance (NMR) based metabolomics techniques untargeted MS-based approach

GWAS and metabolites Previous GWAS focused on metabolic features with known identity Untargeted metabolomics approaches provide quantifications of so-called “unknown metabolites”. unknown metabolite: small molecule no idenitified chemial identity

LC-MS – unknown : specific retention time, one or multiple masses (e.g. from adducts), and a characteristic fragmentation pattern of the primary ion(s) NMR spectroscopy – unknown: pattern in the chemical shifts Unknowns previously undocumented small molecules, such as rare xenobiotics or secondary products of metabolism molecules from established pathways which could not be assigned using current libraries of MS fragmentation patterns or NMR reference spectra GWAS and metabolites

Identify unknowns using a graph-theoretical approach – elucidate structural information of unknown metabolite – attempts to reconstruct the underlying fragmentation tree based on mass-spectra at varying collision energies excluded false candidates for a given unknown by comparing observed and predicted chromatography retention times or by the automatic determination of sum formulas from isotope distributions integrated public metabolic pathway information with correlating peak pairs in order to facilitate metabolite identification GWAS and metabolites

might not be applicable for high-throughput metabolomics datasets that have been produced in a fee-for-service manner, since the mass spectra as such might not be readily available. novel functional metabolomics method to predict the identities of unknown metabolites using a systems biological framework Combinine high-throughput genotyping data, metabolomics data, and literature-derived metabolic pathway information, GWAS and metabolites

The concept of our approach GWAS with metabolic traits – reveal functional relationships between genetic loci encoding metabolic enzymes and metabolite concentration levels in the blood – genetic variant can alter, for instance, the expression levels of mRNAs or affect the properties of the respective enzymes through changes of the protein sequence (e.g. enzyme activity, substrate specificity GWAS and metabolites

GGMs based on partial correlation coefficients, – identify biochemically related metabolites from high- throughput metabolomics data – suggest that if an unknown compound displays a similar statistical association with a genetic locus in a GWAS or a known metabolite in a GGM provide specific information of where it is located in the metabolic network GWAS and metabolites

Conduct a full genome-wide association study genotyped SNPs with concentrations of known amd unknown metabolites Compute a Gaussian graphical model including both known and unknown metabolites. Integrate the results of the GWAS and GGM computations and combine them with metabolic pathway information from public databases to derive predictions for unknown metabolites. GWAS and metabolites

GWAS study on a German population cohort (n=1768) metabolic profiling: UP-LC and GC coupled with tandem mass spectrometry. Genotype: Affymetrix GeneChip array 6.0 – call rate >95% – Hardy-Weinberg-Equilibrium p-value p(HWE)>10 −6 – MAF>1% – 655,658 SNPs

GWAS avoid spurious false positive associations – exclude metabolic traits with more than 300 non-missing values – exclude data-points of metabolic traits that lay more than 3 standard deviations off the mean Genotypes represented by 0, 1, and 2 for major allele homozygous, heterozygous, and minor allele homozygous employ linear model to test for associations between a SNP and a metabolite assuming an additive mode of inheritance. statistical tests : PLINK software with age and gender as covariates based on a conservative Bonferroni correction: p- values<1.6×10 −10  significance

34 distinct loci: genome-wide significance level 15/34 loci: associate with at least one unknown compound 12/15 loci, an unknown compound constitutes the strongest association of all tested compounds.

GWAS based on SNPs in or enzymes associating with functionally related: known metabolites GWAS data to derive hypotheses on the potential identity: unknown metabolites – for instance, SNP rs in close proximity to the SULT2A1 gene (converts steroids and bile acids into water-soluble sulfate conjugates for excretion) strongly associates with the concentrations of the unknown metabolites X and X (p = 1.7×10 −43 and p = 2.1×10 −26, respectively) – may speculate that X and X-11244: biochemically related to steroids, bile acids, or water-soluble sulfate conjugates

Gaussian graphical modeling (GGM) Gaussian graphical models induced by full-order partial correlation coefficients, i.e. pairwise correlations corrected against all remaining (n-2) variables based on linear regressions with multiple predictor variables

GGM when regress two random variables X and Y on the remaining variables in the data set partial correlation coefficient between X and Y is given by the Pearson correlation of the residuals from both regressions dataset contains more samples than variables – full-order partial correlations can be conveniently calculated by a matrix inversion operation – significance cutoff of α = 0.05 with Bonferroni correction

Age, gender and SNP effects: removed by adding the respective variables and SNPs states to the data matrix. For each pair of variables, GGM remove the effects of all remaining variables on this correlation add a variable to the data matrix will automatically – result in the removal of confounding effects of this variable on the correlations of all other variables age, gender and SNPs: not investigated as an actual node – only used for the correction procedure GGM

consider metabolite-metabolite edges in the network SNP states coded as numerical values of 0, 1 and 2 such that the linear regressions that underlie the GGM correspond to an additive genetic model Gender represents a “dummy variable” in the linear regression model which only takes values of 1 (male) and 0 (female)

focus on intrinsic relations between the measured metabolites and associations between known and unknown reconstruct pathways involving directly related metabolites from cross-sectional blood serum metabolomics data Each known metabolite – super-pathway: general metabolic class, – sub-pathway: more specific metabolic pathways partial correlation: included in the model if it was significantly different from zero with α = 0.05 after Bonferroni correction, yielding a corrected significance level of = 7.9×10 −7 and an absolute partial correlation cutoff of ζ = GGM

Computation of correlation network and Gaussian graphical model let X = (x kl ) be the ℝ n×m matrix of logarithmized metabolite concentrations (either measured data samples or computer-simulated steady states) – n = number of samples and m = number of metabolites

GGM Computation of correlation network and Gaussian graphical model standard Pearson product-moment correlation coefficients P = (ρ ij ) between metabolites are calculated as where x¯i represents the mean value of metabolite i.

width and the corresponding partial correlation coefficients can be calculated as – partial correlation value ζ ij denotes the pairwise correlation of metabolites i and j corrected for the effects of all remaining metabolites to assess the significance of partial correlations, p-values p(ζ ij ) were calculated using Fisher's z-transform where ϕ stands for the cumulative distribution function of the standard normal distribution GGM

Bonferroni correction: applied for multiple testing after Bonferroni correction ζ ij yields a minimum absolute partial correlation coefficient of for the given significance level all partial correlations smaller than or larger than are considered significant GGM

Network modularity calculation define the adjacency matrix ξ ij of a new unweighted, undirected graph induced by all significantly positive partial correlations in ζ ij : α represents the significance level after multiple testing correction let (V 1,...,V 6 ) be the partitioning of the metabolites into the six metabolite classes: acyl-carnitines, diacyl- PCs, lyso-PCs, acyl-alkyl-PCs, sphingomyelins and amino acids GGM

calculated the relative out-degree R ij ∈ ℝ 6×6 from each class to the other classes, (i.e. the proportion of its edges each class shares with the other classes) as:

compare within-class edges with the edges to the rest of the network. The more edges there are within each class in comparison to the other classes, the higher Q will be. randomly pick two edges from the network and exchange the target nodes of each edge. In order to achieve sufficient randomization, this operation is repeated 5 · e times, where e represents the number of edges in the graph

Combining GGMs and GWAS integrate of the GGM and GWAS approaches with general pathway information from external databases, to generate concrete predictions for the unknowns' metabolic pathway memberships

For unknowns that did not have a known metabolite neighbor in the GGM, – investigated the 2- and 3-neighborhoods. – these hits certainly represent weaker evidence than a direct GGM neighbor Functional annotations three sources 1.sub-pathway assignment provided for each known metabolite in the GGM neighborhood 2.GO functional terms for the associated gene of all genome-wide significant GWAS hits 3.KEGG pathways on which the associated genes lie. Combining GGMs and GWAS

Due to no consistent mapping between annotations from the different data sources available for both metabolites and genes – perform non-automatic step in the analysis – manual interpretation of different functional classes – derive a single consensus pathway annotation – create 16 pathway predictions unknowns with both GGM and GWAS

Experimental validation STEROID scenario unknown metabolite (X-11244) for which both GGM and GWAS data strongly indicate an identity related to steroid-hormone compounds X is tightly linked via GGM edges to dehydroepiandrosterone sulfate and two other unknowns, which in turn connect to epiandrosterone sulfate and androsterone sulfate X displays a highly significant genetic association (with rs296391, which lies in strong LD in the SULT2A1 gene locus. Based on the GGM and GWAS results – hypothesized that X is a steroid sulfate related to androstane.