Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)

Similar presentations


Presentation on theme: "Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)"— Presentation transcript:

1 Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)

2 GWAS GWAS on metabolic quantitative traits uncover the genetically determined metabolic individuality in the general associate with levels of specific metabolites within or in close proximity to metabolic enzymes or transporters with known disease or pharmaceutical relevance Moreover, compared to GWAS with clinical endpoints the effect sizes of the genotypes are exceptionally high.

3 GWAS and metabolites Metabolomics techniques used: targeted mass spectrometry (MS)-based approach untargeted nuclear magnetic resonance (NMR) based metabolomics techniques untargeted MS-based approach

4 GWAS and metabolites Previous GWAS focused on metabolic features with known identity Untargeted metabolomics approaches provide quantifications of so-called “unknown metabolites”. unknown metabolite: small molecule no idenitified chemial identity

5 LC-MS – unknown : specific retention time, one or multiple masses (e.g. from adducts), and a characteristic fragmentation pattern of the primary ion(s) NMR spectroscopy – unknown: pattern in the chemical shifts Unknowns previously undocumented small molecules, such as rare xenobiotics or secondary products of metabolism molecules from established pathways which could not be assigned using current libraries of MS fragmentation patterns or NMR reference spectra GWAS and metabolites

6 Identify unknowns using a graph-theoretical approach – elucidate structural information of unknown metabolite – attempts to reconstruct the underlying fragmentation tree based on mass-spectra at varying collision energies excluded false candidates for a given unknown by comparing observed and predicted chromatography retention times or by the automatic determination of sum formulas from isotope distributions integrated public metabolic pathway information with correlating peak pairs in order to facilitate metabolite identification GWAS and metabolites

7 might not be applicable for high-throughput metabolomics datasets that have been produced in a fee-for-service manner, since the mass spectra as such might not be readily available. novel functional metabolomics method to predict the identities of unknown metabolites using a systems biological framework Combinine high-throughput genotyping data, metabolomics data, and literature-derived metabolic pathway information, GWAS and metabolites

8

9

10 The concept of our approach GWAS with metabolic traits – reveal functional relationships between genetic loci encoding metabolic enzymes and metabolite concentration levels in the blood – genetic variant can alter, for instance, the expression levels of mRNAs or affect the properties of the respective enzymes through changes of the protein sequence (e.g. enzyme activity, substrate specificity GWAS and metabolites

11 GGMs based on partial correlation coefficients, – identify biochemically related metabolites from high- throughput metabolomics data – suggest that if an unknown compound displays a similar statistical association with a genetic locus in a GWAS or a known metabolite in a GGM provide specific information of where it is located in the metabolic network GWAS and metabolites

12 Conduct a full genome-wide association study genotyped SNPs with concentrations of known amd unknown metabolites Compute a Gaussian graphical model including both known and unknown metabolites. Integrate the results of the GWAS and GGM computations and combine them with metabolic pathway information from public databases to derive predictions for unknown metabolites. GWAS and metabolites

13 GWAS study on a German population cohort (n=1768) metabolic profiling: UP-LC and GC coupled with tandem mass spectrometry. Genotype: Affymetrix GeneChip array 6.0 – call rate >95% – Hardy-Weinberg-Equilibrium p-value p(HWE)>10 −6 – MAF>1% – 655,658 SNPs

14 GWAS avoid spurious false positive associations – exclude metabolic traits with more than 300 non-missing values – exclude data-points of metabolic traits that lay more than 3 standard deviations off the mean Genotypes represented by 0, 1, and 2 for major allele homozygous, heterozygous, and minor allele homozygous employ linear model to test for associations between a SNP and a metabolite assuming an additive mode of inheritance. statistical tests : PLINK software with age and gender as covariates based on a conservative Bonferroni correction: p- values<1.6×10 −10  significance

15 34 distinct loci: genome-wide significance level 15/34 loci: associate with at least one unknown compound 12/15 loci, an unknown compound constitutes the strongest association of all tested compounds.

16 GWAS based on SNPs in or enzymes associating with functionally related: known metabolites GWAS data to derive hypotheses on the potential identity: unknown metabolites – for instance, SNP rs296391 in close proximity to the SULT2A1 gene (converts steroids and bile acids into water-soluble sulfate conjugates for excretion) strongly associates with the concentrations of the unknown metabolites X-11440 and X- 11244 (p = 1.7×10 −43 and p = 2.1×10 −26, respectively) – may speculate that X-11440 and X-11244: biochemically related to steroids, bile acids, or water-soluble sulfate conjugates

17 Gaussian graphical modeling (GGM) Gaussian graphical models induced by full-order partial correlation coefficients, i.e. pairwise correlations corrected against all remaining (n-2) variables based on linear regressions with multiple predictor variables

18 GGM when regress two random variables X and Y on the remaining variables in the data set partial correlation coefficient between X and Y is given by the Pearson correlation of the residuals from both regressions dataset contains more samples than variables – full-order partial correlations can be conveniently calculated by a matrix inversion operation – significance cutoff of α = 0.05 with Bonferroni correction

19 Age, gender and SNP effects: removed by adding the respective variables and SNPs states to the data matrix. For each pair of variables, GGM remove the effects of all remaining variables on this correlation add a variable to the data matrix will automatically – result in the removal of confounding effects of this variable on the correlations of all other variables age, gender and SNPs: not investigated as an actual node – only used for the correction procedure GGM

20 consider metabolite-metabolite edges in the network SNP states coded as numerical values of 0, 1 and 2 such that the linear regressions that underlie the GGM correspond to an additive genetic model Gender represents a “dummy variable” in the linear regression model which only takes values of 1 (male) and 0 (female)

21 focus on intrinsic relations between the measured metabolites and associations between known and unknown reconstruct pathways involving directly related metabolites from cross-sectional blood serum metabolomics data Each known metabolite – super-pathway: general metabolic class, – sub-pathway: more specific metabolic pathways partial correlation: included in the model if it was significantly different from zero with α = 0.05 after Bonferroni correction, yielding a corrected significance level of = 7.9×10 −7 and an absolute partial correlation cutoff of ζ = 0.178. GGM

22 Computation of correlation network and Gaussian graphical model let X = (x kl ) be the ℝ n×m matrix of logarithmized metabolite concentrations (either measured data samples or computer-simulated steady states) – n = number of samples and m = number of metabolites

23 GGM Computation of correlation network and Gaussian graphical model standard Pearson product-moment correlation coefficients P = (ρ ij ) between metabolites are calculated as where x¯i represents the mean value of metabolite i.

24 width and the corresponding partial correlation coefficients can be calculated as – partial correlation value ζ ij denotes the pairwise correlation of metabolites i and j corrected for the effects of all remaining metabolites to assess the significance of partial correlations, p-values p(ζ ij ) were calculated using Fisher's z-transform where ϕ stands for the cumulative distribution function of the standard normal distribution GGM

25 Bonferroni correction: applied for multiple testing after Bonferroni correction ζ ij yields a minimum absolute partial correlation coefficient of 0.1619 for the given significance level all partial correlations smaller than -0.1619 or larger than 0.1619 are considered significant GGM

26 Network modularity calculation define the adjacency matrix ξ ij of a new unweighted, undirected graph induced by all significantly positive partial correlations in ζ ij : α represents the significance level after multiple testing correction let (V 1,...,V 6 ) be the partitioning of the metabolites into the six metabolite classes: acyl-carnitines, diacyl- PCs, lyso-PCs, acyl-alkyl-PCs, sphingomyelins and amino acids GGM

27 calculated the relative out-degree R ij ∈ ℝ 6×6 from each class to the other classes, (i.e. the proportion of its edges each class shares with the other classes) as:

28 compare within-class edges with the edges to the rest of the network. The more edges there are within each class in comparison to the other classes, the higher Q will be. randomly pick two edges from the network and exchange the target nodes of each edge. In order to achieve sufficient randomization, this operation is repeated 5 · e times, where e represents the number of edges in the graph

29

30

31 Combining GGMs and GWAS integrate of the GGM and GWAS approaches with general pathway information from external databases, to generate concrete predictions for the unknowns' metabolic pathway memberships

32 For unknowns that did not have a known metabolite neighbor in the GGM, – investigated the 2- and 3-neighborhoods. – these hits certainly represent weaker evidence than a direct GGM neighbor Functional annotations three sources 1.sub-pathway assignment provided for each known metabolite in the GGM neighborhood 2.GO functional terms for the associated gene of all genome-wide significant GWAS hits 3.KEGG pathways on which the associated genes lie. Combining GGMs and GWAS

33 Due to no consistent mapping between annotations from the different data sources available for both metabolites and genes – perform non-automatic step in the analysis – manual interpretation of different functional classes – derive a single consensus pathway annotation – create 16 pathway predictions unknowns with both GGM and GWAS

34

35 Experimental validation STEROID scenario unknown metabolite (X-11244) for which both GGM and GWAS data strongly indicate an identity related to steroid-hormone compounds X-11244 is tightly linked via GGM edges to dehydroepiandrosterone sulfate and two other unknowns, which in turn connect to epiandrosterone sulfate and androsterone sulfate X-11244 displays a highly significant genetic association (with rs296391, which lies in strong LD in the SULT2A1 gene locus. Based on the GGM and GWAS results – hypothesized that X-11244 is a steroid sulfate related to androstane.


Download ppt "Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information PLoS Genetics 2012;8(10)"

Similar presentations


Ads by Google