Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris
Outline Motivation: metabolic network Motivation: metabolic network Method: network inference Method: network inference - Supervised network inference - Supervised network inference - Multiple data integration - Multiple data integration Application Application - Global network prediction - Global network prediction Concluding remarks Concluding remarks
Metabolic network The metabolic network consists of enzyme proteins and chemical compounds The metabolic network consists of enzyme proteins and chemical compounds 6018 genes in yeast genome 6018 genes in yeast genome 1120 genes with EC numbers 1120 genes with EC numbers 668 genes with pathway information 668 genes with pathway information (in the KEGG as of Sep. 2004) (in the KEGG as of Sep. 2004) Problem: unknown part of pathways and many missing enzyme genes Problem: unknown part of pathways and many missing enzyme genes
Network inference methods For gene regulatory network Bayesian network (Friedman et al., 2000, Imoto et al, 2002) Bayesian network (Friedman et al., 2000, Imoto et al, 2002) Boolean network (Akutsu et al., 2000) Boolean network (Akutsu et al., 2000) Graphical modeling (Toh et al., 2001) Graphical modeling (Toh et al., 2001) For protein interaction network Joint graph method (Marcotte et al., 1999) Joint graph method (Marcotte et al., 1999) Mirror tree method (Pazos et al., 2001) Mirror tree method (Pazos et al., 2001)
Objectives Develop a method to infer metabolic gene networks in a supervised context Develop a method to infer metabolic gene networks in a supervised context Integrate heterogeneous genomic data in the framework of network inference Integrate heterogeneous genomic data in the framework of network inference Reconstruct unknown pathways and identify genes for missing enzymes Reconstruct unknown pathways and identify genes for missing enzymes
Kernel in this study Kernel : representation of the similarity between two genes and (e.g., correlation coefficient) Kernel matrix: similarity matrix of a set of genes
An example of the kernel Suppose we have a set of genes x 1, x 2,…, x N and represent them by gene expression profiles
An example of kernel matrix This can be regarded as a kind of similarity matrix
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix based on a genomic dataset Configuration of genes
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Direct network inference Assumption: connected proteins in the network share high similarity in the data Similarity matrix Predicted network
Evaluation of the direct approach: using gene expression data Gold standard data: metabolic network of 668 genes of the yeast in the KEGG/Pathway ROC curve False positives True positives 157 expriments (SMD)
Outline Motivation: metabolic network Motivation: metabolic network Method: network inference Method: network inference - Supervised network inference - Supervised network inference - Multiple data integration - Multiple data integration Application Application - Global network prediction - Global network prediction - Missing enzyme gene estimation - Missing enzyme gene estimation Concluding remarks Concluding remarks
An illustration of formalism Unknown pathway Protein network Similarity matrix in expression
An illustration of formalism Unknown pathway Protein network Similarity matrix in expression training
Supervised network inference :training set Original space Key idea: use of partially known network information
Supervised network inference :training set Original space : edge predicted by direct approach
Supervised network inference :training set Original space :true edge
Supervised network inference 1/2 Step 1: map proteins to a space, where interacting proteins are close to each other Feature space :training set Original space :true edge
Supervised network inference 2/2 Feature space :training set :test set Original space :true edge
Supervised network inference 2/2 Feature space Step 2: predict interacting protein pairs involving the test set :training set :test set Original space :true edge
Algorithm Kernel CCA (Yamanishi et al., 2004) Distance metric learning (Vert et al., 2004)
Result of the supervised learning: ROC curve by cross-validation Direct approachSupervised approach
Outline Motivation: metabolic network Motivation: metabolic network Method: network inference Method: network inference - Supervised network inference - Supervised network inference - Multiple data integration - Multiple data integration Application Application - Global network prediction - Global network prediction - Missing enzyme gene estimation - Missing enzyme gene estimation Concluding remarks Concluding remarks
Various genomic data Bit strings NumericalvectorsStructure Evolutionary similarity Co-localization similarity Co-expresion similarity Gene-gene relationship Data Phylogenetic profile Localization data Geneexpression
Data of the yeast S. cerevisiae Expression: 6059 genes with 157 experiments (SMD database) Expression: 6059 genes with 157 experiments (SMD database) Localization: 6059 proteins with 23 intracellular locations (Huh et al, 2003) Localization: 6059 proteins with 23 intracellular locations (Huh et al, 2003) Phylogenetic profile: 6059 proteins with 145 organisms (KEGG/Ortholog Cluster) Phylogenetic profile: 6059 proteins with 145 organisms (KEGG/Ortholog Cluster)
Gene expression profiles exp1 exp2 exp3 exp4 exp5 … exp P exp1 exp2 exp3 exp4 exp5 … exp P gene 1 (0.1, 0.4, 0.6, 0.2, -0.3, …, 1.5) gene 2 (0.2, 0.9, 1.8, 0.7, -0.3, …, 0.4) gene 3 (0.6, 0.7, -1.0, 0.8, 1.2, …, 0.6) … gene N (1.2, 0.3, 1.9, -0.1, -0.7, …, 0.1) Numerical vectors of the gene expression ratio gene Experiments (or time series)
Phylogenetic profiles org1 org2 org3 org4 org5 … org P org1 org2 org3 org4 org5 … org P gene 1 (1, 1, 0, 0, 0, …, 1) gene 2 (1, 0, 1, 0, 1, …, 0) gene 3 (0, 1, 0, 0, 1, …, 0) … gene N (1, 0, 1, 0, 0, …, 1) Bit strings in which the presence and absence of the genes are corded as 1 or 0 across organisms gene organism
An illustration of our network inference procedure Gene expression Protein localization Phylogenetic profile Gene network similarity matrix of genes INPUT OUTPUT infer
Data representation and integration Genomic dataSimilarity matrix
Evaluating the weight for each data source 1.Individual application to each data 2.Evaluation of its biological relevance by the ROC score ROC curve ROC score: area under the ROC curve
Evaluating the weight by the ROC scores For each data, compute the ROC score - 0.5, which are used as the weight ExpressionLocalizationPhylogenetic profile Evolutionary information seems to be useful
The resulting normalized weights: The effect of data integration ROC curve
Outline Motivation: metabolic network Motivation: metabolic network Method: network inference Method: network inference - Supervised network inference - Supervised network inference - Multiple data integration - Multiple data integration Application Application - Global network prediction - Global network prediction - Missing enzyme gene estimation - Missing enzyme gene estimation Concluding remarks Concluding remarks
Comprehensive prediction of a global gene network We predicted a network of 6059 genes Possible biological applications 1. Estimate unknown pathways 2. Predict biochemical function for hypothetical proteins 3. Identify missing enzyme genes
Prediction for a role in pathways YJR137C (the detail function was unknown as of Sep. 2003) is connected with EC: and EC: in the predicted network YJR137C (the detail function was unknown as of Sep. 2003) is connected with EC: and EC: in the predicted network
Recently, there has been a report that YJR137C is annotated as EC: Prediction for a role in pathways
Outline Motivation: metabolic network Motivation: metabolic network Method: network inference Method: network inference - Supervised network inference - Supervised network inference - Multiple data integration - Multiple data integration Application Application - Global network prediction - Global network prediction Concluding remarks Concluding remarks
Summary We developed supervised approaches to infer the metabolic network from multiple genomic data We developed supervised approaches to infer the metabolic network from multiple genomic data The accuracy improved from the supervised learning and the weighted data integration The accuracy improved from the supervised learning and the weighted data integration We showed some possibilities to obtain new biological findings We showed some possibilities to obtain new biological findings
Collaborator For the methods For the methods Jean-Philippe Vert (Ecole des Mines) Jean-Philippe Vert (Ecole des Mines) Minoru Kanehisa (Kyoto University) Minoru Kanehisa (Kyoto University) For the biochemical experiments For the biochemical experiments Hisaaki Mihara, Motoharu Ohsaki, Hisashi Muramatsu, Nobuyoshi Esaki (Kyoto University) Hisaaki Mihara, Motoharu Ohsaki, Hisashi Muramatsu, Nobuyoshi Esaki (Kyoto University)