Inferring strengths of protein-protein interactions from experimental data using linear programming Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics Center, Kyoto University
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Background (1/3) Understanding protein-protein interactions is useful for understanding of protein functions. Transcription factors Proteins interact with a factor. Regulate the gene. Receptors, etc.
Background (2/3) Various methods were developed for inference of protein-protein interactions Gene fusion/Rosetta stone (Enright et al. and Marcotte et al. 1999) Number of possible genes to be applied is limited. Molecular dynamics Long CPU time Difficult to predict precisely
Background (3/3) A Model based on domain-domain interactions has been proposed. Use domains defined by databases like InterPro or Pfam. Domain
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Probabilistic model of interaction (1/2) Model (Deng et al., 2002) Two proteins interact. At least one pair of domains interacts. Interactions between domains are independent events. D1D1 D2D2 D3D3 D2D2 D4D4 P2P2 P1P1
: Proteins P i and P j interact : Domains D m and D n interact : Domain pair (D m,D n ) is included in protein pair P i X P j Probabilistic model of interaction (2/2)
Overview Background Probabilistic model Related work Association method (Sprinzak et al., 2001) EM method (Deng et al., 2002) Biological experimental data Proposed methods Results of computational experiments Conclusion
Related work INPUT: interacting protein pairs (positive examples) non-interacting protein pairs (negative examples) OUTPUT: Pr(D mn =1) for all domain pairs
Association method (Sprinzak et al., 2001) Inference of probabilities of domain- domain interactions using ratios of frequencies : Number of interacting protein pairs that include (D m, D n ) : Number of protein pairs that include (D m, D n )
EM method (Deng et al.,2002) Probability (likelihood L ) that experimental data {O ij ={0,1} } are observed. Use EM algorithm in order to (locally) maximize L. Estimate Pr(D mn =1)
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Biological experimental data Related methods (Association and EM) use only binary data (interact or not). Experimental data using Yeast 2 hybrid Ito et al. (2000, 2001) Uetz et al. (2001) For many protein pairs, different results ( O ij = {0,1} ) were observed. We developed new methods using raw numerical data.
Numerical data Ito et al. (2000,2001) For each protein pair, experiments were performed multiple times. IST (Interaction Sequence Tag) Number of observed interactions By using a threshold, we obtain binary data.
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Proposed methods It seems difficult to modify EM method for numerical data. Linear Programming For binary data LPBN Combined methods LPEM EMLP SVM-based method For numerical data ASNM LPNM
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
LPBN (LP-based method)(1/2) Transformation into linear inequalities P i and P j interact
LPBN (LP-based method)(2/2) Linear programming for inference of protein-protein interactions
Combination of EM and LPBN LPEM method Use the results of LPBN as initial parameter values for EM. EMLP method Constrains to LPBN with the following inequalities so that LP solutions are close to EM solutions.
Simple SVM-based method Feature vector Simple linear kernel with Interacting pairs = Positive examples Non-interacting pairs = Negative examples
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Strength of protein-protein interaction For each protein pair, experiments were performed multiple times. The ratio can be considered as strength. K ij : Number of observed interactions for a protein pair (P i, P j ) M ij : Number of experiments for (P i, P j )
LPNM method (1/2) Minimize the gap between Pr(P ij =1) and using LP.
LPNM method (2/2) Linear programming for inference of strengths of protein-protein interactions
ASNM Modified Association method for numerical data For binary data (Sprinzak et al., 2001)
Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion
Computational experiments for binary data DIP database (Xenarios et al., 2002) 1767 protein pairs as positive 2/3 of the pairs for training, 1/3 for test Computational environment Xeon processor 2.8 GHz LP solver: loqo
Results on training data (binary data) SVM EM LPBN Association
Results on test data (binary data) SVM EM EMLP Association LPEM
Computational experiments for numerical data YIP database (Ito et al., 2001, 2002) IST (Interaction Sequence Tag) 1586 protein pairs 4/5 for training, 1/5 for test Computational environment Xeon processor 2.8 GHz LP solver: lp_solve
Results on test data (numerical data) ASNM EM LPNM Association
Results on test data (numerical data) LPNM is the best. EM and Association methods classify Pr(P ij =1) into either 0 or 1. LPNM ASNM EMASSOC Ave. Error CPU (sec.)
Conclusion We have defined a new problem to infer strengths of protein-protein interactions. We have proposed LP-based methods. For binary data LPBN, LPEM, EMLP SVM-based method For numerical data ASNM LPNM LPNM outperformed the other methods.
Future work Improve the methods to avoid overfitting. Improve the probabilistic model to understand protein-protein interactions more accurately.