Inferring strengths of protein-protein interactions from experimental data using linear programming Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
An Overview of Machine Learning
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
Chapter 4: Linear Models for Classification
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Ph.D. NCBI / NLM / NIH.
Hidden Markov Models.
Hidden Markov Models Modified from:
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
CISC667, F05, Lec26, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Genetic networks and gene expression data.
Speaker Adaptation for Vowel Classification
Reduced Support Vector Machine
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Object Detection Using the Statistics of Parts Henry Schneiderman Takeo Kanade Presented by : Sameer Shirdhonkar December 11, 2003.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Structure Learning for Inferring a Biological Pathway Charles Vaske Stuart Lab.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
九大数理集中講義 Comparison, Analysis, and Control of Biological Networks (3) Domain-Based Mathematical Models for Protein Evolution Tatsuya Akutsu Bioinformatics.
Constrained Optimization for Validation-Guided Conditional Random Field Learning Minmin Chen , Yixin Chen , Michael Brent , Aaron Tenney Washington University.
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
Marcin Pacholczyk, Silesian University of Technology.
ResponseNet revealing signaling and regulatory networks linking genetic and transcriptomic screening data CSE Fall.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Attractor Detection and Control of Boolean Networks Tatsuya Akutsu Bioinformatics Center Institute for Chemical Research Kyoto University.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Interactive Learning of the Acoustic Properties of Objects by a Robot
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Review of statistical modeling and probability theory Alan Moses ML4bio.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Semi-Supervised Clustering
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Bidirectional CRF for NER
OVERVIEW OF LEAST-SQUARES, MAXIMUM LIKELIHOOD AND BLUP PART 2
Presentation transcript:

Inferring strengths of protein-protein interactions from experimental data using linear programming Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics Center, Kyoto University

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Background (1/3) Understanding protein-protein interactions is useful for understanding of protein functions. Transcription factors Proteins interact with a factor. Regulate the gene. Receptors, etc.

Background (2/3) Various methods were developed for inference of protein-protein interactions Gene fusion/Rosetta stone (Enright et al. and Marcotte et al. 1999) Number of possible genes to be applied is limited. Molecular dynamics Long CPU time Difficult to predict precisely

Background (3/3) A Model based on domain-domain interactions has been proposed. Use domains defined by databases like InterPro or Pfam. Domain

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Probabilistic model of interaction (1/2) Model (Deng et al., 2002) Two proteins interact. At least one pair of domains interacts. Interactions between domains are independent events. D1D1 D2D2 D3D3 D2D2 D4D4 P2P2 P1P1

: Proteins P i and P j interact : Domains D m and D n interact : Domain pair (D m,D n ) is included in protein pair P i X P j Probabilistic model of interaction (2/2)

Overview Background Probabilistic model Related work Association method (Sprinzak et al., 2001) EM method (Deng et al., 2002) Biological experimental data Proposed methods Results of computational experiments Conclusion

Related work INPUT: interacting protein pairs (positive examples) non-interacting protein pairs (negative examples) OUTPUT: Pr(D mn =1) for all domain pairs

Association method (Sprinzak et al., 2001) Inference of probabilities of domain- domain interactions using ratios of frequencies : Number of interacting protein pairs that include (D m, D n ) : Number of protein pairs that include (D m, D n )

EM method (Deng et al.,2002) Probability (likelihood L ) that experimental data {O ij ={0,1} } are observed. Use EM algorithm in order to (locally) maximize L. Estimate Pr(D mn =1)

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Biological experimental data Related methods (Association and EM) use only binary data (interact or not). Experimental data using Yeast 2 hybrid Ito et al. (2000, 2001) Uetz et al. (2001) For many protein pairs, different results ( O ij = {0,1} ) were observed. We developed new methods using raw numerical data.

Numerical data Ito et al. (2000,2001) For each protein pair, experiments were performed multiple times. IST (Interaction Sequence Tag) Number of observed interactions By using a threshold, we obtain binary data.

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Proposed methods It seems difficult to modify EM method for numerical data. Linear Programming For binary data LPBN Combined methods LPEM EMLP SVM-based method For numerical data ASNM LPNM

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

LPBN (LP-based method)(1/2) Transformation into linear inequalities P i and P j interact

LPBN (LP-based method)(2/2) Linear programming for inference of protein-protein interactions

Combination of EM and LPBN LPEM method Use the results of LPBN as initial parameter values for EM. EMLP method Constrains to LPBN with the following inequalities so that LP solutions are close to EM solutions.

Simple SVM-based method Feature vector Simple linear kernel with Interacting pairs = Positive examples Non-interacting pairs = Negative examples

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Strength of protein-protein interaction For each protein pair, experiments were performed multiple times. The ratio can be considered as strength. K ij : Number of observed interactions for a protein pair (P i, P j ) M ij : Number of experiments for (P i, P j )

LPNM method (1/2) Minimize the gap between Pr(P ij =1) and using LP.

LPNM method (2/2) Linear programming for inference of strengths of protein-protein interactions

ASNM Modified Association method for numerical data For binary data (Sprinzak et al., 2001)

Overview Background Probabilistic model Related work Biological experimental data Proposed methods For binary data For numerical data Results of computational experiments Conclusion

Computational experiments for binary data DIP database (Xenarios et al., 2002) 1767 protein pairs as positive 2/3 of the pairs for training, 1/3 for test Computational environment Xeon processor 2.8 GHz LP solver: loqo

Results on training data (binary data) SVM EM LPBN Association

Results on test data (binary data) SVM EM EMLP Association LPEM

Computational experiments for numerical data YIP database (Ito et al., 2001, 2002) IST (Interaction Sequence Tag) 1586 protein pairs 4/5 for training, 1/5 for test Computational environment Xeon processor 2.8 GHz LP solver: lp_solve

Results on test data (numerical data) ASNM EM LPNM Association

Results on test data (numerical data) LPNM is the best. EM and Association methods classify Pr(P ij =1) into either 0 or 1. LPNM ASNM EMASSOC Ave. Error CPU (sec.)

Conclusion We have defined a new problem to infer strengths of protein-protein interactions. We have proposed LP-based methods. For binary data LPBN, LPEM, EMLP SVM-based method For numerical data ASNM LPNM LPNM outperformed the other methods.

Future work Improve the methods to avoid overfitting. Improve the probabilistic model to understand protein-protein interactions more accurately.