Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell ( ), with Betty Cheng, Yan Liu, Eric Xing,

Similar presentations


Presentation on theme: "Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell ( ), with Betty Cheng, Yan Liu, Eric Xing,"— Presentation transcript:

1 Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell ( jgc@cs.cmu.edu ), with Betty Cheng, Yan Liu, Eric Xing, Yanjun Qi, Judith Klein-Seetharaman, and Oznur Tastan Carnegie Mellon University Pittsburgh PA, USA December, 2008 jgc@cs.cmu.edu

2 © 2003, Jaime Carbonell 2 Simplified View of Biology Nobelprize.org Protein sequence Protein structure

3 © 2003, Jaime Carbonell 3 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Normal P ROTEIN S Sequence  Structure  Function (Borrowed from: Judith Klein-Seetharaman)

4 © 2003, Jaime Carbonell 4 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Disease P ROTEIN S Sequence  Structure  Function

5 © 2003, Jaime Carbonell 5 Motivation : Protein Structure and Function Prediction Ultimate goal: Sequence  Function –…and Function  Sequence (drug design, …) –Potential active binding sites are a good start, but how about stability, external accessibility, energetics, … Intermediate goal: Sequence  Structure –Only 1.2% of proteins have been structurally resolved –What-if analysis (precursor of mutagenesis exp’s) Machine Learning & Lang Tech methods –Powertools to model and predict structure & function –ComBio challenges are starting to drive new research in Machine Learning & Language Technologies

6 © 2003, Jaime Carbonell 6 OUTLINE Motivation: sequence  structure  function Vocabulary-based classification approaches (Betty Cheng, Jaime Carbonell, Judith Klein-Seetharaman) –GPRC Subfamily classification –Protein-protein coupling specificity Solving the “Folding Problem” Machine Learning Approaches to Structure Prediction (Yan Liu, Jaime Carbonell, et al) –Teriary folds: β-helix prediction via segmented CRFs –Quaternary Folds: Viral adhesin and capsid complexes Conclusions and future directions

7 © 2003, Jaime Carbonell 7 GPRC Super-family: G-Protein Coupled Receptors Transmembrane protein Target of 60% drugs (Moller, 2002) Involved in cancer, cardiovascular disease, Alzheimer’s and Parkinson’s diseases, stroke, diabetes, and inflammatory and respiratory diseases VII VI C-Terminus N-Terminus Intracellular Loops Extracellular Loops Membrane I II IIIIV V

8 © 2003, Jaime Carbonell 8 Protein Family & Subfamily Classification (applied to GPCRs) Subfamily classification based on pharmaceutical properties

9 © 2003, Jaime Carbonell 9 Comparative Study – Karchin et al., 2002 Support Vector Machines, Neural Nets, Clustering Hidden Markov Models K-Nearest Neighbours, BLAST Complex Simple SVM is the best for subfamily classification - Karchin et al., 2002 Decision Trees, Naïve Bayes Traditionally, hidden Markov models, k-nearest neighbours and BLAST have been used. Recently, more complicated classifiers have been used. Karchin et al. (2002) studied a range of classifiers of varied complexity in GPCR subfamily classification. But what about those simple classifiers at the other end of the scale? Hypothesis: Bio-vocabulary selection is crucial for sub-family classification (and protein- protein interaction prediction)

10 © 2003, Jaime Carbonell 10 Study “segments” with different vocabulary AA, chemical groups, properties of AA

11 © 2003, Jaime Carbonell 11 Computing Chi-Square Observed # of sequences with feature x Expected # of sequences with feature x Total # of sequences # of sequences with feature x # of sequences in class c

12 © 2003, Jaime Carbonell 12 Level I Subfamily Optimization Number of Features Accuracy Decision TreesNaïve Bayes Binary Features N-gram Counts

13 © 2003, Jaime Carbonell 13 Level I Subfamily Results Classifier# of FeaturesType of FeaturesAccuracy Naïve Bayes5500-7700Binary93.0 % 3300-6900N-gram counts90.6 % All (9702)N-gram counts90.0 % SVM9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 88.4 % BLASTLocal sequence alignment83.3 % Decision Tree900-2800Binary77.3 % 700-5600N-gram counts77.3 % All (9723)N-gram counts77.2 % SAM-T2K HMMA HMM model built for each protein subfamily69.9 % kernNN9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 64.0 %

14 © 2003, Jaime Carbonell 14 Level II Subfamily Results Classifier# of FeaturesType of FeaturesAccuracy Naïve Bayes8100Binary92.4 % SVM9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 86.3 % Naïve Bayes5600N-gram counts84.2 % SVMtree9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 82.9 % Naïve BayesAll (9702)N-gram counts81.9 % BLASTLocal sequence alignment74.5 % Decision Tree1200N-gram counts70.8 % Decision Tree2300Binary70.2 % SAM-T2K HMMA HMM model built for each protein subfamily70.0 % Decision TreeAll (9723)N-gram counts66.0 % kernNN9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 51.0 %

15 Helix 3 and 7 known to be important for signal transduction Top 20 selected “words” for Class B GPCRs. They correlate with identified motifs. Loop 1 is suspected common binding site

16 © 2003, Jaime Carbonell 16 Generalization to Other Superfamilies: Nuclear Receptors DatasetFeature Type# of FeaturesAccuracy ValidationTesting FamilyBinary1500-420096.96%94.53% N-grams counts400-490095.75%91.79% Level I Subfamily Binary1500-310098.09%97.77% N-gram counts500-110093.95%91.40% Level II Subfamily Binary1500-210095.32%93.62% N-gram counts3100-560086.39%85.54%

17 © 2003, Jaime Carbonell 17 G-Protein Coupling Specificity Problem Predict which one or more families of G-proteins a GPCR can couple with, given the GPCR sequence Locate regions in the GPCR sequence where the majority of coupling specificity information lies G-Protein FamilyFunction GsGs Activates adenylyl cyclase G i/o Inhibits adenylyl cyclase G q/11 Activates phospholipase C G 12/13 Unknown

18 © 2003, Jaime Carbonell 18 N-gram Based Component Extract n-grams from all possible reading frames Use a set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family Predict coupling if k-NN outputs a probability higher than trained threshold MGNASNDSQSEDCETRQWLPPGESPAI … Test Sequence 01001………51571225 Counts of all n-grams K-NN Classifier Pr(coupling to family C) ≥ threshold? Predict coupling to family C Predict no coupling to family C YesNo

19 © 2003, Jaime Carbonell 19 Alignment-Based Component A set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family Predict coupling if more than x% of retrieved sequences couple to the family 2 parameters: –Number of neighbours, K –Threshold x%

20 © 2003, Jaime Carbonell 20 Our Hybrid Method: Combining Alignment and N-grams MGNASNDSQSEDCETRQWLPPGESPAI … BLAST K-NN, x% = 100% Test Sequence Predict coupling to family C Yes No N-gram K-NN Predict coupling to family C YesNo Predict no coupling to family C

21 © 2003, Jaime Carbonell 21 Evaluation Metrics & Dataset Truth Predict CouplingsNon- Couplings CouplingsAB Non- Couplings CD (Cao et al., 2003) 81.3% training set Same test set

22 © 2003, Jaime Carbonell 22 Results on Cao et al. Dataset MethodN-gram ThresholdPrecRecallF1 Hybrid0.660.6980.9520.805 N-gram0.340.6580.7940.719 Cao et al.0.5770.8890.700 MethodMaxPrecRecallF1 Whole Seq AlignmentF10.7790.8410.809 HybridF10.7750.8730.821 Whole Seq AlignmentPrecision0.7930.7300.760 HybridPrecision0.8030.7780.790 Suggests n-grams contain information not found in alignment Hybrid method outperformed Cao et al. in precision, recall and F1 Suggests alignment contains information not found in n-grams

23 © 2003, Jaime Carbonell 23 Feature Selection of N-grams Pre-processing step to remove noisy or redundant features that may confuse classifier Many feature selection algorithms available Chi-square was used because of success in GPCR subfamily classification

24 © 2003, Jaime Carbonell 24 IC Domain Combination Analysis ICPrecRecF1Acc 10.7820.7030.7390.796 20.8200.7990.8080.845 30.6610.7210.6820.730 40.6320.7550.6700.694 1, 20.8200.8050.8110.847 1, 30.7990.7650.7800.825 1, 40.7800.7550.7650.807 2, 30.8370.8250.8280.861 2, 40.8280.8160.8210.853 3, 40.7730.8070.7880.821 1, 2, 30.8220.8140.8160.850 1, 2, 40.8070.8090.8070.843 1, 3, 40.7920.8070.7970.832 2, 3, 40.8390.8200.8280.861 1, 2, 3, 40.8240.8130.8170.853 Of the 4 domains, 2 nd domain yielded best F1 followed by 1 st, 3 rd and 4 th domains Most information in IC1 already found in IC2

25 © 2003, Jaime Carbonell 25 Tertiary Protein Fold Prediction Protein function strongly modulated by structure Predicting folds, domains and other regular structures requires modeling local and long distance interactions in low-homology sequences –Long distance: Not addressed by n-grams, HMMs, etc. –Low homology: Not address by BLAST algorithms We focus on minimal mathematical structural modeling –Segmented conditional random fields –Layered graphical models –Fully trainable to recognize new instances of structures First acid-test: β-helix super-secondary structural prediction (with data and guidance from Prof. J. King at MIT)

26 © 2003, Jaime Carbonell 26 Protein Structure Determination Lab experiments: time, cost, uncertainty, … –X-ray crystallography ( months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962 –NMR spectroscopy (only works for small proteins or domains) Nobel Prize, Kurt Wuthrich, 2002 The gap between sequence and structure necessitates computational methods of protein structure determination –3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS

27 Predicting Protein Structures Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures: –3,023,461 sequences v.s. 36,247 resolved structures (1.2%) –Therefore we need to predict structures in-silico

28 © 2003, Jaime Carbonell 28 Predicting Tertiary Folds Super-secondary structures –Common protein domains and scaffolding patterns such as regular combinations of β-sheets and/or  -helices Out task –Given a protein sequence, predict supersecondary structures and their components (e.g. β-helices and the location of each rung therein) Examples: –Parallel Right-handed β-helix Leucine-rich repeats

29 © 2003, Jaime Carbonell 29 Parallel Right-handed β-Helix Structure –A regular super-secondary structure with an an elongated helix whose successive rungs are composed of beta-strands –Highly-conserved T2 turn Computational importance –Long-range interactions –Repeat patterns Biological importance –functions such as the bacterial infection of plants, binding the O-antigen and etc.

30 © 2003, Jaime Carbonell 30 Conditional Random Fields Hidden Markov model (HMM) [Rabiner, 1989] Conditional random fields (CRFs) [Lafferty et al, 2001] –Model conditional probability directly (discriminative models, directly optimizable) –Allow arbitrary dependencies in observation –Adaptive to different loss functions and regularizers –Promising results in multiple applications –But, need to scale up (computationally) and extend to long- distance dependencies

31 © 2003, Jaime Carbonell 31 Outputs Y = {M, {W i } }, where W i = {p i, q i, s i } Feature definition –Node feature –Local interaction feature –Long-range interaction feature Our Solution: Conditional Graphical Models Long-range dependencyLocal dependency

32 © 2003, Jaime Carbonell 32 Linked Segmentation CRF Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and intra-chain interactions L-SCRF: conditional probability of y given x is defined as Joint Labels

33 © 2003, Jaime Carbonell 33 Classification: Training : learn the model parameters λ –Minimizing regularized negative log loss –Iterative search algorithms by seeking the direction whose empirical values agree with the expectation Complex graphs results in huge computational complexity Linked Segmentation CRF (II)

34 © 2003, Jaime Carbonell 34 Model Roadmap Conditional random fields [lafferty et al, 2001] Segmentation CRFs (Liu & Carbonell 2005) Chain graph model (Liu, Xing & Carbonell, 2006) Linked segmentation CRFs (Liu & Carbonell, 2007) Long-range Trade-off between local and long-range Inter-chain long-range Semi-markov CRFs [ Sarawagi & Cohen, 2005] Beyond Markov dependencies Generalized discriminative graphical models

35 © 2003, Jaime Carbonell 35 Tertiary Fold Recognition: β- Helix fold Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

36 © 2003, Jaime Carbonell 36 Fold Alignment Prediction: β- Helix Predicted alignment for known β -helices on cross-family validation

37 © 2003, Jaime Carbonell 37 Discovery of New Potential β -helices Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases –Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html www.cs.cmu.edu/~yanliu/SCRF.html Verification on 3 proteins with later experimentally resolved structures from different organisms –1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase –1PXZ: The Major Allergen From Cedar Pollen –GP14 of Shigella bacteriophage as a β-helix protein –No single false positive!

38 © 2003, Jaime Carbonell 38 Predicting Quaternary Folds Triple beta-spirals [van Raaij et al. Nature 1999] –Virus fibers in adenovirus, reovirus and PRD1 Double barrel trimer [ Benson et al, 2004] –Coat protein of adenovirus, PRD1, STIV, PBCV

39 © 2003, Jaime Carbonell 39 Features for Protein Fold Recognition

40 © 2003, Jaime Carbonell 40 Experiment Results: Quaternary Fold Recognition Double barrel-trimersTriple beta-spirals

41 © 2003, Jaime Carbonell 41 Experiment Results: Alignment Prediction Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1B2

42 © 2003, Jaime Carbonell 42 Experiment Results: Discovering New Membership Proteins Predicted membership proteins of triple beta-spirals can be accessed at http://www.cs.cmu.edu/~yanliu/swissprot_list.xls Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions

43 © 2003, Jaime Carbonell 43 Conclusions & Challenges for Protein Structure/Function Prediction Methods from modern Machine Learning and Language Technologies really work in Computational Proteomics –Family/subfamily/sub-subfamily predictions –Protein-protein interactions (GPCRs G-proteins) –Accurate tertiary & quaternary fold structural predictions Next generation of model sophistication… Addressing new challenges –Structure  Function: Structural predictions combined with binding-site & specificity analysis –Predictive Inversion: Function  Structure  Sequence for new hyper-specific drug design (anti-viral, oncology)

44 © 2003, Jaime Carbonell 44 Proteins and Interactions Every function in the living cell depends on proteins Proteins are made of a linear sequence of amino acids and folded into unique 3D structures Proteins can bind to other proteins physically –Enables them to carry out diverse cellular functions

45 © 2003, Jaime Carbonell 45 Protein-Protein Interaction (PPI) Network PPIs play key roles in many biological systems A complete PPI network (naturally a graph) –Critical for analyzing protein functions & understanding the cell –Essential for diseases studies & drug discoveries

46 © 2003, Jaime Carbonell 46 PPI Biological Experiments Small-scale PPI experiments One protein or several proteins at a time Small amount of available data Expensive and slow lab process Large-scale PPI experiments Hundreds / thousands of proteins at a time Noisy and incomplete data Little overlap among different sets  Large portion of the PPIs still missing or noisy !

47 © 2003, Jaime Carbonell 47 Learning of PPI Networks Goal I: Pairwise PPI (links of PPI graph) –Most protein-protein interactions (pairwise) have not been identified or noisy –  Missing link prediction ! Goal II: “Complex” (important groups) –Proteins often interact stably and perform functions together as one unit ( “ complex” ) –Most complexes have not be discovered –  Important group detection ! Pairwise Interactions Protein Complex PPI Network Link Prediction Group Detection

48 © 2003, Jaime Carbonell 48 Goal I: Missing Link Prediction Pairwise Interactions PPI Network

49 © 2003, Jaime Carbonell 49 Related Biological Data Overall, four categories: –Direct high-throughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS) –Indirect high throughput data: Gene expression, protein-DNA binding, etc. –Functional annotation data: Gene ontology annotation, MIPS annotation, etc. –Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc. direct Indirect  Utilize implicit evidence and available direct experimental results together

50 © 2003, Jaime Carbonell 50 Related Data Evidence Relational Evidence Between Proteins 1 Synthetic lethal Attribute Evidence of Each Protein Expression Structure Sequence Annotation …… Relation expanding 1

51 © 2003, Jaime Carbonell 51 Feature Vector for (Pairwise) Pairs –For data representing protein-protein pairs, use directly –For data representing single protein (gene), calculate the (biologically meaningful) similarity between two proteins for each evidence Synthetic lethal: 1 …… Sequence Similarity GeneExp CorrelationCoeff … Pair A-B: fea1, fea2, fea3, ……. Sequence: mtaaqaagee… GeneExp: 233.94, 162.85,... …. Sequence: mrpsgtagaa… GeneExp: 109.4, 975.3,... … Protein B Protein A Pair A-B

52 © 2003, Jaime Carbonell 52 Problem Setting For each protein-protein pair: –Target function: interacts or not ? –Treat as a binary classification task Feature Set –Feature are heterogeneous –Most features are noisy –Most features have missing values Reference Set: –Small-scale PPI set as positive training ( hundreds  thousands ) –No negative set (non-interacting pairs) available –Highly skewed class distribution »Much more non-interacting pairs than interacting pairs »Estimated: 1 out of ~600 yeast; 1 out of ~1000 human

53 © 2003, Jaime Carbonell 53 PPI Inference via ML Methods Jansen,R., et al., Science 2003 –Bayes Classifier Lee,I., et al., Science 2004 –Sum of Log-likelihood Ratio Zhang,L., et al., BMC Bioinformatics 2004 –Decision Tree Bader J., et al., Nature Biotech 2004 –Logistic Regression Ben-Hur,A. et al., ISMB 2005 –Kernel Method Rhodes DR. et al., Nature Biotech 2005 –Naïve Bayes Present focus: Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006

54 © 2003, Jaime Carbonell 54 Predicting Pairwise PPIs –Prediction target (three types) »Pphysical interaction, »Co-complex relationship, »Pathway co-membership inference –Feature encoding »(1) “detailed” style, and (2) “summary” style »Feature importance varies –Classification methods »Random Forest & Support Vector Machine Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006 Details in the paper

55 © 2003, Jaime Carbonell 55 Human Membrane Receptors Ligands Signal Transduction Cascades extracellular Other Membrane Proteins transmembrane cytoplasmic Type IType II (GPCR)

56 PPI Predictions for Human Membrane Receptors A combined approach –Binary classification –Global graph analysis –Biological feedback & validation Y. Qi, et al 2008

57 © 2003, Jaime Carbonell 57 Random Forest Classifier –A collection of independent decision trees ( ensemble classifier) –Each tree is grown on a bootstrap sample of the training set –Within each tree ’ s training, for each node, the split is chosen from a bootstrap sample of the attributes Binary Classification GeneExpress TAP Y2H GOProcess N HMS_PCI N GeneOccur Y GOLocalization Y ProteinExpress GeneExpress Domain Y2H HMS-PCI SynExpress ProteinExpress Robust to noisy features Can handle different types of features

58 © 2003, Jaime Carbonell 58 Compare Classifiers Receptor PPI (sub-network) to general human PPI prediction Classifier Comaparison (27 features extracted from 8 different data sources, modified with biological feedbacks)

59 Global Graph Analysis Degree distribution / Hub analysis / Disease checking Graph modules analysis (from bi-clustering study) Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc ) 59

60 © 2003, Jaime Carbonell 60 Global Graph Analysis Network analysis reveals interesting features of the human membrane receptor PPI graph 60 For instance: Two types of receptors ( GPCR and non-GPCR (Type I) ) GPCRs less densely connected than non-GPCRs (Green: non-GPCR receptors; blue: GPCR)

61 © 2003, Jaime Carbonell 61 Experimental Validation FFive predictions were chosen for experiments and three were verified –EGFR with HCK (pull-down assay) –EGFR with Dynamin-2 (pull-down assay) –RHO with CXCL11 (functional assays, fluorescence spectroscopy, docking) –Experiments @ U.Pitt School of Medicine Y. Qi, et al 2008 Details in the paper

62 © 2003, Jaime Carbonell 62 Motivation Current situation of PPI task –Only a small positive (interacting) set available –No negative (not interacting) set available –Highly skewed class distribution »Much more non-interacting pairs than interacting pairs –The cost for misclassifying an interacting pair is higher than for a non- interacting pair –Accuracy measure is not appropriate here Try to handle this task with ranking –Rank the known positive pairs as high as possible –At the same time, have the ability to rank the unknown positive pairs as high as possible

63 © 2003, Jaime Carbonell 63 Split Features into Multi-View Overall, four feature groups: –P: Direct highthroughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS) –E: Indirect high throughput data: Gene expression, protein-DNA binding, etc. –F: Functional annotation data: Gene ontology annotation, MIPS annotation, etc. –S: Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc. Direct Genomic Functional Sequence Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, BMC Bioinformatics 2007

64 © 2003, Jaime Carbonell 64 Mixture of Feature Experts (MFE) Make protein interaction prediction by –Weighted voting from the four roughly homogeneous feature categories –Treat each feature group as a prediction expert –The weights are also dependent on the input example Hidden variable, M modulates the choice of expert P F S E Interact ?

65 © 2003, Jaime Carbonell 65 Mixture of Four Feature Experts Parameters are trained using EM Experts and root gate use logistic regression (ridge estimator) Expert P Direct PPI High throughput Experiment Data Expert F Function Annotation of Proteins Expert S Sequence or Structure based Evidence Expert E Indirect High throughput Experimental Data

66 © 2003, Jaime Carbonell 66 Mixture of Four Feature Experts Handling missing value –Add additional feature column for each feature having low feature coverage –MFE uses present / absent information when weighting different feature groups The posterior weight for expert i in predicting pair n –The weight can be used to indicate the importance of that feature view ( expert ) for this specific pair

67 © 2003, Jaime Carbonell 67 Performance 162 features for yeast physical PPI prediction task Features extracted in “detail” encoding Under “detail” encoding, the ranking method is almost the same as RF (not shown)

68 © 2003, Jaime Carbonell 68 Functional Expert Dominates Figure: The frequency at which each of the four experts has maximum contribution among validated and predicted pairs 300 candidate protein pairs 51 predicted interactions 33 validated already 18 newly predicted

69 © 2003, Jaime Carbonell 69 Protein Complex Proteins form associations with multiple protein binding partners stably (termed “ complex”) Complex member interacts with part of the group and work as an unit together Identification of these important sub-structures is essential to understand activities in the cell  Group detection within the PPI network

70 © 2003, Jaime Carbonell 70 Identify Complex in PPI Graph PPI network as a weighted undirected graph –Edge weights derived from supervised PPI predictions: Previous work –Unsupervised graph clustering style –All rely on the assumption that complexes correspond to the dense regions of the network Related facts –Many other possible topological structures –A small number of complexes available from reliable experiments –Complexes also have functional /biological properties (like weight / size / … )

71 © 2003, Jaime Carbonell 71 Possible topological structures Edge weight color coded Make use of the small number of known complexes  supervised Model the possible topological structures  subgraph statistics Model the biological properties of complexes  subgraph features

72 © 2003, Jaime Carbonell 72 Properties of Subgraph Subgraph properties as features in BN –Various topological properties from graph –Biological attributes of complexes No.Sub-Graph Property 1Vertex Size 2Graph Density 3Edge Weight Ave / Var 4Node degree Ave / Max 5Degree Correlation Ave / Max 6Clustering Coefficient Ave / Max 7Topological Coefficient Ave / Max 8First Two Eigen Value 9Fraction of Edge Weight > Certain Cutoff 10Complex Member Protein Size Ave / Max 11Complex Member Protein Weight Ave / Max 5/14/2008

73 © 2003, Jaime Carbonell 73 Model Complex Probabilistically Bayesian Network (BN) –C : If this subgraph is a complex (1) or not (0) –N : Number of nodes in subgraph –Xi : Properties of subgraph C N XXXX  Assume a probabilistic model (Bayesian Network) for representing complex sub-graphs

74 © 2003, Jaime Carbonell 74 Model Complex Probabilistically BN parameters trained with MLE –Trained from known complexes and random sampled non- complexes –Discretize continuous features –Bayesian Prior to smooth the multinomial parameters Evaluate candidate subgraphs with the log ratio score L

75 © 2003, Jaime Carbonell 75 Experimental Setup Positive training data: –Set1: MIPS Yeast complex catalog: a curated set of ~100 protein complexes –Set2: TAP05 Yeast complex catalog: a reliable experimental set of ~130 complexes –Complex size (nodes’ num.) follows a power law Negative training data –Generate from randomly selected nodes in the graph –Size distribution follows the same power law as the positive complexes

76 © 2003, Jaime Carbonell 76 Evaluation Train-Test style (Set1 & Set2) Precision / Recall / F1 measures A cluster “detects” a complex if A : Number of proteins only in cluster B : Number of proteins only in complex C : Number of proteins shared If overlapping threshold p set as 50% A CB Detected Cluster Known complex &

77 © 2003, Jaime Carbonell 77 Performance Comparison On yeast predicted PPI graph (~2000 nodes) Compare to a popular complex detection package: MCODE (search for highly interconnected regions) Compare to local search relying on density evidence only Compared to local search with complex score from SVM (also supervised) MethodsPrecisionRecallF1 Density MCODE SVM BN 0.180 0.219 0.211 0.266 0.462 0.075 0.377 0.513 0.253 0.111 0.269 0.346

78 © 2003, Jaime Carbonell 78 Human-PPI (Revise 08) HIV-Human PPI (Revise) Learning PPI Networks Pairwise Interactions Pathway Function Implication Func ? Func A Protein Complex PSB 05 PROTEINS 06 BMC Bioinfo 07 CCR 08 ISMB 08 Prepare Genome Biology 08 PPI Network Domain/Motif Interactions

79 © 2003, Jaime Carbonell 79 Inter species interactome What are the interacting proteins between two organisms?

80 © 2003, Jaime Carbonell 80 HIV-1 host protein interactions HIV-1 depends on the cellular machinery in every aspect of its life cycle. Fusion Reverse transcription Maturation Budding Transcription Peterlin and Torono, Nature Rev Immu 2003.

81 © 2003, Jaime Carbonell 81 HIV-1 host protein interactions Human protein HIV protein

82 © 2003, Jaime Carbonell 82 FIN Questions ?


Download ppt "Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell ( ), with Betty Cheng, Yan Liu, Eric Xing,"

Similar presentations


Ads by Google