Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Interaction Networks Thanks to Mehmet Koyuturk.

Similar presentations


Presentation on theme: "Protein Interaction Networks Thanks to Mehmet Koyuturk."— Presentation transcript:

1 Protein Interaction Networks Thanks to Mehmet Koyuturk

2 Protein-Protein Interactions  Physical association between proteins  Signal transduction, phosphorylation  Docking, complex formation  Permanent vs. transient interactions  Co-location of proteins  Proteins that work in the same cellular component  Soluble location: lysosome, mitochondrial stroma  Membrane location: receptors in plasma membrane, transporters in mitochondrial membrane  Functional association of proteins  Proteins involved in the same biomolecular activity  Enzymes in the same pathway, co-regulated proteins 7. Protein Interaction Networks 2

3 Permanent vs Transient Interactions  Permanent interactions  Some proteins form a stable protein complex that carries out a structural or functional biomolecular role  These proteins are protein subunits of the complex and they work together  ATPase subunits, subunits of nuclear pore  Transient interactions  Proteins that come together in certain cellular states to undertake a biomolecular function  DNA replicative complex, signal transduction 7. Protein Interaction Networks 3

4 Signal Transduction  Phosphorylation  Protein-kinase interaction  Enzyme activation 7. Protein Interaction Networks 4  Signaling cascade

5 Why Study Protein Interactions?  Identification of functional modules and interconnections between these modules  Functional annotation based on binding partners and interaction patterns  Identification of evolutionarily conserved pathways  Identification of drug target proteins to minimize side effects 7. Protein Interaction Networks 5

6 Identification of Protein Interactions  Traditionally, protein interactions are identified by wetlab experiments based on hypotheses on candidate proteins  Small scale assays  Coimmunoprecipitation: Immunoprecipitate one protein, see if other is also precipitated  Reliable, but can only verify interactions between suspected partners  High throughput screening  Throw in thousands of ORFs and see which ones bind to each other  Yeast two hybrid, tandem affinity purification  Large scale, but a lot of noise 7. Protein Interaction Networks 6

7 Yeast Two Hybrid  Split yeast GAL4 gene, which encodes a transcription factor, required for activation of GAL genes in two parts  Activating domain, binding domain  The split protein does not work unless the two parts are in physical contact 7. Protein Interaction Networks 7

8 Protein Interaction Networks  Organize all identified interactions in a network, where proteins are represented by nodes and interactions are represented by edges  TAP identifies a group of proteins that are caught by target protein  Spoke model (star network) vs. matrix model (clique) 7. Protein Interaction Networks 8 Protein Interaction

9 Functional Modularity in PPI Networks  A protein complex  Dense subgraph  A signal transduction pathway  Simple path, parallel paths  A protein with common, key,  fundamental role (e.g., a kinase)  Hub node 7. Protein Interaction Networks 9

10 Computational Prediction of PPIs  Functional association is a higher level conceptualization of interaction  Proteins that act as enzymes catalyzing reactions in the same metabolic pathway  Functionally associated proteins are likely to show up in similar contexts  Co-regulation, co-expression, co-evolution, co-citation…  Functional association between proteins can be computationally identified by looking at different sources of data such as sequences, gene expression, literature  Can also be extended to capture physical associations, for example, by taking into account evolution at structural level 7. Protein Interaction Networks 10

11 Conservation of Gene Neighborhood  In bacteria, the genome of an organism is organized in such a way that that functionally related proteins are coded by neighboring regions  Operons  When more than one bacterial species are considered, it is observed that this neighborhood relationship becomes even more relevant 7. Protein Interaction Networks 11 Distribution of neighboring genes in H. Influenzae and E. coli into functional classes

12 Comparison of Nine Bacterial Genomes  trpB-trpA is the only gene pair whose proximity is conserved across nine prokaryotic genomes  These genes encode the two subunits of tryptophan synthase that interact and catalyze a single reaction 7. Protein Interaction Networks 12

13 Close Orthologs  Run of genes  A set of genes on one strand, such that gaps between adjacent genes is less than a threshold,  (in practice,   300 bp)  Any pair of genes on the same run are said to be close  Bidirectional best hits  Genes X1 and X2 from genomes G1 and G2 are BBH, if their sequence similarity is significant and there are no Y1 (Y2) in G1(G2) that is more similar to X2 (X1) than X1 (X2) 7. Protein Interaction Networks 13 Pair of close bidirectional best hits: Xa, Ya close in G1, Xb, Yb close in G2, Xa&Xb BBH, Ya& Yb BBH

14 Predicting Interactions  For each pair of close orthologs (occuring at least one pair of genomes), calculate a score  Score should increase with the phylogenetic distance between the two genomes, since closely related organisms are more likely to have similar genes nearby due to chance alone  Existence of a triplet (P1, P2, P3) should be stronger than the existence of two pairs (P1, P2 and P1, P3)  Triplet distance can be estimated as the minimum distance between any pair of organisms (in addition to pair score) 7. Protein Interaction Networks 14

15 Reconstructing Pathways 7. Protein Interaction Networks 15 Purine Metabolism  Can identify the association between unknown proteins and known pathways!

16 Projection of Gene Neighborhood  The composition of operons is evolutionarily variable  A particular set of functionally related genes do not always comprise an operon  The application of gene neighborhood based interaction prediction is limited for a single organism  With multiple organisms, it is possible to statistically strengthen conclusions and project findings on other organisms  If an operon with functionally related genes exists in several genomes, a functional association can be predicted for other organisms, even if the corresponding genes are scattered  Variability turns out to be an advantage for prediction 7. Protein Interaction Networks 16

17 Gene Neighborhood - Limitations  It is only directly applicable to bacteria (and archaea), because relevance of gene order does not necessarily extend to eukaryotes  For closely related species, conserved gene order might just be due to lack of time for genome rearrangements  We are interested in selective constraints that preserve gene order  Compared species should be distant enough  But not too distant, because we need sufficient number of orthologs to be able to derive statistically meaningful results 7. Protein Interaction Networks 17

18 Gene Fusion  Domain fusion events  Two protein domains that act as independent proteins (components) in one organism may form (part of) a single polypoptide chain (composites) in another organism  Most proteins that are involved in domain fusion events are known to be subunits of multiprotein complexes (76% in E. coli metabolic network) 7. Protein Interaction Networks 18

19 Gene Fusion Based PPI Prediction  A pair of proteins in query genome are candidate interacting pairs if  They show (local) sequence similarity to the same protein (rosetta stone) in reference genome  They do now show sequence similarity with each other  Complete genomes! 7. Protein Interaction Networks 19

20 Predicted Interactions 7. Protein Interaction Networks 20 Known physical interactions Proteins in the same pathway

21 Gene Fusion Based Prediction - Results  Interactions predicted based on gene fusion events  Distance on circle shows distance on genome 7. Protein Interaction Networks 21

22 Co-evolution of Interacting Proteins  Selective pressure is likely to act on common function  Proteins that are interacting are expected to either be conserved together along with their interactions, or not conserved at all  Hypothesis 1: Orthologs of interacting proteins also interact in other species (supported by evidence, but there are subtleties, which we will discuss this later)  Hypothesis II: If two proteins are interacting, then they will show similar conservation patterns Phylogenetic profiles 7. Protein Interaction Networks 22

23 Phylogenetic Profiles 7. Protein Interaction Networks 23

24 Correlation of Phylogenetic Profiles  Assume we have N genomes, protein X has homologs in x of them, Y has y, and they co-occur in z genomes  Hamming distance:  Pearson correlation:  Mutual information:  Statistical significance: 7. Protein Interaction Networks 24

25 Phylogenetic Profiles - Limitations  Many processes may be common across lineages  Too many false positives  Database of genomes may be biased  All organisms are treated equally  Improvement: Use trees instead of profiles  Proteins are assumed to be conserved as a whole  It is domains that interact  Improvement: Use domain profiles 7. Protein Interaction Networks 25 Yeast nucleoli and ribosomal proteins Organisms

26 Phylogenetic Tree Based Prediction  Phylogenetic trees of Ntr-family two-component sensor histidine kinases and their corresponding regulators 7. Protein Interaction Networks 26

27 Mirror Tree Method  Need to have sufficient number of genomes that contain homologs of both proteins 7. Protein Interaction Networks 27

28 Matrix Method  Start with families of proteins that are suspected to interact  Identify specific pairs of proteins that interact by aligning the phylogenetic trees that underly the two families  Assumption: Identical number of proteins in each family 7. Protein Interaction Networks 28

29 Correlated Mutations  Co-evolution of interacting proteins can be followed more closely by quantifying the degree of co-variation between pairs of residues from these proteins  Correlated mutations may correspond to compensatory mutations that stabilize the mutations in one protein with changes in the other 7. Protein Interaction Networks 29 Distribution of distances between aminoacid positions on a folded protein

30 In silico Two-Hybrid  The correlation of mutations between two positions (may be on different proteins) can be estimated from pairwise assessment of aligned multiple sequences  Position pairs with high correlation are potential contact points  Interaction index  For a protein pair, compute the aggregate correlation (of mutations) across all positions 7. Protein Interaction Networks 30

31 In silico Two-Hybrid 7. Protein Interaction Networks 31

32 Performance of I2H  I2H predicts physical, rather than functional association  It requires complete genomes & sufficient number of homologs 7. Protein Interaction Networks 32

33 Co-citation Based PPI Prediction  Functionally associated proteins are likely to be cited in the same research article  We can assess the statistical significance of co-citation based on hypergeometric model  Algorithmic problem: How to recognize & match protein names?  Train algorithm using annotated abstracts via conditional random fields (CRF) 7. Protein Interaction Networks 33

34 Performance of Co-citation  The method is robust to choice of parameters for name recognition  Statistical significance is quite relevant until it saturates 34

35 Integrating PPI Networks  Interaction data coming from multiple sources  Different sources refer to different levels of interaction  Can integration handle noise, making interaction data more reliable?  Superpose interactions based on their reliability 7. Protein Interaction Networks 35

36 Bayesian Integration  For each prediction method, compute log-likelihood score  Let P(L|E) be the number of interactions predicted by method E, such that functional association between corresponding proteins is known  Let ~P(L|E) be the number of false positives  Let P(L) and ~P(L) be the corresponding priors  Assign weights to methods based on their log-likelihood scores 7. Protein Interaction Networks 36

37 Comparison of Prediction Methods  Integrated network captures functional association better  Note that the integrated network is “trained” using available data on functional association 7. Protein Interaction Networks 37

38 Classification Based Integration  Points: Proteins, Space: Expression, Conservation, Labels: Function  Points: Protein Pairs, Space: Co-expression, Co-evolution, etc., Labels: Existence of Interaction 7. Protein Interaction Networks 38

39 Performance of Domain Co-evolution 7. Protein Interaction Networks 39

40 Co-Evolutionary Matrix 7. Protein Interaction Networks 40

41 Domain Identification 7. Protein Interaction Networks 41

42 Difference between Predicted PPIs 7. Protein Interaction Networks 42

43 Pattern Discovery in Signaling Networks

44 Reconstruction of Cellular Signaling  Network reconstruction includes  chemically accurate representation of all biochemical events occurring within a defined signaling network and incorporates  interconnectivity  functional relationships that are inferred from experimental data.  Cellular signaling networks operate several orders of magnitude in spatio-temporal scales  Quick responses (<10 -1 secs.), e.g., protein modifications  Slow responses (minutes to hours), e.g., transcriptional regulation 44

45 Cellular Signaling  Who are the actors?  Receptors reside inside or on the surface of the cell and bind to specific chemicals with high specificity and affinity.  Protein kinases catalyze reactions involving the transfer of phosphate, from high-energy donor molecules, such as ATP, which results in activation of proteins  Protein phosphatases dephosphorylate active proteins  Transcription factors 45

46 Combinatorics of Cellular Signaling  What is the scope of these actors?  In how many different ways a signal can be transmitted?  In how many different states can a cell be?  Number of receptors, kinases, phosphatases, transcription factos, and the number of possible interactions between these  Alternative splicing  In eukaryotes, introns are spliced out before translation  Different combinations of introns can be spliced out, resulting in different products of the same gene  One more level of combinatorial complexity  If a gene has k exons, then splicing of alternative exons can generate upto 2 k isoforms 46

47 Scope of Human Signaling Network 47

48 Combinatorial Effects  Genes that code for signaling proteins compose 75% of all alternatively spliced genes  This implies that cells use alternative splicing extensively to achieve the extraordinary specificity that is required in signaling systems  After post-transcriptional modification, number of mRNA transcripts  3858 for receptors, 1295 for kinases, 375 for phosphatases  After post-translational modification (phosphorylation, acetylation, methylation), number of distinct protein states  30864 for receptors, 10360 for kinases, 3000 for phosphatases  20-fold increase in number of protein states over genes 48

49 Links and Connectivity  Interactions allow for an even greater degree of combinatorial control  Homo- and heterodimerization of 224 proteins can provide sufficient specificity to control the expression of 25000 genes in human genome (n(n-1)/2)  If receptors assume only ligand bound and unbound states, then k receptors can recognize 2 k different ligand combinations  If 1% of estimated 1543 receptors in human genome can be independently expressed, then the cell could potentially respond to 32768 different ligand combinations 49

50 Signal Reception  Based on the average surface area of a cell and average area of a receptor, it is estimated that there can be as many as a few million receptors on the surface of the typical somatic cell at a given time  ~ 30000 distinct receptor types  ~130 receptors of each receptor type  A few receptors (~10-40) in high numbers (~10 5 per cell) for highly differentiated and specialized cells  Many receptors (~2000-3000) in small numbers (~10 2 per cell) for stem cells or undifferentiated cells 50

51 Reconstructing Signaling Networks 51

52 Focusing on Parts of the Network  Nodes  Who does a single protein interact with? In what contexts?  Modules  Group of related interactions, e.g., a protein complex  Pathways  Chain of interactions that connect a signaling input to output 52

53 Protein Complexes in PPI Networks  Spoke vs matrix model  Recall that in PCP methods like TAP identify a group of proteins that bind to each other using a single protein as bait  How to encode this into a network of pairwise interactions? 53 Actual Complex Spoke Model Matrix Model

54 Protein Complexes in Matrix Model 54

55 Modules and Quotients  Define a module as a group of proteins such that the interactions of the proteins with those outside the module are identical  Quotient: Replace proteins in a module with a single node  The edges of the representative node will represent the interactions of all proteins in the module 55

56 Types of Modules  Parallel module  No interaction between proteins in the module  These are likely to correspond to proteins that are functionally related, but do not interact with each other  Series module  Proteins in the module form a clique among themselves  All proteins in the module perform some function together (single complex or multiple related complexes)  Prime module  All other topologies  This is probably what you will observe most of the time 56

57 Hierarchical Decomposition  Recursively identify and contract modules  This results in a tree representation of the network  Each node is a quotient graph  Leaves are proteins  Root is entire network 57

58 Decomposition of Yeast PPI Network 58

59 Identification of Modules  Graph clustering  Find groups of nodes with high interconnectivity (and relatively low connectivity with outside)  Issues  Definition of clustering metrics  Density  Has to be normalized by subgraph size  Distance-based metrics  A module has low diameter  Normalizing intra-cluster connectivity with outer connectivity 59

60 Algorithms  The problems are generalizations of maximum clique  Maximum clique itself is NP-hard (enumeration of cliques in early PPI networks was possible, though, and these were used as seed subgraphs for dense clusters)  Heuristic approaches  Graph clustering is very well studied  Recall that, while clustering vectors in metric spaces (e.g., gene expression data), it is common to generate similarity graphs  Bottom-up heuristics  Start with a single node, grow subgraph until “density” is lost  Top-down heuristics  Recursively partition the entire network until subgraph is dense enough 60

61 MCODE Algorithm  Three stages  Vertex weighting  Complex prediction  Post-processing for finding overlapping clusters  Vertex weighting  How “clustered” is a network’s neighborhood?  Use core clustering coefficient instead of clustering coefficient N : subgraph induced by neighbors of v K : k-core subgraph of N that maximizes k d : density of K weight(v) = k x d 61

62 MCODE Algorithm (cont’d)  Complex prediction  Seed a complex with the node with highest weight  At each node addition, check the neighbors of that node, if their weight is above a given threshold relative to that of the seed vertex, add that node into the complex as well  Repeat until no node can be added  Once a complex is identified, remove those nodes and find other complexes  Post-processing  Filter-out complexes that do not contain at least a 2-core  Add nodes to allow overlaps to a given threshold  Complex score: density x size 62

63 Scoring Subgraphs  Observe the trade-off between size and density  A single interaction has density one  What is a good cut-off for density?  Statistical significance  What is the expected size of the largest dense subgraph?  Implicitly trades off density and size  If we can analytically characterize the distribution of the largest dense subgraph, then we can use statistical significance as a score function (stopping criterion)  This also implicitly handles correction for multiple hypothesis testing 63

64 G(n,p) Model  Let random variable R  be the size of largest subgraph with density   The typical value of R  is given by where denotes divergence  The p-value of a larger dense subgraph is given by 64 r0 =r0 = Hp()Hp() log(n) – log(log(n)) + log(H p (  )) H p (  ) =  log(  /p) + (1-  ) log((1-  )/(1-p)) P(R   r 0 )  O ( log(n)/n 1/H (  ) )

65 Piecewise G(n,p) Model 65  Two protein groups; hubs (V h ) and regulars (V l )  There is an edge between u and v with probability  p h if u, v  V h  p b if u  V h, v  V l, or vice versa  p if u, v  V l  p h > p b > p l, |V h | < |V l |  If |V h | << |V l |, it contributes an additive factor r1 =r1 = log(n) + 2|V h | log(B) - log(log(n)) + log(H p (  )) Hp()Hp() where B = p b (1-p)/p+1-p b

66 S I D E S Algorithm  Recursive minimum-cut partitioning  Partition nodes into two parts such that the number of edges in between is minimized, then recurse 66 p << 1

67 MCODE vs S I D E S 67 -log(p-value) Specificity (%) Sensitivity (%) Cluster Size Correlation S I D E S: 0.76 MCODE: 0.43

68 MCODE vs S I D E S 68 Module Size Specificity (%) Sensitivity (%) Correlation S I D E S: 0.22 MCODE: -0.02 Correlation S I D E S: 0.27 MCODE: 0.36

69 Fiedler Vector  For network G, Laplacian L is defined as follows: Here, w(u i,u j ) denotes the weight of edge u i u j.  It can be shown that  Matrix L is positive semi-definite, with exactly one zero eigenvalue for each connected component  The eigenvector x that corresponds to the smallest non-zero eigenvalue minimizes This vector is known as the Fiedler vector of network G. 69

70 Spectral Graph Clustering  Fiedler vector provides the optimal mapping of the nodes of the network on one-dimensional Euclidian space, in the mean squares sense  This also generalizes to optimal k dimensional mapping  Once a one-dimensional mapping is obtained, clustering algorithms can be used on this one dimensional space  Find cut points in one dimensional space  Top-down: Partition one dimensional space by finding two cut- points and recurse on each part  Bottom-up: Merge two closest nodes, recurse 70 10

71 Identification of Signaling Pathways  We would like to identify simple paths (chains of interactions) in the PPI networks, which might correspond to, for example, signaling cascades highlighting the group of proteins and interactions that are resposible for the transduction of a specific signal  What can we do based solely on interaction data?  In the PPI network, there may be be plenty of paths connecting each pair of nodes  Which ones are interesting?  How long can a pathway be?  How about identifying “most reliable” paths? 71

72 Formulating Pathway Identification  Assume that the edges are scored, such that p(u,v) denotes the likelihood that proteins u and v interact  Then the multiplication of edge scores along the path quantifies the likelihood that the path exists  Let w(u,v) = -log p(u,v) denote the weight of edge  Then, if we define the weight of a path as the summation of the weights of the edges on the path, paths with less weight will be more reliable paths  For a given set I of proteins, find all minimum-weight paths of length k from I to each protein in the network  I might be the set of receptor proteins 72

73 Enumerating Pathways  Dynamic programming  For v  S  V, let W(v, S) be the minimum weight of a simple path that starts from a protein in I, visits all proteins in S, and ends in v  This function can be tabulated using the following recursion  where if v  I, and  otherwise  For given v, the minimum path from I to v is given by the minimum W(v, S) over all S that contain v  The running time of this algorithm is O(kn k )  Not feasible for k larger than a few 73

74 Color Coding  Color each protein randomly using a set of k colors  Search for paths that contain one protein from each colour => No vertex will be repeated on the path  The dynamic programming algorithm can be modified to solve this problem  The running time of this algorithm is O(2 k km)  However, this algorithm misses an optimal path if two proteins on the path happen to be colored identically  For each path, the algorithm succeeds with ~ probability  Repeat times to make sure that the probability that the algorithm will fail for at least one protein is less than 74

75 Hunting Biologically Meaningful Paths  Constraining the set of proteins  If a protein is required to be in the path, assign a unique color to the target protein  If a family is required, assign color to the family  Constraining order of occurrence  Signal transduction often progress in inward order, from membrane proteins to nuclear proteins and transcription factors  Segmented pathways: Assign labels to proteins, where labels represent cellular component, require paths to be monotonic with respect to labels  Labels can also be generalized to intervals (consistent segments) 75


Download ppt "Protein Interaction Networks Thanks to Mehmet Koyuturk."

Similar presentations


Ads by Google