9. Lecture WS 2008/09Bioinformatics III1 V9: Reliability of Protein Interaction Networks Jansen et al. Science 302, 449 (2003) One would like to integrate.

Slides:

Advertisements

Similar presentations

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Advertisements

Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11

CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.

Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.

V8: Structure of Cellular Networks

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

What is Statistical Modeling

Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.

Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.

Evolution of minimal metabolic networks WANG Chao April 11, 2006.

Evaluating Hypotheses

25. Lecture WS 2003/04Bioinformatics III1 Integrating Protein-Protein Interactions: Bayesian Networks - Lot of direct experimental data coming about protein-protein.

Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI

Basics of discriminant analysis

1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

Experimental Evaluation

Inferences About Process Quality

Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.

Quantitative Genetics

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

Richard M. Jacobs, OSA, Ph.D.

Objectives of Multiple Regression

12. Lecture WS 2004/05Bioinformatics III1 Direct comparison of different data sets Bayesian Network approach V12: Reliability of Protein Interaction Networks.

Categorical Data Prof. Andy Field.

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

Chapter 1: Introduction to Statistics

Interactions and more interactions

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.

Chapter 16 The Chi-Square Statistic

Proteome and interactome Bioinformatics.

Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Biol 304 Week 3 Equilibrium Binding Multiple Multiple Binding Sites.

© 2008 Pearson Addison-Wesley. All rights reserved Chapter 5 Statistical Reasoning.

TAP(Tandem Affinity Purification) Billy Baader Genetics 677.

VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.

Issues concerning the interpretation of statistical significance tests.

Chapter 8: Simple Linear Regression Yang Zhenlin.

DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.

8. Lecture WS 2006/07Bioinformatics III1 V8: Reliability of Protein Interaction Networks Jansen et al. Science 302, 449 (2003) One would like to integrate.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.

SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.

Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.

Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.

Principal Component Analysis

CORRELATION-REGULATION ANALYSIS Томский политехнический университет.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Stats Methods at IC Lecture 3: Regression.

Chapter 7. Classification and Prediction

DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Parametric Methods Berlin Chen, 2005 References:

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

9. Lecture WS 2008/09Bioinformatics III1 V9: Reliability of Protein Interaction Networks Jansen et al. Science 302, 449 (2003) One would like to integrate evidence from many different sources to increase the predictivity of true and false protein-protein predictions.  use Bayesian approach for integrating interaction information that allows for the probabilistic combination of multiple data sets; apply to yeast. Input: Approach can be used for combining noisy genomic interaction data sets. Normalization: Each source of evidence for interactions is compared against samples of known positives and negatives (“gold-standard”). Output: predict for every possible protein pair likelihood of interaction. Verification: test on experimental interaction data not included in the gold-standard + new TAP (tandem affinity purification experiments).

9. Lecture WS 2008/09Bioinformatics III2 Integration of various information sources Jansen et al. Science 302, 449 (2003) (iii) Gold-standards of known interactions and noninteracting protein pairs. 3 different types of data used: (i) Interaction data from high- throughput experiments. These comprise large-scale two-hybrid screens (Y2H) and in vivo pull- down experiments. (ii) Other genomic features: expression data, biological function of proteins (from Gene Ontology biological process and the MIPS functional catalog), and data about whether proteins are essential.

9. Lecture WS 2008/09Bioinformatics III3 Combination of data sets into probabilistic interactomes (B) Combination of data sets into probabilistic interactomes. The 4 interaction data sets from HT experiments were combined into 1 PIE. The PIE represents a transformation of the individual binary-valued interaction sets into a data set where every protein pair is weighed according to the likelihood that it exists in a complex. A „naïve” Bayesian network is used to model the PIP data. These information sets hardly overlap. Jansen et al. Science 302, 449 (2003) Because the 4 experimental interaction data sets contain correlated evidence, a fully connected Bayesian network is used.

9. Lecture WS 2008/09Bioinformatics III4 Bayesian Networks Bayesian networks are probabilistic models that graphically encode probabilistic dependencies between random variables. Y E1E1 E2E2 E3E3 Bayesian networks also include a quantitative measure of dependency. For each variable and its parents this measure is defined using a conditional probability function or a table. Here, one such measure is the probability Pr(E 1 |Y). A directed arc between variables Y and E 1 denotes conditional dependency of E 1 on Y, as determined by the direction of the arc.

9. Lecture WS 2008/09Bioinformatics III5 Bayesian Networks Together, the graphical structure and the conditional probability functions/tables completely specify a Bayesian network probabilistic model. Y E1E1 E2E2 E3E3 Here, Pr(Y,E 1,E 2,E 3 ) = Pr(E 1 |Y) Pr(E 2 |Y) Pr(E 3 |Y) Pr(Y) This model, in turn, specifies a particular factorization of the joint probability distribution function over the variables in the networks.

9. Lecture WS 2008/09Bioinformatics III6 Gold-Standard Jansen et al. Science 302, 449 (2003) should be (i) independent from the data sources serving as evidence (ii) sufficiently large for reliable statistics (iii) free of systematic bias (e.g. towards certain types of interactions). Positives: use MIPS (Munich Information Center for Protein Sequences, HW Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that are within the same complex) from biomedical literature. Negatives: - harder to define - essential for successful training Assume that proteins in different compartments do not interact. Synthesize “negatives” from lists of proteins in separate subcellular compartments.

9. Lecture WS 2008/09Bioinformatics III7 Measure of reliability: likelihood ratio Jansen et al. Science 302, 449 (2003) Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“). Likelihood ratio L(f) is defined as: L(f) = 1 means that the feature has no predictability: the same number of positives and negatives have feature f. The larger L(f) the better its predictability.

9. Lecture WS 2008/09Bioinformatics III8 Combination of features Jansen et al. Science 302, 449 (2003) For two features f 1 and f 2 with uncorrelated evidence, the likelihood ratio of the combined evidence is simply the product: L(f 1,f 2 ) = L(f 1 )  L(f 2 ) For correlated evidence L(f 1,f 2 ) cannot be factorized in this way. Bayesian networks are a formal representation of such relationships between features. The combined likelihood ratio is proportional to the estimated odds that two proteins are in the same complex, given multiple sources of information.

9. Lecture WS 2008/09Bioinformatics III9 Prior and posterior odds „positive“ : a pair of proteins that are in the same complex. Given the number of positives among the total number of protein pairs, the „prior“ odds of finding a positive are: „posterior“ odds: odds of finding a positive after considering N datasets with values f 1... f N : The terms „prior“ and „posterior“ refer to the situation before and after knowing the information in the N datasets. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III10 Static naive Bayesian Networks In the case of protein-protein interaction data, the posterior odds describe the odds of having a protein-protein interaction given that we have the information from the N experiments, whereas the prior odds are related to the chance of randomly finding a protein- protein interaction when no experimental data is known. If O post > 1, the chances of having an interaction are Jansen et al. Science 302, 449 (2003) higher than having no interaction.

9. Lecture WS 2008/09Bioinformatics III11 Static naive Bayesian Networks The likelihood ratio L defined as relates prior and posterior odds according to Bayes‘ rule: In the special case that the N features are conditionally independent (i.e. they provide uncorrelated evidence) the Bayesian network is a so-called „naïve” network, and L can be simplified to: Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III12 Computation of prior and posterior odds L can be computed from contingency tables relating positive and negative examples with the N features (by binning the feature values f 1... f N into discrete intervals). O post > 1 can be achieved with L > 600. Jansen et al. Science 302, 449 (2003) Determining the prior odds O prior is somewhat arbitrary. It requires an assumption about the number of positives. Here, 30,000 is taken a conservative lower bound for the number of positives (i.e. pairs of proteins that are in the same complex). Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in total (with N = 6000 for yeast),

9. Lecture WS 2008/09Bioinformatics III13 Essentiality (PIP) Consider whether proteins are essential or non-essential = does a deletion mutant where this protein is knocked out from the genome have the same phenotype? Jansen et al. Science 302, 449 (2003) It should be more likely that both of 2 proteins in a complex are essential or non- essential, but not a mixture of these two attributes. Deletion mutants of either one protein should impair the function of the same complex.

9. Lecture WS 2008/09Bioinformatics III14 Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete values (EE: both essential; NN: both non-essential; NE: one essential and one not). Jansen et al. Science 302, 449 (2003) Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast interactome (~18M pairs). Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the 2,708,746 gold-standard negatives. Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the protein pairs with likelihood ratio  L, computed by summing up the values in the „pos“ (or „neg“) column. P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and L, the ratio of these two conditional probabilities.

9. Lecture WS 2008/09Bioinformatics III15 mRNA expression data Proteins in the same complex tend to have correlated expression profiles. Although large differences can exist between the mRNA and protein abundance, protein abundance can be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA transcript. Jansen et al. Science 302, 449 (2003) Experimental data source: - time course of expression fluctuations during the yeast cell cycle - Rosetta compendium: expression profiles of 300 deletion mutants and cells under chemical treatments. Problem: both data sets are strongly correlated. Compute first principal component of the vector of the 2 correlations. Use this as independent source of evidence for the P-P interaction prediction. The first principal component is a stronger predictor of P-P interactions that either of the 2 expression correlation datasets by themselves.

9. Lecture WS 2008/09Bioinformatics III16 mRNA expression data The values for mRNA expression correlation (first principal component) range on a continuous scale from -1.0 to +1.0 (fully anticorrelated to fully correlated). This range was binned into 19 intervals. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III17 PIP – Functional similarity Quantify functional similarity between two proteins: Jansen et al. Science 302, 449 (2003) - consider which set of functional classes two proteins share, given either the MIPS or Gene Ontology (GO) classification system. - Then count how many of the ~18 million protein pairs in yeast share the exact same functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned into 5 intervals. - In general, the smaller this count, the more similar and specific is the functional description of the two proteins.

9. Lecture WS 2008/09Bioinformatics III18 PIP – Functional similarity Observation: low counts correlate with a higher chance of two proteins being in the same complex. But signal (L) is quite weak. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III19 Calculation of the fully connected Bayesian network (PIE) The 3 binary experimental interaction datasets can be combined in at most 2 4 = 16 different ways (subsets). For each of these 16 subsets, one can compute a likelihood ratio from the overlap with the gold-standard positives („pos“) and negatives („neg“). Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III20 Distribution of likelihood ratios Number of protein pairs in the individual datasets and the probabilistic interactomes as a function of the likelihood ratio. There are many more protein pairs with high likelihood ratios in the probabilistic interactomes (PIE) than in the individual datasets G,H,U,I. Protein pairs with high likelihood ratios provide leads for further experimental investigation of proteins that potentially form complexes. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III21 PIP vs. the information sources Ratio of true to false positives (TP/FP) increases monotonically with L cut.  L is an appropriate measure of the odds of a real interaction. The ratio is computed as: Protein pairs with L cut > 600 have a > 50% chance of being in the same complex. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III22 PIE vs. the information sources 9897 interactions are predicted from PIP and 163 from PIE. In contrast, likelihood ratios derived from single genomic factors (e.g. mRNA coexpression) or from individual interaction experiments (e.g. the Ho data set) did no exceed the cutoff when used alone. This demonstrates that information sources that, taken alone, are only weak predictors of interactions can yield reliable predictions when combined. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III23 parts of PIP graph Test whether the thresholded PIP was biased toward certain complexes, compare distribution of predictions among gold-standard positives. (A ) The complete set of gold- standard positives and their overlap with the PIP. The PIP (green) covers 27% of the gold-standard positives (yellow). The predicted complexes are roughly equally apportitioned among the different complexes  no bias. Jansen et al. Science 302, 449 (2003)

9. Lecture WS 2008/09Bioinformatics III24 parts of PIP graph Jansen et al. Science 302, 449 (2003) Graph of the largest complexes in PIP, i.e. only those proteins having  20 links. (Left) overlapping gold-standard positives are shown in green, PIE links in blue, and overlaps with both PIE and gold-standard positives in black. (Right) Overlapping gold-standard negatives are shown in red. Regions with many red links indicate potential false-positive predictions.

9. Lecture WS 2008/09Bioinformatics III25 experimental verification Jansen et al. Science 302, 449 (2003) conduct TAP-tagging experiments (  Cellzome) for 98 proteins. These produced 424 experimental interactions overlapping with the PIP threshold at L cut = 300. Of these, 185 overlapped with gold-standard positives and 16 with negatives.

9. Lecture WS 2008/09Bioinformatics III26 Concentrate on large complexes Jansen et al. Science 302, 449 (2003) Sofar all interactions were treated as independent. However, the joint distribution of interactions in the PIs can help identify large complexes: an ideal complex should be a fully connected „clique“ in an interaction graph. In practice, this rarely happens because of incorrect or missing links. Yet large complexes tend to have many interconnections between them, whereas false-positive links to outside proteins tend to occur randomly, without a coherent pattern.

9. Lecture WS 2008/09Bioinformatics III27 Improve ratio TP / FP Observation: Increasing the minimum number of links raises TP/FP by preserving the interactions among proteins in large complexes, while filtering out false-positive interactions with heterogeneous groups of proteins outside the complexes. Jansen et al. Science 302, 449 (2003) TP/FP for subsets of the thresholded PIP that only include proteins with a minimum number of links. Requiring a minimum number of links isolates large complexes in the thresholded PIP graph (Fig. 3B).

9. Lecture WS 2008/09Bioinformatics III28 Summary In a similar manner, the approach could have been extended to a number of other features related to interactions (e.g. phylogenetic co-occurrence, gene fusions, gene neighborhood). Jansen et al. Science 302, 449 (2003) Bayesian approach allows reliable predictions of protein-protein interactions by combining weakly predictive genomic features. The de novo prediction of complexes replicated interactions found in the gold- standard positives and PIE. Also, several predictions were confirmed by new TAP experiments. The accuracy of the PIP was comparable to that of the PIE while simultaneously achieving greater coverage. As a word of caution: Bayesian approaches don‘t work everywhere.

9. Lecture WS 2008/09Bioinformatics III29 Dynamic Simulation of Protein Complex Formation - Most cellular functions are conducted or regulated by protein complexes of varying size - organization into complexes may contribute substantially to an organism‘s complexity. E.g different proteins (yeast) may form 18  10 6 different pairs of interacting proteins, but already different complexes of size 3.  mechanism how evolution could significantly increase the regulatory and metabolic complexity of organisms without substantially increasing the genome size. - Only a very small subset of the many possible complexes is actually realized. Beyer, Wilhelm, Bioinformatics

9. Lecture WS 2008/09Bioinformatics III30 Experimental reference data 229 biologically meaningful ‚TAP complexes‘ from yeast with sizes ranging from 2 to 88 different proteins per complex. „Cumulative“ means that there are 229 complexes of size 2 that may also be parts of larger complexes.  size-frequency of complexes has common characteristics: # of complexes of a given size versus complex size is exponentially decreasing Does the shape of this distribution reflect the nature of the underlying cellular dynamics which is creating the protein complexes?  Test by simulation model

9. Lecture WS 2008/09Bioinformatics III31 Dynamic Complex Formation Model 3 variants of the protein complex association-dissociation model (PAD-model) are tested with the following features: (i) In all 3 versions the composition of the proteome does not change with time. Degradation of proteins is always balanced by an equal production of the same kind of proteins. (ii) The cell consists of either one (PAD A & B) or several (PAD C) compartments in which proteins and protein complexes can freely interact with each other. Thus, all proteins can potentially bind to all other proteins in their compartment (risky assumption!). (iii) Association and dissociation rate constants are the same for all proteins. In PAD-models A and C association and dissociation are independent of complex size and complex structure.

9. Lecture WS 2008/09Bioinformatics III32 Dynamic Complex Formation Model (iv) At each time step a set of complexes is randomly selected to undergo association and dissociation. Association is simulated as the creation of new complexes by the binding of two smaller complexes. Dissociation is simulated as the reverse process, i.e. it is the decay of a complex into two smaller complexes. The number of associations and dissociations per time step are k a · N C 2 and k d · N C respectively, N C : total number of complexes in the cell k a [1/(#complexes · time)] : association rate constant k d [1/time] : dissociation rate constant. k a and k d correspond to the biochemical rates of a reversible reaction.

9. Lecture WS 2008/09Bioinformatics III33 Protein Association/Dissociation Models PAD A : the most simple model where all proteins can interact with each other (no partitioning) and it assumes that association and dissociation are independent of complex size. PAD B : is equivalent to PAD A, but larger complexes are assumed more likely to bind (preferential attachment). Here, the binding probability is assumed as proportional to i·j, where i and j are the sizes of two potentially interacting complexes. PAD C : extends PAD A by assuming that proteins can interact only within groups of proteins (with partitioning). The sizes of these protein groups are based on the sizes of first level functional modules according to the yeast data base. PAD C assumes 16 modules each containing between 100 and 1000 different ORFs.  the protein groups do not represent physical compartments, but rather resemble functional modules of interacting proteins.

9. Lecture WS 2008/09Bioinformatics III34 Mathematical Description - explicit simulation of an entire cell (50 million protein molecules were simulated) is too time consuming for many applications of the model. - therefore use a simplified mathematical description of the PAD model to quickly assess different scenarios and parameter combinations. The change of the number of complexes of size i,  x i, during one time step  t can be described as G i a and G i d : gains due to association and dissociation L i a and L i d : losses due to association and dissociation (1)

9. Lecture WS 2008/09Bioinformatics III35 Mathematical Description Given a total number of N C complexes, the total number of associations and dissociations per time step are k a · N C 2 and k d · N C, respectively. We assume throughout that we can calculate the mean number of associating or dissociating complexes of size i per time step as 2 · k a · x i · N C and k d · x i. The probability that complexes of size j and i-j get selected for one association is  deduce the number of complexes of size i that get created during each time step via association of smaller complexes simply by summing over all complex sizes that potentially create a complex of size i:

9. Lecture WS 2008/09Bioinformatics III36 Mathematical Description When j is equal to i/2 (which is possible only for even i’s) both interaction partners have the same size. The size of the pool x i-j is therefore reduced by 1 after the first interaction partner has been selected, which yields a small reduction of the probability of selecting a second complex from that pool. Account for this effect with the correction  i, which only applies to even i’s: This correction is usually very small. The loss of complexes of size i due to association is simply proportional to the probability of selecting them for association, i.e.

9. Lecture WS 2008/09Bioinformatics III37 Mathematical Description Complexes of size i get created by dissociation of larger complexes. A complex of size j has possible ways of dissociation and the number of possible fragments of size i is The probability that a dissociating complex of size j > i creates a fragment of size i is hence The number of new complexes follows by summing over all possible parent sizes The respective loss term becomes

9. Lecture WS 2008/09Bioinformatics III38 Number of complexes formed The figure shows a comparison of a numerical solution of equation (1) with a stochastic simulation of the association-dissociation process.

9. Lecture WS 2008/09Bioinformatics III39 Steady-state After a transient period a steady-state is reached. We are mainly interested in this steady-state distribution of frequencies x i.  find a set of x i solving  x i /  t = 0. The solution of this non-linear equation system is obtained by numerically minimizing all  x i /  t. By dividing equation (1) by k d it can be seen that the steady-state distribution is independent of the absolute values of k a and k d, but it only depends on the ratio of the two parameters R ad = k a / k d. Hence, only two parameters affect the x i at steady-state: - the total number of proteins N P (which indirectly determines N C ) and - the ratio of the two rate constants R ad.

9. Lecture WS 2008/09Bioinformatics III40 Association in model C For PAD-model B the dissociation terms remain unchanged, wheras the association terms have to be modified. In case of PAD C we calculated weighted averages of results obtained with PAD A. Assume that association is proportional to the product of the sizes of the participating complexes. This assumption changes equation (2) to: where n is the maximum complex size and

9. Lecture WS 2008/09Bioinformatics III41 Computation of a Dissociation Constant K D Mathematically our model describes a reversible (bio-)chemical reaction.  calculate an equilibrium dissociation constant K D, which quantifies the fraction of free subcomplexes A and B compared to the bound complex AB. This equilibrium is complex size dependent, because a large complex AB is less likely to randomly dissociate exactly into the two specific subunits A and B than a small complex. (A and B can be ensembles of several proteins.) We get for any given complex of size i the following K D : K D (i) = [A][B] / [AB] = (R ad ·N i · V) – 1 (4) where N i is the number of possible fragments of a complex of size i and V is the cell volume. Cell-wide averages of K D -values are estimated by computing a weighted average with N C being the total number of complexes and x i being the number of complexes of size i.

9. Lecture WS 2008/09Bioinformatics III42 Results - dynamically simulate the association and dissociation of 6200 different protein types yielding a set of about 50 million protein molecules. - analyze the resulting steady-state size distribution of protein complexes. This steady-state is thought to reflect the growth conditions under which the yeast cells were held when TAP-measuring the protein complexes. - calculate a protein complex size distribution from the exp. data to which we can compare the simulation results (Figure 1).

9. Lecture WS 2008/09Bioinformatics III43 Results TAP measurements do not provide concentrations of the measured complexes, they only demonstrate the presence of a certain protein complex in yeast cells. Also the number of proteins of a certain type inside such a complex could not be measured  the complex size from Figure 1 does not represent real complex sizes (i.e. total number of proteins in the complex), but it refers to the number of different proteins in a complex. The measured data reflect the characteristics of only 229 different protein complexes of size  2, which is just a small subset of the ‘complexosome’. These peculiarities have to be taken into account when comparing simulation results to the observed complex size distribution. Here, the ‘measurable complex size’ is taken as the number of distinct proteins in a protein complex (Figure 2). When comparing our simulation results to the measurements, we always select a random- subset of 229 different complexes from the simulated pool of complexes. This results in a complex size distribution comparable to the measured distribution from Figure 1 (‘bait distribution’).

9. Lecture WS 2008/09Bioinformatics III44 Effect of preferential attachment Both simulations were performed with the best fit parameters for PAD A. In case of preferential attachment the best regression result (solid line) is obtained with a power-law, while the simulation without preferential attachment is best fitted assuming an exponentially decreasing curve. The original, measurable and bait distributions are always close to exponential in case of PAD A and power- law like in case of PAD B, independent of the parameters chosen. PAD B model gives power-law distribution  not in agreement with experimental observation. Cumulative number of distinct protein complexes versus their size, resulting from simulations without (diamonds) and with (squares) preferential attachment to larger complexes.

9. Lecture WS 2008/09Bioinformatics III45 Conclusions A very simple, dynamic model can reproduce the observed complex size distribution. Given the small number of input parameters the very good fit of the observed data is astonishing (and may be fortuitous). Preferential attachment does not take place in yeast cells under the investigated conditions. This is biologically plausible: Specific and strong binding can be just as important for small protein complexes as for large complexes.  the dissociation should on average be independent of the complex size. Interpreting the simulated association and dissociation in terms of K D -values suggests that larger complexes bind more strongly than smaller complexes. However, the size dependence of K D is compensated by the higher number of possible dissociations in larger complexes. Here, we assumed that all possible dissociations happen with the same probability. In reality large complexes may break into specific subcomplexes, which subsequently can be re-used for a different purpose.  Improved versions of the model should account for specificity of association and for specific dissociation.

9. Lecture WS 2008/09Bioinformatics III46 Conclusions Conclusion 2 the number of complexes that were missed during the TAP measurements is potentially large. Simulations give an upper limit of the number of different complexes in cells. At a first glance, the number of different complexes in PAD A (> 3.5 mill.) and PAD C (~ 2 mill.) may appear to be far too large. Even PAD C may overestimate the true number of different complexes, because association within the groups is unrestricted. However, the PAD-models do not only simulate functional, mature complexes, but they also consider all intermediate steps. Each of these steps is counted as a different protein complex. The large difference between the number of measured complexes and the (potential) number of existing complexes may partly explain the very small overlap that has been observed between different large scale measurements of protein complexes. A correct interpretation of the kinetic parameters is important: - k a and k d cannot be compared to real numbers, because the model does not define a length of the time steps for interpreting k a and k d as actual rate constants. - the association-to-dissociation ratio R ad is not identical to a physical K D -value obtained by in vitro measurements of protein binding in water solutions.

9. Lecture WS 2008/09Bioinformatics III47 Discussion Factors complicating this simple interpretation: (i) In vivo diffusion rates are below those in water (e.g. 5 – 20-fold) due to the high concentration of proteins and other large molecules in the cytosol. (ii) Most proteins either are synthesized where they are needed or they get transported directly to the site where the complex gets compiled.  transport to the site of action is on average faster than random diffusion. (iii) Protein concentrations are often above the cell average due to the compartmentalization of the cell. All these processes (protein production, transport, and degradation) are not explicitly described in the PAD-model, but they are lumped in the assumptions. The R ad must therefore be interpreted as an operationally defined property. It characterizes the overall, cell averaged complex assembly process, which includes all steps necessary to synthesize a protein complex.

9. Lecture WS 2008/09Bioinformatics III48 Discussion However, even the model-derived K D -s allow for some conclusions regarding complex formation. We calculated weighted averages (K D ) of the size-dependent K D -values by using the steady-state complex size distribution of the best fit. This yields average K D -s of 4.7 nM and 0.18 nM for the best fits of PAD A and PAD C, respectively. First, the fact that the K D for PAD C is below that of PAD A underlines the notion that more specific binding is reflected by smaller K D values. Second, typical in vitro K D –values are > 1 nM. Thus the average K D of PAD C is quite low. The model confirms that protein complex formation in vivo gets accelerated due to directed protein transport and due to the compartmentalization of eukaryotes.

9. Lecture WS 2008/09Bioinformatics III49 Discussion The simulated complex size distribution is almost independent of the assumed protein abundance distribution. P P is a valuable summarizing property that can be used to characterize proteomes of different species. A decreasing P P increases the number of different large complexes (the slope in Table 1 gets more shallow), because it is less likely that a large complex contains the same protein twice. Thus, P P is a measure of complexity that not only relates to the diversity of the proteome but also to the composition of protein complexes. Probably the most severe simplification in our model is the assumption that all proteins can potentially interact with each other. PAD-model C is a first step towards more biological realism. By restricting the number of potential interaction partners it more closely maps functional modules and cell compartments, which both restrict the interaction among proteins.

9. Lecture WS 2008/09Bioinformatics III50 Further improvements The partitioning in PAD C means that proteins within one group exhibit very strong binding, whereas binding between protein groups is set to zero. This again is a simplification, since cross-talk between different modules or compartments is possible. Future extensions of the model could incorporate more and more detailed information about the binding specificity of proteins. Assuming even more specific binding will further reduce the number of different complexes, whereas the frequency of the complexes will increase. High binding specificity potentially lowers the complex sizes, so R ad has to be increased in order to fit the experimentally observed protein complex size distribution. On the other hand, cross talk gives rise to larger complexes. Taking both counteracting refinements into account, it is impossible to generally predict the best-fit R ad, since it depends on the quantitative details.

9. Lecture WS 2008/09Bioinformatics III51 Further improvements - a refinement of PAD C could account for the observed clustering of protein interaction networks. - one could simulate protein associations and dissociations according to predefined binary protein interactions. - a detailed model could additionally account for individual association/ dissociation rates between individual proteins. Such extensions will yield more realistic figures about the number of different protein complexes created in yeast cells.

9. Lecture WS 2008/09Bioinformatics III52 additional slides (not used)

9. Lecture WS 2008/09Bioinformatics III53 Jansen et al. Science 302, 449 (2003) Overview PIP and PIE are separately tested against the gold-standard.

9. Lecture WS 2008/09Bioinformatics III54 Possible Limitations In order to get a correct picture of the protein complex size distribution it is necessary to have an unbiased, random subset of all complexes in the cells. TAP data are biased, e.g. contain too few membrane proteins. However, if compared to other data sets such as MIPS complexes, the TAP complexes constitute a fairly random selection of all protein complexes in yeast. Uncertainties in the TAP data do not affect our conclusions as long as they are not strongly biased with respect to the resulting complex size distribution. Since Gavin et al. (2002) have measured long-term interactions, our results apply to permanent complexes. Yet the model is applicable to future protein complex data that take account of transient binding.

9. Lecture WS 2008/09Bioinformatics III55 Protein Abundance Data Abundance of 6200 yeast proteins:.... Beyer et al. (2004) compiled a protein abundance data set for yeast under standard conditions in YPD-medium. Based on this data set we derived a distribution of protein abundances that resembles the characteristics of the measured data in the upper range (Figure S2). For approximately 2000 proteins no abundance values are available. We assume that the undetected proteins primarily belong to the low-abundance classes, which gives rise to the hypothetical distribution.

9. Lecture WS 2008/09Bioinformatics III56 Biochemical Interpretation of the Rate Constants The process of forming a protein complex AB from the two subcomplexes A and B, and its dissociation can be described as a reversible reaction: with constants k on [L/(mol s)] and k off [1/s] quantifying the forward and backward reactions: In our model the concentration [A] can be calculated as with f A being the fraction of species A among all N C complexes in the system and V being the cell volume.

9. Lecture WS 2008/09Bioinformatics III57 Biochemical Interpretation of the Rate Constants The number of associations of two complex-species A and B per time step becomes since we assume k a ·N C 2 many associations per time step. Here, n A and n B are the number of complexes of the respective species. Division by the cell-volume V yields units of ‘concentration per time’. Thus, k on in a biochemical reaction approximately equals k a ·V, since the total number of complexes N C is very large in all scenarios that we have simulated.

9. Lecture WS 2008/09Bioinformatics III58 Biochemical Interpretation of the Rate Constants When looking for an equivalent expression for k off we have to quantify the specific dissociation of a complex AB into the subcomplexes A and B. The unspecific dissociation of AB is simply k d ·[AB], k d : dissociation rate constant. Since AB may consist of > 2 proteins it can also be split into subcomplexes other than A and B. For the specific dissociation rate, one has to know how often AB actually dissociates into the subcomplexes A and B. The total number of dissociations per time step is k d · N C. The probability that a complex AB with size i breaks into the specific sub-complexes A and B is 1/N i, N i : number of possible fragments of a complex of size i. This holds under the assumption that all proteins in AB are distinct, which is approximately true for the simulations conducted here.

9. Lecture WS 2008/09Bioinformatics III59 Biochemical Interpretation of the Rate Constants n AB /N C : fraction of complexes AB among all complexes  size specific dissociation rate N AB dissoc (i): from which the complex size dependent rate constant k off.(i) = k d /N i results. Taking into account that certain proteins may be in the complex more than once we get k off = k d /N i. One can calculate an apparent equilibrium constant K D, which describes the equilibrium between the independent species A and B and the bound species AB: where i is the size of the complex AB. Since N i is exponentially increasing with i, K D is exponentially decreasing with complex size.

9. Lecture WS 2008/09Bioinformatics III60 Measurable Size Distribution and Bait Selection Based on the distribution resulting from equation (1) at steady-state derive two further distributions: (i) the ‘measurable size distribution’ and (ii) the ‘bait distribution’. The former is defined as the frequency distribution of the measurable complex sizes. The measurable complex size is the number of different proteins in a protein complex (as opposed to the total number of proteins). For the measurable size-distribution we only count the number of complexes with distinct protein compositions. Measurable versus ‘actual’ complex size distribution. Diamonds show frequencies of actual complex sizes and triangles are frequencies of measurable complexes. Filled diamonds and triangles reflect simulation without partitioning (PAD A) and open diamonds and triangles are simulation results assuming binding only within certain modules (PAD C). The difference between the original and the measurable complex size distribution is comparably small, because most of the simulated complexes are unique. However, in case of PAD C smaller complexes occur at higher copy numbers and larger complexes are often counted as smaller measurable complexes because they contain some proteins more than once.

9. Lecture WS 2008/09Bioinformatics III61 Direct comparison of different data sets Reliability of Protein Interaction Networks

9. Lecture WS 2008/09Bioinformatics III62 High-throughput methods for detecting protein interactions Yeast two-hybrid assay. Pairs of proteins to be tested for interaction are expressed as fusion proteins ('hybrids') in yeast: one protein is fused to a DNA-binding domain, the other to a transcriptional activator domain. Any interaction between them is detected by the formation of a functional transcription factor. Benefits: it is an in vivo technique; transient and unstable interactions can be detected; it is independent of endogenous protein expression; and it has fine resolution, enabling interaction mapping within proteins. Drawbacks: only two proteins are tested at a time (no cooperative binding); it takes place in the nucleus, so many proteins are not in their native compartment; and it predicts possible interactions, but is unrelated to the physiological setting. Mass spectrometry of purified complexes. Individual proteins are tagged and used as 'hooks' to biochemically purify whole protein complexes. These are then separated and their components identified by mass spectrometry. Two protocols exist: tandem affinity purification (TAP), and high-throughput mass- spectrometric protein complex identification (HMS-PCI). Benefits: several members of a complex can be tagged, giving an internal check for consistency; and it detects real complexes in physiological settings. Drawbacks: it might miss some complexes that are not present under the given conditions; tagging may disturb complex formation; and loosely associated components may be washed off during purification. Correlated mRNA expression (synexpression). mRNA levels are systematically measured under a variety of different cellular conditions, and genes are grouped if they show a similar transcriptional response to these conditions. These groups are enriched in genes encoding physically interacting proteins. Benefits: it is an in vivo technique, albeit an indirect one; and it has much broader coverage of cellular conditions than other methods. Drawbacks: it is a powerful method for discriminating cell states or disease outcomes, but is a relatively inaccurate predictor of direct physical interaction; and it is very sensitive to parameter choices and clustering methods during analysis. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III63 High-throughput methods for detecting protein interactions Genetic interactions (synthetic lethality). Two nonessential genes that cause lethality when mutated at the same time form a synthetic lethal interaction. Such genes are often functionally associated and their encoded proteins may also interact physically. This type of genetic interaction is currently being studied in an all-versus-all approach in yeast. Benefits: it is an in vivo technique, albeit an indirect one; and it is amenable to unbiased genome-wide screens. In silico predictions through genome analysis. Whole genomes can be screened for three types of interaction evidence: (1) in prokaryotic genomes, interacting proteins are often encoded by conserved operons; (2) interacting proteins have a tendency to be either present or absent together from fully sequenced genomes, that is, to have a similar 'phylogenetic profile'; and (3) seemingly unrelated proteins are sometimes found fused into one polypeptide chain. This is an indication for a physical interaction. Benefits: fast and inexpensive in silico techniques; and coverage expands as more genomes are sequenced. Drawbacks: it requires a framework for assigning orthology between proteins, failing where orthology relationships are not clear; and so far it has focused mainly on prokaryotes. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III64 Data set Experiment: Uetz et al. 957 interactions Ito et al interactions HMS-PCI33014 interactions In silico: Conserved gene neighborhood 6387 interactions Gene fusions 358 interactions Co-occurrence of genes 997 interactions Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III65 Counting interactions Various high-throughput methods give differing results on the same complex. > interactions available for yeast. Only are supported by more than 1 method. Von Mering et al. Nature 417, 399 (2002) Possible explanations ? - Methods may not have reached saturation - Many of the methods produce a significant fraction of false positives - Some methods may have difficulties for certain types of interactions

9. Lecture WS 2008/09Bioinformatics III66 Protein interactions between functional categories Each technique produces a unique distribution of interactions with respect to functional categories  methods have specific strengths and weaknesses. E.g. TAP and HMS-PCI predict few interactions for proteins involved in transport and sensing because these categories are enriched with membrane proteins. E.g. Y2H detects few proteins involved in translation. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III67 Complementarity between data sets Glycine decarboxylase - Multienzyme complex needed when Gly is used as 1-carbon source. - Its key components GCV1, GCV2, GCV3 are only induced when there is excess Glycine and folate levels are low. This may explain why complex is not detected in experiments. However, 3 components can be detected by several independent in silico methods - Gene neighborhood of all 3 components in 7 diverged species - genes show very similar phylogenetic distribution - microarrays: genes are closely co- regulated. Von Mering et al. Nature 417, 399 (2002) Opposite example: PPH3 protein Complex found in 4 independent purifications, but no in silico method predicts interaction.

9. Lecture WS 2008/09Bioinformatics III68 Quantitative comparison of interaction data sets The various data sets are benchmarked against a reference set of 10,907 trusted interactions, which are derived from protein complexes annotated manually at MIPS and YPD databases. Coverage and accuracy are lower limits owing to incompleteness of the reference set. Each dot in the graph represents an entire interaction data set. For the combined evidence, consider only interactions supported by an agreement of two (or three) of any of the methods shown. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III69 Biases in interaction coverage Experiment: Uetz et al. 957 interactions Ito et al interactions HMS-PCI33014 interactions In silico: Conserved gene neighborhood 6387 interactions Gene fusions 358 interactions Co-occurrence of genes 997 interactions None of the methods covers more than 60% of the proteins in the yeast genome. Are there common biases as to which proteins are covered? Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III70 Bias 1 towards proteins of high abundance mRNA abundance is a rough measure of protein abundance. Here, divide yeast genome into 10 mRNA abundance classes (bins) of equal size. For each data set and abundance class, the number of interactions is recorded having at least one protein in that class. Each interaction (A–B) is counted twice: once under the abundance class of partner A, and once under the abundance class of partner B.  Most data sets are heavily biased towards proteins of high abundance except for genetic techniques (Y2H and synthetic lethality) Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III71 Bias 2 towards cellular localization Protein localization and interaction coverage. Protein localizations are derived from the MIPS and TRIPLES databases. a, The distribution of protein localization among the proteins covered by a data set. E.g. in silico predictions overestimate mitochondrial interactions. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III72 Von Mering et al. Nature 417, 399 (2002) Bias 2 towards cellular localization Independent quality measure: Are proteins that interact belong to the same compartment? Y2H method gives relatively poor results here.

9. Lecture WS 2008/09Bioinformatics III73 Bias 3 in interaction coverage Separate yeast genome into 4 classes according to the conservation of the genes in other species The presence of a gene in any of these species was concluded from bi-directional best hits in Swiss-Waterman searches, using 0.01 as cutoff. Bias related to the degree of evolutionary novelty of proteins. Proteins restricted to yeast are less well covered than ancient, evolutionarily conserved proteins. Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III74 Outlook How many protein-protein interactions can be expected in yeast? Overlap of high-throughput data is 20 times larger than expected by chance.  Good signal-to-noise ratio. Also, for interactions discovered ≥ 2 times, usually both partners have the same functional category and cellular localization.  Overlap mainly consists of „true positives“. Less than 1/3 of new interactions in overlap set were previously known. Given currently known interactions predict > protein interactions in yeast (lower boundary). Von Mering et al. Nature 417, 399 (2002)

9. Lecture WS 2008/09Bioinformatics III75 Problems Jansen et al. Science 302, 449 (2003) Unfortunately, interaction data sets are often incomplete and contradictory (von Mering et al. 2002). In the context of genome-wide analyses, these inaccuracies are greatly magnified because the protein pairs that do not interact (negatives) by far outnumber those that do interact (positives). E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential interactions. But the estimated number of actual interactions is < Therefore, even reliable techniques can generate many false positives when applied genome-wide. Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in 0.1% of the population. This would roughly produce 1 true positive for every 10 false ones.