JCKBSE2010 Kaunas Predicting Combinatorial Protein-Protein Interactions from Protein Expression Data Based on Correlation Coefficient Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa Faculty of Systems Engineering, Wakayama University
JCKBSE2010 Kaunas Wakayama University 2 2 Agenda Background Combinatorial Protein-Protein Interactions The Proposed Data Mining Method Evaluation Conclusion
JCKBSE2010 Kaunas Wakayama University Background Finding Interactions among genes/proteins are important Many data-mining algorithms to discover gene-gene (or protein-protein) interactions are proposed so far. One of the main source is gene or protein expression data 3 2D Electorophoresis ( for protein expression ) Microarray ( for gene expression) Color strength is expression level Size of spot is expression level
JCKBSE2010 Kaunas Wakayama University Related Work for Interaction Discovery Bayesian Networks Discovering interactions from expression data based on conditional probability among events 4 A C B AB C Ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C Ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C samples Event “C is expressed” If high, Interaction is predicted
JCKBSE2010 Kaunas Wakayama University Problems of Bayesian Networks Bayesian Networks Require large Number of Samples For gene: microarray supplies cheap and high-speed experiment For protein: 2D-electrophoresis takes time and expensive 5 A C B sufficient samples in the area ? Many Samples are Necessary to obtain statistically reliable results AB C ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C
JCKBSE2010 Kaunas Wakayama University 6 The Objective of our study Finding combinatorial protein-protein interactions from small-size protein expression data
JCKBSE2010 Kaunas Wakayama University 7 7 Expression Data 2D-electrophoresis processed for each sample which includes expression levels of each protein. Expression levels: obtained by measuring size of areas As pre-processing, normalization is applied ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ Each black area indicates a protein: size of areas represent expression levels sample3 sample2 sample1 Proteins
JCKBSE2010 Kaunas Wakayama University 8 8 Model of Protein-Protein Interaction Considered Model: two proteins A and B effect on other protein C’s expression level only when both A and B are expressed We want to estimate the combinatorial Effect! A B C C A B C Effect on expression levels Complex of A and B A B A B A B Sole effect from A,B on C is usually considered Only If both A and B exist, Combinatorial effect works on C!
JCKBSE2010 Kaunas Wakayama University 9 9 Predicting Interactions by Correlation Coefficient Computing correlation coefficient of (A,B) and C Correlation coefficient requires less number of samples The amount of complex (A,B) is estimated by min(A,B) Total effect on C will be high if correlation is high Expression level AB Expression level of A and B of a sample Estimated amount of complex of A and B Compute correlation of min( A,B ) and C This amount would Effect on C min( A,B ) C
JCKBSE2010 Kaunas Wakayama University 10 The problem of scale difference Amount of expression level for 1 molecular is different among proteins, so the same amount of A and B not always combined. Therefore, taking min cannot express correct amount of complex Exp.level AB Proteins A and B Estimated number of complex AB Proteins A and B The amount of complex is not correct Taking min leads correct amount of complex Solution : correct the scale of A Scaling problem and solution is the expression level required for a complex Exp.level
JCKBSE2010 Kaunas Wakayama University 11 How to determine correct scale? Expression level ABk1Ak1Ak2Ak2Ak3Ak3A We compute Score S: the total effect of (A, B) on C Compute Correlation Select the scale which leads the maximum correlation coefficient of min(A,B) and C If interaction of our model exists, high correlation value must appear. min( A,B ) Score S Correlation : 0.1Correlation : 0.2Correlation : 0.3Correlation : 0.7
JCKBSE2010 Kaunas Wakayama University Estimating Combinatorial Effect from Score S Score S consists of “Sole Effect” and “Combinatorial Effect” Compute Score S’: Score S assuming no combinatorial effect Difference between S and S’ is the level of Combinatorial Effect 12 Level of combinatorial effect B C A The difference between score S and S’ is the combinatorial effect A B C B C A C Assuming no combinatorial Effect A B C C Score S B C A Score S’ Computing Statistic Distribution
JCKBSE2010 Kaunas Wakayama University Assume that expression levels of proteins A, B and C follow normal distribution Computer simulation leads the distribution of Score S’ How to compute distribution of score S’? 13 Correlation α Correlation β Distribution of A Distribution of B Distribution of C Score S’ of α=0.5, β=0.3 ② Obtain distribution of score S’ ① Randomly create a distribution of A, B and C where correlation coefficient of A-B is α, that of B-C is β ③ Create the table of average and stddev for each α and β Repeat computation of score S Repeat computation of score S Score S’ of α=0.5, β=0.4 We can obtain the distribution for each α and β. Upper: average Lower: stddev
JCKBSE2010 Kaunas Wakayama University Place the score S in distribution of S’ Z-score: Measure difference between score S and average of S’ as the count of standard deviation Score S Computing Combinatorial Effect as Z-score 14 The higher z-score is, the stronger the combinatorial effect is ! Distribution of score S’Compute score Scorresponding The amount of combinatorial effect level Z-score = (score S-avg(S’)) / stddev(S’) Measurement as count of standard deviation average Score S Z-score Score S’
JCKBSE2010 Kaunas Wakayama University Trying all combination of A, B and C Compute the maximum correlation coefficient among all scale of A and B to compute Score S Compute z-score and create ranking by them 15 Compute z-scores from distribution of S’ Summary of the proposed algorithm ABCD sample1 sample2 sample3 Expression Data (A,B)→C (A,B)→D (A,B)→E (A,B)→F : (A,C)→B (A,C)→D (A,C)→E (A,C)→F : …………………… (B,C)→A (B,C)→D (B,C)→E (B,C)→F : …………………… Trying all combinations 1 Compute max correlation among every scale 2 C A B C A B C A B Try every scales correlation : 0.3 correlation : 0.8 correlation: 0.5 S Z-score = 5.5 list of all combinations 3 Ranking by z-score 4 rankCombinationsZ-score 1 (A,C)→B 5.5 2 (B,C)→E 4.9 3 (A,B)→F 4.7 Score S = 0.8 S’
JCKBSE2010 Kaunas Wakayama University 16 Evaluation Applying our method into real expression data Protein expression data of black cattle # of samples is 195, # of proteins is 879 finding combinatorial protein-protein interactions using our method
JCKBSE2010 Kaunas Wakayama University The Expression Data Follows Normal Distribution By way of Jarque-Bera test with confidential level of 95%, we test if expression data follows normal distribution. Result: 454 proteins out of 879 proteins follow normal distribution Thus, we use 454 proteins for evaluation 17
JCKBSE2010 Kaunas Wakayama University Results We found so many combinations of proteins which would have combinatorial effect The maximum value of z-score is 11.0 The combinations where z-value is more than about 5.5 (p-value is less than (=0.05/ 454 C 3 ))) would have combinatorial effect with confidential level of 95%. 18 The histogram of z-score # of combinations Z-score
JCKBSE2010 Kaunas Wakayama University Comparing z-scores with normal distribution 19 We compare the histogram with that of without combinatorial effect Created by augmenting normal distribution with the number of trials ( 454 C 3 ) It is inferred that this data includes considerable amount of combinatorial effect Distribution of z-score under assumption no combinatorial effect Estimated distribution of z-score obtained from real data # of combinations Z-score Histogram of real data # of combinations Z-score Histogram without combinatorial effect
JCKBSE2010 Kaunas Wakayama University The Ranking based on Z-score 20 The ranking table shows that Combinations with low score S are retrieved. Same protein tends to appear many times. The ranking of Z-score obtained from real data ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ B CAC Correlation of B-C Score S B C A Z-score Rank A Protein Num B Protein Num C Protein Num Correlation of A-C
JCKBSE2010 Kaunas Wakayama University Conclusion 21 Summary We propose a method to estimate combinatorial effect of three proteins from protein expression data Applying the method into real data, we found many combinations which would have combinatorial effect Future work To confirm the reliability, we are planning to study whether the found combinations include well-known protein-protein interactions or not.