JCKBSE2010 Kaunas Predicting Combinatorial Protein-Protein Interactions from Protein Expression Data Based on Correlation Coefficient Sho Murakami, Takuya.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Percentiles and the Normal Curve
Sta220 - Statistics Mr. Smith Room 310 Class #14.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL.
Is it statistically significant?
Inference Sampling distributions Hypothesis testing.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
ESTIMATION AND HYPOTHESIS TESTING
Chapter 7 Introduction to Sampling Distributions
Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
Chapter 8 Estimating Single Population Parameters
Assuming normally distributed data! Naïve Bayes Classifier.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
CHAPTER 7: SAMPLING DISTRIBUTIONS. 2 POPULATION AND SAMPLING DISTRIBUTIONS Population Distribution Sampling Distribution.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Created by Tom Wegleitner, Centreville, Virginia Section 5-2.
Quantitative Data Analysis or Introduction to Statistics Need to know: What you want to know -- Relationship? Prediction? Causation? Description? Everything?!
Slide 1 Statistics Workshop Tutorial 4 Probability Probability Distributions.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Chapter 7 Probability and Samples: The Distribution of Sample Means
Chapter 11: Random Sampling and Sampling Distributions
1 Assessment of Imprecise Reliability Using Efficient Probabilistic Reanalysis Farizal Efstratios Nikolaidis SAE 2007 World Congress.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Section 8.1 ~ Sampling Distributions
Chapter 2: The Research Enterprise in Psychology
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 9 Introduction to Hypothesis Testing.
Hypothesis Testing II The Two-Sample Case.
Introduction to Statistical Inferences
From Last week.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Probability Quantitative Methods in HPELS HPELS 6210.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
PARAMETRIC STATISTICAL INFERENCE
Please turn off cell phones, pagers, etc. The lecture will begin shortly.
Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.
Stats/Methods I JEOPARDY. Jeopardy CorrelationRegressionZ-ScoresProbabilitySurprise $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
Correlation and Prediction Error The amount of prediction error is associated with the strength of the correlation between X and Y.
7.4 – Sampling Distribution Statistic: a numerical descriptive measure of a sample Parameter: a numerical descriptive measure of a population.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Chapter Outline 2.1 Estimation Confidence Interval Estimates for Population Mean Confidence Interval Estimates for the Difference Between Two Population.
Statistical Inference Statistical Inference involves estimating a population parameter (mean) from a sample that is taken from the population. Inference.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CHAPTER SEVEN ESTIMATION. 7.1 A Point Estimate: A point estimate of some population parameter is a single value of a statistic (parameter space). For.
AP Statistics Section 11.1 B More on Significance Tests.
Chapter 2 The Research Enterprise in Psychology. Table of Contents The Scientific Approach: A Search for Laws Basic assumption: events are governed by.
Analyzing Expression Data: Clustering and Stats Chapter 16.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Testing Hypotheses about a Population Proportion Lecture 31 Sections 9.1 – 9.3 Wed, Mar 22, 2006.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Statistical Analysis – Chapter 6 “Hypothesis Testing” Dr. Roderick Graham Fashion Institute of Technology.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 3 – Slide 1 of 27 Chapter 11 Section 3 Inference about Two Population Proportions.
Chapter 13 Understanding research results: statistical inference.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Inferential Statistics. Population Curve Mean Mean Group of 30.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 3 Investigating the Relationship of Scores.
Statistical Inference
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

JCKBSE2010 Kaunas Predicting Combinatorial Protein-Protein Interactions from Protein Expression Data Based on Correlation Coefficient Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa Faculty of Systems Engineering, Wakayama University

JCKBSE2010 Kaunas Wakayama University 2 2 Agenda Background Combinatorial Protein-Protein Interactions The Proposed Data Mining Method Evaluation Conclusion

JCKBSE2010 Kaunas Wakayama University Background Finding Interactions among genes/proteins are important Many data-mining algorithms to discover gene-gene (or protein-protein) interactions are proposed so far. One of the main source is gene or protein expression data 3 2D Electorophoresis ( for protein expression ) Microarray ( for gene expression) Color strength is expression level Size of spot is expression level

JCKBSE2010 Kaunas Wakayama University Related Work for Interaction Discovery Bayesian Networks Discovering interactions from expression data based on conditional probability among events 4 A C B AB C Ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C Ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C samples Event “C is expressed” If high, Interaction is predicted

JCKBSE2010 Kaunas Wakayama University Problems of Bayesian Networks Bayesian Networks Require large Number of Samples For gene: microarray supplies cheap and high-speed experiment For protein: 2D-electrophoresis takes time and expensive 5 A C B sufficient samples in the area ? Many Samples are Necessary to obtain statistically reliable results AB C ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C

JCKBSE2010 Kaunas Wakayama University 6 The Objective of our study Finding combinatorial protein-protein interactions from small-size protein expression data

JCKBSE2010 Kaunas Wakayama University 7 7 Expression Data 2D-electrophoresis processed for each sample which includes expression levels of each protein. Expression levels: obtained by measuring size of areas As pre-processing, normalization is applied ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ Each black area indicates a protein: size of areas represent expression levels sample3 sample2 sample1 Proteins

JCKBSE2010 Kaunas Wakayama University 8 8 Model of Protein-Protein Interaction Considered Model: two proteins A and B effect on other protein C’s expression level only when both A and B are expressed We want to estimate the combinatorial Effect! A B C C A B C Effect on expression levels Complex of A and B A B A B A B Sole effect from A,B on C is usually considered Only If both A and B exist, Combinatorial effect works on C!

JCKBSE2010 Kaunas Wakayama University 9 9 Predicting Interactions by Correlation Coefficient Computing correlation coefficient of (A,B) and C Correlation coefficient requires less number of samples The amount of complex (A,B) is estimated by min(A,B) Total effect on C will be high if correlation is high Expression level AB Expression level of A and B of a sample Estimated amount of complex of A and B Compute correlation of min( A,B ) and C This amount would Effect on C min( A,B ) C

JCKBSE2010 Kaunas Wakayama University 10 The problem of scale difference Amount of expression level for 1 molecular is different among proteins, so the same amount of A and B not always combined. Therefore, taking min cannot express correct amount of complex Exp.level AB Proteins A and B Estimated number of complex AB Proteins A and B The amount of complex is not correct Taking min leads correct amount of complex Solution : correct the scale of A Scaling problem and solution is the expression level required for a complex Exp.level

JCKBSE2010 Kaunas Wakayama University 11 How to determine correct scale? Expression level ABk1Ak1Ak2Ak2Ak3Ak3A We compute Score S: the total effect of (A, B) on C Compute Correlation Select the scale which leads the maximum correlation coefficient of min(A,B) and C If interaction of our model exists, high correlation value must appear. min( A,B ) Score S Correlation : 0.1Correlation : 0.2Correlation : 0.3Correlation : 0.7

JCKBSE2010 Kaunas Wakayama University Estimating Combinatorial Effect from Score S Score S consists of “Sole Effect” and “Combinatorial Effect” Compute Score S’: Score S assuming no combinatorial effect Difference between S and S’ is the level of Combinatorial Effect 12 Level of combinatorial effect B C A The difference between score S and S’ is the combinatorial effect A B C B C A C Assuming no combinatorial Effect A B C C Score S B C A Score S’ Computing Statistic Distribution

JCKBSE2010 Kaunas Wakayama University Assume that expression levels of proteins A, B and C follow normal distribution Computer simulation leads the distribution of Score S’ How to compute distribution of score S’? 13 Correlation α Correlation β Distribution of A Distribution of B Distribution of C Score S’ of α=0.5, β=0.3 ② Obtain distribution of score S’ ① Randomly create a distribution of A, B and C where correlation coefficient of A-B is α, that of B-C is β ③ Create the table of average and stddev for each α and β Repeat computation of score S Repeat computation of score S Score S’ of α=0.5, β=0.4 We can obtain the distribution for each α and β. Upper: average Lower: stddev

JCKBSE2010 Kaunas Wakayama University Place the score S in distribution of S’ Z-score: Measure difference between score S and average of S’ as the count of standard deviation Score S Computing Combinatorial Effect as Z-score 14 The higher z-score is, the stronger the combinatorial effect is ! Distribution of score S’Compute score Scorresponding The amount of combinatorial effect level Z-score = (score S-avg(S’)) / stddev(S’) Measurement as count of standard deviation average Score S Z-score Score S’

JCKBSE2010 Kaunas Wakayama University  Trying all combination of A, B and C  Compute the maximum correlation coefficient among all scale of A and B to compute Score S  Compute z-score and create ranking by them 15 Compute z-scores from distribution of S’ Summary of the proposed algorithm ABCD sample1 sample2 sample3 Expression Data (A,B)→C (A,B)→D (A,B)→E (A,B)→F : (A,C)→B (A,C)→D (A,C)→E (A,C)→F : …………………… (B,C)→A (B,C)→D (B,C)→E (B,C)→F : …………………… Trying all combinations 1 Compute max correlation among every scale 2 C A B C A B C A B Try every scales correlation : 0.3 correlation : 0.8 correlation: 0.5 S Z-score = 5.5 list of all combinations 3 Ranking by z-score 4 rankCombinationsZ-score 1 (A,C)→B 5.5 2 (B,C)→E 4.9 3 (A,B)→F 4.7 Score S = 0.8 S’

JCKBSE2010 Kaunas Wakayama University 16 Evaluation Applying our method into real expression data Protein expression data of black cattle # of samples is 195, # of proteins is 879 finding combinatorial protein-protein interactions using our method

JCKBSE2010 Kaunas Wakayama University The Expression Data Follows Normal Distribution By way of Jarque-Bera test with confidential level of 95%, we test if expression data follows normal distribution. Result: 454 proteins out of 879 proteins follow normal distribution Thus, we use 454 proteins for evaluation 17

JCKBSE2010 Kaunas Wakayama University Results We found so many combinations of proteins which would have combinatorial effect The maximum value of z-score is 11.0 The combinations where z-value is more than about 5.5 (p-value is less than (=0.05/ 454 C 3 ))) would have combinatorial effect with confidential level of 95%. 18 The histogram of z-score # of combinations Z-score

JCKBSE2010 Kaunas Wakayama University Comparing z-scores with normal distribution 19 We compare the histogram with that of without combinatorial effect Created by augmenting normal distribution with the number of trials ( 454 C 3 ) It is inferred that this data includes considerable amount of combinatorial effect Distribution of z-score under assumption no combinatorial effect Estimated distribution of z-score obtained from real data # of combinations Z-score Histogram of real data # of combinations Z-score Histogram without combinatorial effect

JCKBSE2010 Kaunas Wakayama University The Ranking based on Z-score 20 The ranking table shows that Combinations with low score S are retrieved. Same protein tends to appear many times. The ranking of Z-score obtained from real data ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ B CAC Correlation of B-C Score S B C A Z-score Rank A Protein Num B Protein Num C Protein Num Correlation of A-C

JCKBSE2010 Kaunas Wakayama University Conclusion 21 Summary We propose a method to estimate combinatorial effect of three proteins from protein expression data Applying the method into real data, we found many combinations which would have combinatorial effect Future work To confirm the reliability, we are planning to study whether the found combinations include well-known protein-protein interactions or not.