Presentation is loading. Please wait.

Presentation is loading. Please wait.

(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

Similar presentations


Presentation on theme: "(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University."— Presentation transcript:

1 (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,02.27 14:30-15:45)

2 (c) M Gerstein '06, gerstein.info/talks 2 Predicting Networks via Bayesian Integration: Problem Motivation

3 (c) M Gerstein '06, gerstein.info/talks 3 Combining networks forms an ideal way of integrating diverse information Metabolic pathway Transcriptional regulatory network Physical protein- protein Interaction Co-expression Relationship Part of the TCA cycle Genetic interaction (synthetic lethal) Signaling pathways

4 (c) M Gerstein '06, gerstein.info/talks 4 Binding Experiments on Subunit Pairs 1111010 Pull-down 2 1101010110111001010000000000 1101010 Pull-down 1 1101010110111101010000010000 0011111101 Cross-linking 1111111111 1 Pull-down 3 11000110 Far Western 3 100100 Far Western 2 111111111100000100000000000000 11111001 Far Western 1 0000000 Interaction experiments before structure was known 11111111123566666888899910 12 23568910111235689101112568910111268910111289101112910111210111211 12 222222233333355555 Subunits

5 (c) M Gerstein '06, gerstein.info/talks 5 RNA polymerase II: Structure Which subunits interact? Based on Binding experiments Source: Cramer et al., 2000, Science, 288:632-633 Compare with Gold Std. Structure 5 6 8 9 1011 12 1 2 3 Source: Edwards et al., 2002, Trends in Genetics

6 (c) M Gerstein '06, gerstein.info/talks 6 Gold-Standard Positives 5 6 8 9 1011 12 1 2 3 Gold-Standard Positive (GSTD+): 13 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits

7 (c) M Gerstein '06, gerstein.info/talks 7 Gold-Standard Negatives Gold-Standard Negative (GSTD-): 32 5 6 8 9 1011 12 1 2 3 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits

8 (c) M Gerstein '06, gerstein.info/talks 8 RNA Polymerase II: Gold-Standards Gold-Standard Negative (GSTD-): 32 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits Gold-Standard Positive (GSTD+): 13 5 6 8 9 1011 12 1 2 3

9 (c) M Gerstein '06, gerstein.info/talks 9 Assess Quality and Coverage of PPints 1111010 Pull-down 2 1101010110111001010000000000 1101010 Pull-down 1 1101010110111101010000010000 0011111101 Cross-linking 1111111111 1 Pull-down 3 11000110 Far Western 3 100100 Far Western 2 111111111100000100000000000000 11111001 Far Western 1 0000000 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- Likelihood Ratio for Feature f: “Quality Score”

10 (c) M Gerstein '06, gerstein.info/talks 10 Data integration: RNA polymerase II

11 (c) M Gerstein '06, gerstein.info/talks 11 Integrate using naive Bayes classifier Data integration: RNA polymerase II

12 (c) M Gerstein '06, gerstein.info/talks 12 Data integration: RNA polymerase II Integrate using naive Bayes classifier

13 (c) M Gerstein '06, gerstein.info/talks 13 Data integration: RNA polymerase II Integrate using naive Bayes classifier (Cross validate)

14 (c) M Gerstein '06, gerstein.info/talks 14 Weighted Voting: the Likelihood Ratio Maj. Vote: 0 = round(avg( 0 + 0 + 0 + 1 + 1 + 0 + 0 ) With weights: likelihood ratio L = L 1 + L 2 + L 3 …

15 (c) M Gerstein '06, gerstein.info/talks 15 Predicting Networks via Bayesian Integration: Intuition & Formalism

16 (c) M Gerstein '06, gerstein.info/talks 16 Classification by Voting (N1) Derived from "perceptron model" in earlier class R = + b

17 (c) M Gerstein '06, gerstein.info/talks 17 Classification by Voting (N2)

18 (c) M Gerstein '06, gerstein.info/talks 18 Bayes Rule Which is shorthand for: [From Mitchell, Machine Learning]

19 (c) M Gerstein '06, gerstein.info/talks 19 More Bayes Rule

20 (c) M Gerstein '06, gerstein.info/talks 20 More Bayes Rule

21 (c) M Gerstein '06, gerstein.info/talks 21 Feature Correlation and Fully Connected Bayes

22 (c) M Gerstein '06, gerstein.info/talks 22 Estimating Probabilities We have so far estimated P(X=x | Y=y) by the fraction n x|y /n y, where n y is the number of instances for which Y=y and n x|y is the number of these for which X=x This is a problem when n x is small  E.g., assume P(X=x | Y=y)=0.05 and the training set is s.t. that n y =5. Then it is highly probable that n x|y =0  The fraction is thus an underestimate of the actual probability  It will dominate the Bayes classifier for all new queries with X=x

23 (c) M Gerstein '06, gerstein.info/talks 23 m-estimate Replace n x|y /n y by: Where p is our prior estimate of the probability we wish to determine and m is a constant  Typically, p = 1/k (where k is the number of possible values of X)  m acts as a weight (similar to adding m virtual instances distributed according to p)

24 (c) M Gerstein '06, gerstein.info/talks 24 Training and Testing Set Cross Validation: Leave one out, seven-fold Credits: Munson, 1995; Garnier et al., 1996 4-fold Cross Validation

25 (c) M Gerstein '06, gerstein.info/talks 25 Intuition N2.7 : ROC Curve & the prior

26 (c) M Gerstein '06, gerstein.info/talks 26 Predicting Networks via Bayesian Integration: Worked Examples

27 (c) M Gerstein '06, gerstein.info/talks 27 Network Gold-Standards Prediction of protein interactions: Bayesian integration [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard –

28 (c) M Gerstein '06, gerstein.info/talks 28 Network Gold-Standards [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Prediction of protein interactions: Bayesian integration Frac. of Gold-Std Negatives with Feature Frac. of Gold-Std Positives with Feature Likelihood Ratio for Feature i: = "Quality Score" =

29 (c) M Gerstein '06, gerstein.info/talks 29 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = (4/ 4 )/(3/6) =2 [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration

30 (c) M Gerstein '06, gerstein.info/talks 30 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = ( 4 /4)/(3/6) =2 [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration

31 (c) M Gerstein '06, gerstein.info/talks 31 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = (4/4)/(3/ 6 ) =2 [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration

32 (c) M Gerstein '06, gerstein.info/talks 32 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = (4/4)/( 3 /6) =2 [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration

33 (c) M Gerstein '06, gerstein.info/talks 33 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration L 1 = (4/4)/(3/6) =2

34 (c) M Gerstein '06, gerstein.info/talks 34 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = (4/4)/(3/6) =2 L2 = (3/4)/(3/6) =1.5 For each protein pair: LR = L1  L2 log(LR) = log(L 1 ) + log(L 2 ) [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration Weighted Voting (Assuming uncorrelated features and Naïve Bayes)

35 (c) M Gerstein '06, gerstein.info/talks 35 Feature 2, e.g. Y2H Feature 1, e.g. co-expression Gold-standard + Gold-standard – Network Gold-Standards L 1 = (4/4)/(3/6) =2 L2 = (3/4)/(3/6) =1.5 For each protein pair: LR = L1  L2 log(LR) = log(L 1 ) + log(L 2 ) [Jansen, Yu, et al., Science; Yu, et al., Genome Res.] Likelihood Ratio for Feature i: Prediction of protein interactions: Bayesian integration

36 (c) M Gerstein '06, gerstein.info/talks 36 4 4 3 0 TPLR Cutoffs 4 3 2 1 Feature 1; L 1 = 2 Feature 2; L 2 = 1.5 Gold-standard + Gold-standard – Predicted Positive Predicted Negative LR cutoffs 1 No Prediction made Protein Bayesian Integration and ROC Curves 2 <3 22 1 3< 3 <2 3/2< 4  3/2

37 (c) M Gerstein '06, gerstein.info/talks 37 6 3 0 0 FPLR Cutoffs 4 3 2 1 Feature 1; L 1 = 2 Feature 2; L 2 = 1.5 Gold-standard + Gold-standard – Predicted Positive Predicted Negative LR cutoffs 1 No Prediction made Protein Bayesian Integration and ROC Curves 2 <3 22 1 3< 3 <2 3/2< 4  3/2

38 (c) M Gerstein '06, gerstein.info/talks 38 64 34 03 00 FPTPLR Cutoffs 4 3 2 1 Feature 1; L 1 = 2 Feature 2; L 2 = 1.5 Gold-standard + Gold-standard – Predicted Positive Predicted Negative LR cutoffs 1 No Prediction made Protein 4 3 2 1 FPR=FP/N "Error Rate" TPR=TP/P "Coverage" ROC Curve 1/21 1 Bayesian Integration and ROC Curves 2 <3 22 1 3< 3 <2 3/2< 4  3/2

39 (c) M Gerstein '06, gerstein.info/talks 39 Likelihood Ratios 1101010 Pull-down 1 1101010110111101010000010000 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- Likelihood Ratio for Feature f:

40 (c) M Gerstein '06, gerstein.info/talks 40 Calculating Likelihood Ratios 10100 Pull-down 1 10111 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD-

41 (c) M Gerstein '06, gerstein.info/talks 41 Calculating Likelihood Ratios 11 Pull-down 1 10101101101010000010000 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- =1.34 1101010 Pull-down 1 1101010110111101010000010000 =0.70

42 (c) M Gerstein '06, gerstein.info/talks 42 Calculating Likelihood Ratios 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- Pull-down 1 L1 = (6/13) / (11/32) = 1.34L0 = (4/13) / (14/32) = 0.70 Pull-down 2 L1 = (7/13) / (9/32) = 1.91L0 = (2/13) / (16/32) = 0.31 Pull-down 3 L1 = (2/13) / (3/32) = 1.64L0 = (2/13) / (2/32) = 2.46 Cross-linking L1 = (10/13) / (7/32) = 3.52L0 = (0/13) / (3/32) = 0 Far Western 1 L1 = (2/13) / (4/32) = 1.23L0 = (3/13) / (6/32) = 1.23 Far Western 2 L1 = (6/13) / (5/32) = 2.95L0 = (2/13) / (17/32) = 0.29 Far Western 3 L1 = (1/13) / (1/32) = 2.46L0 = (2/13) / (2/32) = 2.46 1111010 Pull-down 2 1101010110111001010000000000 1101010 Pull-down 1 1101010110111101010000010000 0011111101 Cross-linking 1111111111 1 Pull-down 3 11000110 Far Western 3 100100 Far Western 2 111111111100000100000000000000 11111001 Far Western 1 0000000

43 (c) M Gerstein '06, gerstein.info/talks 43 Data Integration: ROC-Curve 001011100 Combined (Bayes) 111110111000010000000000000000000000 1111010 Pull-down 2 1101010110111001010000000000 1101010 Pull-down 1 1101010110111101010000010000 0011111101 Cross-linking 1111111111 1 Pull-down 3 11000110 Far Western 3 100100 Far Western 2 111111111100000100000000000000 11111001 Far Western 1 0000000 3.5218.211.113.926.60.220 0.290.060.120.220.290.06 0.22 0.360.080.910.2200.522.1613219.50.533.685.5212.710.42.2526.60.2202.2511.118.210.42.250.060.29 0 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- “Weighted Voting”

44 (c) M Gerstein '06, gerstein.info/talks 44 Data Integration: ROC Curve 001011100 Combined (Bayes) 111110111000010000000000000000000000 3.5218.211.113.926.60.220 0.290.060.120.220.290.06 0.22 0.360.080.910.2200.522.1613219.50.533.685.5212.710.42.2526.60.2202.2511.118.210.42.250.060.29 0 111010101 Majority 111110111000010000000000000000000000 001010000 Intersection 111100111000010000000000000000000000 111111111 Union 111111111100011100100100000100000000 0.2 0.4 0.6 0.8 1.0 0.20.40.60.81.0 FPR=FP/N=1-Specificity TPR=TP/P=Sensitivity ROC Curves of RNA Polymerase II 11111111123566666888899910 12222222233333355555 Subunits 23568910111235689101112568910111268910111289101112910111210111211 12 Subunits True False GSTD+ GSTD- Bayesian Integration Majority Intersection Union

45 (c) M Gerstein '06, gerstein.info/talks 45 Predicting Networks via Bayesian Integration: Features Correlation

46 (c) M Gerstein '06, gerstein.info/talks 46 Correlations between similar features

47 (c) M Gerstein '06, gerstein.info/talks 47 Intuition N3 : Feature Correlation

48 (c) M Gerstein '06, gerstein.info/talks 48 Naive Bayes

49 (c) M Gerstein '06, gerstein.info/talks 49 A ‘correct’ factorisation

50 (c) M Gerstein '06, gerstein.info/talks 50 A Typical BBN


Download ppt "(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University."

Similar presentations


Ads by Google