Presentation is loading. Please wait.

Presentation is loading. Please wait.

(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #2 - Application Mark Gerstein, Yale.

Similar presentations


Presentation on theme: "(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #2 - Application Mark Gerstein, Yale."— Presentation transcript:

1 (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.01 14:30-15:45)

2 (c) M Gerstein '06, gerstein.info/talks 2 Predicting Networks via Bayesian Integration: Real Thing but with a few features

3 (c) M Gerstein '06, gerstein.info/talks 3 Papers on Predicting Protein Interactions  A Enright et al. ( 1999 ) "Protein interaction maps for complete genomes based on gene fusion events." Nature. 402(6757):86-90.  E Marcotte et al. (1999) "A Combined Algorithm for Genome-Wide Prediction of Protein Function." Nature 402, 83-86 (1999).  E Marcotte et al. (1999) "Detecting Protein Function & Protein-Protein Interactions from Genome Sequences." Science 285, 751-753  M Pellegrini et al. (1999) "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles." Proc.Natl. Acad. Sci. 96, 4285-4288.  R Jansen et al. (2003). "A Bayesian networks approach for predicting protein-protein interactions from genomic data." Science 302: 449-53.  I Lee et al. (2004) "A Probabilistic Functional Network of Yeast Genes". Science 206: 1555-1558  H Yu et al. (2004) "Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs." Genome Res 14: 1107-18.  L Lu et al. (2005) "Assessing the limits of genomic data integration for predicting protein networks." Genome Res 15: 945-53.  A Ramani et al. (2005) "Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome." Genome Biology 6:r40.  Xia et al. ( 2006 ) "Integrated prediction of the helical membrane protein interactome in yeast." J Mol Biol. 357:339-49

4 (c) M Gerstein '06, gerstein.info/talks 4 Bayesian Formalism  R Jansen et al. (2003). "A Bayesian networks approach for predicting protein-protein interactions from genomic data." Science 302: 449-53.

5 (c) M Gerstein '06, gerstein.info/talks 5 Overview of information integrated: PIE & PIP Noisy high-throughput protein-protein interaction data Features that are not explicit interactions but which are suggestive

6 (c) M Gerstein '06, gerstein.info/talks 6 co-essentiality & gene-expression correlations if A and B interact  both are more likely be essential or non-essential  less likely that A is essential and B is non-essen (and the converse) Gene Expression Correlation Discussed earlier

7 (c) M Gerstein '06, gerstein.info/talks 7 Subcellular localization as a gold-standard negative: If A and B interact they have to be in the same cellular compartment Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi

8 (c) M Gerstein '06, gerstein.info/talks 8 Predicting Networks via Bayesian Integration: Relating Function to Interactions

9 (c) M Gerstein '06, gerstein.info/talks 9 GO DAG

10 (c) M Gerstein '06, gerstein.info/talks 10 Proteins classified by GO wikipedia Natural Science ChemistryBiology Biochemistry Glycolysis Also: DMOZ, Y! Define a continuous measure of similarity on a DAG

11 (c) M Gerstein '06, gerstein.info/talks 11 Simple Measures Shortest Distance Lowest Common Ancestor

12 (c) M Gerstein '06, gerstein.info/talks 12 Resnik's Measure

13 (c) M Gerstein '06, gerstein.info/talks 13 New Measure N/N AB

14 (c) M Gerstein '06, gerstein.info/talks 14 Differences between Methods

15 (c) M Gerstein '06, gerstein.info/talks 15 Relationship of Functional Similarity to Protein Interactions (Exclude Interactions in Evidence Codes)

16 (c) M Gerstein '06, gerstein.info/talks 16 Relating Functions to Interactions and vice versa Functions associated with nodes, Interactions with edges Interaction Prediction  Two nodes with consistent labeling predicts interactions. Turn it around: function prediction  Two connected nodes should be consistently labeled. Use this to predict function from interactions.  Constraint satisfaction problem.

17 (c) M Gerstein '06, gerstein.info/talks 17 Predicting Networks via Bayesian Integration: Practical Issues in Assessing Results

18 (c) M Gerstein '06, gerstein.info/talks 18 Combination of All Features Genetic Interactions GO MIPS mRNA co-expr. Individual Features and their Integration for Yeast Membrane Protein Interaction Prediction Figure 4: ROC curve for logistic regression classifiers based on all features combined, as well as each individual feature alone. (a) all features combined, (b) MIPS functional similarity, (c) GO functional similarity, (d) genetic interaction, (e) membrane co-localization, (f) total mRNA expression, (g) transcriptional co-regulation, (h) total protein abundance, (i) co-essentiality, (j) comparative genomic evidence, (k) mRNA expression correlation, (l) total marginal essentiality. Arrow represents a cost of k=100 used in our final integrated predictions.

19 (c) M Gerstein '06, gerstein.info/talks 19 Integration of Features in PIP Gives Much Higher Likelihood Ratios than Any Individual Feature Jansen et al. Science 302:449 "Weighted Vote"

20 (c) M Gerstein '06, gerstein.info/talks 20 t-SNAREs v-SNAREs H + -transporting ATPase (vacuolar) lipid biosynthesis cytochrome c oxidase cytochrome bc1 complex oligosaccharyltransferase transport COPII carbohydrate transport protein targeting amino acid glycosylation Figure 6: A map of known and a subset of predicted interactions among helical membrane proteins. Nodes represent helical membrane proteins, and edges represent interactions among them. Red edges represent known interactions that are also predicted to interact, blue edges represent other known interactions, and green edges represent ~700 top interaction predictions (ranked by descending logistic regression score) out of a total of 4,145. Purple nodes represent helical membrane proteins that show up in the known interactions, and green nodes represent new helical membrane proteins. Map of Known and Predicted Membrane Protein Interactome in Yeast New KnownKnown  Xia et al. (2006) "Integrated prediction of the helical membrane protein interactome in yeast." J Mol Biol. 357:339-49

21 (c) M Gerstein '06, gerstein.info/talks 21 How do we assess topology prediction? Validate against a gold-standard of known interactions Choice of gold-standard is of paramount importance  A negative set is equally important and often overlooked. Reliably measured negative sets are not available, sets based on exclusion based on genomic annotations (i.e. GO cellular component) are often used. Measure false positive (FP), true positive (TP), false negative (FN) and true negative (TN) rate Often plot results as a ROC-curve (sensitivity vs. specificity)  Other metrics useful (e.g. PPV)  Balance of positives and negatives

22 (c) M Gerstein '06, gerstein.info/talks 22 Comparison of Predictions against a Positive and Negative Gold Standard Threshold "predictions" at different levels and compare to + and - gold standards ROC plot (cross validated) "Error Rate" "Coverage"

23 (c) M Gerstein '06, gerstein.info/talks 23 Effect on Predictions of Large Number of Negatives

24 (c) M Gerstein '06, gerstein.info/talks 24 Importance of Balanced Positive and Negative Examples

25 (c) M Gerstein '06, gerstein.info/talks 25 Predicting Networks via Bayesian Integration: Adding in Even More Features

26 (c) M Gerstein '06, gerstein.info/talks 26 Adding in More Features [Eisenberg, Bork, Boone, Skolnick, Vidal]

27 (c) M Gerstein '06, gerstein.info/talks 27 Evidence Integration + Graph Clustering  I Lee et al. (2004) "A Probabilistic Functional Network of Yeast Genes". Science 206: 1555-1558

28 (c) M Gerstein '06, gerstein.info/talks 28 Integrated Network Showing Context Inferred Linkages  I Lee et al. (2004) "A Probabilistic Functional Network of Yeast Genes". Science 206: 1555-1558

29 (c) M Gerstein '06, gerstein.info/talks 29 ROC-like Plot Showing Power of Integration  I Lee et al. (2004) "A Probabilistic Functional Network of Yeast Genes". Science 206: 1555-1558

30 (c) M Gerstein '06, gerstein.info/talks 30 Predicting Networks via Bayesian Integration: Feature List

31 (c) M Gerstein '06, gerstein.info/talks 31 Phylogenetic Profiles  M Pellegrini et al. (1999) "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles." Proc.Natl. Acad. Sci. 96, 4285-4288.

32 (c) M Gerstein '06, gerstein.info/talks 32 Gene Fusion Method  A Enright et al. (1999) "Protein interaction maps for complete genomes based on gene fusion events." Nature. 402(6757):86-90.

33 (c) M Gerstein '06, gerstein.info/talks 33 Co-citation feature P(chance L or more abstracts co-cite two proteins | n,m,N) [from Ramani et al., 2005]  A Ramani et al. (2005) "Consolidating the set of known human protein- protein interactions in preparation for large- scale mapping of the human interactome." Genome Biology 6:r40.

34 (c) M Gerstein '06, gerstein.info/talks 34 Intuition on Hyper- geometric Consider picking m articles citing protein #1 from N total articles, n of which cite #2 (numb. ways of picking k citing #1 from those citing #2) * (numb. ways of picking m-k citing #2 from overall population N that do not cite #1) / (numb. of ways of picking m articles citing #1 from overall population) [from http://mathworld.wolfram.com/HypergeometricDistribution.html]

35 (c) M Gerstein '06, gerstein.info/talks 35 Abs. Prot. Abund. & mRNA abund If A and B interact, abundance (A) - abundance (B) is less than just two random pairs  (might be described proportionally)

36 (c) M Gerstein '06, gerstein.info/talks 36 Interologs & Regulogs  H Yu et al. (2004) "Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs." Genome Res 14: 1107-18.

37 (c) M Gerstein '06, gerstein.info/talks 37 LR cutoff ~ 1000 More similar Pairs of Sequences Higher Weighted Vote, More likely to interact 75 [Yu et al, Genome Res 14: 1107-18] High likelihood ratios from interolog mapping (for very similar sequences)

38 (c) M Gerstein '06, gerstein.info/talks 38 Synthetic Lethality A B aa B X A bb Viable Lethal aa bb Wild-type Viable X Synthetic Lethality Identifies Functional Relationships Large-Scale Synthetic Lethality Analysis Should Generate a Global Map of Functional Relationships between Genes and Pathways Gene Conservation = Genetic Network Conservation

39 (c) M Gerstein '06, gerstein.info/talks 39 A B C Essential Product X Y Z A B C Essential Product X Y Z A B C Essential Product X Y Z A B C Essential Product X Y Z X Dead X X X X X Genetic Interaction Network Similar Patterns of Genetic Interactions Identify Pathways or Complexes

40 (c) M Gerstein '06, gerstein.info/talks 40 Prevalence of interactions for co-regulated proteins Occurrence of motif relative to random expectation (log) TF-target Regulation Target interaction

41 (c) M Gerstein '06, gerstein.info/talks 41 Predicting protein interactions by completing defective cliques High-throughput experiments are prone to missing interactions P Q K If proteins P and Q interact with a clique K of proteins which all interact with each other, then P and Q are more likely to interact with each other P, Q, and K form a defective clique

42 (c) M Gerstein '06, gerstein.info/talks 42 Predicting protein interactions by completing defective cliques (cont’d) A defective clique can be viewed as the union of two overlapping complete cliques P Q K To find all defective cliques, we look for partial overlap of pairs of maximal cliques In test cases shows good prediction likelihood ratio

43 (c) M Gerstein '06, gerstein.info/talks 43 Examples Yu et al, 2006, Bioinformatics

44 (c) M Gerstein '06, gerstein.info/talks 44 Predicting Networks via Bayesian Integration: Practical Analysis of Feature Correlation

45 (c) M Gerstein '06, gerstein.info/talks 45 Does adding in more features help?

46 (c) M Gerstein '06, gerstein.info/talks 46 Integration of Three Best Additional Features with Four Original Features 4 Orig. Features Total 7 Features "Coverage" "Error Rate"  L Lu et al. (2005) "Assessing the limits of genomic data integration for predicting protein networks." Genome Res 15: 945-53.

47 (c) M Gerstein '06, gerstein.info/talks 47 Feature Correlation  L Lu et al. (2005) "Assessing the limits of genomic data integration for predicting protein networks." Genome Res 15: 945-53.

48 (c) M Gerstein '06, gerstein.info/talks 48 What is Mutual Information?

49 (c) M Gerstein '06, gerstein.info/talks 49 Entropy H(a) Information (or uncertainly) about event a is log 2 1/f(p,a) Entropy of an ensemble is probability weighted information content H(a) = -  a=1 to 20 f(a) log 2 1/f(a), where f(a) = frequency of character a Say string only has one character (AAAAA): H(a) = 1 log 2 1 + 0 log 2 0 + 0 log 2 0 + … = 0 + 0 + 0 + … = 0 Say string is random with all 20 chars equiprobable (ACD..ACD..ACD..): H rand (a) = -.05 log 2.05 -.05 log 2.05 - … =.22 +.22 + … = 4.3 bits H for a bionomial variable is 2*log 2 (1/.5) = 1 bit Shannon derived a measure of information content called the self-information or "surprisal" of a message: log 1/p

50 (c) M Gerstein '06, gerstein.info/talks 50 Worked Mutual Information Example

51 (c) M Gerstein '06, gerstein.info/talks 51 Feature Correlation  L Lu et al. (2005) "Assessing the limits of genomic data integration for predicting protein networks." Genome Res 15: 945-53.

52 (c) M Gerstein '06, gerstein.info/talks 52 Could lack of improvement have to do with feature correlation? NO! See effect of using Boosting to overcome correlation Boosted Naïve Bayes Naïve Bayes "Coverage" "Error Rate"  L Lu et al. (2005) "Assessing the limits of genomic data integration for predicting protein networks." Genome Res 15: 945-53.

53 (c) M Gerstein '06, gerstein.info/talks 53 Control: Observe how boosting improves integration when features are definitely correlated Using Simple Naïve Bayes Using Boosting "Coverage" "Error Rate"


Download ppt "(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #2 - Application Mark Gerstein, Yale."

Similar presentations


Ads by Google