Presentation is loading. Please wait.

Presentation is loading. Please wait.

Do not reproduce without permission 1 Gerstein.info/talks (c) 2003 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.

Similar presentations


Presentation on theme: "Do not reproduce without permission 1 Gerstein.info/talks (c) 2003 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation."— Presentation transcript:

1 Do not reproduce without permission 1 Gerstein.info/talks (c) 2003 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein, Yale University, 2003, Feel free to use images in it with PROPER acknowledgement.

2 Do not reproduce without permission 2 Gerstein.info/talks (c) 2003 2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics: Predicting protein function on a genome scale Mark B Gerstein Yale U H Hegyi, J Lin, J Qian, N Luscombe, T Johnson, A Drawid, R Jansen, V Alexandrov, M Snyder, A Kumar, H Zhu, D Greenbaum, N Lan, P Harrison, N Echols, S Balasubramanian, P Bertone, Z Zhang, R Das, Y Liu, Y Kluger, H Yu, D Greenbaum, A Edwards, J Greenblatt, B Kus, P Miller, K Cheung, S Weissman, J Chang, R Basri, J Tsai Talk at GCB’03 2003.10.12

3 Do not reproduce without permission 3 Gerstein.info/talks (c) 2003 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understanding proteins, through analysis of populations rather than individuals

4 Do not reproduce without permission 4 Gerstein.info/talks (c) 2003 4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term Greenbaum et al., Genome Res. 11:1463

5 Do not reproduce without permission 5 Gerstein.info/talks (c) 2003 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term Greenbaum et al., Genome Res. 11:1463

6 Do not reproduce without permission 6 Gerstein.info/talks (c) 2003 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term PubMed Hits Proteome Greenbaum et al., Genome Res. 11:1463

7 Do not reproduce without permission 7 Gerstein.info/talks (c) 2003 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The popularity of Proteomics for the non-scientist

8 Do not reproduce without permission 8 Gerstein.info/talks (c) 2003 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Many Manifestations of Proteins & Research Topics in Proteomics Analyzing protein fossils (pseudogenes) in genomes Predicting protein function on a genomic scale Comparing folds & families between proteomes Analyzing protein flexibility in terms of packing Structures Sequences ArraysGels

9 Do not reproduce without permission 9 Gerstein.info/talks (c) 2003 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understanding Protein Function on a Genomic Scale Originally, 250 of 650 known on chr. 22 [Dunham et al.] >>30K+ Proteins in Entire Human Genome (with alt. splicing).…… ~650

10 Do not reproduce without permission 10 Gerstein.info/talks (c) 2003 10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in defining protein function on a genomic scale Multi-functionality: 2 functions/protein (also 2 proteins/function) Role Conflation: molecular, cellular, phenotypic Fun terms… but do they scale?....

11 Do not reproduce without permission 11 Gerstein.info/talks (c) 2003 11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Names in Biology: Systematic? Yippee  named for the reaction of a graduate student upon cloning protein. If she has a good result, she would write "yippee" in the margin of her notebook vulcan & klingon stranded at second  mutant dies during development, usually in 2nd larval stage sarah  affects female fertility (biblical ref.) Sonic & kryptonite Darkener of apricot & suppressor of white apricot ROP vs ROM  "Regulator of Copy Number" or RNA-I- II-complex-binding-protein Barentsz  named for Dutch explorer who froze to death near the North Pole. The mutant blocks the movement of a key mRNA, causing it to get stuck in wrong place Agoraphobic  mutant for which the larvae look normal but never crawl out of the egg single-minded Redtape  series of designations given to genes which, when mutated, block transport along axons. Lush & cheapdate  former wants alcohol, later makes susceptible [Adapted from conversations + Am. Sci.]

12 Do not reproduce without permission 12 Gerstein.info/talks (c) 2003 12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in defining protein function on a genomic scale Fun terms… but do they scale? Starry night (P Adler, ’94) For now, definable aspects of function: interactions, location, enzymatic rxn. [Babbit]

13 Do not reproduce without permission 13 Gerstein.info/talks (c) 2003 13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Toward Systematic Ontologies for Function Networks [Eisenberg et al.] Hierarchies & DAGs [Enzyme, Bairoch; GO, Ashburner; MIPS, Mewes, Frishman] Interaction Vectors [Lan et al, IEEE 90:1848]

14 Do not reproduce without permission 14 Gerstein.info/talks (c) 2003 14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration  Other approaches: domain fusions, correlated gene neighbors, phylogenetic profiles, motifs & key sites [Koonin, Eisenberg, Bork, Ouzounis, Sternberg, Thornton, Rose]

15 Do not reproduce without permission 15 Gerstein.info/talks (c) 2003 15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration  Compare uncharacterized genome sequences against known sequences in DBs, transferring func. annotation for similar sequences Issue: Threshold is major parameter & limitation

16 Do not reproduce without permission 16 Gerstein.info/talks (c) 2003 16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1000s of structurally based alignments of structurally and functionally characterized sequences (Human) 90% (Chick) 45% (E coli) (B ster.) 20% (E coli) (Yeast) Sequence 5.3.1.1 (TP Isomerase) Same Exact 5.3.1.1 (TP Isomerase) Both Class 5 (isom.) 5.3.1.1 (TP Isomerase) 5.3.1.24 (PRA Isomerase) 5.3.1.15 (Xylose Isom.) Different Classes 4.1.3.3 (Aldolase) 4.2.1.11 (Enolase) EC Function

17 Do not reproduce without permission 17 Gerstein.info/talks (c) 2003 17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID Sequence similarity of pairs of proteins % Same Function Percentage of pairs that have same precise function as defined by Enzyme & FlyBase functional classifications Wilson et al. JMB 297: 233

18 Do not reproduce without permission 18 Gerstein.info/talks (c) 2003 18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233

19 Do not reproduce without permission 19 Gerstein.info/talks (c) 2003 19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233

20 Do not reproduce without permission 20 Gerstein.info/talks (c) 2003 20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233

21 Do not reproduce without permission 21 Gerstein.info/talks (c) 2003 21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Caveats: Sequence Divergence of Multidomain Proteins, Implies a Practical Theshold is >40% (Human) (Chick) (E coli) (B ster.) (E coli) (Yeast) (Rat) Single Domain Sequences Multidomain Sequences Hegyi & Gerstein, Genome Res. 11: 1632

22 Do not reproduce without permission 22 Gerstein.info/talks (c) 2003 22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu (Very Close) Sequence Similarity [ -log(e-value) ] % Same Function Multi-domain proteins have greater divergence in function with sequence Hegyi & Gerstein, Genome Res. 11: 1632

23 Do not reproduce without permission 23 Gerstein.info/talks (c) 2003 23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration 

24 Do not reproduce without permission 24 Gerstein.info/talks (c) 2003 24 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Structure Suggesting Function Asp tRNA Synthetase Staphylococcal Nuclease Solve structures of ORFs with no homologs, using fold & site similarity to determine function. (Rationale for Structure Prediction) Issue: To what degree does fold determine function, globally? CspA E.G. cspA OB fold suggests DNA binding [Montelione]

25 Do not reproduce without permission 25 Gerstein.info/talks (c) 2003 25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fold Function Combinations Many Functions on Same Scaffold (TIM- barrel) Different Folds with Same Function (Carbonic Anhydrases, 4.2.1.1)

26 Do not reproduce without permission 26 Gerstein.info/talks (c) 2003 26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations

27 Do not reproduce without permission 27 Gerstein.info/talks (c) 2003 27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations Sort

28 Do not reproduce without permission 28 Gerstein.info/talks (c) 2003 28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu To what degree is fold associated with function? Folds with multiple functions Number of functions associated with a fold Frequency in database of 229 folds Hegyi & Gerstein, JMB 288: 147 [Similar results by Thornton]

29 Do not reproduce without permission 29 Gerstein.info/talks (c) 2003 29 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

30 Do not reproduce without permission 30 Gerstein.info/talks (c) 2003 30 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteome Chips [Snyder] Microarray experiments Expression Arrays [Brown]

31 Do not reproduce without permission 31 Gerstein.info/talks (c) 2003 31 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 Format of Gene Expression Data

32 Do not reproduce without permission 32 Gerstein.info/talks (c) 2003 32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Microarray timecourse of 1 ribosomal protein mRNA expression level (ratio) Time-> [Brown, Davis]

33 Do not reproduce without permission 33 Gerstein.info/talks (c) 2003 33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Random relationship from 18M

34 Do not reproduce without permission 34 Gerstein.info/talks (c) 2003 34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Close relationship from 18M (2 Interacting Ribosomal Proteins) mRNA expression level (ratio) Time-> [Botstein; Church]

35 Do not reproduce without permission 35 Gerstein.info/talks (c) 2003 35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Predict Functional Interaction of Unknown Member of Cluster mRNA expression level (ratio) Time->

36 Do not reproduce without permission 36 Gerstein.info/talks (c) 2003 36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Local Clustering algorithm identifies further (reasonable) types of expression relation- ships Simultaneous Traditional Global Correlation Inverted Time- Shifted [Church] Qian et al. JMB 314:1053

37 Do not reproduce without permission 37 Gerstein.info/talks (c) 2003 37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Mapped problem onto a simple adaptation of SW local sequence alignment Simultaneous Traditional Global Correlation Inverted Time- Shifted Qian et al. JMB 314:1053

38 Do not reproduce without permission 38 Gerstein.info/talks (c) 2003 38 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples inverted relationships Documented YME1 :mito. protease involved in cplx. assembly YNT20 :known surpressor of YME1 Suggestive PUT2 :involved in Pro degradation SER3 :involved in Ser synthesis Time Expr. Ratio

39 Do not reproduce without permission 39 Gerstein.info/talks (c) 2003 39 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples: SIM: Example Shifted Relationship (SIM)

40 Do not reproduce without permission 40 Gerstein.info/talks (c) 2003 40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MIM: Examples: Example Shifted Relationship (MIM)

41 Do not reproduce without permission 41 Gerstein.info/talks (c) 2003 41 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships ~313K significant relationships from ~18M possible

42 Do not reproduce without permission 42 Gerstein.info/talks (c) 2003 42 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships Simultaneous 188K Inverted 63K Shifted 67K ~313K significant relationships from ~18M possible

43 Do not reproduce without permission 43 Gerstein.info/talks (c) 2003 43 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

44 Do not reproduce without permission 44 Gerstein.info/talks (c) 2003 44 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Iyer et al, Nature, 409:533 Lee et al., Science. 298:799 Horak et al, Genes & Development, 16:3017 Relationship between Transcription and Expression chIP-chip experiments provide large- scale known regulatory relationships

45 Do not reproduce without permission 45 Gerstein.info/talks (c) 2003 45 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Regulators & Targets: Prevalence of shifted & inverted relationships Yu et al. TIG 19:422

46 Do not reproduce without permission 46 Gerstein.info/talks (c) 2003 46 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Regulators & Targets: Inhibitors v Activators Yu et al. TIG 19:422

47 Do not reproduce without permission 47 Gerstein.info/talks (c) 2003 47 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Co-regulated Targets Yu et al. TIG 19:422

48 Do not reproduce without permission 48 Gerstein.info/talks (c) 2003 48 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Next Step: Can you predict regulatory networks from expression data? [Siggia, Bussemaker, Gifford & Young]

49 Do not reproduce without permission 49 Gerstein.info/talks (c) 2003 49 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Trained and tested standard SVM: Reasonable results against Transfac 2% 36% FP rate Coverage Qian et al. Bioinformatics (in press)

50 Do not reproduce without permission 50 Gerstein.info/talks (c) 2003 50 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu However, poor overlap with chIP-chip Qian et al. Bioinformatics (in press)

51 Do not reproduce without permission 51 Gerstein.info/talks (c) 2003 51 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

52 Do not reproduce without permission 52 Gerstein.info/talks (c) 2003 52 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Biclustering to associate particular genes with certain phenotypes Conditions Reordered Genes (Sorted according to a classification vector) ? Matrix of raw data Genes Reordered Conditions (Sorted according to a classification vector) Shuffled Matrix (containing checkerboard “biclusters” of conditions with marker genes) Kluger et al. Genome Res. 13:703

53 Do not reproduce without permission 53 Gerstein.info/talks (c) 2003 53 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Identify checkerboard matrices by their action on classification vectors: Formulation as “eigenproblem” Checkerboard Matrix A Condition Classification Vect. x Conditions Genes Gene Classification Vector y A A x = x’ T A A y = y’ T Genes Conditions x’x’ y A T Kluger et al. Genome Res. 13:703

54 Do not reproduce without permission 54 Gerstein.info/talks (c) 2003 54 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu SVD to Solve Eigenproblem [Botstein] [Altman, Kim]

55 Do not reproduce without permission 55 Gerstein.info/talks (c) 2003 55 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Matrix Normalization: Rescaling Rows & Columns to Same Mean A x = y noise R A x = y noise Kluger et al. Genome Res. 13:703

56 Do not reproduce without permission 56 Gerstein.info/talks (c) 2003 56 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Representative Cancer Data set Lymphoma Data from Dalla-Favera et al. at Columbia Informatics from Stolovitzky & Califano at IBM Supervised learning some identified characteristic genes associated with different types of lymphoma Kluger et al. Genome Res. 13:703

57 Do not reproduce without permission 57 Gerstein.info/talks (c) 2003 57 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Matrix values represent outer products of two blocky classification eigenvectors Results on Representative Cancer Data set Kluger et al. Genome Res. 13:703

58 Do not reproduce without permission 58 Gerstein.info/talks (c) 2003 58 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data with Normalization and Sorting Kluger et al. Genome Res. 13:703

59 Do not reproduce without permission 59 Gerstein.info/talks (c) 2003 59 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data just with Sorting (no normalization) Kluger et al. Genome Res. 13:703

60 Do not reproduce without permission 60 Gerstein.info/talks (c) 2003 60 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data (no normalization or sorting) Kluger et al. Genome Res. 13:703

61 Do not reproduce without permission 61 Gerstein.info/talks (c) 2003 61 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data just with Sorting (no normalization) Kluger et al. Genome Res. 13:703

62 Do not reproduce without permission 62 Gerstein.info/talks (c) 2003 62 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data with Normalization and Sorting Kluger et al. Genome Res. 13:703

63 Do not reproduce without permission 63 Gerstein.info/talks (c) 2003 63 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Matrix values represent outer products of two blocky classification eigenvectors Just signal from top classification eigenvectors Kluger et al. Genome Res. 13:703

64 Do not reproduce without permission 64 Gerstein.info/talks (c) 2003 64 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Actual Values of Projections onto Classification Eigenvectors Kluger et al. Genome Res. 13:703

65 Do not reproduce without permission 65 Gerstein.info/talks (c) 2003 65 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Classification of Cancers Based on Projection onto two top classification eigenvectors: Better with Normalization Normalized (“bistochastization”) CLL DLCL FL DLCL Straight SVD Four types of Cancer in Della Favera dataset Kluger et al. Genome Res. 13:703

66 Do not reproduce without permission 66 Gerstein.info/talks (c) 2003 66 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286 biclusteringbistochastization SVDbi-normalizationNormalized cuts ALL (B) ALL (T) AML

67 Do not reproduce without permission 67 Gerstein.info/talks (c) 2003 67 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

68 Do not reproduce without permission 68 Gerstein.info/talks (c) 2003 68 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration for Interactomes Diverse sources of interaction information  Databases (BIND, DIP, MIPS etc.) Individual expts. in literature  High-throughput datasets in vivo pull down (Ho, Gavin) yeast two-hybrid (Uetz, Ito)  Genomic data Expression Phenotypes Localization Functional Noisy  High-throughput data is less reliable than smaller scale experiments [Grigorev, Bork] Combining data increases  Accuracy & coverage [Church] How to do quantitatively?  How to weight different data sources?  General classification problem (machine learning) Bayesian Approaches…. Science 295:284

69 Do not reproduce without permission 69 Gerstein.info/talks (c) 2003 69 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Example of data integration: RNA polymerase II Which subunits interact? Based on protein-protein interaction experiments [Kornberg] Compare with Gold Std. structure Edwards, Kus, et al. TIG 18:529

70 Do not reproduce without permission 70 Gerstein.info/talks (c) 2003 70 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II

71 Do not reproduce without permission 71 Gerstein.info/talks (c) 2003 71 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II

72 Do not reproduce without permission 72 Gerstein.info/talks (c) 2003 72 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Interaction experiments before structure was known

73 Do not reproduce without permission 73 Gerstein.info/talks (c) 2003 73 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II

74 Do not reproduce without permission 74 Gerstein.info/talks (c) 2003 74 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integrate using naive Bayes classifier Data integration: RNA polymerase II

75 Do not reproduce without permission 75 Gerstein.info/talks (c) 2003 75 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier

76 Do not reproduce without permission 76 Gerstein.info/talks (c) 2003 76 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier

77 Do not reproduce without permission 77 Gerstein.info/talks (c) 2003 77 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Weighted Voting: the Likelihood Ratio Vote: +2 = 1 + 1 + -1 + -1 + 1 + 1 With weights: likelihood ratio L = L 1 + L 2 + L 3 …

78 Do not reproduce without permission 78 Gerstein.info/talks (c) 2003 78 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Correlations between similar features

79 Do not reproduce without permission 79 Gerstein.info/talks (c) 2003 79 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relative quality of different expts. L = (TP/FP) (N/P) [for uncorrelated features]

80 Do not reproduce without permission 80 Gerstein.info/talks (c) 2003 80 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gavin UetzHo 90/556711/135 1357/6226 6/6 353/212 18/6 15/1 TP / FP Disagreement in high-throughput protein interaction datasets [Eisenberg, Fields & Bork] Jansen et al. JSFG 2:71

81 Do not reproduce without permission 81 Gerstein.info/talks (c) 2003 81 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

82 Do not reproduce without permission 82 Gerstein.info/talks (c) 2003 82 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of network against gold standard complex interactions Positives 8250 known interactions in MIPS complexes [Mewes] Negatives ~2.7 M pairs in diff. Subcellular compartments TP FP Set of predicted “interactions” [Related Data in Bind, DIP]

83 Do not reproduce without permission 83 Gerstein.info/talks (c) 2003 83 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Overview of information integrated and Bayesian Formalism Data suggestive of interactions (co-expression, co-localization, similar essentiality) Noisy high-throughput experiments (Gavin et al., Uetz et al. &c) Gold-standard complexes (MIPS, Mewes, Frishman et al.) Jansen et al. Science (in press)

84 Do not reproduce without permission 84 Gerstein.info/talks (c) 2003 84 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Overview of information integrated and Bayesian Formalism Cross-validated training and testing Thresholding L at various values Tabulation of observed TP and FP at various thresholds Jansen et al. Science (in press)

85 Do not reproduce without permission 85 Gerstein.info/talks (c) 2003 85 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)

86 Do not reproduce without permission 86 Gerstein.info/talks (c) 2003 86 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)

87 Do not reproduce without permission 87 Gerstein.info/talks (c) 2003 87 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)

88 Do not reproduce without permission 88 Gerstein.info/talks (c) 2003 88 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature Jansen et al. Science (in press)

89 Do not reproduce without permission 89 Gerstein.info/talks (c) 2003 89 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of Predictions with Known Complexes Jansen et al. Science (in press)

90 Do not reproduce without permission 90 Gerstein.info/talks (c) 2003 90 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicted Network Jansen et al. Science (in press)

91 Do not reproduce without permission 91 Gerstein.info/talks (c) 2003 91 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Example prediction: Mito. Ribosome Jansen et al. Science (in press)

92 Do not reproduce without permission 92 Gerstein.info/talks (c) 2003 92 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison with new experiments J Greenblatt RFA cplx Jansen et al. Science (in press)

93 Do not reproduce without permission 93 Gerstein.info/talks (c) 2003 93 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature Jansen et al. Science (in press)

94 Do not reproduce without permission 94 Gerstein.info/talks (c) 2003 94 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Same is True for Combining High-throughput Interaction Data (Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature) Jansen et al. Science (in press)

95 Do not reproduce without permission 95 Gerstein.info/talks (c) 2003 95 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of Strength of Purely Predicted Features (PIP) vs. Integrated High throughput Data (PIE) Jansen et al. Science (in press)

96 Do not reproduce without permission 96 Gerstein.info/talks (c) 2003 96 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments  Local Clustering (to identify time-shifted and inverted relationships)  Relating Clustering to Known Regulatory Relationships  Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration  Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data)  Predicting interactions in yeast de novo from non-interaction data sources (with verification)

97 Do not reproduce without permission 97 Gerstein.info/talks (c) 2003 97 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Acknowledgements Protein Function Prediction (GeneCensus.org) J Qian, R Jansen, A Drawid, C Wilson, H Yu, D Greenbaum, J Lin, N Luscombe, H Hegyi, Y Kluger Pseudogenes (Pseudogene.org) P Harrison, Z Zhang, Y Liu, S Balasubramanian, P Bertone, T Johnson, J Karro Macromolecular Motions (MolMovDB.org) J Junker, H Yu, N Echols, V Alexandrov, W Krebs, D Milburn, U Lehnert Collaborators J Chang, R Basri, J Greenblatt (N Krogan) Yale CEGS M Snyder (A Kumar, H Zhu, M Bilgin …) S Weissmann, P Miller (K Cheung) NESG.org G Montelione, A Edwards (B Kuss) NIH, NSF Structural Proteomics (PartsList.org) C Goh, N Lan, H Hegyi, R Das, S Douglas, B Stenger


Download ppt "Do not reproduce without permission 1 Gerstein.info/talks (c) 2003 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation."

Similar presentations


Ads by Google