Download presentation
Presentation is loading. Please wait.
1
Do not reproduce without permission 1 Gerstein.info/talks (c) 2003 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein, Yale University, 2003, Feel free to use images in it with PROPER acknowledgement.
2
Do not reproduce without permission 2 Gerstein.info/talks (c) 2003 2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics: Predicting protein function on a genome scale Mark B Gerstein Yale U H Hegyi, J Lin, J Qian, N Luscombe, T Johnson, A Drawid, R Jansen, V Alexandrov, M Snyder, A Kumar, H Zhu, D Greenbaum, N Lan, P Harrison, N Echols, S Balasubramanian, P Bertone, Z Zhang, R Das, Y Liu, Y Kluger, H Yu, D Greenbaum, A Edwards, J Greenblatt, B Kus, P Miller, K Cheung, S Weissman, J Chang, R Basri, J Tsai Talk at GCB’03 2003.10.12
3
Do not reproduce without permission 3 Gerstein.info/talks (c) 2003 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understanding proteins, through analysis of populations rather than individuals
4
Do not reproduce without permission 4 Gerstein.info/talks (c) 2003 4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term Greenbaum et al., Genome Res. 11:1463
5
Do not reproduce without permission 5 Gerstein.info/talks (c) 2003 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term Greenbaum et al., Genome Res. 11:1463
6
Do not reproduce without permission 6 Gerstein.info/talks (c) 2003 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term PubMed Hits Proteome Greenbaum et al., Genome Res. 11:1463
7
Do not reproduce without permission 7 Gerstein.info/talks (c) 2003 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The popularity of Proteomics for the non-scientist
8
Do not reproduce without permission 8 Gerstein.info/talks (c) 2003 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Many Manifestations of Proteins & Research Topics in Proteomics Analyzing protein fossils (pseudogenes) in genomes Predicting protein function on a genomic scale Comparing folds & families between proteomes Analyzing protein flexibility in terms of packing Structures Sequences ArraysGels
9
Do not reproduce without permission 9 Gerstein.info/talks (c) 2003 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understanding Protein Function on a Genomic Scale Originally, 250 of 650 known on chr. 22 [Dunham et al.] >>30K+ Proteins in Entire Human Genome (with alt. splicing).…… ~650
10
Do not reproduce without permission 10 Gerstein.info/talks (c) 2003 10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in defining protein function on a genomic scale Multi-functionality: 2 functions/protein (also 2 proteins/function) Role Conflation: molecular, cellular, phenotypic Fun terms… but do they scale?....
11
Do not reproduce without permission 11 Gerstein.info/talks (c) 2003 11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Names in Biology: Systematic? Yippee named for the reaction of a graduate student upon cloning protein. If she has a good result, she would write "yippee" in the margin of her notebook vulcan & klingon stranded at second mutant dies during development, usually in 2nd larval stage sarah affects female fertility (biblical ref.) Sonic & kryptonite Darkener of apricot & suppressor of white apricot ROP vs ROM "Regulator of Copy Number" or RNA-I- II-complex-binding-protein Barentsz named for Dutch explorer who froze to death near the North Pole. The mutant blocks the movement of a key mRNA, causing it to get stuck in wrong place Agoraphobic mutant for which the larvae look normal but never crawl out of the egg single-minded Redtape series of designations given to genes which, when mutated, block transport along axons. Lush & cheapdate former wants alcohol, later makes susceptible [Adapted from conversations + Am. Sci.]
12
Do not reproduce without permission 12 Gerstein.info/talks (c) 2003 12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in defining protein function on a genomic scale Fun terms… but do they scale? Starry night (P Adler, ’94) For now, definable aspects of function: interactions, location, enzymatic rxn. [Babbit]
13
Do not reproduce without permission 13 Gerstein.info/talks (c) 2003 13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Toward Systematic Ontologies for Function Networks [Eisenberg et al.] Hierarchies & DAGs [Enzyme, Bairoch; GO, Ashburner; MIPS, Mewes, Frishman] Interaction Vectors [Lan et al, IEEE 90:1848]
14
Do not reproduce without permission 14 Gerstein.info/talks (c) 2003 14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration Other approaches: domain fusions, correlated gene neighbors, phylogenetic profiles, motifs & key sites [Koonin, Eisenberg, Bork, Ouzounis, Sternberg, Thornton, Rose]
15
Do not reproduce without permission 15 Gerstein.info/talks (c) 2003 15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration Compare uncharacterized genome sequences against known sequences in DBs, transferring func. annotation for similar sequences Issue: Threshold is major parameter & limitation
16
Do not reproduce without permission 16 Gerstein.info/talks (c) 2003 16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1000s of structurally based alignments of structurally and functionally characterized sequences (Human) 90% (Chick) 45% (E coli) (B ster.) 20% (E coli) (Yeast) Sequence 5.3.1.1 (TP Isomerase) Same Exact 5.3.1.1 (TP Isomerase) Both Class 5 (isom.) 5.3.1.1 (TP Isomerase) 5.3.1.24 (PRA Isomerase) 5.3.1.15 (Xylose Isom.) Different Classes 4.1.3.3 (Aldolase) 4.2.1.11 (Enolase) EC Function
17
Do not reproduce without permission 17 Gerstein.info/talks (c) 2003 17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID Sequence similarity of pairs of proteins % Same Function Percentage of pairs that have same precise function as defined by Enzyme & FlyBase functional classifications Wilson et al. JMB 297: 233
18
Do not reproduce without permission 18 Gerstein.info/talks (c) 2003 18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233
19
Do not reproduce without permission 19 Gerstein.info/talks (c) 2003 19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233
20
Do not reproduce without permission 20 Gerstein.info/talks (c) 2003 20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Wilson et al. JMB 297: 233
21
Do not reproduce without permission 21 Gerstein.info/talks (c) 2003 21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Caveats: Sequence Divergence of Multidomain Proteins, Implies a Practical Theshold is >40% (Human) (Chick) (E coli) (B ster.) (E coli) (Yeast) (Rat) Single Domain Sequences Multidomain Sequences Hegyi & Gerstein, Genome Res. 11: 1632
22
Do not reproduce without permission 22 Gerstein.info/talks (c) 2003 22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu (Very Close) Sequence Similarity [ -log(e-value) ] % Same Function Multi-domain proteins have greater divergence in function with sequence Hegyi & Gerstein, Genome Res. 11: 1632
23
Do not reproduce without permission 23 Gerstein.info/talks (c) 2003 23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration
24
Do not reproduce without permission 24 Gerstein.info/talks (c) 2003 24 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Structure Suggesting Function Asp tRNA Synthetase Staphylococcal Nuclease Solve structures of ORFs with no homologs, using fold & site similarity to determine function. (Rationale for Structure Prediction) Issue: To what degree does fold determine function, globally? CspA E.G. cspA OB fold suggests DNA binding [Montelione]
25
Do not reproduce without permission 25 Gerstein.info/talks (c) 2003 25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fold Function Combinations Many Functions on Same Scaffold (TIM- barrel) Different Folds with Same Function (Carbonic Anhydrases, 4.2.1.1)
26
Do not reproduce without permission 26 Gerstein.info/talks (c) 2003 26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations
27
Do not reproduce without permission 27 Gerstein.info/talks (c) 2003 27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations Sort
28
Do not reproduce without permission 28 Gerstein.info/talks (c) 2003 28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu To what degree is fold associated with function? Folds with multiple functions Number of functions associated with a fold Frequency in database of 229 folds Hegyi & Gerstein, JMB 288: 147 [Similar results by Thornton]
29
Do not reproduce without permission 29 Gerstein.info/talks (c) 2003 29 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
30
Do not reproduce without permission 30 Gerstein.info/talks (c) 2003 30 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteome Chips [Snyder] Microarray experiments Expression Arrays [Brown]
31
Do not reproduce without permission 31 Gerstein.info/talks (c) 2003 31 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 Format of Gene Expression Data
32
Do not reproduce without permission 32 Gerstein.info/talks (c) 2003 32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Microarray timecourse of 1 ribosomal protein mRNA expression level (ratio) Time-> [Brown, Davis]
33
Do not reproduce without permission 33 Gerstein.info/talks (c) 2003 33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Random relationship from 18M
34
Do not reproduce without permission 34 Gerstein.info/talks (c) 2003 34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Close relationship from 18M (2 Interacting Ribosomal Proteins) mRNA expression level (ratio) Time-> [Botstein; Church]
35
Do not reproduce without permission 35 Gerstein.info/talks (c) 2003 35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Predict Functional Interaction of Unknown Member of Cluster mRNA expression level (ratio) Time->
36
Do not reproduce without permission 36 Gerstein.info/talks (c) 2003 36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Local Clustering algorithm identifies further (reasonable) types of expression relation- ships Simultaneous Traditional Global Correlation Inverted Time- Shifted [Church] Qian et al. JMB 314:1053
37
Do not reproduce without permission 37 Gerstein.info/talks (c) 2003 37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Mapped problem onto a simple adaptation of SW local sequence alignment Simultaneous Traditional Global Correlation Inverted Time- Shifted Qian et al. JMB 314:1053
38
Do not reproduce without permission 38 Gerstein.info/talks (c) 2003 38 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples inverted relationships Documented YME1 :mito. protease involved in cplx. assembly YNT20 :known surpressor of YME1 Suggestive PUT2 :involved in Pro degradation SER3 :involved in Ser synthesis Time Expr. Ratio
39
Do not reproduce without permission 39 Gerstein.info/talks (c) 2003 39 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples: SIM: Example Shifted Relationship (SIM)
40
Do not reproduce without permission 40 Gerstein.info/talks (c) 2003 40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MIM: Examples: Example Shifted Relationship (MIM)
41
Do not reproduce without permission 41 Gerstein.info/talks (c) 2003 41 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships ~313K significant relationships from ~18M possible
42
Do not reproduce without permission 42 Gerstein.info/talks (c) 2003 42 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships Simultaneous 188K Inverted 63K Shifted 67K ~313K significant relationships from ~18M possible
43
Do not reproduce without permission 43 Gerstein.info/talks (c) 2003 43 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
44
Do not reproduce without permission 44 Gerstein.info/talks (c) 2003 44 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Iyer et al, Nature, 409:533 Lee et al., Science. 298:799 Horak et al, Genes & Development, 16:3017 Relationship between Transcription and Expression chIP-chip experiments provide large- scale known regulatory relationships
45
Do not reproduce without permission 45 Gerstein.info/talks (c) 2003 45 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Regulators & Targets: Prevalence of shifted & inverted relationships Yu et al. TIG 19:422
46
Do not reproduce without permission 46 Gerstein.info/talks (c) 2003 46 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Regulators & Targets: Inhibitors v Activators Yu et al. TIG 19:422
47
Do not reproduce without permission 47 Gerstein.info/talks (c) 2003 47 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Relationships between Co-regulated Targets Yu et al. TIG 19:422
48
Do not reproduce without permission 48 Gerstein.info/talks (c) 2003 48 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Next Step: Can you predict regulatory networks from expression data? [Siggia, Bussemaker, Gifford & Young]
49
Do not reproduce without permission 49 Gerstein.info/talks (c) 2003 49 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Trained and tested standard SVM: Reasonable results against Transfac 2% 36% FP rate Coverage Qian et al. Bioinformatics (in press)
50
Do not reproduce without permission 50 Gerstein.info/talks (c) 2003 50 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu However, poor overlap with chIP-chip Qian et al. Bioinformatics (in press)
51
Do not reproduce without permission 51 Gerstein.info/talks (c) 2003 51 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
52
Do not reproduce without permission 52 Gerstein.info/talks (c) 2003 52 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Biclustering to associate particular genes with certain phenotypes Conditions Reordered Genes (Sorted according to a classification vector) ? Matrix of raw data Genes Reordered Conditions (Sorted according to a classification vector) Shuffled Matrix (containing checkerboard “biclusters” of conditions with marker genes) Kluger et al. Genome Res. 13:703
53
Do not reproduce without permission 53 Gerstein.info/talks (c) 2003 53 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Identify checkerboard matrices by their action on classification vectors: Formulation as “eigenproblem” Checkerboard Matrix A Condition Classification Vect. x Conditions Genes Gene Classification Vector y A A x = x’ T A A y = y’ T Genes Conditions x’x’ y A T Kluger et al. Genome Res. 13:703
54
Do not reproduce without permission 54 Gerstein.info/talks (c) 2003 54 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu SVD to Solve Eigenproblem [Botstein] [Altman, Kim]
55
Do not reproduce without permission 55 Gerstein.info/talks (c) 2003 55 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Matrix Normalization: Rescaling Rows & Columns to Same Mean A x = y noise R A x = y noise Kluger et al. Genome Res. 13:703
56
Do not reproduce without permission 56 Gerstein.info/talks (c) 2003 56 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Representative Cancer Data set Lymphoma Data from Dalla-Favera et al. at Columbia Informatics from Stolovitzky & Califano at IBM Supervised learning some identified characteristic genes associated with different types of lymphoma Kluger et al. Genome Res. 13:703
57
Do not reproduce without permission 57 Gerstein.info/talks (c) 2003 57 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Matrix values represent outer products of two blocky classification eigenvectors Results on Representative Cancer Data set Kluger et al. Genome Res. 13:703
58
Do not reproduce without permission 58 Gerstein.info/talks (c) 2003 58 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data with Normalization and Sorting Kluger et al. Genome Res. 13:703
59
Do not reproduce without permission 59 Gerstein.info/talks (c) 2003 59 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data just with Sorting (no normalization) Kluger et al. Genome Res. 13:703
60
Do not reproduce without permission 60 Gerstein.info/talks (c) 2003 60 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data (no normalization or sorting) Kluger et al. Genome Res. 13:703
61
Do not reproduce without permission 61 Gerstein.info/talks (c) 2003 61 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data just with Sorting (no normalization) Kluger et al. Genome Res. 13:703
62
Do not reproduce without permission 62 Gerstein.info/talks (c) 2003 62 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Actual Data with Normalization and Sorting Kluger et al. Genome Res. 13:703
63
Do not reproduce without permission 63 Gerstein.info/talks (c) 2003 63 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Matrix values represent outer products of two blocky classification eigenvectors Just signal from top classification eigenvectors Kluger et al. Genome Res. 13:703
64
Do not reproduce without permission 64 Gerstein.info/talks (c) 2003 64 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Patients (samples) sorted according to projection onto blocky classification eigenvector (u2) Genes sorted according to projection onto blocky classification eigenvector (v2) Actual Values of Projections onto Classification Eigenvectors Kluger et al. Genome Res. 13:703
65
Do not reproduce without permission 65 Gerstein.info/talks (c) 2003 65 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Classification of Cancers Based on Projection onto two top classification eigenvectors: Better with Normalization Normalized (“bistochastization”) CLL DLCL FL DLCL Straight SVD Four types of Cancer in Della Favera dataset Kluger et al. Genome Res. 13:703
66
Do not reproduce without permission 66 Gerstein.info/talks (c) 2003 66 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286 biclusteringbistochastization SVDbi-normalizationNormalized cuts ALL (B) ALL (T) AML
67
Do not reproduce without permission 67 Gerstein.info/talks (c) 2003 67 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
68
Do not reproduce without permission 68 Gerstein.info/talks (c) 2003 68 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration for Interactomes Diverse sources of interaction information Databases (BIND, DIP, MIPS etc.) Individual expts. in literature High-throughput datasets in vivo pull down (Ho, Gavin) yeast two-hybrid (Uetz, Ito) Genomic data Expression Phenotypes Localization Functional Noisy High-throughput data is less reliable than smaller scale experiments [Grigorev, Bork] Combining data increases Accuracy & coverage [Church] How to do quantitatively? How to weight different data sources? General classification problem (machine learning) Bayesian Approaches…. Science 295:284
69
Do not reproduce without permission 69 Gerstein.info/talks (c) 2003 69 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Example of data integration: RNA polymerase II Which subunits interact? Based on protein-protein interaction experiments [Kornberg] Compare with Gold Std. structure Edwards, Kus, et al. TIG 18:529
70
Do not reproduce without permission 70 Gerstein.info/talks (c) 2003 70 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
71
Do not reproduce without permission 71 Gerstein.info/talks (c) 2003 71 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
72
Do not reproduce without permission 72 Gerstein.info/talks (c) 2003 72 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Interaction experiments before structure was known
73
Do not reproduce without permission 73 Gerstein.info/talks (c) 2003 73 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
74
Do not reproduce without permission 74 Gerstein.info/talks (c) 2003 74 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integrate using naive Bayes classifier Data integration: RNA polymerase II
75
Do not reproduce without permission 75 Gerstein.info/talks (c) 2003 75 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier
76
Do not reproduce without permission 76 Gerstein.info/talks (c) 2003 76 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier
77
Do not reproduce without permission 77 Gerstein.info/talks (c) 2003 77 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Weighted Voting: the Likelihood Ratio Vote: +2 = 1 + 1 + -1 + -1 + 1 + 1 With weights: likelihood ratio L = L 1 + L 2 + L 3 …
78
Do not reproduce without permission 78 Gerstein.info/talks (c) 2003 78 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Correlations between similar features
79
Do not reproduce without permission 79 Gerstein.info/talks (c) 2003 79 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relative quality of different expts. L = (TP/FP) (N/P) [for uncorrelated features]
80
Do not reproduce without permission 80 Gerstein.info/talks (c) 2003 80 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gavin UetzHo 90/556711/135 1357/6226 6/6 353/212 18/6 15/1 TP / FP Disagreement in high-throughput protein interaction datasets [Eisenberg, Fields & Bork] Jansen et al. JSFG 2:71
81
Do not reproduce without permission 81 Gerstein.info/talks (c) 2003 81 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
82
Do not reproduce without permission 82 Gerstein.info/talks (c) 2003 82 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of network against gold standard complex interactions Positives 8250 known interactions in MIPS complexes [Mewes] Negatives ~2.7 M pairs in diff. Subcellular compartments TP FP Set of predicted “interactions” [Related Data in Bind, DIP]
83
Do not reproduce without permission 83 Gerstein.info/talks (c) 2003 83 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Overview of information integrated and Bayesian Formalism Data suggestive of interactions (co-expression, co-localization, similar essentiality) Noisy high-throughput experiments (Gavin et al., Uetz et al. &c) Gold-standard complexes (MIPS, Mewes, Frishman et al.) Jansen et al. Science (in press)
84
Do not reproduce without permission 84 Gerstein.info/talks (c) 2003 84 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Overview of information integrated and Bayesian Formalism Cross-validated training and testing Thresholding L at various values Tabulation of observed TP and FP at various thresholds Jansen et al. Science (in press)
85
Do not reproduce without permission 85 Gerstein.info/talks (c) 2003 85 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)
86
Do not reproduce without permission 86 Gerstein.info/talks (c) 2003 86 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)
87
Do not reproduce without permission 87 Gerstein.info/talks (c) 2003 87 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Observed TP/FP Ratio Tracks L, Suggesting a Threshold Jansen et al. Science (in press)
88
Do not reproduce without permission 88 Gerstein.info/talks (c) 2003 88 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature Jansen et al. Science (in press)
89
Do not reproduce without permission 89 Gerstein.info/talks (c) 2003 89 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of Predictions with Known Complexes Jansen et al. Science (in press)
90
Do not reproduce without permission 90 Gerstein.info/talks (c) 2003 90 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicted Network Jansen et al. Science (in press)
91
Do not reproduce without permission 91 Gerstein.info/talks (c) 2003 91 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Example prediction: Mito. Ribosome Jansen et al. Science (in press)
92
Do not reproduce without permission 92 Gerstein.info/talks (c) 2003 92 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison with new experiments J Greenblatt RFA cplx Jansen et al. Science (in press)
93
Do not reproduce without permission 93 Gerstein.info/talks (c) 2003 93 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature Jansen et al. Science (in press)
94
Do not reproduce without permission 94 Gerstein.info/talks (c) 2003 94 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Same is True for Combining High-throughput Interaction Data (Integration of Features Gives Much Higher Likelihood Ratios than Any Individual Feature) Jansen et al. Science (in press)
95
Do not reproduce without permission 95 Gerstein.info/talks (c) 2003 95 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of Strength of Purely Predicted Features (PIP) vs. Integrated High throughput Data (PIE) Jansen et al. Science (in press)
96
Do not reproduce without permission 96 Gerstein.info/talks (c) 2003 96 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Predicting Protein Function on a Genome Scale "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering microarray experiments Local Clustering (to identify time-shifted and inverted relationships) Relating Clustering to Known Regulatory Relationships Spectral Biclustering (to identify marker genes associated with particular phenotypes) Data integration Bayesian methods to uniformly & optimally combine evidence (in application to integration of protein interaction data) Predicting interactions in yeast de novo from non-interaction data sources (with verification)
97
Do not reproduce without permission 97 Gerstein.info/talks (c) 2003 97 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Acknowledgements Protein Function Prediction (GeneCensus.org) J Qian, R Jansen, A Drawid, C Wilson, H Yu, D Greenbaum, J Lin, N Luscombe, H Hegyi, Y Kluger Pseudogenes (Pseudogene.org) P Harrison, Z Zhang, Y Liu, S Balasubramanian, P Bertone, T Johnson, J Karro Macromolecular Motions (MolMovDB.org) J Junker, H Yu, N Echols, V Alexandrov, W Krebs, D Milburn, U Lehnert Collaborators J Chang, R Basri, J Greenblatt (N Krogan) Yale CEGS M Snyder (A Kumar, H Zhu, M Bilgin …) S Weissmann, P Miller (K Cheung) NESG.org G Montelione, A Edwards (B Kuss) NIH, NSF Structural Proteomics (PartsList.org) C Goh, N Lan, H Hegyi, R Das, S Douglas, B Stenger
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.