Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein, Yale University, Feel free to use images in it with PROPER acknowledgement.
Do not reproduce without permission 2 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics of Protein Complexes Mark B Gerstein Yale U Talk at NIH
Do not reproduce without permission 3 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The Interactome: the Next ‘omic Step Interactome Proteome Transcriptome Genome
Do not reproduce without permission 4 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The popularity of interactome information
Do not reproduce without permission 5 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics of Complexes 1.Interactions provide a systematic way of defining protein function on a genomic scale 2.Known complexes provide a benchmark to validate and integrate genome-wide interaction experiments, providing a more accurate interactome 3.Known complexes provide a focus for the intergration of (non-interaction) genomic information – e.g. expression data 4.Extrapolating from known complexes, one can predict protein complexes on a genome-scale via integrating experimental interactions and non- interaction information (combining #1 and #2)
Do not reproduce without permission 6 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Circumscribing Protein Function in terms of Interactions
Do not reproduce without permission 7 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understanding Protein Function on a Genomic Scale 250 of 650 known on chr. 22 [Dunham et al.] >>30K+ Proteins in Entire Human Genome (alt. splicing).…… ~650
Do not reproduce without permission 8 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in defining protein function on a genomic scale Multi-functionality: 2 functions/protein (also 2 proteins/function) Role Conflation: molecular, cellular, phenotypic Fun terms… but do they scale? Starry night Sarah (affects female fertility) ; Sonic; Darkener of apricot & suppressor of white apricot; Redtape, gridlock, roadblock (when mutated block transport along axons) ; ROP vs ROM ( "Regulator of Copy Number" or RNA-I-II-complex-binding-protein) For now, definable aspects of function: interactions, location, enzymatic rxn. [Babbit]
Do not reproduce without permission 9 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Ontologies for function: Networks, Hierarchies, DAGs
Do not reproduce without permission 10 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Ontologies for function: Interaction vectors Lan et al. IEEE (2002) & COSB (2003)
Do not reproduce without permission 11 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Validating and Integrating Genomic Protein-Protein Interaction Datasets with Known Complexes
Do not reproduce without permission 12 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein interaction data Databases (BIND, DIP, MIPS etc.) literature High-throughput datasets in vivo pull down yeast two-hybrid Computational predictions Tangential genomic data Expression data Phenotypic data Localization Data
Do not reproduce without permission 13 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining interaction data High-throughput data is less reliable than more careful, smaller scale experiments Orthogonal datasets Combining data increases accuracy coverage How to do this in a quantitative way? How to weight the different data sources? General classification problem (machine learning) Bayesian networks: probabilistic
Do not reproduce without permission 14 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Example of data integration: RNA polymerase II Which subunits interact? -> protein-protein interaction experiments Kornberg et al., 2001 Compare with Gold Std. structure: Edwards, Kus, Jansen, Greenbaum, Greenblatt, Gerstein, TIG (2002)
Do not reproduce without permission 15 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
Do not reproduce without permission 16 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
Do not reproduce without permission 17 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Interaction experiments before structure was known
Do not reproduce without permission 18 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II
Do not reproduce without permission 19 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier
Do not reproduce without permission 20 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA polymerase II Integrate using naive Bayes classifier
Do not reproduce without permission 21 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Data integration: RNA ploymerase II
Do not reproduce without permission 22 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of interaction data sets. Data set Method
Do not reproduce without permission 23 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparison of experimental data with gold standards Positives 8250 interactions in MIPS complexes Negatives ~2.7 M pairs in diff. Subcellular compartments TP FP Set of experimental “interactions”
Do not reproduce without permission 24 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gavin UetzHo 90/556711/ /6226 6/6 353/212 18/6 15/1 TP / FP Combining experimental data Jansen et al. JSFG 2002
Do not reproduce without permission 25 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Integrating Structural Complexes with Non-interaction Genomic Information: Using them to Interpret Gene Expression data
Do not reproduce without permission 26 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 Format of Gene Expression Data
Do not reproduce without permission 27 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCMs prots. ORC Polym. & Expression Correlations Segment Replication Complex into Component Parts
Do not reproduce without permission 28 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Range of Expression Correlations within Complexes Replication Cplx Overall.05 ORC.19, MCMs.75 Pol. .45, .75, Ribosome Overall.80 Large.80 Small.81 Proteasome Overall.43 20S.50 19S.51
Do not reproduce without permission 29 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein-Protein Interactions & Expression between selected expression timecourses (all pairs, control) (strong interactions in perm- anent complexes, clearly diff.) Cell Cycle CDC28 expt. (Davis) Sets of interactions (from MIPS) (Uetz et al.) Pairwise interactions
Do not reproduce without permission 30 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Significance of correlations (complexes) PermanentTransient/other Jansen et al., Genome Research, 2002
Do not reproduce without permission 31 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permanent v. Transient Complexes Jansen et al., Genome Research, 2002
Do not reproduce without permission 32 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Transient complexes Example: replication complex Subparticles behave like permanent complexes Jansen et al., Genome Research, 2002 Permanent complexes show strong co- expression vs. Transient complexes have weaker co- expression
Do not reproduce without permission 33 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Genome-wide prediction of protein complexes based on both high- throughput interaction data and non- interaction, genomic information
Do not reproduce without permission 34 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships ~313K significant relationships from ~18M possible
Do not reproduce without permission 35 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships Simultaneous 188K Inverted 63K Shifted 67K ~313K significant relationships from ~18M possible
Do not reproduce without permission 36 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Globally, how well do expression relationships predict known interactions? Coverage of the 8250 Known Interactions in Complexes Found [MIPS] Random ~2% 1x (313K/18M) 24x Enrichment Compared to Randomized Expression Relationships CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42%
Do not reproduce without permission 37 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found [MIPS] KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x Enrichment Compared to Randomized Expression Relationships
Do not reproduce without permission 38 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found [MIPS] CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42% 24x KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x KO v CC 55% 111x KO ^ CC 21% 254x Enrichment Compared to Randomized Expression Relationships
Do not reproduce without permission 39 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics of Complexes 1.Interactions provide a systematic way of defining protein function on a genomic scale 2.Known complexes provide a benchmark to validate and integrate genome-wide interaction experiments, providing a more accurate interactome 3.Known complexes provide a focus for the intergration of (non-interaction) genomic information – e.g. expression data 4.Extrapolating from known complexes, one can predict protein complexes on a genome-scale via integrating experimental interactions and non- interaction information (combining #1 and #2)
Do not reproduce without permission 40 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu For the Future Developing an accurate interactome for the cell, from prediction and through integration of high-throughput information Development of statistical approaches to combine and integrate information Development of database technologies to store hetrogeneous and noisy genome-wide interaction datasets A moderate number of structural complexes are very useful as gold standard data
Do not reproduce without permission 41 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein complexes & Structural Genomics A computational challenge following from the solution of the partslist Given many monomeric structures produced by structural genomics, predict (or rationalize) the interactome through docking Maybe many structures will be only be solved as complexes….
Do not reproduce without permission 42 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Association between Protein Sequence Features and Experimental Progress
Do not reproduce without permission 43 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Bottlenecks in analysis of all of TargetDB (Interologs)
Do not reproduce without permission 44 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Acknowledgements J Qian, R Jansen, A Drawid, C Wilson, D Greenbaum, C Goh, N Lan, H Hegyi, R Das, S Douglas, B Stenger J Lin, Y Kluger Collaborators M Snyder (A Kumar, H Zhu, …) A Edwards, B Kus, J Greenblatt NIH GeneCensus.org