Download presentation
Presentation is loading. Please wait.
1
Do not reproduce without permission 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein, Yale University, 2002. Feel free to use images in it with PROPER acknowledgement.
2
Do not reproduce without permission 2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics: Genome-scale studies of protein function, structure, and evolution Mark B Gerstein Yale U H Hegyi, J Lin, J Qian, N Luscombe, T Johnson, A Drawid, R Jansen, V Alexandrov, M Snyder, A Kumar, H Zhu, D Greenbaum, N Lan, P Harrison, N Echols, S Balasubramanian, P Bertone, Z Zhang, R Das, Y Liu, Y Kluger, H Yu, D Greenbaum, P Miller, K Cheung, S Weissman Talk at Harvard 02.02.11
3
Do not reproduce without permission 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understand Proteins, through analyzing populations Structures Functions Evolution StructuresSequencesMicroarrays Integration of Information (motions, packing, folds)
4
Do not reproduce without permission 4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term
5
Do not reproduce without permission 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term
6
Do not reproduce without permission 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term PubMed Hits Proteome
7
Do not reproduce without permission 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges (Initial Step: genome sequence & genes) Analyzing protein fossils 1. Understanding genes in detail Predicting protein function on a genomic scale 2. Understanding what’s between genes
8
Do not reproduce without permission 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges Predicting protein function on a genomic scale
9
Do not reproduce without permission 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 250 of 650 known on chr. 22 [Dunham et al.] >>30K+ Proteins in Entire Human Genome (alt. splicing).…… ~650
10
Do not reproduce without permission 10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration
11
Do not reproduce without permission 11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration Compare uncharacterized genome sequences against known sequences in DBs, transferring func. annotation for similar sequences Issue: Threshold is major parameter & limitation Also, look for motifs & sites [Sternberg, Thornton, Rose, Koonin]
12
Do not reproduce without permission 12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1000s of structurally based alignments of structurally and functionally characterized sequences (Human) 90% (Chick) 45% (E coli) (B ster.) 20% (E coli) (Yeast) Sequence 5.3.1.1 (TP Isomerase) Same Exact 5.3.1.1 (TP Isomerase) Both Class 5 (isom.) 5.3.1.1 (TP Isomerase) 5.3.1.24 (PRA Isomerase) 5.3.1.15 (Xylose Isom.) Different Classes 4.1.3.3 (Aldolase) 4.2.1.11 (Enolase) Function
13
Do not reproduce without permission 13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID Sequence similarity of pairs of proteins % Same Function Percentage of pairs that have same precise function as defined by Enzyme & FlyBase functional classifications
14
Do not reproduce without permission 14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID % Same Function
15
Do not reproduce without permission 15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function
16
Do not reproduce without permission 16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function
17
Do not reproduce without permission 17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Broad v Narrow Similarity
18
Do not reproduce without permission 18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Caveats: Sequence Divergence of Multidomain Proteins, Implies a Practical Theshold is >40% (Human) (Chick) (E coli) (B ster.) (E coli) (Yeast) (Rat) Single Domain Sequences Multidomain Sequences
19
Do not reproduce without permission 19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu (Very Close) Sequence Similarity [ -log(e-value) ] % Same Function Multi-domain proteins have greater divergence in function with sequence
20
Do not reproduce without permission 20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration Structures of ORFs with unknown function, Use Fold & Site Similarity to Determine Function Rationale for Structure Prediction Issue: To what degree does fold determine function? [Kim, Edwards & Arrowsmith, Montelione, Burley, Eisenberg]
21
Do not reproduce without permission 21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fold Function Combinations Many Functions on Same Fold (TIM-barrel) Different Folds with Same Function (Carbonic Anhydrases, 4.2.1.1)
22
Do not reproduce without permission 22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations
23
Do not reproduce without permission 23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds 91 Enzymatic Functions Correlation with Structural Features Architectural Class all- all- small Non-Enz
24
Do not reproduce without permission 24 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Correlation with Structural Features Slight Overpopulation Architectural Class Enzyme Class all- all- small
25
Do not reproduce without permission 25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations Sort
26
Do not reproduce without permission 26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu To what degree is fold associated with function? Folds with multiple functions [Similar results by Thornton] Number of functions associated with a fold Frequency in database of 229 folds
27
Do not reproduce without permission 27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration
28
Do not reproduce without permission 28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteome Chips [Snyder] Microarray experiments Expression Arrays [Brown]
29
Do not reproduce without permission 29 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Microarray timecourse of 1 ribosomal protein mRNA expression level (ratio) Time-> [Brown, Davis]
30
Do not reproduce without permission 30 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Random relationship from ~18M mRNA expression level (ratio) Time->
31
Do not reproduce without permission 31 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Close relationship from 18M (2 Interacting Ribosomal Proteins) mRNA expression level (ratio) Time-> [Botstein; Church, Vidal]
32
Do not reproduce without permission 32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Predict Functional Interaction of Unknown Member of Cluster mRNA expression level (ratio) Time->
33
Do not reproduce without permission 33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Local Clustering algorithm identifies further (reasonable) types of expression relation- ships Simultaneous Traditional Global Correlation Inverted Time- Shifted [Church]
34
Do not reproduce without permission 34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples inverted relationships Documented YME1 :mito. protease involved in cplx. assembly YNT20 :known surpressor of YME1 Suggestive PUT2 :involved in Pro degradation SER3 :involved in Ser synthesis Time Expr. Ratio
35
Do not reproduce without permission 35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples time-shifted relationships Suggestive ARP3 :in actin remodelling cplx. ARC35 :in same cplx. (required late in cell cycle) Time Expr. Ratio Predicted J0544 :unknown function MRPL19:mito.ribosome
36
Do not reproduce without permission 36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples time-shifted relationships Suggestive ARP3 :in actin remodelling cplx. ARC35 :in same cplx. (required late in cell cycle) Time Expr. Ratio Predicted J0544 :unknown function MRPL19:mito.ribosome
37
Do not reproduce without permission 37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Correlations Segment Large Replication Complex into Component Parts MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1
38
Do not reproduce without permission 38 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Correlations Segment Large Replication Complex into Component Parts MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCMs prots. ORC Polym. &
39
Do not reproduce without permission 39 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Range of Expression Correlations within Complexes Replication Cplx Overall.05 ORC.19, MCMs.75 Pol. .45, .75, Ribosome Overall.80 Large.80 Small.81 Proteasome Overall.43 20S.50 19S.51
40
Do not reproduce without permission 40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permanent v. Transient Complexes
41
Do not reproduce without permission 41 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships ~313K significant relationships from ~18M possible
42
Do not reproduce without permission 42 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships Simultaneous 188K Inverted 63K Shifted 67K ~313K significant relationships from ~18M possible
43
Do not reproduce without permission 43 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Globally, how well do expression relationships predict known interactions? Coverage of the 8250 Known Interactions in Complexes Found Random ~2% 1x (313K/18M) 24x Enrichment Compared to Randomized Expression Relationships CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42%
44
Do not reproduce without permission 44 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x Enrichment Compared to Randomized Expression Relationships
45
Do not reproduce without permission 45 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42% 24x KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x KO v CC 55% 111x KO ^ CC 21% 254x Enrichment Compared to Randomized Expression Relationships
46
Do not reproduce without permission 46 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration Obviously integration of orthogonal info. is good but how to achieve it? And what are the issues. An example for subcellular localization in yeast
47
Do not reproduce without permission 47 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Subcellular Localization, a standardized aspect of function Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi
48
Do not reproduce without permission 48 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu "Traditionally" subcellular localization is "predicted" by sequence patterns NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi Import Sig.
49
Do not reproduce without permission 49 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Subcellular localization is associated with the level of gene expression Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi [Expression Level in Copies/Cell]
50
Do not reproduce without permission 50 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combine Expression Information & Sequence Patterns to Predict Localization NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi Import Sig. [Expression Level in Copies/Cell]
51
Do not reproduce without permission 51 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in Combining Many Features NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Mitochondria Golgi Import Sig. Total of 30 diverse features (also including essentiality, coiled-coils, expression fluc., & obscure seq. patterns) How to standardize features? How to weight them?
52
Do not reproduce without permission 52 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Feature 1: NLS # NLS Everything expressed in standardized probabilitic terms (Features as freq. in training set) Bayesian System for Localizing Proteins Prior New Estimate Feature 2: High Expr. Better Estimate Feature 3: Is Essential? Sequentially apply features to refine prior assumed estimate using Bayes Rule (Feature x Prior / Normalization) Final Estimate Final estimate that naturally weights features comes out
53
Do not reproduce without permission 53 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Testing Data 7-fold cross- validation training & test sets Overall compartment population 96% accuracy Nuc. Cyt. ER TM Mito.
54
Do not reproduce without permission 54 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Extrapolation to Compartment Populations of Whole Yeast Genome ~3300 Known Localizations from Exisiting DB + Expt. Localizations by Transposon Tagging & Direct Overexpression [Snyder] + Predictions (from Bayesian System)
55
Do not reproduce without permission 55 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 1)"Traditional" sequence patterns oLimitation: tight threshold oDeveloped 40% threshold for sequence comparison 2)Via fold similarity (structural genomics) oLimitation: multifunctionality on a fold, so weak relationship oMeasured extent of this in current DB 3)Clustering a microarray experiment oLimitation: suggestive relationships but not yet predictive oLocal clustering found ~130K relationships beyond ~180K simultaneous ones 4)Data integration oLimitation: power is obvious but complicated to achieve oIncreased power. Works for localization of all proteins in yeast.
56
Do not reproduce without permission 56 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges (Initial Step: genome sequence & genes) Analyzing protein fossils 1. Understanding genes in detail Predicting protein function on a genomic scale 2. Understanding what’s between genes
57
Do not reproduce without permission 57 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges Analyzing protein fossils
58
Do not reproduce without permission 58 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogenes ( G) as Disabled Homologies S/T Protein Phosphatase PP1 (C-term) …SRILCMHGGLSPHLQTLDQLRQLPRPQDPPNPSIGIDLLWADPDQWVKGWQAN TRGVSYVFGQDVVADVCSRLDIDLVARAHQVVQDGYEFFASKKMVTIFSAPHYC GQFDNSAATMKVDENMVCTFVMYKPTPKSMRRG* IIIIIIIVVX Worm Genome Pseudogenic fragment TKRTSNGFGQDVVVDLFSILDSGLVARAHX VLQDIFEFFASKKMVTIFS # APHSPHSAPH YCAQFDNSAATVKV Most Multiply Disabled #
59
Do not reproduce without permission 59 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Types of G: Duplicated & Processed Original Gene Duplicated Gene Duplicated G AAAAAA> Processed G Spliced mRNA Processed G with disablements [Heidmann]
60
Do not reproduce without permission 60 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Large-scale Assignment of Pseudogenes (*Chr 21+22 only)
61
Do not reproduce without permission 61 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu G Calculations Large Basic Calculation "blasting" ~1M fragments against protein DB 50 CPU days for worm Parallel computing + large DBs Protein DB chromosome GG Integrating heterogeneous, dynamically changing annotation Changing sequences, gene predictions, repeats Sequence 1 Sequence 2 Sequence 3 Genes A Repeats 1 Genes B Repeats 2
62
Do not reproduce without permission 62 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu G: Questions 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?
63
Do not reproduce without permission 63 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm G: Lots! Good Stats 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?
64
Do not reproduce without permission 64 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Amino-acid composition of Pseudogenes is midway between Genes and translated Intergenic DNA Worm Amino Acid (sorted) Frequency
65
Do not reproduce without permission 65 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Amino-acid composition of Pseudogenes is midway between Genes and translated Intergenic DNA in many genomes Worm Yeast Fly Human
66
Do not reproduce without permission 66 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Midway composition also applies to codons
67
Do not reproduce without permission 67 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogene distribution on worm chomo- somes: On Ends Pseudogenes elevated at ends of worm chr I Genes elevated in middle
68
Do not reproduce without permission 68 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogene distribution on worm chomo- somes: On Ends ~50% G in terminal 3Mb vs ~30% G G -- G 16% (min) G -- G 29% (max)
69
Do not reproduce without permission 69 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Default: # G #genes, in a family RT 28 59 # pseudogenes in family # genes in family
70
Do not reproduce without permission 70 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction?
71
Do not reproduce without permission 71 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction? Horz. Transfer?
72
Do not reproduce without permission 72 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction? Horz. Transfer? Contamination?
73
Do not reproduce without permission 73 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm G families: chemoreceptors & transposon functions Environ- mental Response Family
74
Do not reproduce without permission 74 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm G families: Unique or highly expanded relative to fly Environ- mental Response Family
75
Do not reproduce without permission 75 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Common worm Pseudofolds [Scop, Murzin] G rank Genes rank Genes rank
76
Do not reproduce without permission 76 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Common worm Pseudofolds [Scop, Murzin] G rank Genes rank Genes rank
77
Do not reproduce without permission 77 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fly G: Broken Fossils 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?
78
Do not reproduce without permission 78 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fly fossils are "broken up" by deletions Explanation for having only 5% of G of worm is high rate of small genomic deletions (~10 - 50 bp) [Petrov, Hartl] pseudomotifs
79
Do not reproduce without permission 79 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Same density of pseudomotifs in fly and worm Analyze occurrence of "pseudomotifs" (ancient, broken- up fossils) in intergenic regions relative to statistical expectation 1329 Prosite Motifs (e.g ZnF C -x(2,4)- C -x(3)- -x(8)- H -x(3,5) - H & Tubulin) TM-helices
80
Do not reproduce without permission 80 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Human G: Duplicated v. Processed 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ? (*Chr 21+22 only) Expansion of genome papers [Venter et al., Dunham et al., Lander et al.]
81
Do not reproduce without permission 81 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 Top Functional Families in Genes and G on Human Chr. 21 and 22 Ig* 70 Environmental Response Family
82
Do not reproduce without permission 82 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Other DNA binding32 Nucleotide binding28 Transcription Factor28 Other Nucleic-acid binding 16 Kinase14 GenesPseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 Top Functional Families in Genes and G on Human Chr. 21 and 22 Ig* 69 Ig* 70 Environmental Response Family
83
Do not reproduce without permission 83 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Other DNA binding32 Nucleotide binding28 Transcription Factor28 Other Nucleic-acid binding 16 Kinase14 Genes Transcription factor6 Nucleotide binding5 Kinase4 Transferase4 Receptor4 Ribosomal protein42 Transcription factor7 Other DNA binding7 Receptor4 Other Nucleic-acid binding 4 Pseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 DuplicatedProcessed Top Functional Families in Genes and G on Human Chr. 21 and 22 Ig* 70 Ig* 69 Ig* 70 Environmental Response Family
84
Do not reproduce without permission 84 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Distribution of processed G roughly matches expectation of random insertions Constant density between human & worm Occurrence related to expression ribosomal proteins Among ribosomal proteins: Proportional selection between subunits Uniform insertion across 21 & 22 22 21
85
Do not reproduce without permission 85 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast G: Simple Story & Mechanism 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?
86
Do not reproduce without permission 86 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast G concentrated near telomeres Pseudogenes Genes ……
87
Do not reproduce without permission 87 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Environmental response functions of yeast G 5 most common families in G Not same as the most common families in genes Environ- mental Response Family
88
Do not reproduce without permission 88 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast G come from yeast-specific families GG genes Fraction having a non-yeast homolog 40%80%
89
Do not reproduce without permission 89 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Resurrecting pseudogenes: is it possible? Hypothetical example of a flocculin Idea of "untranslatable intermediates" in protein evolution has been around for a while [Nei, '70; Koch, '72] [Walsh] Functioning FLO8 causes filamentous growth in most strains [ Fink ] FLO8 disabled in lab strain (S288C)
90
Do not reproduce without permission 90 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu We find suggestive evidence that PSI resurrects G: 35 G easily resurrectable with only 1 stop Microarrays show some of these are expressed [M Snyder] Many involved in environmental response Perhaps testable with selection experiments A speculative mechanism for resurrecting yeast G, via [PSI+], perhaps in environmental response [PSI+] [ Lindquist ] Prion of Sup35p, translation-termination protein Causes read-through of stops Causes phenotypic diversity, through the expression of new or altered proteins
91
Do not reproduce without permission 91 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Evolutionary Implications of a Reservoir Resurrectable G for Creating New Folds
92
Do not reproduce without permission 92 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Not all folds shared between phylogenetic groups Evolution of new folds 4673156 Eubacteria Eukaryotes 90 20 104 Animals Plants
93
Do not reproduce without permission 93 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Evolutionary Implications of a Reservoir Resurrectable G for Creating New Folds Paradox: going between folds A & Z with all intermediates functional [Koch '72] Pseudogenes free of constraints of being transcribed & translated
94
Do not reproduce without permission 94 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu G: Summary 1)What are they composed of? Intermediate composition between genes & translated intergenic DNA 2)Where are they? On chromosomal ends. In all organisms, though reduced to pseudomotifs in the fly. 3)Which type of proteins are they? Why? Environmental response proteins. G may be extra parts that can be resurrected. Potential mechanism suggested for yeast involving [PSI+]. 4)Do processed & duplicated varieties differ? YES. (Duplicated G described above.) Processed G appear to be just randomly inserted from mRNA pool. Hence, they show obvious relationship to mRNA level & intergenic region size
95
Do not reproduce without permission 95 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Practical Backdrop to Integrated Gene Annotation & Interpretation of DNA Arrays 137 potential new yeast genes Integrated approach: homology search + transposons + microarrays Small ORFs & anti-sense to existing ORFs [Snyder] Human DNA Arrays (ongoing) All of chr22 on a chip in ~1kb chunks, probe for expression, TF binding Need to have mapped landscape (genes, Gs, repeats, SNPs, &c) to design chip & interpret results I IIIII IV V VI VII VIII
96
Do not reproduce without permission 96 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu GeneCensus.org ORF Query Alignment Server Alignment Database PDB Query Detailed Tables RanksTrees PartsList.org
97
Do not reproduce without permission 97 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Acknowledgements Predicting Protein Function on a Genomic Scale Jiang Qian, Ronald Jansen, Amar Drawid, Cyrus Wilson, Hedi Hegyi, Dov Greenbaum, Hayiuan Yu, Jimmy Lin, Ning Lan, Yuval Kluger Analysis of Pseudogenes Paul Harrison, Zhaolei Zhang, Nathaniel Echols, Suganthi Balasubramanian, Nicholas Luscombe, Paul Bertone, Ted Johnson, Patrick McGarvey Other Projects Yang Liu, Jochen Junker, Vadim Alexandrov, Rajdeep Das, Werner Krebs, Brad Stenger Collaborators M Snyder (A Kumar, H Zhu, M Bilgin, C Horack …) S Weissmann (Z Lian, S Yamaga…) P Miller (K Cheung), M Schultz G Montelione et al. PartsList.org, GeneCensus.org, nesg.org
98
Do not reproduce without permission 98 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Assessing Function Globally Known associated pairs have same "cellular role" (according to MIPS, GO, &c)
99
Do not reproduce without permission 99 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Function Prediction
100
Do not reproduce without permission 100 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Based on Distributions, Correlation of Established Functional Categories, Computer Clusterings Correlation: Always Significant Sometimes Significant (depends on expt.) Never Significant
101
Do not reproduce without permission 101 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Interaction Prediction
102
Do not reproduce without permission 102 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein-Protein Interactions & Expression between selected expression timecourses (all pairs, control) (strong interactions in perm- anent complexes, clearly diff.) Cell Cycle CDC28 expt. (Davis) Sets of interactions (from MIPS) (Uetz et al.) Pairwise interactions
103
Do not reproduce without permission 103 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein-Protein Interactions & Expression Sets of interactions between selected expression timecourses (all pairs, control) (from MIPS) (strong interactions in perm- anent complexes, clearly diff.) (Uetz et al.) Cell Cycle CDC28 expt. (Davis) Pairwise interactions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.