Do not reproduce without permission 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein,

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Microarray Data Analysis Day 2
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
Do not reproduce without permission 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Comparing Genomes in terms of Protein Structure: Surveys of a.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Indiana University Bloomington, IN Junguk Hur Computational Omics Lab School of Informatics Differential location analysis A novel approach to detecting.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, 2.Microarrays 3.Use BLAST to find homologous sequences 4.Multiple.
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Datamining Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Protein Classification A comparison of function inference techniques.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Transcription and Translation
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Finish up array applications Move on to proteomics Protein microarrays.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
LARVA: An integrative framework for Large-scale Analysis of Recurrent Variants in noncoding Annotations M Gerstein, Yale Slides freely downloadable from.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Sequence Alignment.
Motif Search and RNA Structure Prediction Lesson 9.
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
1 Genomics Advances in 1990 ’ s Gene –Expressed sequence tag (EST) –Sequence database Information –Public accessible –Browser-based, user-friendly bioinformatics.
Finding genes in the genome
1 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu Analysis of Genomes & Transcriptomes in terms of the Occurrence of Parts and Features: Surveys of.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
ENCODE Pseudogenes and Transcription
Genome Center of Wisconsin, UW-Madison
Genomes and Their Evolution
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Mapping Global Histone Acetylation Patterns to Gene Expression
From Mendel to Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Do not reproduce without permission 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation is copyright Mark Gerstein, Yale University, Feel free to use images in it with PROPER acknowledgement.

Do not reproduce without permission 2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Computational Proteomics: Genome-scale studies of protein function, structure, and evolution Mark B Gerstein Yale U H Hegyi, J Lin, J Qian, N Luscombe, T Johnson, A Drawid, R Jansen, V Alexandrov, M Snyder, A Kumar, H Zhu, D Greenbaum, N Lan, P Harrison, N Echols, S Balasubramanian, P Bertone, Z Zhang, R Das, Y Liu, Y Kluger, H Yu, D Greenbaum, P Miller, K Cheung, S Weissman Talk at Harvard

Do not reproduce without permission 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Understand Proteins, through analyzing populations Structures Functions Evolution StructuresSequencesMicroarrays Integration of Information (motions, packing, folds)

Do not reproduce without permission 4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term

Do not reproduce without permission 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term

Do not reproduce without permission 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu The central post- genomic term PubMed Hits Proteome

Do not reproduce without permission 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges (Initial Step: genome sequence & genes) Analyzing protein fossils 1. Understanding genes in detail Predicting protein function on a genomic scale 2. Understanding what’s between genes 

Do not reproduce without permission 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges Predicting protein function on a genomic scale

Do not reproduce without permission 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 250 of 650 known on chr. 22 [Dunham et al.] >>30K+ Proteins in Entire Human Genome (alt. splicing).…… ~650

Do not reproduce without permission 10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration 

Do not reproduce without permission 11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration  Compare uncharacterized genome sequences against known sequences in DBs, transferring func. annotation for similar sequences Issue: Threshold is major parameter & limitation Also, look for motifs & sites [Sternberg, Thornton, Rose, Koonin]

Do not reproduce without permission 12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1000s of structurally based alignments of structurally and functionally characterized sequences (Human) 90% (Chick) 45% (E coli) (B ster.) 20% (E coli) (Yeast) Sequence (TP Isomerase) Same Exact (TP Isomerase) Both Class 5 (isom.) (TP Isomerase) (PRA Isomerase) (Xylose Isom.) Different Classes (Aldolase) (Enolase) Function

Do not reproduce without permission 13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID Sequence similarity of pairs of proteins % Same Function Percentage of pairs that have same precise function as defined by Enzyme & FlyBase functional classifications

Do not reproduce without permission 14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Relationship of Similarity in Sequence to that in Function %ID % Same Function

Do not reproduce without permission 15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function

Do not reproduce without permission 16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function

Do not reproduce without permission 17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Can not transfer Fold or Functional Annotation ("Twilight Zone") Can transfer Annotation related Fold but not Function Can transfer both Fold & Functional Annotation Relationship of Similarity in Sequence to that in Function %ID % Same Function Broad v Narrow Similarity

Do not reproduce without permission 18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Caveats: Sequence Divergence of Multidomain Proteins, Implies a Practical Theshold is >40% (Human) (Chick) (E coli) (B ster.) (E coli) (Yeast) (Rat) Single Domain Sequences Multidomain Sequences

Do not reproduce without permission 19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu (Very Close) Sequence Similarity [ -log(e-value) ] % Same Function Multi-domain proteins have greater divergence in function with sequence

Do not reproduce without permission 20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration  Structures of ORFs with unknown function, Use Fold & Site Similarity to Determine Function Rationale for Structure Prediction Issue: To what degree does fold determine function? [Kim, Edwards & Arrowsmith, Montelione, Burley, Eisenberg]

Do not reproduce without permission 21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fold Function Combinations Many Functions on Same Fold (TIM-barrel) Different Folds with Same Function (Carbonic Anhydrases, )

Do not reproduce without permission 22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations

Do not reproduce without permission 23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds 91 Enzymatic Functions Correlation with Structural Features Architectural Class all-  all-  small Non-Enz

Do not reproduce without permission 24 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Correlation with Structural Features Slight Overpopulation Architectural Class Enzyme Class all-  all-  small

Do not reproduce without permission 25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 229 Folds Non-Enz 91 Enzymatic Functions Global View of Fold- Function Combinations Sort

Do not reproduce without permission 26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu To what degree is fold associated with function? Folds with multiple functions [Similar results by Thornton] Number of functions associated with a fold Frequency in database of 229 folds

Do not reproduce without permission 27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict functions for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration 

Do not reproduce without permission 28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteome Chips [Snyder] Microarray experiments Expression Arrays [Brown]

Do not reproduce without permission 29 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Microarray timecourse of 1 ribosomal protein mRNA expression level (ratio) Time-> [Brown, Davis]

Do not reproduce without permission 30 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Random relationship from ~18M mRNA expression level (ratio) Time->

Do not reproduce without permission 31 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Close relationship from 18M (2 Interacting Ribosomal Proteins) mRNA expression level (ratio) Time-> [Botstein; Church, Vidal]

Do not reproduce without permission 32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Clustering the yeast cell cycle to uncover interacting proteins Predict Functional Interaction of Unknown Member of Cluster mRNA expression level (ratio) Time->

Do not reproduce without permission 33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Local Clustering algorithm identifies further (reasonable) types of expression relation- ships Simultaneous Traditional Global Correlation Inverted Time- Shifted [Church]

Do not reproduce without permission 34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples inverted relationships Documented YME1 :mito. protease involved in cplx. assembly YNT20 :known surpressor of YME1 Suggestive PUT2 :involved in Pro degradation SER3 :involved in Ser synthesis Time Expr. Ratio

Do not reproduce without permission 35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples time-shifted relationships Suggestive ARP3 :in actin remodelling cplx. ARC35 :in same cplx. (required late in cell cycle) Time Expr. Ratio Predicted J0544 :unknown function MRPL19:mito.ribosome

Do not reproduce without permission 36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Examples time-shifted relationships Suggestive ARP3 :in actin remodelling cplx. ARC35 :in same cplx. (required late in cell cycle) Time Expr. Ratio Predicted J0544 :unknown function MRPL19:mito.ribosome

Do not reproduce without permission 37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Correlations Segment Large Replication Complex into Component Parts MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1

Do not reproduce without permission 38 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Expression Correlations Segment Large Replication Complex into Component Parts MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 MCMs prots. ORC Polym.  & 

Do not reproduce without permission 39 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Range of Expression Correlations within Complexes Replication Cplx Overall.05 ORC.19, MCMs.75 Pol. .45, .75, Ribosome Overall.80 Large.80 Small.81 Proteasome Overall.43 20S.50 19S.51

Do not reproduce without permission 40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permanent v. Transient Complexes

Do not reproduce without permission 41 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships ~313K significant relationships from ~18M possible

Do not reproduce without permission 42 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Global Network of 3 Different Types of Relationships Simultaneous 188K Inverted 63K Shifted 67K ~313K significant relationships from ~18M possible

Do not reproduce without permission 43 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Globally, how well do expression relationships predict known interactions? Coverage of the 8250 Known Interactions in Complexes Found Random ~2% 1x (313K/18M) 24x Enrichment Compared to Randomized Expression Relationships CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42%

Do not reproduce without permission 44 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x Enrichment Compared to Randomized Expression Relationships

Do not reproduce without permission 45 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combining Expression Data Sets Increases Coverage & Decreases Noise Coverage of the 8250 Known Interactions in Complexes Found CC: 313K relationships from ~18M possible from clustering cell-cycle expt. CC 42% 24x KO: 278K relationships from clustering knock-out profiles [Rosetta] KO 34% 22x KO v CC 55% 111x KO ^ CC 21% 254x   Enrichment Compared to Randomized Expression Relationships

Do not reproduce without permission 46 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 1)"Traditional" sequence patterns 2)Via fold similarity (structural genomics) 3)Clustering a microarray experiment 4)Data integration  Obviously integration of orthogonal info. is good but how to achieve it? And what are the issues. An example for subcellular localization in yeast

Do not reproduce without permission 47 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Subcellular Localization, a standardized aspect of function Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi

Do not reproduce without permission 48 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu "Traditionally" subcellular localization is "predicted" by sequence patterns NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi Import Sig.

Do not reproduce without permission 49 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Subcellular localization is associated with the level of gene expression Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi [Expression Level in Copies/Cell]

Do not reproduce without permission 50 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Combine Expression Information & Sequence Patterns to Predict Localization NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Cytoplasm Mitochondria Golgi Import Sig. [Expression Level in Copies/Cell]

Do not reproduce without permission 51 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Issues in Combining Many Features NLS TM-helix Sig. Seq. HDEL Nucleus Membrane Extra- cellular [secreted] ER Mitochondria Golgi Import Sig. Total of 30 diverse features (also including essentiality, coiled-coils, expression fluc., & obscure seq. patterns) How to standardize features? How to weight them?

Do not reproduce without permission 52 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Feature 1: NLS # NLS Everything expressed in standardized probabilitic terms (Features as freq. in training set) Bayesian System for Localizing Proteins Prior New Estimate Feature 2: High Expr. Better Estimate Feature 3: Is Essential? Sequentially apply features to refine prior assumed estimate using Bayes Rule (Feature x Prior / Normalization) Final Estimate Final estimate that naturally weights features comes out

Do not reproduce without permission 53 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Testing Data 7-fold cross- validation  training & test sets Overall compartment population  96% accuracy Nuc. Cyt. ER TM Mito.

Do not reproduce without permission 54 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Extrapolation to Compartment Populations of Whole Yeast Genome ~3300 Known Localizations from Exisiting DB + Expt. Localizations by Transposon Tagging & Direct Overexpression [Snyder] + Predictions (from Bayesian System)

Do not reproduce without permission 55 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu How to predict function for 1000s of proteins? 1)"Traditional" sequence patterns oLimitation: tight threshold oDeveloped 40% threshold for sequence comparison 2)Via fold similarity (structural genomics) oLimitation: multifunctionality on a fold, so weak relationship oMeasured extent of this in current DB 3)Clustering a microarray experiment oLimitation: suggestive relationships but not yet predictive oLocal clustering found ~130K relationships beyond ~180K simultaneous ones 4)Data integration  oLimitation: power is obvious but complicated to achieve oIncreased power. Works for localization of all proteins in yeast.

Do not reproduce without permission 56 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges (Initial Step: genome sequence & genes) Analyzing protein fossils 1. Understanding genes in detail Predicting protein function on a genomic scale 2. Understanding what’s between genes 

Do not reproduce without permission 57 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Proteins are central to the 2 major post-genomic challenges Analyzing protein fossils 

Do not reproduce without permission 58 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogenes (  G) as Disabled Homologies S/T Protein Phosphatase PP1 (C-term) …SRILCMHGGLSPHLQTLDQLRQLPRPQDPPNPSIGIDLLWADPDQWVKGWQAN TRGVSYVFGQDVVADVCSRLDIDLVARAHQVVQDGYEFFASKKMVTIFSAPHYC GQFDNSAATMKVDENMVCTFVMYKPTPKSMRRG* IIIIIIIVVX Worm Genome Pseudogenic fragment TKRTSNGFGQDVVVDLFSILDSGLVARAHX VLQDIFEFFASKKMVTIFS # APHSPHSAPH YCAQFDNSAATVKV Most Multiply Disabled #

Do not reproduce without permission 59 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Types of  G: Duplicated & Processed Original Gene Duplicated Gene Duplicated  G AAAAAA> Processed  G Spliced mRNA Processed  G with disablements [Heidmann]

Do not reproduce without permission 60 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Large-scale Assignment of Pseudogenes (*Chr only)

Do not reproduce without permission 61 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu  G Calculations Large Basic Calculation  "blasting" ~1M fragments against protein DB  50 CPU days for worm  Parallel computing + large DBs Protein DB chromosome GG Integrating heterogeneous, dynamically changing annotation  Changing sequences, gene predictions, repeats Sequence 1 Sequence 2 Sequence 3 Genes A Repeats 1 Genes B Repeats 2

Do not reproduce without permission 62 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu  G: Questions 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?

Do not reproduce without permission 63 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm  G: Lots! Good Stats 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?

Do not reproduce without permission 64 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Amino-acid composition of Pseudogenes is midway between Genes and translated Intergenic DNA Worm Amino Acid (sorted) Frequency

Do not reproduce without permission 65 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Amino-acid composition of Pseudogenes is midway between Genes and translated Intergenic DNA in many genomes Worm Yeast Fly Human

Do not reproduce without permission 66 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Midway composition also applies to codons

Do not reproduce without permission 67 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogene distribution on worm chomo- somes: On Ends Pseudogenes elevated at ends of worm chr I Genes elevated in middle

Do not reproduce without permission 68 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogene distribution on worm chomo- somes: On Ends ~50%  G in terminal 3Mb vs ~30% G  G -- G 16% (min)  G -- G 29% (max)

Do not reproduce without permission 69 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Default: #  G  #genes, in a family RT # pseudogenes in family # genes in family

Do not reproduce without permission 70 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction?

Do not reproduce without permission 71 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction? Horz. Transfer?

Do not reproduce without permission 72 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 0 Completely Dead Families in Worm Genome Extinction? Horz. Transfer? Contamination?

Do not reproduce without permission 73 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm  G families: chemoreceptors & transposon functions Environ- mental Response Family

Do not reproduce without permission 74 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Worm  G families: Unique or highly expanded relative to fly Environ- mental Response Family

Do not reproduce without permission 75 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Common worm Pseudofolds [Scop, Murzin]  G rank Genes rank Genes rank

Do not reproduce without permission 76 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Common worm Pseudofolds [Scop, Murzin]  G rank Genes rank Genes rank

Do not reproduce without permission 77 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fly  G: Broken Fossils 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?

Do not reproduce without permission 78 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Fly fossils are "broken up" by deletions Explanation for having only 5% of  G of worm is high rate of small genomic deletions (~ bp) [Petrov, Hartl] pseudomotifs

Do not reproduce without permission 79 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Same density of pseudomotifs in fly and worm Analyze occurrence of "pseudomotifs" (ancient, broken- up fossils) in intergenic regions relative to statistical expectation 1329 Prosite Motifs (e.g ZnF C -x(2,4)- C -x(3)-  -x(8)- H -x(3,5) - H & Tubulin) TM-helices

Do not reproduce without permission 80 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Human  G: Duplicated v. Processed 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ? (*Chr only) Expansion of genome papers [Venter et al., Dunham et al., Lander et al.]

Do not reproduce without permission 81 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Pseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 Top Functional Families in Genes and  G on Human Chr. 21 and 22 Ig* 70 Environmental Response Family

Do not reproduce without permission 82 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Other DNA binding32 Nucleotide binding28 Transcription Factor28 Other Nucleic-acid binding 16 Kinase14 GenesPseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 Top Functional Families in Genes and  G on Human Chr. 21 and 22 Ig* 69 Ig* 70 Environmental Response Family

Do not reproduce without permission 83 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Other DNA binding32 Nucleotide binding28 Transcription Factor28 Other Nucleic-acid binding 16 Kinase14 Genes Transcription factor6 Nucleotide binding5 Kinase4 Transferase4 Receptor4 Ribosomal protein42 Transcription factor7 Other DNA binding7 Receptor4 Other Nucleic-acid binding 4 Pseudogenes Ribosomal protein43 Transcription factor12 Other DNA binding8 Receptor8 Kinase5 DuplicatedProcessed Top Functional Families in Genes and  G on Human Chr. 21 and 22 Ig* 70 Ig* 69 Ig* 70 Environmental Response Family

Do not reproduce without permission 84 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Distribution of processed  G roughly matches expectation of random insertions Constant density between human & worm Occurrence related to expression  ribosomal proteins Among ribosomal proteins: Proportional selection between subunits Uniform insertion across 21 &

Do not reproduce without permission 85 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast  G: Simple Story & Mechanism 1)What are they composed of? (composition) 2)Where are they? (which organisms, chr. position) 3)Which type of proteins are they? Why? (functions) 4)Do processed & duplicated varieties differ?

Do not reproduce without permission 86 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast  G concentrated near telomeres Pseudogenes Genes ……

Do not reproduce without permission 87 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Environmental response functions of yeast  G 5 most common families in  G Not same as the most common families in genes Environ- mental Response Family

Do not reproduce without permission 88 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Yeast  G come from yeast-specific families GG genes Fraction having a non-yeast homolog 40%80%

Do not reproduce without permission 89 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Resurrecting pseudogenes: is it possible? Hypothetical example of a flocculin Idea of "untranslatable intermediates" in protein evolution has been around for a while [Nei, '70; Koch, '72] [Walsh] Functioning FLO8 causes filamentous growth in most strains [ Fink ] FLO8 disabled in lab strain (S288C)

Do not reproduce without permission 90 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu We find suggestive evidence that PSI resurrects  G: 35  G easily resurrectable with only 1 stop Microarrays show some of these are expressed [M Snyder] Many involved in environmental response Perhaps testable with selection experiments A speculative mechanism for resurrecting yeast  G, via [PSI+], perhaps in environmental response [PSI+] [ Lindquist ] Prion of Sup35p, translation-termination protein Causes read-through of stops Causes phenotypic diversity, through the expression of new or altered proteins

Do not reproduce without permission 91 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Evolutionary Implications of a Reservoir Resurrectable  G for Creating New Folds

Do not reproduce without permission 92 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Not all folds shared between phylogenetic groups  Evolution of new folds Eubacteria Eukaryotes Animals Plants

Do not reproduce without permission 93 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Evolutionary Implications of a Reservoir Resurrectable  G for Creating New Folds Paradox: going between folds A & Z with all intermediates functional [Koch '72] Pseudogenes free of constraints of being transcribed & translated

Do not reproduce without permission 94 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu  G: Summary 1)What are they composed of? Intermediate composition between genes & translated intergenic DNA 2)Where are they? On chromosomal ends. In all organisms, though reduced to pseudomotifs in the fly. 3)Which type of proteins are they? Why? Environmental response proteins.  G may be extra parts that can be resurrected. Potential mechanism suggested for yeast involving [PSI+]. 4)Do processed & duplicated varieties differ? YES. (Duplicated  G described above.) Processed  G appear to be just randomly inserted from mRNA pool. Hence, they show obvious relationship to mRNA level & intergenic region size

Do not reproduce without permission 95 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Practical Backdrop to Integrated Gene Annotation & Interpretation of DNA Arrays 137 potential new yeast genes Integrated approach: homology search + transposons + microarrays Small ORFs & anti-sense to existing ORFs [Snyder] Human DNA Arrays (ongoing) All of chr22 on a chip in ~1kb chunks, probe for expression, TF binding Need to have mapped landscape (genes,  Gs, repeats, SNPs, &c) to design chip & interpret results I IIIII IV V VI VII VIII

Do not reproduce without permission 96 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu GeneCensus.org ORF Query Alignment Server Alignment Database PDB Query Detailed Tables RanksTrees PartsList.org

Do not reproduce without permission 97 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Acknowledgements Predicting Protein Function on a Genomic Scale Jiang Qian, Ronald Jansen, Amar Drawid, Cyrus Wilson, Hedi Hegyi, Dov Greenbaum, Hayiuan Yu, Jimmy Lin, Ning Lan, Yuval Kluger Analysis of Pseudogenes Paul Harrison, Zhaolei Zhang, Nathaniel Echols, Suganthi Balasubramanian, Nicholas Luscombe, Paul Bertone, Ted Johnson, Patrick McGarvey Other Projects Yang Liu, Jochen Junker, Vadim Alexandrov, Rajdeep Das, Werner Krebs, Brad Stenger Collaborators M Snyder (A Kumar, H Zhu, M Bilgin, C Horack …) S Weissmann (Z Lian, S Yamaga…) P Miller (K Cheung), M Schultz G Montelione et al. PartsList.org, GeneCensus.org, nesg.org

Do not reproduce without permission 98 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Assessing Function Globally Known associated pairs have same "cellular role" (according to MIPS, GO, &c)

Do not reproduce without permission 99 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Function Prediction

Do not reproduce without permission 100 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Based on Distributions, Correlation of Established Functional Categories, Computer Clusterings Correlation: Always Significant Sometimes Significant (depends on expt.) Never Significant

Do not reproduce without permission 101 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Results on Interaction Prediction

Do not reproduce without permission 102 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein-Protein Interactions & Expression between selected expression timecourses (all pairs, control) (strong interactions in perm- anent complexes, clearly diff.) Cell Cycle CDC28 expt. (Davis) Sets of interactions (from MIPS) (Uetz et al.) Pairwise interactions

Do not reproduce without permission 103 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Protein-Protein Interactions & Expression Sets of interactions between selected expression timecourses (all pairs, control) (from MIPS) (strong interactions in perm- anent complexes, clearly diff.) (Uetz et al.) Cell Cycle CDC28 expt. (Davis) Pairwise interactions