P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

Slides:



Advertisements
Similar presentations
Dissecting plant genomes using PLAZA 2.5 Michiel Van Bel 1,2+, Sebastian Proost 1,2+, Elisabeth Wischnitzki 1,2, Sara Mohavedi 1,2, Christopher Scheerlinck.
Advertisements

Pathways analysis Iowa State Workshop 11 June 2009.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
Design principle of biological networks—network motif.
Curation of the EcoCyc Database: The EcoCyc Update Project Martha Arnaud Scientific Database Curator Bioinformatics Research Group SRI International
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Bioinformatics and Phylogenetic Analysis
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Comparative Genomics of the Eukaryotes
Influenza Research Database (IRD): A Web-based Resource for Influenza Virus Data and Analysis Victoria Hunt 1 *, R. Burke Squires 1, Jyothi Noronha 1,
The Ensembl Gene set The “Genebuild” 21 April 2008.
Comparative Genomics Tools in GMOD GMOD.org Dave Clements 1, Sheldon McKay 2, Ken Youns-Clark 2, Ben Faga 3, Scott Cain 4, and the GMOD Consortium 1 National.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Generic model/many/my organism database Oct 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University GMOD.
Functional genomics data collection, integration, visualization project Collects functional genomics (microarray, interaction, localization, etc) data.
Copyright OpenHelix. No use or reproduction without express written consent1.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
GMOD: Managing Genomic Data from Emerging Model Organisms Dave Clements 1, Hilmar Lapp 1, Brian Osborne 2, Todd J. Vision 1 1 National Evolutionary Synthesis.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Organizing information in the post-genomic era The rise of bioinformatics.
Got genom e? Community Meetings GMOD.org The GMOD community meets semi- annually to discuss GMOD components, best practices,
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Digesting the Genome Glut Promoting the Use and Extension of GMOD To Emerging Model Organisms David Clements 1 Brian Osborne 2 Hilmar Lapp 1 Xianhua Liu.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
Central dogma: the story of life RNA DNA Protein.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Protein Structure Database for Structural Genomics Group Jessica Lau December 13, 2004 M.S. Thesis Defense.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Copyright OpenHelix. No use or reproduction without express written consent1.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Behavior and Phenotype in GMOD Natural Diversity in GMOD
The Refgene Database.
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Genome Annotation Continued
Overview of Microbial Pathway and Genome Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
got genome? Community Meetings Databases Training GMOD.org
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen

6/3/2015 |P-POD |2 P-POD - Manuscript The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists Heinicke S 1,*, Livstone MS 1,*, Lu C 1,*, Oughtred R 1,*, Kang F 1, Angiuoli SV 2,3, White O 2, Botstein D 1, Dolinski K 1 PLoS ONE Aug 22; 2(1): e766 PubMed ID Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America. 2 The Institute for Genomic Research, Rockville, Maryland, United States of America 3 Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America * These authors contributed equally to this work

6/3/2015 |P-POD |3 P-POD - Introduction Existing: many biological databases that provide comparative genomics information and tools None of these combine results from multiple comparative genomics methods with manually curated information from the literature  P-POD: Princeton Protein Orthology Database: –Visualizes phylogenetic relationships among predicted orthologs –Shows the orthologs in a wider evolutionary context –Contains experimental results manually collected from the literature, that can be compared to the computational analyses –Shows links to relevant human disease and gene information via the OMIM, model organism and sequence database

6/3/2015 |P-POD |4 P-POD – Ortholog methods Orthology is determined using OrthoMCL: –Can be run on multiple species at once –One of the better performing algorithms in terms of sensitivity and specificity (Alexeyenko et al., 2006 and Chen et al., 2007) Evolutionary context is determined using Jaccard: –Clustering algorithm to find related proteins –Larger groups than just orthologs –Manuscript in preparation

6/3/2015 |P-POD |5 P-POD – Covered species P-POD contains 8 species: Plasmodium falciparum Homo sapiens Drosophila melanogaster Mus musculus Arabidopsis thaliana Caenorhabditis elegans Danio rerio Saccharomyces cerevisiae  Most widely studied organisms, from a wide evolutionary range

6/3/2015 |P-POD |6 P-POD – Source Species Databases

6/3/2015 |P-POD |7 P-POD – Supported identifiers Organism Source Database Valid gene/protein identifier(s)Examples P.falciparumPlasmoDBPlasmoDB IDPF08_0034 H.sapiensENSEMBLENSEMBL peptide ID, peptide nameENSP ENSP , CDK2CDK2 D.melanogasterFlyBaseFlyBase IDCG17520-PACG17520-PA, CkIIalpha-PACkIIalpha-PA M.musculusENSEMBLENSEMBL peptide IDENSMUSP A.thalianaTAIRTAIR identifier or gene nameAT1G AT1G , PAB4PAB4 C.elegansWormBaseWormBase identifier or gene nameC09G4.1C09G4.1, dbr-1dbr-1 D.rerioENSEMBLENSEMBL peptide ID, ZFIN ID ENSDARP ENSDARP , ZDB-GENE ZDB-GENE S.cerevisiaeSGDORF name or gene nameYNL098CYNL098C, DPM1DPM1 + OMIM IDs

6/3/2015 |P-POD |8 P-POD – Orthology and clustering numbers 25,271 OrthoMCL families 15,050 Jaccard Clustering families 165,970 proteins (154,736 OrthoMCL and 152,799 Jaccard) 984 families containing proteins in all species (‘omnipresent’) 112 families with exactly one protein in each of the 8 species: involved in core biological processes, such as: –Translation –Transport –Cell cycle regulation –Cytoskeleton organization

6/3/2015 |P-POD |9 P-POD – Proteins in families, and orphans Relatively low percentages of orphans (<=13%, except for S. cerevisiae and P. falciparum) These numbers confirm the high conservation of proteins across eukaryotes, with the notable exception the Plasmodium outlier Yeast: complete protein set used, including 800 ORFS flagged as “Dubious” by SGD. If these are excluded, the percentage of orphans drops to 20%

6/3/2015 |P-POD |10 P-POD – Compared to other orthology databases Tot.

6/3/2015 |P-POD |11 P-POD - Pipeline

6/3/2015 |P-POD |12 P-POD – Pipeline Components [4] Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189 [5] Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4680 [29] Samuel Lattimore B, van Dongen S, Crabbe MJ (2005) GeneMCL in microarray analysis. Comput Biol Chem 29: 354– 359

6/3/2015 |P-POD |13 P-POD – The Database P-POD uses the Generic Model Organism Database (GMOD) database package using PostgreSQL software GMOD is the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. You can use it to create a small laboratory database of genome annotations, or a large web-accessible community database. GMOD tools are in use at many large and small community databases Other popular GMOD tools are Apollo (Genome annotation editor), Gbrowse (Genome annotation viewer), Cmap (Comparative map viewer), Sybil (Comparative genome viewer), Chado (Biological database schema) and BioMart (Data mining system)

6/3/2015 |P-POD |14 P-POD - Web Interface (1) The web interface allows users to search and browse the data in several ways Results can be queried by various peptide identifiers or gene names Searches generate result pages that contain: –a hyperlinked phylogenetic tree of predicted orthologs generated by OrthoMCL or of more distantly-related proteins generated by Jaccard clustering –a list of diseases and genes associated with the human ortholog(s) as documented in OMIM –a manually curated list of papers with cross-complementation experiments involving the yeast ortholog(s), from SGD database –a downloadable ClustalW alignment of family members Web address:

6/3/2015 |P-POD |15 P-POD – Web Interface (2) OrthoMCL OMIM CLUSTALW SGD Lit. INPUT

6/3/2015 |P-POD |16 P-POD – Web Interface (3) SGD Lit. CLUSTALW JACCARD

6/3/2015 |P-POD |17 P-POD – Comparison of methods Orthology/clustering methods OrthoMCL and Jaccard can be compared using P-POD Jaccard is far more inclusive than OrthoMCL Shown at the right: OrthoMCL family of the alpha tubulins. It contains only the alpha tubulins, while the Jaccard family contains the alpha, beta, and gamma tubulins

6/3/2015 |P-POD |18 P-POD – Discussion (1) P-POD shows direct orthology (OrthoMCL) and broader evolutionary clustering (Jaccard) P-POD uses a generic, modular database schema (GMOD) in combination with a freely available database system (PostgreSQL) P-POD provides experimental evidence of conservation curated from the primary literature Three sets of users: –Molecular biologists that query the database over the web to browse orthology data for their favorite proteins –Model organism database developers, who will quickly be able to provide comparative genomics tools with their species of interest by implementing our system –Computational biologists who are developing novel comparative genomics algorithms will find the curated information and computational data from other methods extremely useful in assessing their approach

6/3/2015 |P-POD |19 P-POD – Discussion (2) P-POD can be downloaded in its entirety for installation on one’s own system Software developers can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools

6/3/2015 |P-POD |20 P-POD – Future plans Provide regular updates to the data contained within the database Add new features to the web interface Expand upon the amount of data stored within the database Provide curated literature describing experimental confirmation of orthology Include literature from other species than just S. cerevisiae As more refined methods for automatic detection of orthology are developed, they can be incorporated into the P-POD tool, taking advantage of the modular design scheme