California Institute of Technology

Slides:



Advertisements
Similar presentations
An Information Retrieval and Extraction System for C. elegans Literature.
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
COG and GO tutorial.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Gene Ontology at WormBase: Making the Most of GO Annotations Kimberly Van Auken.
WormBase Workshop: 2015 International C. elegans Meeting Tools & Resources InterMine / WormMine – Chris Grove JBrowse – Scott Cain The WormBase Ontology.
How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the.
March 24, Integrating genomic knowledge sources through an anatomy ontology Gennari JH, Silberfein A, and Wiley JC Pac Symp Biocomputing 2005:
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Integrating the Cell Cycle Ontology with the Mouse Genome Database David R. Smith Mary Dolan Dr. Judith Blake.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Linking Animal Models and Human Diseases Supported by NIH P41 HG002659, U54 HG004028, & R01 HG Cambridge University & the University of Oregon.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Community Curation of Gene Descriptions Ranjana Kishore Pasadena, California.
Copyright OpenHelix. No use or reproduction without express written consent1.
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Protein Structure Database for Structural Genomics Group Jessica Lau December 13, 2004 M.S. Thesis Defense.
S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Gene Ontology TM (GO) Consortium
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
PowerPoint ® Lecture Slide Presentation by Patty Bostwick-Taylor, Florence-Darlington Technical College Copyright © 2009 Pearson Education, Inc., publishing.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Extra cellular components
Prof. Ding Xue, Ph.D. Department of MCD Biology
Towards a unified MOD resource: An Overview
Cells Anatomy.
Networks and Interactions
University of California, San Diego
Genomics research paper presentation
Control of Gene Expression
Cells and Tissues.
General idea and concepts of cell-cell signaling
Warm Up What characteristics does an organism need to have to be considered living?
Cells and Tissues Chapter 3.
Workshop Aims TAMU GO Workshop 17 May 2010.
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Ch2: The Cell.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
RNA: The other nucleic acid
Prof. Ding Xue, Ph.D. Department of MCD Biology
General idea and concepts of cell-cell signaling
Ch2: The Cell DOWNLOAD THIS SLIDE For more slides click here.
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Introduction: Internal membranes compartmentalize the eukaryotic cell
Source Page Understanding for Heterogeneous Molecular Biological Data
A Database of human biological pathways
Introduction to Human Anatomy & Physiology Acids, Bases & Chemicals
Genetic Data in Mary Ann Tuli.
CHAPTER 2.2: CELLS PART 1: A tour of the Cell.
Chapter 7 Inside the Cell Biological Science, Third Edition
Presentation transcript:

California Institute of Technology Automated generation of human-readable gene summaries using structured data. Ranjana Kishore WormBase California Institute of Technology Pasadena, California. Biocuration Conference 2017, Stanford University

Gene summaries - the big picture Condenses the knowledge into key semantic categories: Orthology Molecular function Biological process and pathway Tissue and sub-cellular expression Genetic and physical interactions

The need for automation The problem Several years to write ~6000 gene summaries Difficult to maintain Many new genes Hundreds of papers Could we automate the writing of gene summaries?

Looking within: A wealth of data in WormBase Curated data are the building blocks for a gene summary Gene names Orthology to genes from human and other species Gene Ontology (GO) annotations Tissue expression GO project, www.geneontology.org

Templates for sentence construction Semantic Category Template for Sentence Construction Orthology <Gene> is an ortholog of <human gene 1> Biological Process is involved in <process term 1>, <process term 2> and <process term 3> Molecular Function exhibits is predicted to have <molecular function term 1> Tissue expression is expressed in <anatomy term 1> and <anatomy term 2> Sub-cellular expression is localized to <cellular component term 1> and <cellular component term 2> Description for the gene npp-19 built using the above templates: npp-19 is an ortholog of human NUP35 (nucleoporin ); npp-19 is involved in embryo development, nuclear import and nucleus organization. npp-19 is localized to the nuclear envelope.

Automated gene summaries are reader-friendly pfn-3 is involved in muscle thin filament assembly; pfn-3 is localized to the striated muscle dense body. tbc-8 is an ortholog of human SGSM2 (small G protein signaling modulator 2) and SGSM1 (small G protein signaling modulator 1); tbc-8 is involved in dense core granule maturation; tbc-8 exhibits Rab GTPase binding activity; tbc-8 is expressed in the nervous system; tbc-8 is localized to the Golgi medial cisterna, the Golgi trans cisterna, the cytosol and the early endosome. Cbr-twk-18 is an ortholog of C. elegans twk-18, which is involved in potassium ion transport, muscle contraction and locomotion; based on protein domain information, Cbr-twk-18 is involved in potassium ion transmembrane transport, is predicted to have potassium channel activity and is localized to the membrane.

Enhancing the readability of gene summaries Strategies used when there was too much data For ortholog data: grouped orthologs using gene classes mentioned orthologs based on number of publications (popularity score) For expression data: grouped cells into cell groups Strategy used when there was not enough data: Borrowed from the summary of the well studied species

Enhancing readability when there is too much data Example 1 CBG00317 is an ortholog of C. elegans fbxc-16, fbxc-15, fbxc-18, sdz-4, fbxc-28, fbxc-19 and fbxc-12. Grouped orthologs using C. elegans gene class and gene popularity score (from Textpresso*). Description becomes more readable: CBG00317 is an ortholog of C. elegans sdz-4 and members of the fbxc gene class including fbxc-28, fbxc-15 and fbxc-18. Example 2 hrp-1 is an ortholog of human HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2), HNRNPA1 (heterogeneous nuclear ribonucleoprotein A1), HNRNPA3 (heterogeneous nuclear ribonucleoprotein A3) and HNRNPA2B1 (heterogeneous nuclear ribonucleoprotein A2/B1). Grouped human genes using HGNC** human gene families. Now becomes readable: hrp-1 is an ortholog of members of the human RBM (RNA binding motif containing) family including HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2). *Textpresso project, www.textpresso.org **HGNC, Hugo Gene Nomenclature Committee, (www.genenames.org)

Enhancing readability when data content is poor Example 1 CBG02064 is an ortholog of C. elegans immt-1. Added information about immt-1 from the C. elegans summary: CBG02064 is an ortholog of C. elegans immt-1; in C. elegans, immt-1 is involved in response to reactive oxygen species, growth, cristae formation and mitochondrion morphogenesis. Example 2 PPA00338 is an ortholog of C. elegans cst-2 and cst-1. Added information about cst-2 and cst-1: PPA00338 is an ortholog of C. elegans cst-2 and cst-1; in C. elegans, cst-2 and cst-1 are involved in determination of adult lifespan and locomotion.

Automated gene summaries pipeline Additional rule based processing of data: Grouping of data, popularity scores, borrowing of data Summaries with too much data Summaries with too little data Input data files Gene summaries with enhanced readability Rule based processing of data Database Build

Automated gene summaries are displayed in WormBase

Automated gene summaries filled a large data gap Species Before automation (WS245 Oct 2014) After automation (WS252 Jan 2016) Current numbers (WS257 March 2017) C. elegans 6, 680 13, 819 18, 103 C. brenneri 22, 449 22, 439 C. briggsae <10 17, 022 17, 346 C. japonica 18, 905 18, 902 C. remanei 23, 184 23, 226 Pristionchus pacificus 18 12, 586 12, 527 Brugia malayi 8, 676 9, 608 Strongyloides ratti 9, 119 9, 150 Onchocerca volvulus 9, 407 9, 432 This project has generated thousands of gene summaries for nine species including these parasitic species. We have tripled the number of genes for C. elegans and written thousands of summaries for species where none existed.

Automated gene summaries: the benefits Labor and time efficient Tells us what’s missing Leverages the time and effort spent on other annotation projects Scales: From 6,704 gene summaries for 1 species to over 140,000 for 9 species Applicable to other data types - Eg., Allele/variation summaries Stays current, refreshed with new data, every database build Provides a draft for community participation Software and Data Availability Software (written in Perl) at: textpresso.org/automatedgenesummary/software/ Data at: textpresso.org/automatedgenesummary/release/WS257/

Automated gene summaries as drafts for community participation From the WormBase homepage Description can be edited in this field

Acknowledgements Juancarlos Chan (Curation tools) James Done (Automated summaries software) Yuling Lee (Software) Kevin Howe (Orthology data) Hans Michael Muller & Yuling Li (Textpresso software) Paul Sternberg WormBase Consortium