Presentation is loading. Please wait.

Presentation is loading. Please wait.

California Institute of Technology

Similar presentations


Presentation on theme: "California Institute of Technology"— Presentation transcript:

1 California Institute of Technology
Automated generation of human-readable gene summaries using structured data. Ranjana Kishore WormBase California Institute of Technology Pasadena, California. Biocuration Conference 2017, Stanford University

2 Gene summaries - the big picture
Condenses the knowledge into key semantic categories: Orthology Molecular function Biological process and pathway Tissue and sub-cellular expression Genetic and physical interactions

3 The need for automation
The problem Several years to write ~6000 gene summaries Difficult to maintain Many new genes Hundreds of papers Could we automate the writing of gene summaries?

4 Looking within: A wealth of data in WormBase
Curated data are the building blocks for a gene summary Gene names Orthology to genes from human and other species Gene Ontology (GO) annotations Tissue expression GO project,

5 Templates for sentence construction
Semantic Category Template for Sentence Construction Orthology <Gene> is an ortholog of <human gene 1> Biological Process is involved in <process term 1>, <process term 2> and <process term 3> Molecular Function exhibits is predicted to have <molecular function term 1> Tissue expression is expressed in <anatomy term 1> and <anatomy term 2> Sub-cellular expression is localized to <cellular component term 1> and <cellular component term 2> Description for the gene npp-19 built using the above templates: npp-19 is an ortholog of human NUP35 (nucleoporin ); npp-19 is involved in embryo development, nuclear import and nucleus organization. npp-19 is localized to the nuclear envelope.

6 Automated gene summaries are reader-friendly
pfn-3 is involved in muscle thin filament assembly; pfn-3 is localized to the striated muscle dense body. tbc-8 is an ortholog of human SGSM2 (small G protein signaling modulator 2) and SGSM1 (small G protein signaling modulator 1); tbc-8 is involved in dense core granule maturation; tbc-8 exhibits Rab GTPase binding activity; tbc-8 is expressed in the nervous system; tbc-8 is localized to the Golgi medial cisterna, the Golgi trans cisterna, the cytosol and the early endosome. Cbr-twk-18 is an ortholog of C. elegans twk-18, which is involved in potassium ion transport, muscle contraction and locomotion; based on protein domain information, Cbr-twk-18 is involved in potassium ion transmembrane transport, is predicted to have potassium channel activity and is localized to the membrane.

7 Enhancing the readability of gene summaries
Strategies used when there was too much data For ortholog data: grouped orthologs using gene classes mentioned orthologs based on number of publications (popularity score) For expression data: grouped cells into cell groups Strategy used when there was not enough data: Borrowed from the summary of the well studied species

8 Enhancing readability when there is too much data
Example 1 CBG00317 is an ortholog of C. elegans fbxc-16, fbxc-15, fbxc-18, sdz-4, fbxc-28, fbxc-19 and fbxc-12. Grouped orthologs using C. elegans gene class and gene popularity score (from Textpresso*). Description becomes more readable: CBG00317 is an ortholog of C. elegans sdz-4 and members of the fbxc gene class including fbxc-28, fbxc-15 and fbxc-18. Example 2 hrp-1 is an ortholog of human HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2), HNRNPA1 (heterogeneous nuclear ribonucleoprotein A1), HNRNPA3 (heterogeneous nuclear ribonucleoprotein A3) and HNRNPA2B1 (heterogeneous nuclear ribonucleoprotein A2/B1). Grouped human genes using HGNC** human gene families. Now becomes readable: hrp-1 is an ortholog of members of the human RBM (RNA binding motif containing) family including HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2). *Textpresso project, **HGNC, Hugo Gene Nomenclature Committee, (

9 Enhancing readability when data content is poor
Example 1 CBG02064 is an ortholog of C. elegans immt-1. Added information about immt-1 from the C. elegans summary: CBG02064 is an ortholog of C. elegans immt-1; in C. elegans, immt-1 is involved in response to reactive oxygen species, growth, cristae formation and mitochondrion morphogenesis. Example 2 PPA00338 is an ortholog of C. elegans cst-2 and cst-1. Added information about cst-2 and cst-1: PPA00338 is an ortholog of C. elegans cst-2 and cst-1; in C. elegans, cst-2 and cst-1 are involved in determination of adult lifespan and locomotion.

10 Automated gene summaries pipeline
Additional rule based processing of data: Grouping of data, popularity scores, borrowing of data Summaries with too much data Summaries with too little data Input data files Gene summaries with enhanced readability Rule based processing of data Database Build

11 Automated gene summaries are displayed in WormBase

12 Automated gene summaries filled a large data gap
Species Before automation (WS245 Oct 2014) After automation (WS252 Jan 2016) Current numbers (WS257 March 2017) C. elegans 6, 680 13, 819 18, 103 C. brenneri 22, 449 22, 439 C. briggsae <10 17, 022 17, 346 C. japonica 18, 905 18, 902 C. remanei 23, 184 23, 226 Pristionchus pacificus 18 12, 586 12, 527 Brugia malayi 8, 676 9, 608 Strongyloides ratti 9, 119 9, 150 Onchocerca volvulus 9, 407 9, 432 This project has generated thousands of gene summaries for nine species including these parasitic species. We have tripled the number of genes for C. elegans and written thousands of summaries for species where none existed.

13 Automated gene summaries: the benefits
Labor and time efficient Tells us what’s missing Leverages the time and effort spent on other annotation projects Scales: From 6,704 gene summaries for 1 species to over 140,000 for 9 species Applicable to other data types - Eg., Allele/variation summaries Stays current, refreshed with new data, every database build Provides a draft for community participation Software and Data Availability Software (written in Perl) at: textpresso.org/automatedgenesummary/software/ Data at: textpresso.org/automatedgenesummary/release/WS257/

14 Automated gene summaries as drafts for community participation
From the WormBase homepage Description can be edited in this field

15 Acknowledgements Juancarlos Chan (Curation tools) James Done
(Automated summaries software) Yuling Lee (Software) Kevin Howe (Orthology data) Hans Michael Muller & Yuling Li (Textpresso software) Paul Sternberg WormBase Consortium


Download ppt "California Institute of Technology"

Similar presentations


Ads by Google