A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January.

Slides:



Advertisements
Similar presentations
The Gene Wiki: Community Intelligence Applied to Gene Annotation FaceBase Kick-off Meeting November 16, 2009 Andrew Su, Ph.D.
Advertisements

Pensoft Writing Tool (PWT) Lyubomir Penev ViBRANT Tools for DNA taxonomists, 11 June 2013, Brussles ViBRANT.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2005.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Moodle, Blogs, Wikis and More Exploring Web 2.0 Tools: The 2nd Generation of the World Wide Web.
Web 2.0 The Read/Write Web. Marc Prensky Terms Digital Natives Digital Natives Digital Immigrants--maintain a pre-digital accent Digital Immigrants--maintain.
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Evidence-Based Information Retrieval in Bioinformatics
Project presentation using TWiki Lim Yun Ping National University of Singapore.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Web 2.0: Concepts and Applications 2 Publishing Online.
Proprietary & Confidential The Thread That Ties it All Together Voicethread and Discovery Education Jennifer Dorman denblogs.com/jendorman.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
WEB 2.0 READ/WRITE WEB Eidson. WORLD WIDE WEB  Sir Tim Berners-Lee  World Wide Web Inventor-1989  Web 2.0 – The Read/Write Web.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Web 2.0: Concepts and Applications 2 Publishing Online.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Copyright OpenHelix. No use or reproduction without express written consent1.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Doi: /journal.pbio Scivee Pubcast. 2 Community intelligence Traditional media revolves around the Short Head – a few number of publishers.
Gene Wiki Jamboree FaceBase Spring Meeting June 3-4, 2010 Benjamin Good, GNF.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Why do we need good quality annotations? Pankaj Jaiswal Oregon State University Gene Annotation Workshop July 31, 2010 ASPB Plant Biology 2010 Montreal,
8 October 2009Microbial Research Commons1 Toward a biomedical research commons: A view from NLM-NIH Jerry Sheehan Assistant Director for Policy Development.
Common Gene Pages Scott Cain GMOD Coordinator Cold Spring Harbor Laboratory.
Towards Data Attribution & Citation in the Life Sciences Philip E. Bourne UCSD 8/22/11Data Attribution and Citation.
The New Website of the Gene Ontology Consortium Seth Carbon Chris Mungall, PhD Monica Munoz-Torres, PhD Genomics Division,
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
A collaborative tool for sequence annotation. Contact:
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Crowd Sourcing Methods to Annotate Biological Processes Andra Waagmeester Micelio.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
A wiki is a collaborative web application which allows people to add and edit content using a browser… …it creates communities and empowers users as they.
Geeks - FDU Library Staff Meeting - Summer 2007 Geeks Bearing Gifts Unwrapping New Technology Trends.
Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University
Objectives: Trial a new way for all NRENs/global PR network to share, access and showcase use case information using a digital/blog platform Raise awareness.
INTRODUCTION TO MAPNET WIKI Anar Khan on behalf of AgResearch IS Bioinformatics, Mathematics and Statistics 10/10/2006.
Donna Waters Felecia Wesley The Man with the visio n Tim Berners-Lee began the development of his vision of the World Wide Web in Before the creation.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Social Media & Social Networking 101 Canadian Society of Safety Engineering (CSSE)
Ukpmc.ac.uk As a result of the mandates Research in the open How mandates work in practice 29 th May, 2009 Paul Davey, UK PubMed Central Engagement Manager,
By: Jamie Morgan  A wiki is a web page or collection of web pages which you and your students can access to contribute or modify content without having.
Use SIOC RDF format for representation of scientific statements Annotated statements created by manual curation automated extraction of biomedical literature.
STRING Large-scale data and text mining
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Annotation: linking literature to gene products
denblogs.com/jendorman
The Gene Wiki, from a BioRDF-naïve perspective
Gene Safari (Biological Databases)
Problems from last section
Presentation transcript:

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January 16, 2014 GMOD 2014 OK

Why am I giving this keynote? 2

3 Harnessing the crowd…

4 … to organize information

My simplified history of MODs 5

6

GMOD is widely used (!) organizations listed as GMOD users

Does the current model scale? 8

9

10 # sequenced genomes Year

Does the current model scale? 11

The Long Tail of genomic data is being lost 12 Identified 517 operons and 103 small regulatory RNAs...

The Long Tail of genomic data is being lost 13 Identified 517 operons and 103 small regulatory RNAs...

At least you can download structured data… 14

Centralized Model Organism Database concept 15 CMOD

16 GMOD as a Service (GaaS)

17

Few genes are well annotated… 18 Data: NCBI, February % 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GO Annotation Counts

… because the literature is sparsely curated? 19

… because the literature is sparsely curated? 20 Number of articles read by typical scientist

21 311,696 articles (1.5% of PubMed) have been cited by GO annotations

22 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.

The Long Tail is a prolific source of content 23 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol

Wikipedia is reasonably accurate 24

Wikipedia has breadth and depth 25 July 2008 Articles Words (millions) WikipediaBritannica Online

26 We can harness the Long Tail of scientists to directly participate in the gene annotation process.

Filtering, extracting, and summarizing PubMed Documents ConceptsReview article

Filtering, extracting, and summarizing PubMed Documents Concepts

Wiki success depends on a positive feedback 29 Gene wiki page utility Number of users Number of contributors

10,000 gene “stubs” within Wikipedia 30 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors

Gene Wiki has a critical mass of readers 31 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors

Gene Wiki has a critical mass of editors 32 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editor count Editors Edits Edit count

A review article for every gene is powerful 33 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002

Making the Gene Wiki more computable 34 Structured annotationsFree text

Filling the gaps in gene annotation 35 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO: novel GO annotations 2147 novel DO annotations

Gene Wiki content improves enrichment analysis 36 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO: axon guidance (GO: ) 264 genes Linked genes through PubMed P = 1.55 E articles YesNo Yes 132 No

Gene Wiki content improves enrichment analysis 37 GO term Gene list Concept recognition PubMed abstracts Gene Wiki + Enrichment analysis GO: muscle contraction (GO: ) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0P = 1.22 E articles 87 articles

Gene Wiki content improves enrichment analysis 38 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only

The Long Tail of scientists is a valuable source of information on gene function 39

Can we skip text mining?

Wikidata 41 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić

Wikidata understands scale 42

Wikidata understands scale million Wikidata items… …13 million total genes in Entrez Gene

Wikidata understands scale million Wikidata statements… …150k total GO annotations

Wikidata for biology 45 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q Q Q Q Q Reelin

Wikidata for biology 46 Property:P31 Property:P128 Property:P129 Q8054 Q Q Q Q Q

Increasing biological data in Wikidata 47

Loading genomic data into Wikidata 48 Entrez Gene Ensembl UniProt UCSC PDB RefSeq

Wikidata gene model 49 Added ~1000 human genes so far….

Wikidata as CMOD? 50 CMOD

Wikidata as CMOD? 51 CMOD Powered by: CMOD

The Long Tail of bioinformaticians can collaboratively build a Centralized Model Organism Database (CMOD). 52

53 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim Many Wikipedia editors WP:MCB Project Gene Wiki Collaborators Katie Fisch Ben Good Salvatore Loguercio Tobias Meissner Max Nanis Chunlei Wu Group members Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820) Contact +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni Recruiting for student, postdoc, outreach, and/or staff positions!