A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January 16, 2014 GMOD 2014 OK
Why am I giving this keynote? 2
3 Harnessing the crowd…
4 … to organize information
My simplified history of MODs 5
6
GMOD is widely used (!) organizations listed as GMOD users
Does the current model scale? 8
9
10 # sequenced genomes Year
Does the current model scale? 11
The Long Tail of genomic data is being lost 12 Identified 517 operons and 103 small regulatory RNAs...
The Long Tail of genomic data is being lost 13 Identified 517 operons and 103 small regulatory RNAs...
At least you can download structured data… 14
Centralized Model Organism Database concept 15 CMOD
16 GMOD as a Service (GaaS)
17
Few genes are well annotated… 18 Data: NCBI, February % 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GO Annotation Counts
… because the literature is sparsely curated? 19
… because the literature is sparsely curated? 20 Number of articles read by typical scientist
21 311,696 articles (1.5% of PubMed) have been cited by GO annotations
22 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
The Long Tail is a prolific source of content 23 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
Wikipedia is reasonably accurate 24
Wikipedia has breadth and depth 25 July 2008 Articles Words (millions) WikipediaBritannica Online
26 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
Filtering, extracting, and summarizing PubMed Documents ConceptsReview article
Filtering, extracting, and summarizing PubMed Documents Concepts
Wiki success depends on a positive feedback 29 Gene wiki page utility Number of users Number of contributors
10,000 gene “stubs” within Wikipedia 30 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
Gene Wiki has a critical mass of readers 31 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
Gene Wiki has a critical mass of editors 32 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editor count Editors Edits Edit count
A review article for every gene is powerful 33 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable 34 Structured annotationsFree text
Filling the gaps in gene annotation 35 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO: novel GO annotations 2147 novel DO annotations
Gene Wiki content improves enrichment analysis 36 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO: axon guidance (GO: ) 264 genes Linked genes through PubMed P = 1.55 E articles YesNo Yes 132 No
Gene Wiki content improves enrichment analysis 37 GO term Gene list Concept recognition PubMed abstracts Gene Wiki + Enrichment analysis GO: muscle contraction (GO: ) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0P = 1.22 E articles 87 articles
Gene Wiki content improves enrichment analysis 38 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only
The Long Tail of scientists is a valuable source of information on gene function 39
Can we skip text mining?
Wikidata 41 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
Wikidata understands scale 42
Wikidata understands scale million Wikidata items… …13 million total genes in Entrez Gene
Wikidata understands scale million Wikidata statements… …150k total GO annotations
Wikidata for biology 45 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q Q Q Q Q Reelin
Wikidata for biology 46 Property:P31 Property:P128 Property:P129 Q8054 Q Q Q Q Q
Increasing biological data in Wikidata 47
Loading genomic data into Wikidata 48 Entrez Gene Ensembl UniProt UCSC PDB RefSeq
Wikidata gene model 49 Added ~1000 human genes so far….
Wikidata as CMOD? 50 CMOD
Wikidata as CMOD? 51 CMOD Powered by: CMOD
The Long Tail of bioinformaticians can collaboratively build a Centralized Model Organism Database (CMOD). 52
53 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim Many Wikipedia editors WP:MCB Project Gene Wiki Collaborators Katie Fisch Ben Good Salvatore Loguercio Tobias Meissner Max Nanis Chunlei Wu Group members Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820) Contact +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni Recruiting for student, postdoc, outreach, and/or staff positions!