Download presentation
Presentation is loading. Please wait.
Published bySabrina Fletcher Modified over 9 years ago
1
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org January 16, 2014 GMOD 2014 OK
2
Why am I giving this keynote? 2
3
3 http://www.flickr.com/photos/portland_mike/6140660504/ Harnessing the crowd…
4
4 … to organize information http://www.flickr.com/photos/45697441@N00/6629580443
5
My simplified history of MODs 5
6
6
7
GMOD is widely used 7 199 (!) organizations listed as GMOD users
8
Does the current model scale? 8
9
9
10
10 # sequenced genomes Year
11
Does the current model scale? 11
12
The Long Tail of genomic data is being lost 12 Identified 517 operons and 103 small regulatory RNAs...
13
The Long Tail of genomic data is being lost 13 Identified 517 operons and 103 small regulatory RNAs...
14
At least you can download structured data… 14
15
Centralized Model Organism Database concept 15 CMOD
16
16 http://www.flickr.com/photos/aigle_dore/5626312363/ GMOD as a Service (GaaS)
17
17 http://www.flickr.com/photos/shannonmary/187131727/
18
Few genes are well annotated… 18 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GO Annotation Counts
19
… because the literature is sparsely curated? 19
20
… because the literature is sparsely curated? 20 Number of articles read by typical scientist
21
21 311,696 articles (1.5% of PubMed) have been cited by GO annotations
22
22 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
23
The Long Tail is a prolific source of content 23 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
24
Wikipedia is reasonably accurate 24
25
Wikipedia has breadth and depth 25 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) WikipediaBritannica Online
26
26 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
27
Filtering, extracting, and summarizing PubMed Documents ConceptsReview article
28
Filtering, extracting, and summarizing PubMed Documents Concepts
29
Wiki success depends on a positive feedback 29 Gene wiki page utility Number of users Number of contributors 100 1 200 2
30
10,000 gene “stubs” within Wikipedia 30 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
31
Gene Wiki has a critical mass of readers 31 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
32
Gene Wiki has a critical mass of editors 32 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editor count Editors Edits Edit count
33
A review article for every gene is powerful 33 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
34
Making the Gene Wiki more computable 34 Structured annotationsFree text
35
Filling the gaps in gene annotation 35 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO:0006897 6319 novel GO annotations 2147 novel DO annotations
36
Gene Wiki content improves enrichment analysis 36 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO:0007411 axon guidance (GO:0007411) 264 genes Linked genes through PubMed P = 1.55 E-20 811 articles YesNo Yes 132 No 25112033
37
Gene Wiki content improves enrichment analysis 37 GO term Gene list Concept recognition PubMed abstracts Gene Wiki + Enrichment analysis GO:0006936 muscle contraction (GO:0006936) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0P = 1.22 E-09 251 articles 87 articles
38
Gene Wiki content improves enrichment analysis 38 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only
39
The Long Tail of scientists is a valuable source of information on gene function 39
40
http://fiehnlab.ucdavis.edu/projects/rice_metabolome/ Can we skip text mining?
41
Wikidata 41 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
42
Wikidata understands scale 42
43
Wikidata understands scale 43 14 million Wikidata items… …13 million total genes in Entrez Gene
44
Wikidata understands scale 44 27 million Wikidata statements… …150k total GO annotations
45
Wikidata for biology 45 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 Reelin http://www.wikidata.org/wiki/Q414043
46
Wikidata for biology 46 Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
47
Increasing biological data in Wikidata 47 http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
48
Loading genomic data into Wikidata 48 Entrez Gene Ensembl UniProt UCSC PDB RefSeq
49
Wikidata gene model 49 Added ~1000 human genes so far….
50
Wikidata as CMOD? 50 CMOD
51
Wikidata as CMOD? 51 CMOD Powered by: CMOD
52
The Long Tail of bioinformaticians can collaboratively build a Centralized Model Organism Database (CMOD). 52
53
53 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim Many Wikipedia editors WP:MCB Project Gene Wiki Collaborators Katie Fisch Ben Good Salvatore Loguercio Tobias Meissner Max Nanis Chunlei Wu Group members Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820) Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni Recruiting for student, postdoc, outreach, and/or staff positions!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.