Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January.

Similar presentations


Presentation on theme: "A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January."— Presentation transcript:

1 A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org January 16, 2014 GMOD 2014 OK

2 Why am I giving this keynote? 2

3 3 http://www.flickr.com/photos/portland_mike/6140660504/ Harnessing the crowd…

4 4 … to organize information http://www.flickr.com/photos/45697441@N00/6629580443

5 My simplified history of MODs 5

6 6

7 GMOD is widely used 7 199 (!) organizations listed as GMOD users

8 Does the current model scale? 8

9 9

10 10 # sequenced genomes Year

11 Does the current model scale? 11

12 The Long Tail of genomic data is being lost 12 Identified 517 operons and 103 small regulatory RNAs...

13 The Long Tail of genomic data is being lost 13 Identified 517 operons and 103 small regulatory RNAs...

14 At least you can download structured data… 14

15 Centralized Model Organism Database concept 15 CMOD

16 16 http://www.flickr.com/photos/aigle_dore/5626312363/ GMOD as a Service (GaaS)

17 17 http://www.flickr.com/photos/shannonmary/187131727/

18 Few genes are well annotated… 18 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GO Annotation Counts

19 … because the literature is sparsely curated? 19

20 … because the literature is sparsely curated? 20 Number of articles read by typical scientist

21 21 311,696 articles (1.5% of PubMed) have been cited by GO annotations

22 22 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.

23 The Long Tail is a prolific source of content 23 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol

24 Wikipedia is reasonably accurate 24

25 Wikipedia has breadth and depth 25 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) WikipediaBritannica Online

26 26 We can harness the Long Tail of scientists to directly participate in the gene annotation process.

27 Filtering, extracting, and summarizing PubMed Documents ConceptsReview article

28 Filtering, extracting, and summarizing PubMed Documents Concepts

29 Wiki success depends on a positive feedback 29 Gene wiki page utility Number of users Number of contributors 100 1 200 2

30 10,000 gene “stubs” within Wikipedia 30 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors

31 Gene Wiki has a critical mass of readers 31 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors

32 Gene Wiki has a critical mass of editors 32 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editor count Editors Edits Edit count

33 A review article for every gene is powerful 33 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002

34 Making the Gene Wiki more computable 34 Structured annotationsFree text

35 Filling the gaps in gene annotation 35 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO:0006897 6319 novel GO annotations 2147 novel DO annotations

36 Gene Wiki content improves enrichment analysis 36 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO:0007411 axon guidance (GO:0007411) 264 genes Linked genes through PubMed P = 1.55 E-20 811 articles YesNo Yes 132 No 25112033

37 Gene Wiki content improves enrichment analysis 37 GO term Gene list Concept recognition PubMed abstracts Gene Wiki + Enrichment analysis GO:0006936 muscle contraction (GO:0006936) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0P = 1.22 E-09 251 articles 87 articles

38 Gene Wiki content improves enrichment analysis 38 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only

39 The Long Tail of scientists is a valuable source of information on gene function 39

40 http://fiehnlab.ucdavis.edu/projects/rice_metabolome/ Can we skip text mining?

41 Wikidata 41 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić

42 Wikidata understands scale 42

43 Wikidata understands scale 43 14 million Wikidata items… …13 million total genes in Entrez Gene

44 Wikidata understands scale 44 27 million Wikidata statements… …150k total GO annotations

45 Wikidata for biology 45 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 Reelin http://www.wikidata.org/wiki/Q414043

46 Wikidata for biology 46 Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

47 Increasing biological data in Wikidata 47 http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force

48 Loading genomic data into Wikidata 48 Entrez Gene Ensembl UniProt UCSC PDB RefSeq

49 Wikidata gene model 49 Added ~1000 human genes so far….

50 Wikidata as CMOD? 50 CMOD

51 Wikidata as CMOD? 51 CMOD Powered by: CMOD

52 The Long Tail of bioinformaticians can collaboratively build a Centralized Model Organism Database (CMOD). 52

53 53 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim Many Wikipedia editors WP:MCB Project Gene Wiki Collaborators Katie Fisch Ben Good Salvatore Loguercio Tobias Meissner Max Nanis Chunlei Wu Group members Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820) Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni Recruiting for student, postdoc, outreach, and/or staff positions!


Download ppt "A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, January."

Similar presentations


Ads by Google