Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.

Similar presentations


Presentation on theme: "Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso."— Presentation transcript:

1 Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

2 SAB 2008 User submission (email, web forms) First-pass curation Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse http://www.wormbase.org/db/seq/gbrowse /elegans/ COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7...... How does data get into WormBase?

3 SAB 2008 Publication Flagging/Triage Curation Current first-pass curation pipeline

4 SAB 2008  Growing desire amongst biocurators for user submissions  First people to know what data is in a paper is the authors  TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter Submitter email Paper identifier Locus name Term/descriptor, method User submissions: first-pass flagging/triage

5 SAB 2008 User-submitted first-pass flags - WormBase

6 SAB 2008 User data-submission forms: Expression Pattern

7 SAB 2008 Full-text searching Keywords and/or categories Data extraction: Textpresso Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.

8 SAB 2008  Paper – entity association: pattern matching Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12  Fact extraction: specialized categories Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background. GO cellular component curation (Kimberly):...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows. Textpresso: What data types?

9 SAB 2008 Textpresso-mediated CC curation: from sentences to annotations

10 SAB 2008 Transgenes: 1,100 new paper-transgene connections 250 new transgenes checked manually – 95% accuracy ultimately, connections will go directly into database Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week) Textpresso: How much data?

11 Textpresso: Other data types How else can we use Textpresso? Other data types: Molecular Function Assays, Gene Product Interactions Pilot: GO molecular function annotations for protein kinase activity keyword: phosphorylate category: C. elegans proteins 13 new GO annotations/hour Extension of this: protein modifications – not yet captured in WB Pilot: Gene product interactions for WB and BIND keywords: physically interact category: C. elegans proteins 310 matches in 237 documents 22 physical interactions – top 15 papers

12 Textpresso for triage: Classifying text based on content  Multiple strategies (using existing first-pass papers as training set):  Organismal triage – C. elegans, Drosophila  Identify, prioritize information-rich papers  Flag for specific data types  Multiple levels:  Machine learning – SVM (Support Vector Machine) Word frequency analysis  Hand-crafted categories  Combine SVM and categories  Supplement with word weighting, contextual analyses

13 SAB 2008 Keeping better track of curation statistics.....

14 SAB 2008.....and making curation statistics more transparent to users.  Users could search for curation status of any paper  Users could search for curation status of a given data type  Each database release would report newly curated papers  Each database release would document increases in data-type curation

15 WormBase Literature Curation Gene Symbols, Alleles, Sequence Features, Mapping Data: Mary Ann Tuli, Sanger Gene Function: Concise Descriptions, Gene Ontology: Ranjana Kishore, Caltech Erich Schwarz, Caltech Kimberly Van Auken, Caltech Mutant Phenotypes (RNAi and Alleles): Igor Antoshechkin, Caltech Jolene Fernandez, Caltech Raymond Lee, Caltech Gary Shindelman, Caltech Karen Yook, Caltech First Pass, Genetic Interactions: Andrei Petcherski, Caltech Gene Regulation, PWMs: Xiaodong Wang, Caltech Erich Schwarz, Caltech Expression Patterns, Antibodies, Transgenes: Wen Chen, Caltech Anatomy Ontology, Cell Function: Raymond Lee, Caltech Microarrays, SAGE: Igor Antoshechkin, Caltech Sequence, Gene Structures: Sanger, Wash U Authors, Papers: Cecilia Nakamura, Daniel Wang Curation Tools, Database: Juancarlos Chan, Caltech


Download ppt "Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso."

Similar presentations


Ads by Google