Download presentation
Presentation is loading. Please wait.
Published byCaitlin Smith Modified over 9 years ago
1
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso
2
SAB 2008 User submission (email, web forms) First-pass curation Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse http://www.wormbase.org/db/seq/gbrowse /elegans/ COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7...... How does data get into WormBase?
3
SAB 2008 Publication Flagging/Triage Curation Current first-pass curation pipeline
4
SAB 2008 Growing desire amongst biocurators for user submissions First people to know what data is in a paper is the authors TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter Submitter email Paper identifier Locus name Term/descriptor, method User submissions: first-pass flagging/triage
5
SAB 2008 User-submitted first-pass flags - WormBase
6
SAB 2008 User data-submission forms: Expression Pattern
7
SAB 2008 Full-text searching Keywords and/or categories Data extraction: Textpresso Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.
8
SAB 2008 Paper – entity association: pattern matching Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12 Fact extraction: specialized categories Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background. GO cellular component curation (Kimberly):...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows. Textpresso: What data types?
9
SAB 2008 Textpresso-mediated CC curation: from sentences to annotations
10
SAB 2008 Transgenes: 1,100 new paper-transgene connections 250 new transgenes checked manually – 95% accuracy ultimately, connections will go directly into database Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week) Textpresso: How much data?
11
Textpresso: Other data types How else can we use Textpresso? Other data types: Molecular Function Assays, Gene Product Interactions Pilot: GO molecular function annotations for protein kinase activity keyword: phosphorylate category: C. elegans proteins 13 new GO annotations/hour Extension of this: protein modifications – not yet captured in WB Pilot: Gene product interactions for WB and BIND keywords: physically interact category: C. elegans proteins 310 matches in 237 documents 22 physical interactions – top 15 papers
12
Textpresso for triage: Classifying text based on content Multiple strategies (using existing first-pass papers as training set): Organismal triage – C. elegans, Drosophila Identify, prioritize information-rich papers Flag for specific data types Multiple levels: Machine learning – SVM (Support Vector Machine) Word frequency analysis Hand-crafted categories Combine SVM and categories Supplement with word weighting, contextual analyses
13
SAB 2008 Keeping better track of curation statistics.....
14
SAB 2008.....and making curation statistics more transparent to users. Users could search for curation status of any paper Users could search for curation status of a given data type Each database release would report newly curated papers Each database release would document increases in data-type curation
15
WormBase Literature Curation Gene Symbols, Alleles, Sequence Features, Mapping Data: Mary Ann Tuli, Sanger Gene Function: Concise Descriptions, Gene Ontology: Ranjana Kishore, Caltech Erich Schwarz, Caltech Kimberly Van Auken, Caltech Mutant Phenotypes (RNAi and Alleles): Igor Antoshechkin, Caltech Jolene Fernandez, Caltech Raymond Lee, Caltech Gary Shindelman, Caltech Karen Yook, Caltech First Pass, Genetic Interactions: Andrei Petcherski, Caltech Gene Regulation, PWMs: Xiaodong Wang, Caltech Erich Schwarz, Caltech Expression Patterns, Antibodies, Transgenes: Wen Chen, Caltech Anatomy Ontology, Cell Function: Raymond Lee, Caltech Microarrays, SAGE: Igor Antoshechkin, Caltech Sequence, Gene Structures: Sanger, Wash U Authors, Papers: Cecilia Nakamura, Daniel Wang Curation Tools, Database: Juancarlos Chan, Caltech
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.