Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium.

Similar presentations


Presentation on theme: "Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium."— Presentation transcript:

1 Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium

2 WormBase: A Database for C. elegans and Other Nematodes www.wormbase.org

3 Curating Diverse Data Types Which worms aggregate with other worms and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

4 Curating Diverse Data Types Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

5 Curating Diverse Data Types Which worms (Strain) aggregate with other worms and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Strain information: August 1, 1972 Pineapple field in Hawaii

6 Curating Diverse Data Types Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

7 Curating Diverse Data Types Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Worm Phenotype Ontology (WPO): Bordering (WBPhenotype:0001820) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source

8 Curating Diverse Data Types Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

9 Curating Diverse Data Types Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Gene: npr-1 Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1: Biological Process: feeding behavior Molecular Function: neuropeptide receptor activity Cellular Component: integral to plasma membrane

10 Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

11 Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ PMIDTitleAuthorsAbstract Article type Journal Curator actions Download citation XML

12 Literature Curation Workflow – Full Text Acquisition Fully manual step Done for all papers we select Electronic copies stored in curation database

13 Data Type Flagging/Triage Data Type Flagging/Triage: General classification of papers What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes, Expression patterns, Physical interactions

14 Main pipeline: Support Vector Machines (SVMs) Other methods: Textpresso category searches hidden Markov models Pattern matching scripts Data Type Flagging Methods

15 Support Vector Machines: Document Classification Machine learning models Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s Resulting model classifies all new papers as negative or positive (high, medium, low confidence)

16 Data Type Flagging – Support Vector Machines SVMs trained for ten different data types: Antibody Genetic Interactions Physical Interactions Gene Expression Regulation of Gene Expression Variation Phenotypes Overexpression Phenotypes RNAi Phenotypes Variation Sequence Change Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

17 Curation from Support Vector Machine Results SVM results lead directly to manual curation: e.g. RNAi Phenotypes Results from SVMs are processed further e.g. Variation Sequence Change Pattern Matching Script – regular expressions New variations (entity recognition) e.g. mg366, ju43, e1360

18 Data Type Flagging – Textpresso www.textpresso.org C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper

19 Textpresso Categories Pre-existing dictionaries, vocabularies: Gene names ChEBI (Chemical Entities of Biological Interest) PATO Sequence Ontology (SO) Manually constructed by curators using language from published literature: Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits

20 Data Type Flagging - Textpresso Category Searches Data Type: C. elegans Human Disease Homologs Three-category Textpresso search: C. elegans gene ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ Human disease ”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode anaplastic lymphoma kinse (ALK) homolog, a proto- oncogene receptor tyrosine kinase.”

21 Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

22 Textpresso: Semi-Automated Fact Extraction Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1). Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-1 (Figure 3A and B, lane 7). Gene Ontology – Cellular Component Curation During embryogenesis, PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres, as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N).

23 Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Search Results Suggested GO Annotations Gene Products Textpresso Component See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

24 Future Directions Textpresso, other methods (HMMs) applied to additional data types e.g. GO Biological Process curation (Phenotypes) Focusing triage and fact extraction on novel findings How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? e.g. Commonly used molecular markers

25 Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative

26 Summary Text Mining Applications for Literature Curation: Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence All steps of our pipeline incorporate some form of semi- or fully automated approaches: Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction)

27 The WormBase Consortium, Textpresso WormBase - Caltech Paul Sternberg Juancarlos Chan Wen Chen Chris Grove Ranjana Kishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang Xiaodong Wang Karen Yook Former member: Ruihua Fang Textpresso - Caltech Hans-Michael Muller Yuling Li James Done Former member: Arun Rangarajan WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth Tamberlyn Bieri Phil Ozersky WormBase – EBI, Sanger, Hinxton, UK Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams CGC – Oxford University, Oxford, UK Jonathan Hodgkin

28 Hidden Markov Models: Semi-Automated GO Molecular Function Curation For each sentence, HMM yields: True positive score False positive score For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)


Download ppt "Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium."

Similar presentations


Ads by Google