Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
An Information Retrieval and Extraction System for C. elegans Literature.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Textpresso Application and Extensibility Eimear Kenny GMOD Meeting, April 2004.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
COG and GO tutorial.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
CACAO - Penn State Gene Function and Gene Ontology January 2011
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Gene Ontology at WormBase: Making the Most of GO Annotations Kimberly Van Auken.
WormBase Workshop: 2015 International C. elegans Meeting Tools & Resources InterMine / WormMine – Chris Grove JBrowse – Scott Cain The WormBase Ontology.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
WormBase: A Resource for the Biology & Genome of C. elegans Lincoln D. Stein.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Using The Gene Ontology: Gene Product Annotation.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Mary Ann Tuli Advisory Board Meeting, CSHL 2005 WormBase and the CGC Mary Ann Tuli.
New data and tools at TAIR (The Arabidopsis Information Resource)
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the.
Community Curation Enabling the research community to contribute annotations directly to WormBase Mary Ann Tuli.
Why do we need good quality annotations? Pankaj Jaiswal Oregon State University Gene Annotation Workshop July 31, 2010 ASPB Plant Biology 2010 Montreal,
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Operated by Los Alamos National Security, LLC for NNSA Bioscience Discovering virulence genes present in novel strains and metagenomes Chris Stubben IC.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Community Curation of Gene Descriptions Ranjana Kishore Pasadena, California.
Copyright OpenHelix. No use or reproduction without express written consent1.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
Oct.27, 2003 Curator Meeting, Oct Gene Expression Curation ~WormBase, 2003 ~
Copyright OpenHelix. No use or reproduction without express written consent1.
A collaborative tool for sequence annotation. Contact:
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Genetic Literature Curation at FlyBase-Cambridge Steven Marygold ABC meeting, December 2007 A Database of.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Compiling Information and Inferring Useful Knowledge for Systems Biology by Text Mining the Literature Anália Lourenço IBB – Institute for Biotechnology.
Towards a unified MOD resource: An Overview
California Institute of Technology
Mary Ann Tuli Presented by Anthony Rogers
Mary Ann Tuli Presented by Anthony Rogers
Annotating with GO: an overview
Genomics research paper presentation
An Information Retrieval and Extraction System for C
Protein association networks with STRING
Functional Annotation of the Horse Genome
Genome Annotation Continued
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Genetic Data in Mary Ann Tuli.
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium

WormBase: A Database for C. elegans and Other Nematodes

Curating Diverse Data Types Which worms aggregate with other worms and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Which worms (Strain) aggregate with other worms and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Strain information: August 1, 1972 Pineapple field in Hawaii

Curating Diverse Data Types Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Worm Phenotype Ontology (WPO): Bordering (WBPhenotype: ) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source

Curating Diverse Data Types Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Aggregation Behavior Bendesky et al., 2012, PLoS Genetics Gene: npr-1 Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1: Biological Process: feeding behavior Molecular Function: neuropeptide receptor activity Cellular Component: integral to plasma membrane

Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ PMIDTitleAuthorsAbstract Article type Journal Curator actions Download citation XML

Literature Curation Workflow – Full Text Acquisition Fully manual step Done for all papers we select Electronic copies stored in curation database

Data Type Flagging/Triage Data Type Flagging/Triage: General classification of papers What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes, Expression patterns, Physical interactions

Main pipeline: Support Vector Machines (SVMs) Other methods: Textpresso category searches hidden Markov models Pattern matching scripts Data Type Flagging Methods

Support Vector Machines: Document Classification Machine learning models Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s Resulting model classifies all new papers as negative or positive (high, medium, low confidence)

Data Type Flagging – Support Vector Machines SVMs trained for ten different data types: Antibody Genetic Interactions Physical Interactions Gene Expression Regulation of Gene Expression Variation Phenotypes Overexpression Phenotypes RNAi Phenotypes Variation Sequence Change Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

Curation from Support Vector Machine Results SVM results lead directly to manual curation: e.g. RNAi Phenotypes Results from SVMs are processed further e.g. Variation Sequence Change Pattern Matching Script – regular expressions New variations (entity recognition) e.g. mg366, ju43, e1360

Data Type Flagging – Textpresso C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper

Textpresso Categories Pre-existing dictionaries, vocabularies: Gene names ChEBI (Chemical Entities of Biological Interest) PATO Sequence Ontology (SO) Manually constructed by curators using language from published literature: Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits

Data Type Flagging - Textpresso Category Searches Data Type: C. elegans Human Disease Homologs Three-category Textpresso search: C. elegans gene ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ Human disease ”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode anaplastic lymphoma kinse (ALK) homolog, a proto- oncogene receptor tyrosine kinase.”

Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

Textpresso: Semi-Automated Fact Extraction Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1). Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-1 (Figure 3A and B, lane 7). Gene Ontology – Cellular Component Curation During embryogenesis, PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres, as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N).

Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Search Results Suggested GO Annotations Gene Products Textpresso Component See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

Future Directions Textpresso, other methods (HMMs) applied to additional data types e.g. GO Biological Process curation (Phenotypes) Focusing triage and fact extraction on novel findings How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? e.g. Commonly used molecular markers

Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative

Summary Text Mining Applications for Literature Curation: Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence All steps of our pipeline incorporate some form of semi- or fully automated approaches: Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction)

The WormBase Consortium, Textpresso WormBase - Caltech Paul Sternberg Juancarlos Chan Wen Chen Chris Grove Ranjana Kishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang Xiaodong Wang Karen Yook Former member: Ruihua Fang Textpresso - Caltech Hans-Michael Muller Yuling Li James Done Former member: Arun Rangarajan WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth Tamberlyn Bieri Phil Ozersky WormBase – EBI, Sanger, Hinxton, UK Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams CGC – Oxford University, Oxford, UK Jonathan Hodgkin

Hidden Markov Models: Semi-Automated GO Molecular Function Curation For each sentence, HMM yields: True positive score False positive score For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)