Www.textpresso.org An Information Retrieval and Extraction System for C. elegans Literature.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

EndNote X Basics and New Features. EndNote --for managing papers & bibliographies Construct papers in Word (built-in templates) Construct papers in Word.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
IN THE NAME OF GOD. Searching PubMed PubMed Home Page.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Textpresso Application and Extensibility Eimear Kenny GMOD Meeting, April 2004.
SAB 2008 LITERATURE CURATION Overview & Integrated Phenotype Curation.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
Literature Informatics Beyond PubMed: Next Generation Literature Searching Carrie Iwema, PhD, MLS 24 th August 2011.
Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Lecture 2.21 Retrieving Information: Using Entrez.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Gene Ontology at WormBase: Making the Most of GO Annotations Kimberly Van Auken.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Copyright OpenHelix. No use or reproduction without express written consent1.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
KLUWER JOURNALS
1 CHBE Orientation Program Searching the Literature.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
生物資訊程式語言應用 Part 5 Perl and MySQL Applications. Outline  Application one.  How to get related literature from PubMed?  To store search results in database.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
A Biology Primer Part IV: Gene networks and systems biology Vasileios Hatzivassiloglou University of Texas at Dallas.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
DATA MANAGEMENT AND CURATION AT TAIR
Community Curation of Gene Descriptions Ranjana Kishore Pasadena, California.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
The effects of Malathion and the comparison to the NTE1 gene in yeast Ashley Swift Mentor: David Singleton Introduction : Malathion is a widely used organophosphorous.
Oct.27, 2003 Curator Meeting, Oct Gene Expression Curation ~WormBase, 2003 ~
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
A collaborative tool for sequence annotation. Contact:
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
A database of biological pathways and processes (borrowed from a presentation created by Steve Jupe)
Copyright OpenHelix. No use or reproduction without express written consent1.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
PubMed Basics Barbara A. Wood, MLIS Calder Library University of Miami Miller School of Medicine.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Genomics research paper presentation
An Information Retrieval and Extraction System for C
miRPathDB: A Specialized Professional Database with Upkeep Concerns
Annotation: linking literature to gene products
Presentation transcript:

An Information Retrieval and Extraction System for C. elegans Literature

Is full text important??? Case Studies: - 35% protein-protein interactions not mentioned in abstract Blaschke and Valencia (2001) - 7 out of 19 unique interactions were present in the abstract Friedman et al (2001) Full text contains redundancies!

System Specifications article classification keyword searches semi-semantic queries batch retrieval of facts Queries: Return: citation abstract full text paper sections Target Users: researchers curators bioinformaticians/NLP

Biological Entities Actions, Facts or Circumstances that Relate Two Entities Semantic gene transgene allele nuclei acid organism clone strain sex entity feature life stage phenotype drugs and small molecules molecular function cell and cell group cellular component mutant method consort effect purpose pathway regulation action physical association comparison spatial/time relation localization involvement characterization biological process descriptor bracket determiner conjunction auxiliary conjecture negation pronoun preposition punctuation “Plugin Dictionaries” “Common Sense” Specific Partially Generic Generic

….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29. Biological Process Regulation Gene Molecular Function Biological Process // activation of let-7 RNA expression down regulates LIN-41 to relieve inhibition of lin-29. //

What genes does let-7 regulate? Keyword: “let-7” Category: “Regulation” Category: “Gene”

Facts returned from Journal articles! Keyword Categories

Electronic PDF Text Formatted Text Annotated Text Abstracts Titles Citations Keywords Citation: Year Author Index Maker PDF2text preprocessor text2XML Textpresso Ontology Textpresso Database Wormbase Database Journal web-site PubMed Link Maker

Progress since April….. Installed Textpresso on a new server Expanded Textpresso corpus (~2,700 full text) Preparing PDF2text for release

PDF2text Written in Perl and Python by Robert Caltech Relies on Journal specific templates (Daniel Wang) Software to convert electronic journal article PDF’s to correctly flowing ASCII text Utilizes.pos output of generic pdf2text (xpdf)

Null mutations in the C. elegans heterochronic gene lin-41 cause precocious expression of adult fate at 21 nucleotide regulatory RNA. A lin-41::GFP fusion gene is downregulated in tissues affected in late lar- // Two column PDF Journal format: Typical conversion to ASCII text: // Null mutations in the C. elegans heterochronic gene 21 nucleotide regulatory RNA. A lin-41::GFP fusion lin-41 cause precocious expression of adult fate at gene is downregulated in tissues affected in late lar- // pdf2text output: // Null mutations in the C. elegans heterochronic gene lin-41 cause precocious expression of adult fate at // 21 nucleotide regulatory RNA. A lin-41::GFP fusion gene is downregulated in tissues affected in late lar-

Limitations Doesn’t work so well on older PDF’s Relies on uniformity of article format within Journal Requires the development of templates

Progress since April….. Installed Textpresso on a new server Expanded Textpresso corpus (~2,750 full text) Preparing PDF2text for release Textpresso paper …. in progress Begun Fact Extraction using Textpresso …

Extract C. elegans alleles from full text eg vba-1(e2)

Text extraction pattern: Result: Template: Locus: $1 Allele: $3 Evidence: $paperref Gene age-1 dpy-5 daf-16 lon-2 unc-32 osm-3 lin-29 unc-5 daf-2 Evidence cgc3008 cgc666 cgc5034 wbg14.1 wm97ab55 cgc2033 pmid31222 euwm2000 cgc3012 Allele hx546 e61 mg51a e678 e189 p802 n333 e53 e1370 Sentence...age-1(hx546)......expressed in..... osm-3(p802) was found to be Accept y/n?

Allele : te21 Gene oma-1 Reference [cgc5198] Allele : s1733 Gene let-653 Reference [wbg11.1p21] Allele : s1733 Gene let-653 Reference [cgc3721] Allele : te51 Gene oma-2 Reference [cgc5198] Allele : s1748 Gene let-655 Reference [cgc3120] Allele : tm291 Gene pip-1 Reference [wm2001p213] Allele : gm85 Gene fam-1 Reference [cgc2795] Allele : gm85 Gene fam-1 Reference [cgc2978]

Total papers: ~ 2,000 gene  allele  reference: ~14,000 gene  allele: ~ 3,200 (~1,100) allele  reference:~ 3,200 (~1,500) gene  reference:~ 1,400 ~14,000 ~99% uploaded to Wormbase FILTER ~300 required manual resolution - ~ 80 synonyms - typo’s e.g. rol-2(e678) 160 hits bli-2(e768) 17 hits rol-2(e768) 2 hits

Lots of work to do….. Increasing recall –Anaphora resolution (5%-8%) –Synonym recognition Develop Textpresso Ontology –Integrating open source ontologies (MeSH, UMLS) –Pilot study of other MOD’s Package and release software Develop Fact Extraction