Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake.

Slides:

Advertisements

Similar presentations

Symantec 2010 Windows 7 Migration Global Results.

Advertisements

Building an EndNote Library. EndNote Fundamentals EndNote is a reference organizer Build a library of references Cite references and generate bibliographies.

1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.

Implementation of a Validated Statistical Computing Environment Presented by Jeff Schumack, Associate Director – Drug Development Information September.

Physical Reference Data of the NIST Physics Laboratory Presentation for the DASER Symposium Digital Archives for Science & Engineering Research Saturday,

ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §

Installing Windows XP Professional Using Attended Installation Slide 1 of 30Session 8 Ver. 1.0 CompTIA A+ Certification: A Comprehensive Approach for all.

Mouse Genome Informatics Online Resource Joanne Berghout, PhD Oct 13,

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Increasing the Visibility of Full-Text, Electronic Format Journals Matt Hall Serials Solutions, LLC.

Vlad: A Visual Annotation Display Tool Joel Richardson Mouse Genome Informatics The Jackson Laboratory.

Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.

1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.

Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Mouse Genome Informatics November 2008 Paul Szauter MGI User Support.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

BIOCMS: Resource Integration and Web Application Framework for Bioinformatics DHUNDY R BASTOLA †, *, ANIL KHADKA †, MOHAMMAD SHAFIULLAH † AND HESHAM ALI.

Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.

Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at

Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.

Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:

Anthony Atkins Digital Library and Archives VirginiaTech ETD Technology for Implementers Presented March 22, 2001 at the 4th International.

Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.

Biological Science Database Proquest WEDAD AL-HUSAINAN ISD/NSTIC Kuwait Institute for Scientific Research November/2012.

1 DATABASES By: Hanna Ben-Or Phone: October 2011.

Curatorial Procedures at Mouse Genome Informatics with an Emphasis on Expression Data Constance M. Smith The Jackson Laboratory Bar Harbor, ME.

30 september 2009 The communication mix shifts from paper to screen: take the edge with digital documentation.

Bioinformatics and medicine: Are we meeting the challenge?

IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,

CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.

Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.

Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition Tools and Resources to Assess and Enhance Fitness-For-Use.

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:

Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.

DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.

Integrating the Cell Cycle Ontology with the Mouse Genome Database David R. Smith Mary Dolan Dr. Judith Blake.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Toward a Unified Gene Page GMOD Meeting, April 2004 Don Gilbert,

XP Practical PC, 3e Chapter 3 1 Installing and Learning Software.

Introduction Sample Projects Resources Summary Future Plans Bioinformatics Support Information Session Karsten Hokamp TCD 3rd October, 2007.

Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.

DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.

Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.

Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.

Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,

Seybold 2001 Mark Stephens (Managing Director). Who are IDRSolutions? Based in United Kingdom. Customers mainly large corporations.

1 Manage your Research Articles : Using Mendeley & Zotero Winter Term 2012 Helen B. Josephine

Computational Research in the Battelle Center for Mathmatical medicine.

ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan

GENE INDEXING Janice Ward Indexer/Reviser Index Section, NLM.

Worldwide Protein Data Bank Common D&A Project Sequence Processing Modular Demo May 6, 2010 Project Deliverable.

MGI and Phenotyping Projects Mouse Genome Informatics.

Genetic Literature Curation at FlyBase-Cambridge Steven Marygold ABC meeting, December 2007 A Database of.

生物資料庫搜尋 ( 第八組 ) 連威森王鼎黃智楹張鈞淵

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

The world’s libraries. Connected. The Benefits of CONTENTdm Hosting Services OCLC’s Digital Lifecycle Webinar Series April 9, 2013.

Towards a unified MOD resource: An Overview

Functional Annotation of the Horse Genome

PIR: Protein Information Resource

QTL Annotation in MGI Susan M Bello, Ira Lu, Cynthia L Smith, Janan T Eppig, and the Mouse Genome Informatics Group.

Browsing the GO at MGI Harold Drabkin, Ph.D. Senior Scientific Curator

Implementing KFS Release 2 (Let’s Get Cookin’!)

Presentation transcript:

Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake 7 th Fraunhofer Symposium on Text Mining October 6, 2009

 ProMiner at Mouse Genome Informatics (MGI)  Background on MGI and our biocuration process  Applying Named Entity Recognition (NER) applications to improve MGI curator efficiency and minimize bottlenecks  Our implementation and results to date using ProMiner to annotate full-text scientific journal articles in HTML and PDF format

 A comprehensive, integrated public information resource for mouse genetics, genomics and biology  Facilitates use of the laboratory mouse as a model for human biology  Provides extensively curated mouse data

The MGI website presents information on mouse biology in a publically accessible, content rich, continually updated online database

MGI content spans from DNA sequence to disease phenotype

MGI integrates information on mouse genes and experimental data through a combination of manual curation, computational curation, and collaboration with other online resources.

Primary Triage Secondary Triage Master Bibliography Indexing Expert Curation  For literature curation we  Review more than 160 scientific journals each month  Screen more than 12,000 articles a year

Primary Triage Secondary Triage Master Bibliography Indexing Expert Curation  Curators pick papers based on  Expression  Mapping  Homology  New Genes  Gene Ontology (GO)  Alleles & Phenotypes  Sequences  Inbred Strain  Tumor  Nomenclature  General Interest Screen for references to mouse, mice, murine

Primary Triage Secondary Triage Master Bibliography Indexing Expert Curation Selected articles are assigned reference numbers and entered into a master bibliography In 2009… 10,097 articles added ~1122 per month (as of September 29, 2009)

Primary Triage Secondary Triage Master Bibliography Indexing Expert Curation Indexing is our internal process of associating article reference numbers to at least one entity within the MGI database. For gene indexing that entity is a gene.

Primary Triage Secondary Triage Master Bibliography Indexing Expert Curation  Curators read each paper and enter information into MGI database using controlled vocabularies  Articles annotated based on  Expression  Mapping  Homology  New Genes  Sequences  Inbred Strains  Tumors  Alleles & Phenotypes

Papers Added Master Bibliography12,97913,23114,190 Phenotype Papers9681 (75%)10,322 (78%)10,689 (75%) GO Papers8364 (64%)7716 (58%)9913 (70%) Selected for Both5974 (46%)6,688 (51%)7231 (51%)

 Many areas could benefit from text mining (as tools, not replacements for human curators)  Selected gene indexing as a prototype project to  Minimize a bottleneck within our curation workflow Articles added to pipeline each month % are selected for GO 770 Articles gene indexed each month 200 More than 2000 articles in gene indexing pipeline

 A dictionary-based named entity recognition (NER) system that  Complements our existing biocuration processes and workflow  Processes full-text PDF files in batch  Uses MGI or comparable dictionaries of mouse symbols, synonyms, and human orthologs  Produces meaningful reports that aid curators  Provides visualization tools  Achieves high F-scores in published evaluations

 Of all the dictionary-based NER tools we evaluated, ProMiner most closely fit our needs  Rule-based protein and entity recognition using pre-processed dictionaries (Entrez Gene, SwissProt, ATTC, and ECACC)  Batch processing of PDF Files (beta release)  Standard and custom reports  Customizable annotation projects and dictionaries/term lists  Initiated collaborative pilot project between SCAI and MGI

 System requirements  Runs on Linux systems, Sun-Ultra, and other UNIX-based systems  Requires minimum 1 GB RAM, 500 MB disk space Java (v1.5 or higher) and Perl (v5.8 or higher)  Uses GeneDB to retrieve data (requires 1 GB to store index files). Includes an HTML-based (CGI) viewer  One processor can update ~1000 articles per project  On a cluster of 16 processors, ProMiner can search the entire MEDLINE literature base with 1 dictionary in ~2 hours

 MGI Operating Environment  Dedicated Sun Fire X4100 Server with two dual core AMD Opteron processors, 2.8 Ghz, 64 bit  Solaris 10 V. 508 operating system, Java5 built-in  Adobe Acrobat Pro Version 9.1  SCAI delivered…  Installation scripts, ProMiner scripts and dictionaries  Documentation and demos  MGI project definition files for annotation using human and mouse dictionaries

 HTML Version 6.4 implemented in March  PDF Version 7.1 delivered in August

This paper was indexed to mouse genes Tlr4 and Ly96

 1 part-time curator working 5.5 hours a day processing batches of 10 articles at a time  8 of 10 PDFs processed correctly, without errors  Some PDF format (PDF/A) and color labeling errors  We provide feedback to SCAI to enhance dictionaries and PDF formatting Manual IndexingIndexing with ProMiner 30 minutes per article18-24 minutes per article 50 articles per week60-70 articles per week F-Score performance measurements in progress

ProMiner 7.1 annotates 75 full-text articles in PDF format in less than 20 minutes on our server Processing time = (No. Articles ) R² =

 Complete performance testing and evaluate status of pilot project with SCAI  Consider extending pilot to continue testing ProMiner 7.1  Explore future collaborations  Gene Ontology terms  Protein-protein interactions  Other curation functions at MGI

 MGI  Judith Blake  Nancy Butler  Harold Drabkin  Alex Diehl  David Hill  Monica McAndrews-Hill  Sue McClatchy  David Shaw  Dmitry Sitnikov MGI System Administration  Matt Baya  Mike McCrossin  Iry Witham  Fraunhofer SCAI  Juliane Fluck  Heinz-Theodor Mevissen  Symposium Organizers  MITRE Corporation  Lynette Hirschman  Journal of Immunology