NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Slides:



Advertisements
Similar presentations
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
User Interface Design Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Hidden Markov Models in Bioinformatics
Semantic Role Labeling Abdul-Lateef Yussiff
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Sanchay and other NLP Tools Himanshu Sharma, Sambhav Jain.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Lecture 1 Introduction to Java MIT- AITI 2004 What is a Computer Program? For a computer to be able to do anything (multiply, play a song, run a word.
SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Open Information Extraction using Wikipedia
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Natural Language Processing Course Project: Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
Software Architecture
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Natural language processing tools Lê Đức Trọng 1.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
MedKAT Medical Knowledge Analysis Tool December 2009.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
ICEE Internship International Center for Engineering Education Project: Natural Language Interaction with a Construction Estimating Virtual Reality Environment.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
PoS tagging and Chunking with HMM and CRF
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
1 Presentation Methodology Summary B. Golden. 2 Introduction Why use visualizations?  To facilitate user comprehension  To convey complexity and intricacy.
Teaching Bioinformatics Nevena Ackovska Ana Madevska - Bogdanova.
Starter What do you know about DNA and gene expression?
Overview of Statistical NLP IR Group Meeting March 7, 2006.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Tools for Natural Language Processing Applications
CRF &SVM in Medication Extraction
Are End-to-end Systems the Ultimate Solutions for NLP?
Aspect-based sentiment analysis
LING 581: Advanced Computational Linguistics
Automatic Detection of Causal Relations for Question Answering
Chunk Parsing CS1573: AI Application Development, Spring 2003
Extracting Recipes from Chemical Academic Papers
CS246: Information Retrieval
CS224N Section 3: Corpora, etc.
SNoW & FEX Libraries; Document Classification
Presentation transcript:

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004

What do we have? Biology Literature (huge amount of text) E.g. Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

What do we want? Named entities: gene names, protein names, drugs, etc. Interaction events between entities: transcription, translation, post translational modification, etc. Relationships between basic events: caused by, inhibited by, etc. (from Hirschman et al. 02)

Preliminary System Structure Pre-processed data ready to mine POS TaggerParserEntity Extractor … Collections of raw textual data Genes, proteins, other entitiesNouns, Verbs, etc. NPs, VPs, Relations … Text Pre-processing: NLP Text Mining Modules: TM

POS Taggers Tree Tagger Brill Tagger SNoW Tagger LT Chunk Stanford Tagger

Results of POS Tagging Raw text: Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

Results of POS Tagging (cont.) TreeBrillSNoWLTStanford theDT 12SJJCDNNJJCD ribosomalJJ RNANPNNP subunitNN isVBZ invertedVVNJJ VBNJJ andCC separatedVVNVBNVBDVBNJJ fromIN theDT

Results of POS Tagging (cont.) TreeBrillSNoWLTStanford 16SJJCDNNJJCD rRNANN NNP byIN aDT novelNN JJNN non-codingVVGJJNNJJ regionNN

Comparison of POS Taggers TreeBrillSNoWLTStanford Src. StuttgartEric BrillUIUCEdinburghStanford Alg. decision tree transformation based network of linear functions HMM dis- ambiguation maximum entropy Speed 1min/5M< 1min/5M~8mins/5M40mins/5M80mins/5M Adapt. yes source included highlow source & API included Other punc- tuation sensitive commonly used help available 96 – 98% precision

Conclusions Existing general-purpose POS taggers work fine for our task. Most nouns and verbs correctly identified There is still room to improve existing POS taggers for biology data. E.g. to identify gene and protein names Speed and adaptability are important.

A Little Bit More on SNoW SNoW has a POS tagger and a shallow parser. Speed is reasonable. Software is adaptable as help is available from CCG. The network model can be trained if we have training data.

Result of SNoW Shallow Parser [NP the 12 S ribosomal RNA subunit] [VP is] [ADJP inverted] and [VP separated] [PP from] [NP the 16 S rRNA] [PP by] [NP a novel non-coding region] (from online demo) Problems:  Currently the package is not available for download from the new CCG page.  There is still problem running the old package on our machine. (compilation, path setting, etc.)

Parsers SNoW (already covered) LT-Chunk MiniPar Collins Stanford

Result of LT-Chunk [[ the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN ]] (( is_VBZ inverted_VBN and_CC separated_VBN )) from_IN [[ the_DT 16S_JJ rRNA_NNP ]] by_IN [[ a_DT novel_JJ non-coding_JJ region_NN ]]

Result of MiniPar 16(the~ Det20det(gov subunit)) 17(12S~ N20nn(gov subunit)) 18(ribosomal~ A20mod(gov subunit)) 19(RNA~ N20nn(gov subunit)) 20(subunit~ N22s(gov invert)) 21(isbe be22be(gov invert)) 22(invertedinvert VE0i(gov fin)) E4(()subunit N22obj(gov invert) 23(and~ U22lex-mod(gov invert)) 24(separatedseparate V22lex-dep(gov invert)) 25(from~ Prep22mod(gov invert)) 26(the~ Det28det(gov rRNA)) 27(16S~ N28nn(gov rRNA)) 28(rRNA~ N25pcomp-n(gov from))

Results of Collins Parser (S~is~2~2 (NPB~subunit~5~5 the/DT 12S/CD ribosomal/JJ RNA/NNP subunit/NN ) (VP~is~2~1 is/VBZ (UCP~inverted~3~1 (ADJP~inverted~1~1 inverted/JJ ) and/CC (VP~separated~3~1 separated/VBN (PP~from~2~1 from/IN (NPB~rRNA~3~3 the/DT 16S/CD rRNA/NN ) ) (PP~by~2~1 by/IN (NP~region~2~1 (NPB~region~4~4 a/DT novel/JJ non-coding/JJ region/NN,/PUNC, )

Comparison of Parsers LTMiniParCollinsStanford Src. EdinburghU AlbertaM. CollinsStanford Prec. Part of LT-POS Slightly over 88% ~ 85% Speed 40min/5M14min/5M> 3 hrs/5Mvery slow … Adapt. Low, training not allowed High, provides API Source included Source & API included Other LT-Chunk is a part of LT-POS; Readable output Complex Output of dependency and governing info. Well-known. Tagged input needed. Java based.

Conclusion on Parsers MiniPar has advantages so far: Fast Outputs dependency & governing info. and useful relations Provides API If SNoW is tuned for the task, we can easily plug it into the module.

Entity Extractors Abner: extracts protein, DNA, RNA, cell line, and cell type Yagi: extracts only gene names, a brother of Abner LingPipe: Named entity extraction that can be trained for different domains.

Result of Abner Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12 S ribosomal RNA subunit is inverted and separated from the 16 S rRNA by a novel non-coding region, …

Result of LingPipe Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non- coding region, …

Comparison of Entity Extractors AbnerYagiLingPipe Src. U Wisconsin Alias-i, Inc. Alg. CRF Model B-CUBED alg. Prec. 89.3%/69.9% (seen/unseen) data, 72% for protein 75% on unseen data Exact Match: 64.9 % Recall 65%Exact: ~ 70% Speed 40mins/5M3mins/5M 5mins/5M (model1) 3hrs/5M (model2) Adapt. Java based, pre-trained Java based, pre-trained with BioCreative Two trained models, training allowed Other Graphic Interface; files <= 500k Should split into small files <= 1M, can take directory as input Command line & demo. Also does co- referencing.

Conclusion on Entities Extractors Still a lot of room to improve. However, with existing extractors we can begin high level text mining work. Performances over honeybee data need to be evaluated. As soon as better extractor is constructed, we can plug in easily.

Summary Some Existing NLP tools for supporting Biology Literature Mining: POS Taggers, Parsers and Entity- Extractors are evaluated Observations along two lines: Still considerable room of improvement beyond the existing NLP tools, especially customize them for special domains. We can begin exploring higher-level text mining research with support of these toolkits. Text Preprocessing Modules are independent, easy to plug and play

References Hirschman, L. et al. Accomplishments and challenges in literature data mining for biology Bioinformatics, 2002 Dekang Lin. Dependency-based evaluation of MiniPar In Workshop on the Evaluation of Parsing Systems, 1998

End of Talk Thank you!