Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Stemming, tagging and chunking Text analysis short of parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Memory Strategy – Using Mental Images
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Mining and Summarizing Customer Reviews
Extraction of Adverse Drug Effects from Clinical Records E. ARAMAKI* Ph.D., Y. MIURA **, M. TONOIKE ** Ph.D., T. OHKUMA ** Ph.D., H. MASHUICHI ** Ph.D.,K.WAKI.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Improving Subcategorization Acquisition using Word Sense Disambiguation Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
MedKAT Medical Knowledge Analysis Tool December 2009.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
CRF &SVM in Medication Extraction
Keywords the words (or n word sequences) which are significantly more frequent in a specialised corpus than in a "reference corpus" generally, the reference.
Chapter 4: Use Case Modeling
Category-Based Pseudowords
Automatic Detection of Causal Relations for Question Answering
Text Mining & Natural Language Processing
Faculty of Computer Science and Information System
Presentation transcript:

Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg

2 Outline 1.MWVs in MEDLINE abstracts 2.Collecting MWV candidates 3.Case study of overgeneration 4.Ranking proper MWV candidates 5.Evaluation 6.Summary

3 MEDLINE Abstracts Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary MEDLINE Abstracts Access:PubMed® Domain: clinical medicine, biomedicine, biological and physical sciences. Source: articles from over 4,600 journals published throughout the world. Coverage of: abstracts are included for about 52% of the articles, over 10 Mio. abstracts. GENIA Corpus MEDLINE abstracts collected using keywords human, blood cell, transcription factor. (1800 in test) A POS-tag-annotated version. An NE-annotated version.

4 Information Extraction Beyond NER We show that in this work …. High levels of UDG expression in a transient transfection assay result in the down-regulation of transcriptional activity through elements specific for E2F-mediated transcription. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary named entities (NEs),domain-specific verb high levels of UDG expressiondown-regulation of transcriptional activity result in relational information

5 MWVs in MEDLINE Abstracts Examples of MWVs be able to, shed light on, take place, result in Possible ambiguities caused if without MWV detection: shed light on :light is not an NE take place :place is not an NE result in/from :in or from should not construct prepositional phrases as in general cases Appropriate handling of MWVs simplifies the processing. Reliable detection of MWVs (such as: interact with) contributes to relational information extraction. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

6 Collecting MWV Candidates from Corpus S is the set of automaton states, S = {nextOut, stop, nextIn, head, halt}; I is the input set, namely the chunks in both POS tag sequence and lexical sequence; O is the output set, namely the MWV candidates, O = {o i | o i is a successful MWV candidate}; F is the set of output controlling functions; G is the set of automaton state transition functions; START is the beginning state for MWV collecting, START = head. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Definition of an automaton T= {S, I, O, F, G, START}, where

7 Working Mechanism: Non-contiguous Model Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}; Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. c >= 0. x is limited by the given right- side window size s, i,e., x<=s-1. b/a does not include SEPR, EVP and IVP. nextOut halt EVP/IVP acac head nextIn stop ab b b/a SEPR SEPR/Ø nextOut bxbx SEPR EVP/IVP b/a (b/a) c

8 Working Mechanism: Contiguous model Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}; Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. The stop state will trigger an output operation. c >= 0. b/a does not include SEPR, EVP and IVP. nextOut halt EVP/IVP acac head nextIn stop a b b b/a SEPR SEPR/Ø EVP/IVP b/a (b/a) c acac head nextIn stop a b b

9 Fragment of the Automaton for MWV Collecting SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}. Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. cc>=0. stopThe stop state will trigger an output operation. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary head stop EVP/IVP a b acac b NextIn

10 An Example Chunks of a sentenceChunk tag State transition Output InitializationnextOutØ The 3'NF-E2/AP1 motifENPnextOutØ isEVPheadO i =“be” ableADJPnextInO i =“be able” to exertIVPstopO i =“be able to”, (success) headO i+1 =“exert” both positive and negative regulatory effects ENPSstop, nextOutO i+1 =“exert” (failure) onINnextOutØ the zeta 2-globin promoter activityENPnextOutØ inINnextOutØ K562 cellsENPnextOutØ SEPRhaltØ Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

11 Case Study of Overgeneration Case 1: take place, take place in, take place at pattern of POS Tag: Verb Noun (Preposition) Case 2: be able to, be important for 83% able in be able to, 8.4% important in be important for Case 3: take place, bind DNA DNA is a named entity, but place is not Case 4: be able to, be unaffected acceptable boundary words should not be adjectives Case 5: associate with, be associated with, associated with be associated with associated with, no difference use to, be used to, used to; sometimes be used to differs from used to Example: “He used to smoke a pipe.” Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study: overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

12 Ranking Proper MWV Candidates Head(c): the priority of selecting a head word is from the left to right side of a candidate, but a word that is one of the most frequent verbs (be, have, do,…), or a preposition is excluded. result in, be able to, … Assumption: the following aspects can be important for ranking a proper MWV candidate c. aspect 1: f(c), absolute frequency of c. aspect 2: f(c) /f(head(c)), the proportion of f(c) to the frequency of MWV head of c, i.e., head(c). aspect 3: F(c) /f(head(c)), the proportion of the sum of all occurrences of candidates that share the same MWV head with c, F(c), to the frequency of MWV head of c. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

13 Flowchart of Ranking MWVs Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Output: reliable MWV candidate c Contiguous MWV candidates Non-contiguous MWV candidates if: R(c) > t Ranking of reliability Controls of overgeneration Domain- specific terms Verb chunker for common verb lemmata Examine the compatibility of long &short candidates Examine candidates of passive and active forms Filter candidates with open boundaries head(c), the MWV head of a candidate c; f(c), the frequency of c; f(head(c)), the frequency of head(c); F(c), sum of all occurrences of candidates that share the same MWV head with c; c 1, c 2 and c 3 are coefficients; R(c), the value of reliability evaluation; t, threshold.

14 Selection of Sample Set for Result Evaluation Selection of candidates for result evaluation Most frequent 33 candidates (f(c) >= 60) 31 candidates with moderate frequencies (19 >= f(c) >= 14) 95 candidates with low frequencies (f(c) = 6 or 7) Evaluation according to: Oxford Advanced Learner’s Dictionary of Current English (encyclopedic edition) LEO Germany English online dictionary ( Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

15 Baseline Performance t = a minR(c) + b. (baseline: a=1, b=0) Note: in this case, all candidates in the sample set are given a positive ranking value. c1c1 c2c2 c3c3 t Precision Recall F-measure Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

16 Evaluation Baseline: a=1, b=0, then P=0.4565, R=1, F= Let a=2.3, b=0.1(or 0.2), then P=0.6863, R=0.8333, F= Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

17 Result: A List of MWVs in Ranking Order Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary No.MWV candidateNo.MWV candidate 1(8) be subject to13(5) synergizes with 2(5) subject* to14(35) base* on 3(7) give rise to15(42) interfere with 4(7) take place16(91) derive* from 5(325) result in17(62) consist of 6(271) lead to18(19) belong to 7(293) associate* with19(111) contribute to 8(89) fail to20(17) attribute* to 9(7) culminate in21(41) compose* of 10(5) challenge* with22(31) result from 11(5) coincide with23(56) be present in 12(5) submit* to24(5) base* upon Note: ( ) -- occurrences, * -- dominated by passive form.

18 Summary Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Our results present a sound balance between the low- and high-frequency MWV candidates in the sublanguage corpus. Find out MWVs share the same head with different accessories (base on and base upon), with different perspectives (result in vs. result from); POS tag errors affect the ranking process (related JJ to); Some specific entries are difficult to evaluate (synergize, pretreat, etc) ; Most frequent verbs/auxiliaries (be, have, do) were not considered in this experiment. Ongoing and future works UMLS (unified medical language system) specialist lexicon instead of WordNet for verb stemming; Recognition of derivational forms of specific verbs; Combination with domain-specific analysis.

Thank you for your audience! Questions?