Using UMLS CUIs for WSD in the Biomedical Domain

Slides:



Advertisements
Similar presentations
Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes¹ Ted Pedersen² and Serguei V. Pakhomov¹ University of Minnesota¹.
Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
U. S. National Library of Medicine NLM Indexing Initiative Tools for NLP: MetaMap and the Medical Text Indexer Natural Language Processing: State of the.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I NFORMATION C ONTENT M EASURES OF S IMILARITY Bridget.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities.
AMIA A Comparative Study of Supervised Learning as Applied to Acronym Expansion in Clinical Reports Mahesh Joshi, Serguei Pakhomov, Ted Pedersen,
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities Background.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota.
Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Distributed Representations of Sentences and Documents
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Automated Classification of Medical Questions Using Semantic Parsing Techniques Paul E. Pancoast, MD Arthur B. Smith, MS Chi-Ren Shyu, PhD University of.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
1 Introduction to Modeling Languages Striving for Engineering Precision in Information Systems Jim Carpenter Bureau of Labor Statistics, and President,
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Annual reports and feedback from UMLS licensees Kin Wah Fung MD, MSc, MA The UMLS Team National Library of Medicine Workshop on the Future of the UMLS.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK Joint work.
Intelligent Database Systems Lab Presenter : Kung, Chien-Hao Authors : Yoong Keok Lee and Hwee Tou Ng 2002,EMNLP An Empirical Evaluation of Knowledge Sources.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
That's What She Said: Double Entendre Identification Kiddon & Brun 2011.
MetaCoDe A GATE PLUGIN FOR TAGGING MEDICAL CORPORA IN FRENCH WITH CONTROLED TERMINOLOGIES Thierry Delbecque Pierre.
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
A Simple Approach for Author Profiling in MapReduce
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Aspect-Based Sentiment Analysis Using Lexico-Semantic Patterns
System for Semi-automatic ontology construction
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Bridget McInnes Ted Pedersen Serguei Pakhomov
Category-Based Pseudowords
Statistical NLP: Lecture 9
WordNet WordNet, WSD.
Computational Lexical Semantics
Introduction Task: extracting relational facts from text
A method for WSD on Unrestricted Text
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Using UMLS CUIs for WSD in the Biomedical Domain Bridget T. McInnes¹ Ted Pedersen² and John Carlis¹ University of Minnesota Twin Cities¹ and University of Minnesota Duluth²

The culture count doubled. What is WSD? The culture count doubled. Culture Anthropological Culture Laboratory Culture Sense Inventory

Sense Inventory: UMLS Unified Medical Language System contains a list of Concept Unique Identifiers (CUIs) which are concepts (senses) associated with a word or term Culture Anthropological Culture (C0010453)‏ Laboratory Culture (C0430400)‏ Sense Inventory: UMLS

UMLS: Semantic Network framework encoded with different semantic and syntactic structures Anthropological Culture (C0010453)‏ Laboratory Culture (C0430400)‏ Semantic Type(s): Idea or Concept Semantic Type(s): Laboratory Procedure Semantic Type: Mental Process semantic relation: assesses_effect_of semantic relation: result_of

MetaMap Concept mapping system maps text to concepts in the UMLS provides a wealth of information for all words in a document phrasal information Part of speech (POS) of a word CUI of a word Semantic types of a word

Example The culture count doubled count doubled CUI: Count (C0750480)‏ semantic type: Idea or Concept (idcn)‏ pos: noun doubled CUI: Duplicate (C0205173)‏ semantic type: Functional Concept (ftcn)‏ pos: verb

Supervised Approaches Leroy and Rindflesch 2005 Semantic types, semantic relations, part-of-speech, and head information (from MetaMap) Joshi, Pedersen and Maclin 2005 unigrams in the same sentence as the ambiguous word in the same abstract as the ambiguous word Liu, Teller and Friedman 2004 unigrams, direction and orientation of unigrams and collocations

Questions

Questions Would UMLS CUIs be an improvement over semantic types?

Questions Would UMLS CUIs be an improvement over semantic types? Would the biomedical specific feature CUIs be an improvement over the more general feature unigrams?

Questions Would UMLS CUIs be an improvement over semantic types? Would the biomedical specific feature CUIs be an improvement over the more general feature unigrams? Would increasing the context window in which surrounding CUIs are found improve the results?

Our supervised approach Algorithm: Naïve Bayes from WEKA datamining package using 10 fold cross validation Features: UMLS CUIs obtained from MetaMap that occur in the same sentence as the ambiguous word more than one time (s-1-cui) that occur in the same abstract as the ambiguous word more than one time (a-1-cui)‏

Example ... The culture count doubled. The cells multiplied by twice the expected rate ... Abstract: Sentence: C0750480 Count (2)‏ C0205173 Duplicate (1)‏ ... C0750480 Count (2)‏ C0205173 Duplicate (3)‏ C0007634 Cells (4)‏ C1517001 Expected (1)‏ C1521828 Rate (3)‏ ...

Algorithm Example Instances Extract Relevant CUIs Test Data Training Data Naïve Bayes Algorithm Sense Tagged Test Data Naïve Bayes Algorithm Naïve Bayes Algorithm Naïve Bayes Algorithm Sense Tagged Test Data Sense Tagged Test Data Sense Tagged Test Data

Dataset National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset 50 words from the 1998 MEDLINE abstracts 100 instances for each of the 50 words Each instance has been tagged by MetaMap The target word was manually assigned a UMLS concept or None Average number of concepts per ambiguous word is 2.26 (not including None)‏

Data subsets Liu subset Leroy subset Joshi subset Liu, Teller and Friedman 2004 22 out of the 50 words in NLM-WSD Leroy subset Leroy and Rindflesch 2005 15 out of the 50 words in NLM-WSD Joshi subset Joshi, Pedersen and Maclin 2005 28 out of the 50 words in NLM-WSD (union of Leroy and Liu subsets)

Results

Would CUIs be an improvement over semantic types? Results for Question 1 Would CUIs be an improvement over semantic types?

Comparative results with Leroy and Rindflesch 2005 74.5% 71% 65.6%

Significance of Differences Pairwise t-test s-1-cui (71%) and s-0-Leroy (65.6%)‏ p <= 0.001 a-1-cui (74.5%) and s-0-Leroy (65.6%)‏ p <= .00005

Results for Question 2 Would the biomedical specific feature CUIs be an improvement over the more general feature unigrams?

Comparative results with Joshi, Pedersen and Maclin 2005 82.5% 80% 77.7% 79.3%

Significance of Results Pairwise t-test s-1-cui (77.7%) and s-4-Joshi (79.3%)‏ p < 0.135 a-1-cui (80.0%) and a-4-Joshi (82.5%)‏ p < 0.003

Results for Question 3 Would increasing the size of the context window in which surrounding CUIs are found improve the results, as seen by Joshi, Pedersen and Maclin using unigrams?

Comparative results between size of context window 83.3% 85.6%

Significance of Results Pairwise t-test s-1-cui (83.3%) and a-1-cui (85.6%)‏ p < 0.0006

Comparative results with Liu, Teller and Friedman 2004 85.5% 81.9%

Significance of Results Pairwise t-test a-1-cui (81.9%) and s-1-Liu (85.5%)‏ p < 0.001

Conclusions CUIs result in more accurate disambiguation than semantic types and are comparable to unigrams Incorporating more surrounding context improves the results MetaMap generates useful information that can used as features for supervised disambiguation

Future Work Combination approach Exploring additional UMLS features Unsupervised approach using information from the UMLS

Software and Data CuiTools version 0.05 NLM-WSD Dataset http://cuitools.sourceforge.net NLM-WSD Dataset http://wsd.nlm.nih.gov Pairwise t-test http://www.quantitativeskills.com/sisa/statistics/