Category-Based Pseudowords

Slides:



Advertisements
Similar presentations
How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton.
Advertisements

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Research Methods in MIS: Sampling Design
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Improving Information Retrieval in MEDLINE by Modulating MeSH Term Weights Kwangcheol Shin, Sang-Yong Han School of CSE, Chung-Ang Univ. Seoul, Korea NLDB.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities Background.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Evaluating Hypotheses
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
CpSc 881: Machine Learning Evaluating Hypotheses.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
PROBABILITY AND STATISTICS WEEK 2 Onur Doğan. Introduction to Probability The Classical Interpretation of Probability The Frequency Interpretation of.
Language Identification and Part-of-Speech Tagging
Machine Learning: Ensemble Methods
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Queensland University of Technology
Chapter 7 Confidence Interval Estimation
Statistical NLP: Lecture 7
12. Principles of Parameter Estimation
CHAPTER 7 Sampling Distributions
Modeling and Simulation CS 313
Statistical Data Analysis
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Chapter 8 Confidence Interval Estimation.
Terminology problems in literature mining and NLP
SAMPLING (Zikmund, Chapter 12.
When Security Games Go Green
Welcome.
Using UMLS CUIs for WSD in the Biomedical Domain
Statistical NLP: Lecture 9
PROBABILITY AND STATISTICS
Revealing priors on category structures through iterated learning
CHAPTER 7 Sampling Distributions
CHAPTER 7 Sampling Distributions
Discriminative Frequent Pattern Analysis for Effective Classification
Introduction Task: extracting relational facts from text
Special Topics in Text Mining
Statistics Workshop Tutorial 1
Statistical Data Analysis
CHAPTER 7 Sampling Distributions
CHAPTER 7 Sampling Distributions
Test Drop Rules: If not:
Sampling.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
CHAPTER 7 Sampling Distributions
12. Principles of Parameter Estimation
Statistical NLP : Lecture 9 Word Sense Disambiguation
EQ: What is a “random sample”?
Chapter 5: Sampling Distributions
Machine Learning: Lecture 5
Presentation transcript:

Category-Based Pseudowords Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint 11/19/2018 HLT/NAACL'03

Word sense disambiguation WSD task: determine the sense of a particular instance of a multi-sense word given its context classic ambiguous example: bank homography river bank financial institution polysemy building 11/19/2018 HLT/NAACL'03

Evaluation Ideally: using a sense-tagged corpus Moving to a new domain general purpose – e.g. SENSEVAL corpus specific domain, e.g. biomedical the National Library of Medicine test collection contains instances of 50 highly frequent ambiguous concepts from the UMLS Metathesaurus. Moving to a new domain a sense-tagged corpus may be unavailable even when available, may be unsuitable What if we use a different sense distinction: e.g. MeSH instead of the UMLS Metathesaurus? What if we are also interested in less frequent words, e.g. need to evaluate an all-words system? 11/19/2018 HLT/NAACL'03

Pseudowords building a sense-tagged corpus is very expensive, so create an artificial one pseudoword: composite comprised of two or more words, chosen at random (Gale et al.’92), (Schuetze’92): e.g. banana and door  banana_door accepted as an upper bound of the true system’s accuracy 11/19/2018 HLT/NAACL'03

Problems Chosen entirely at random, and thus: difficult to characterize in terms of the type of ambiguity being modeled optimistic in their estimations (Gaustad’01) highly likely to combine semantically distinct words real ambiguous words have senses similar in meaning and difficult to distinguish 11/19/2018 HLT/NAACL'03

Use lexical category membership The solution Use lexical category membership 11/19/2018 HLT/NAACL'03

MeSH and Medline we use MeSH (Medical Subject Headings) example: Eye has the following codes A01.456.505.420 (child of Face) A09.371 (child of Sense Organs) average number of senses: 2.12 we cut after the first period to allow generalization (e.g. A01 and A09) 71.18% - single class, 22.14% - two classes the ambiguity drops to 1.39 Medline abstracts - 180,226 training: 120,150 testing: 60,076 11/19/2018 HLT/NAACL'03

Pseudowords generation (1) Build a list C of the category couples and their frequencies in the training corpus 11/19/2018 HLT/NAACL'03

Pseudowords generation (2) Generate pseudowords with the following characteristics: represent a real ambiguity class pair (met in the training corpus) the number of pseudowords drawn from a particular class pair is proportional to the pair’s frequency only unambiguous words are used as pseudowords constituents multi-word concepts are allowed as elements, e.g. general systems theory + glutathione s-tranferase 11/19/2018 HLT/NAACL'03

Pseudowords generation (3) Pseudowords for the lower bound in real texts, the more frequent sense for a two-sense distinction occurs around 92% of the time (Sanderson & van Rijsbergen’99) evenly distributed senses are harder so we build a balanced list W of pairs: we calculate the mean corpus word frequency E and then find the words with freq. in [E/2;3E/2] in the particular experiment: E=45.21, which gave a list of 64,596 pairs 11/19/2018 HLT/NAACL'03

Pseudowords generation (4) importance sampling 1) Select a category pair c1,c2 from C by sampling from a multinomial distribution with parameters proportional to the frequencies of the elements of C. 2) Sample uniformly to draw two random distinct words w1 and w2 whose classes correspond to the classes selected in step 1). 3) If the word pair w1,w2 has been sampled already, go to step 1) and try again. we sampled 1,000 pseudowords (88,758 instances) out of the possible 64,596 11/19/2018 HLT/NAACL'03

Sample pseudowords the more unusual pairs come from less frequent categories 11/19/2018 HLT/NAACL'03

Classifier Naïve Bayes classifier we used a symmetric context window: simple, commonly used for WSD, and among the best performing we used a symmetric context window: 10, 20, 40 and 300 words on each side category name as a proxy for the sense ambiguous MeSH categories as target UNambiguous MeSH categories as features (we use a class-based model, and not a word-based one) 11/19/2018 HLT/NAACL'03

used an abbr. extraction tool described in (Schwartz&Hearst’03) Abbreviations we have no real disambiguated corpus use abbreviations, as suggested in (Liu et al.,’02) represent real ambiguous words but may be due to accident intermediate position between entirely random pseudowords and real ambiguous words we generated 98,841 abbreviations (332,020 instances in total) such that: their expansions are fully and unambiguously mapped to MeSH they represent exactly two distinct categories used an abbr. extraction tool described in (Schwartz&Hearst’03) 11/19/2018 HLT/NAACL'03

Sample abbreviations 11/19/2018 HLT/NAACL'03

Evaluation Category based Non-category based baseline – choose the more frequent class (shown for abbreviations) pessimistic – evenly distributed constituents realistic – random constituents (frequency at least 5) abbreviations Non-category based optimistic – completely random (the standard way to generate) 11/19/2018 HLT/NAACL'03

Conclusions We introduced category based pseudowords based on distributions from lexical category co-occurrence: give a more accurate lower bound allow detailed study (many samples) of a particular sense ambiguity represent a better motivated word grouping in pseudowords 11/19/2018 HLT/NAACL'03

Thank you! Your questions? 11/19/2018 HLT/NAACL'03