Category-Based Pseudowords Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint 11/19/2018 HLT/NAACL'03
Word sense disambiguation WSD task: determine the sense of a particular instance of a multi-sense word given its context classic ambiguous example: bank homography river bank financial institution polysemy building 11/19/2018 HLT/NAACL'03
Evaluation Ideally: using a sense-tagged corpus Moving to a new domain general purpose – e.g. SENSEVAL corpus specific domain, e.g. biomedical the National Library of Medicine test collection contains instances of 50 highly frequent ambiguous concepts from the UMLS Metathesaurus. Moving to a new domain a sense-tagged corpus may be unavailable even when available, may be unsuitable What if we use a different sense distinction: e.g. MeSH instead of the UMLS Metathesaurus? What if we are also interested in less frequent words, e.g. need to evaluate an all-words system? 11/19/2018 HLT/NAACL'03
Pseudowords building a sense-tagged corpus is very expensive, so create an artificial one pseudoword: composite comprised of two or more words, chosen at random (Gale et al.’92), (Schuetze’92): e.g. banana and door banana_door accepted as an upper bound of the true system’s accuracy 11/19/2018 HLT/NAACL'03
Problems Chosen entirely at random, and thus: difficult to characterize in terms of the type of ambiguity being modeled optimistic in their estimations (Gaustad’01) highly likely to combine semantically distinct words real ambiguous words have senses similar in meaning and difficult to distinguish 11/19/2018 HLT/NAACL'03
Use lexical category membership The solution Use lexical category membership 11/19/2018 HLT/NAACL'03
MeSH and Medline we use MeSH (Medical Subject Headings) example: Eye has the following codes A01.456.505.420 (child of Face) A09.371 (child of Sense Organs) average number of senses: 2.12 we cut after the first period to allow generalization (e.g. A01 and A09) 71.18% - single class, 22.14% - two classes the ambiguity drops to 1.39 Medline abstracts - 180,226 training: 120,150 testing: 60,076 11/19/2018 HLT/NAACL'03
Pseudowords generation (1) Build a list C of the category couples and their frequencies in the training corpus 11/19/2018 HLT/NAACL'03
Pseudowords generation (2) Generate pseudowords with the following characteristics: represent a real ambiguity class pair (met in the training corpus) the number of pseudowords drawn from a particular class pair is proportional to the pair’s frequency only unambiguous words are used as pseudowords constituents multi-word concepts are allowed as elements, e.g. general systems theory + glutathione s-tranferase 11/19/2018 HLT/NAACL'03
Pseudowords generation (3) Pseudowords for the lower bound in real texts, the more frequent sense for a two-sense distinction occurs around 92% of the time (Sanderson & van Rijsbergen’99) evenly distributed senses are harder so we build a balanced list W of pairs: we calculate the mean corpus word frequency E and then find the words with freq. in [E/2;3E/2] in the particular experiment: E=45.21, which gave a list of 64,596 pairs 11/19/2018 HLT/NAACL'03
Pseudowords generation (4) importance sampling 1) Select a category pair c1,c2 from C by sampling from a multinomial distribution with parameters proportional to the frequencies of the elements of C. 2) Sample uniformly to draw two random distinct words w1 and w2 whose classes correspond to the classes selected in step 1). 3) If the word pair w1,w2 has been sampled already, go to step 1) and try again. we sampled 1,000 pseudowords (88,758 instances) out of the possible 64,596 11/19/2018 HLT/NAACL'03
Sample pseudowords the more unusual pairs come from less frequent categories 11/19/2018 HLT/NAACL'03
Classifier Naïve Bayes classifier we used a symmetric context window: simple, commonly used for WSD, and among the best performing we used a symmetric context window: 10, 20, 40 and 300 words on each side category name as a proxy for the sense ambiguous MeSH categories as target UNambiguous MeSH categories as features (we use a class-based model, and not a word-based one) 11/19/2018 HLT/NAACL'03
used an abbr. extraction tool described in (Schwartz&Hearst’03) Abbreviations we have no real disambiguated corpus use abbreviations, as suggested in (Liu et al.,’02) represent real ambiguous words but may be due to accident intermediate position between entirely random pseudowords and real ambiguous words we generated 98,841 abbreviations (332,020 instances in total) such that: their expansions are fully and unambiguously mapped to MeSH they represent exactly two distinct categories used an abbr. extraction tool described in (Schwartz&Hearst’03) 11/19/2018 HLT/NAACL'03
Sample abbreviations 11/19/2018 HLT/NAACL'03
Evaluation Category based Non-category based baseline – choose the more frequent class (shown for abbreviations) pessimistic – evenly distributed constituents realistic – random constituents (frequency at least 5) abbreviations Non-category based optimistic – completely random (the standard way to generate) 11/19/2018 HLT/NAACL'03
Conclusions We introduced category based pseudowords based on distributions from lexical category co-occurrence: give a more accurate lower bound allow detailed study (many samples) of a particular sense ambiguity represent a better motivated word grouping in pseudowords 11/19/2018 HLT/NAACL'03
Thank you! Your questions? 11/19/2018 HLT/NAACL'03