CPSC 503 Computational Linguistics Word-Sense Disambiguation Information Retrieval Lecture 18 Giuseppe Carenini 9/25/2019 CPSC503 Spring 2004
Semantics Summary What meaning is and how to represent it How to map sentences into their meaning Meaning of individual words Tasks: Information Extraction Word Sense Disambiguation Information Retrieval How the meaning of a sentence depends on the emaning of its constituents phrases and words compositional semantics 9/25/2019 CPSC503 Spring 2004
Today 24/3 Word-Sense Disambiguation Information Retrieval (ad hoc) Machine Learning Approaches Information Retrieval (ad hoc) Stand-alone with minimal assumptions on what information will be provided by other processes 9/25/2019 CPSC503 Spring 2004
Supervised ML Approaches to WSD Training Data ((word + context1) sense1) …… ((word + contextn) sensen) Machine Learning Classifier That’s too hard… try something empirical In supervised machine learning approaches, a training corpus of words tagged in context with their sense is used to train a classifier that can tag words in new text (that reflects the training text) (word + context) sense 9/25/2019 CPSC503 Spring 2004
Training Data Example ((word + context) sense)i ..after the soup she had bass with a big salad… Examples, One of 8 possible senses for “bass” in WordNet One of the 2 key distinct senses for “bass” in WordNet sense Context portion of text in which target word is embedded 9/25/2019 CPSC503 Spring 2004
WordNet Bass: music vs. fish The noun ``bass'' has 8 senses in WordNet bass - (the lowest part of the musical range) bass, bass part - (the lowest part in polyphonic music) bass, basso - (an adult male singer with …) sea bass, bass - (flesh of lean-fleshed saltwater fish of the family Serranidae) freshwater bass, bass - (any of various North American lean-fleshed ………) bass, bass voice, basso - (the lowest adult male singing voice) bass - (the member with the lowest range of a family of musical instruments) bass -(nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes) 9/25/2019 CPSC503 Spring 2004
Representations for Context GOAL: Informative characterization of the window of text surrounding the target word Supervised ML requires a very simple representation for the training data: vectors of feature/value pairs Most supervised ML approaches require a very simple representation for the input training data. Vectors of sets of feature/value pairs I.e. files of comma-separated values So our first task is to extract training data from a corpus with respect to a particular instance of a target word This typically consists of a characterization of the window of text surrounding the target This is where ML and NLP intersect If you stick to trivial surface features that are easy to extract from a text, then most of the work is in the ML system If you decide to use features that require more analysis (say parse trees) then the ML part may be doing less work (relatively) if these features are truly informative TASK: Select relevant linguistic information, encode them as a feature vector 9/25/2019 CPSC503 Spring 2004
Relevant Linguistic Information(1) Collocational: info about the words that appear in specific positions to the right and left of the target word Typically words and their POS [word in position -n, part-of-speech position -n, … word in position +n, part-of-speech position +n,] Assume a window of +/- 2 from the target Example text (WSJ) An electric guitar and bass player stand off to one side not really part of the scene, … Often limited to the words themselves as well as they’re part of speech [guitar, NN, and, CJC, player, NN, stand, VVB] 9/25/2019 CPSC503 Spring 2004
Relevant Linguistic Information(2) Co-occurrence: info about the words that occur anywhere in the window regardless of position Find k content words that most frequently co-occur with target in corpus (for bass: fishing, big, sound, player, fly …, guitar, band)) Vector for one case: [c(fishing), c(big), c(sound), c(player), c(fly), …, c(guitar), c(band)] Typically limited to frequency counts (bag of words assumption) Information about the words that occur within the window. First derive a set of terms to place in the vector. Then note how often each of those terms occurs in a given window. Example text (WSJ) An electric guitar and bass player stand off to one side not really part of the scene, … [0,0,0,1,0,0,0,0,0,0,1,0] 9/25/2019 CPSC503 Spring 2004
ML for Classifiers Machine Learning Classifier Training Data: Co-occurrence Collocational Naïve Bayes Decision lists Decision trees Neural nets Support vector machines Nearest neighbor methods… Machine Learning Once we cast the WSD problem as a classification problem, then all sorts of techniques are possible The choice of technique, in part, depends on the set of features that have been used Some techniques work better/worse with features with numerical values Some techniques work better/worse with features that have large numbers of possible values For example, the feature the word to the left has a fairly large number of possible values Classifier 9/25/2019 CPSC503 Spring 2004
Naïve Bayes Independence 9/25/2019 CPSC503 Spring 2004 Rewriting with Bayes and assuming independence of the features P(s) … just the prior of that sense. Just as with part of speech tagging, not all senses will occur with equal frequency P(vj|s)… conditional probability of some particular feature/value combination given a particular sense You can get both of these from a tagged corpus with the features encoded 9/25/2019 CPSC503 Spring 2004
Naïve Bayes: Evaluation Experiment comparing different classifiers [Mooney 96] Naïve Bayes and Neural Network achieved highest performance 73% in assigning one of six senses to line Good? What was the baseline? 9/25/2019 CPSC503 Spring 2004
Bootstrapping What if you don’t have enough data to train a system… Machine Learning Small Training Data Classifier seeds More Classified Data Start with a small set of labeled instances Use the little training data you have to train an inadequate system Use that system to tag new data. Use that larger set of training data to train a new system Bootstrap More Data 9/25/2019 CPSC503 Spring 2004
Bootstrapping: how to pick the seeds Hand-labeling: Likely correct Likely to be prototypical One sense per collocation: search for words or phrases strongly associates with target senses. Then automatic labeling. Pick a word that you as an analyst think will co-occur with your target word in particular sense Grep through your corpus for your target word and the hypothesized word Assume that the target tag is the right one E.g., bass play is strongly associated with the music sense whereas fish is strongly associated the fish sense 9/25/2019 CPSC503 Spring 2004
Unsupervised Methods [Schultze ’98] Training Data Machine Learning (word + vector)1 …… (word + vector)n K Clusters ci Hand-labeling (c1 sense1) …… PROBLEMS - #of senses may not be known - Clusters can be heterogeneous with respect to the sense - #of clusters is almost always different from the number of senses Tested on a small sample of words (word + vector) sense Vector/cluster Similarity 9/25/2019 CPSC503 Spring 2004
Agglomerative Clustering Assign each instance to its own cluster Repeat Merge the two clusters that are more similar Until (specified # of clusters is reached) Large number of standard algorithms that can be applied to inputs structured as vectors of numerical values If there are too many training instances ->random sampling 9/25/2019 CPSC503 Spring 2004
Problems Given these general ML approaches, how many classifiers do I need to perform WSD robustly One for each ambiguous word in the language How do you decide what set of tags/labels/senses to use for a given word? Depends on the application 9/25/2019 CPSC503 Spring 2004
Recent Work on WSD Word Sense Disambiguation: Recent Successes and Future Directions A SIGLEX/SENSEVAL Workshop at ACL 2002 University of Pennsylvania 9/25/2019 CPSC503 Spring 2004
Today 24/3 Word-Sense Disambiguation Information Retrieval (ad hoc) Machine Learning Approaches Information Retrieval (ad hoc) Stand-alone with minimal assumptions on what information will be provided by other processes 9/25/2019 CPSC503 Spring 2004
Information Retrieval Retrieving relevant documents from document repositories Sub-Areas: Ad hoc retrieval (Query-> List of documents) Text Categorization (Document -> Category) Eg BusinessNews (OIL, ACQ, … ) Start your own search engine company… IR Def. Retrieving information (relevant documents) from document repositories In ad hoc retrieval an untrained user poses a query to a system and is presented with an ordered list of documents that are thought to be relevant to the query. Filtering (special case of TC, with 2 categories - relevant/non-relevant) 9/25/2019 CPSC503 Spring 2004
Information Retrieval Bag of words assumption: in modern IR the meanings of documents is captured by analyzing (counting) the words that occur in them. Efficiency Works in practice Tobias Scheffer and Stefan Wrobel. Text classification beyond the bag-of-words representation : In Proceedings of the ICML-Workshop on Text Learning. 2002. The basic assumption of modern IR is that the meanings of documents can be captured by analyzing (counting) the words that occur in them. This is known as the bag of words approach. http://www-ai.ijs.si/DunjaMladenic/TextML02/ 9/25/2019 CPSC503 Spring 2004
IR Terminology Documents Collection Terms Query Any contiguous bunch of text (E.g. News article, Web page, paragraph) Collection A bunch of documents Terms Words that occur in a collection (but it may include common phrases E.g. car insurance) Preliminaries Query Terms that express an information need 9/25/2019 CPSC503 Spring 2004
Terms Selection and Creation Stop list? a list of frequent largely content-free words that are not considered (of, the, a, to, etc.) Stemming? Are terms stems or words? Eg. Are dog and dogs separate terms or are they collapsed to dog? Function words Phrases? Include most frequent biagrams as phrases 9/25/2019 CPSC503 Spring 2004
Ad hoc Ranked Retrieval Documents in collection ranked by relevance query d1 d2 … dM Vector space model: both documents and queries are represented as vectors of numbers. N = number of term types in the collection Given a query, we want to know how relevant all the documents in the collection are to that query The numbers are derived from the words that occur in the collection what should a t express? Whether and to what extent that term should contribute to the meaning of the query/document What should a t express? 9/25/2019 CPSC503 Spring 2004
First approximation: bit vector ti = 1 if the corresponding word type occurs in the document ( ti = 0 otherwise ) Similarity: we can compare two docs or a query and a doc by counting the terms they have in common No it treats all the terms as equally important Is this a satisfying solution? NO - Of course, the bit vector idea is a little limited. It treats all terms that occur in the query and the document equally. Its better to give the more important terms greater weight - Longer documents have an advantage Is this a satisfying solution? 9/25/2019 CPSC503 Spring 2004
Better: Term Weighting Local weight: How important is this term to the meaning of this document Global weight: How well does this term discriminate among the documents in the collection The more documents a term occurs in the less important it is; Two measures are used… LOCAL… Usually based on the frequency of the term in the document GLOBAL…. The standard technique is known as inverse document frequency The fewer the better. To get the weight for a term in a document, multiply the term’s frequency derived weight by its inverse document frequency. SOLUTION: combine Local and Global 9/25/2019 CPSC503 Spring 2004
New Similarity: the cosine measure d q normalized 9/25/2019 CPSC503 Spring 2004
Ad Hoc Retrieval: Summary Given a user’s query: find all the documents that contain any of the terms in the query Why only those documents? Convert the query to a vector Compute the cosine between the query vector and all the candidate documents and sort using the same weighting scheme that was used to represent the documents query d1 d2 … dM Documents in collection ranked by relevance 9/25/2019 CPSC503 Spring 2004
IR Evaluation (1) What do we want? d1 d2 … dM We want documents relevant to the query to be near the top of the list d1 d2 … dM Use a test collection where you have A set of documents A set of queries A set of relevance judgments that tell you which documents are relevant to each query 9/25/2019 CPSC503 Spring 2004
IR Evaluation (2) Can we use Precision and Recall? Not directly... Precision: #relevant docs returned/#docs returned Recall: #relevant docs returned/#relevant docs total Not directly... d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d2 d4 d6 d1 d7 d3 d5 d8 d9 d10 Precision is the percentage of things you find (return) that are right. Recall is the percentage of right things out there that you found (returned) Need cut-off points d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d2 d4 d6 d1 d7 d3 d5 d8 d9 d10 9/25/2019 CPSC503 Spring 2004
Precision and Recall Plots Higher cut-off 1 precision 1 recall 9/25/2019 CPSC503 Spring 2004
IR Current Research TREC (Text Retrieval Conference) large document sets for testing uniform scoring systems Different Tracks: Interactive Track: studying user interaction with text retrieval systems. Question Answering Track Web Track Terabyte Track ……... Documents/queries/relevance-judgments 9/25/2019 CPSC503 Spring 2004
Next Time Discourse and Dialog Chp. 18 and 19 9/25/2019 CPSC503 Spring 2004