CPSC 503 Computational Linguistics

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Chapter 5: Introduction to Information Retrieval

Albert Gatt Corpora and Statistical Methods Lecture 13.

Introduction to Information Retrieval

Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -

5/16/2015CPSC503 Winter CPSC 503 Computational Linguistics Computational Lexical Semantics Lecture 14 Giuseppe Carenini.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Natural Language Processing Lecture 22—11/14/2013 Jim Martin.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Text Classification, Active/Interactive learning.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Chapter 6: Information Retrieval and Web Search

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Chapter 23: Probabilistic Language Models April 13, 2004.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Plan for Today’s Lecture(s)

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Lecture 1: Introduction and the Boolean Model Information Retrieval

Text Based Information Retrieval

Information Retrieval: Models and Methods

Vector-Space (Distributional) Lexical Semantics

Lecture 16: Lexical Semantics, Wordnet, etc

Machine Learning in Natural Language Processing

Distributed Representation of Words, Sentences and Paragraphs

Statistical NLP: Lecture 9

Machine Learning in Practice Lecture 11

Text Categorization Assigning documents to a fixed set of categories

CSCI 5832 Natural Language Processing

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

CS246: Information Retrieval

CPSC 503 Computational Linguistics

Information Retrieval

Word embeddings (continued)

Information Retrieval and Web Design

Statistical NLP : Lecture 9 Word Sense Disambiguation

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

CPSC 503 Computational Linguistics Word-Sense Disambiguation Information Retrieval Lecture 18 Giuseppe Carenini 9/25/2019 CPSC503 Spring 2004

Semantics Summary What meaning is and how to represent it How to map sentences into their meaning Meaning of individual words Tasks: Information Extraction Word Sense Disambiguation Information Retrieval How the meaning of a sentence depends on the emaning of its constituents phrases and words compositional semantics 9/25/2019 CPSC503 Spring 2004

Today 24/3 Word-Sense Disambiguation Information Retrieval (ad hoc) Machine Learning Approaches Information Retrieval (ad hoc) Stand-alone with minimal assumptions on what information will be provided by other processes 9/25/2019 CPSC503 Spring 2004

Supervised ML Approaches to WSD Training Data ((word + context1)  sense1) …… ((word + contextn)  sensen) Machine Learning Classifier That’s too hard… try something empirical In supervised machine learning approaches, a training corpus of words tagged in context with their sense is used to train a classifier that can tag words in new text (that reflects the training text) (word + context) sense 9/25/2019 CPSC503 Spring 2004

Training Data Example ((word + context)  sense)i ..after the soup she had bass with a big salad… Examples, One of 8 possible senses for “bass” in WordNet One of the 2 key distinct senses for “bass” in WordNet sense Context portion of text in which target word is embedded 9/25/2019 CPSC503 Spring 2004

WordNet Bass: music vs. fish The noun ``bass'' has 8 senses in WordNet bass - (the lowest part of the musical range) bass, bass part - (the lowest part in polyphonic music) bass, basso - (an adult male singer with …) sea bass, bass - (flesh of lean-fleshed saltwater fish of the family Serranidae) freshwater bass, bass - (any of various North American lean-fleshed ………) bass, bass voice, basso - (the lowest adult male singing voice) bass - (the member with the lowest range of a family of musical instruments) bass -(nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes) 9/25/2019 CPSC503 Spring 2004

Representations for Context GOAL: Informative characterization of the window of text surrounding the target word Supervised ML requires a very simple representation for the training data: vectors of feature/value pairs Most supervised ML approaches require a very simple representation for the input training data. Vectors of sets of feature/value pairs I.e. files of comma-separated values So our first task is to extract training data from a corpus with respect to a particular instance of a target word This typically consists of a characterization of the window of text surrounding the target This is where ML and NLP intersect If you stick to trivial surface features that are easy to extract from a text, then most of the work is in the ML system If you decide to use features that require more analysis (say parse trees) then the ML part may be doing less work (relatively) if these features are truly informative TASK: Select relevant linguistic information, encode them as a feature vector 9/25/2019 CPSC503 Spring 2004

Relevant Linguistic Information(1) Collocational: info about the words that appear in specific positions to the right and left of the target word Typically words and their POS [word in position -n, part-of-speech position -n, … word in position +n, part-of-speech position +n,] Assume a window of +/- 2 from the target Example text (WSJ) An electric guitar and bass player stand off to one side not really part of the scene, … Often limited to the words themselves as well as they’re part of speech [guitar, NN, and, CJC, player, NN, stand, VVB] 9/25/2019 CPSC503 Spring 2004

Relevant Linguistic Information(2) Co-occurrence: info about the words that occur anywhere in the window regardless of position Find k content words that most frequently co-occur with target in corpus (for bass: fishing, big, sound, player, fly …, guitar, band)) Vector for one case: [c(fishing), c(big), c(sound), c(player), c(fly), …, c(guitar), c(band)] Typically limited to frequency counts (bag of words assumption) Information about the words that occur within the window. First derive a set of terms to place in the vector. Then note how often each of those terms occurs in a given window. Example text (WSJ) An electric guitar and bass player stand off to one side not really part of the scene, … [0,0,0,1,0,0,0,0,0,0,1,0] 9/25/2019 CPSC503 Spring 2004

ML for Classifiers Machine Learning Classifier Training Data: Co-occurrence Collocational Naïve Bayes Decision lists Decision trees Neural nets Support vector machines Nearest neighbor methods… Machine Learning Once we cast the WSD problem as a classification problem, then all sorts of techniques are possible The choice of technique, in part, depends on the set of features that have been used Some techniques work better/worse with features with numerical values Some techniques work better/worse with features that have large numbers of possible values For example, the feature the word to the left has a fairly large number of possible values Classifier 9/25/2019 CPSC503 Spring 2004

Naïve Bayes Independence 9/25/2019 CPSC503 Spring 2004 Rewriting with Bayes and assuming independence of the features P(s) … just the prior of that sense. Just as with part of speech tagging, not all senses will occur with equal frequency P(vj|s)… conditional probability of some particular feature/value combination given a particular sense You can get both of these from a tagged corpus with the features encoded 9/25/2019 CPSC503 Spring 2004

Naïve Bayes: Evaluation Experiment comparing different classifiers [Mooney 96] Naïve Bayes and Neural Network achieved highest performance 73% in assigning one of six senses to line Good? What was the baseline? 9/25/2019 CPSC503 Spring 2004

Bootstrapping What if you don’t have enough data to train a system… Machine Learning Small Training Data Classifier seeds More Classified Data Start with a small set of labeled instances Use the little training data you have to train an inadequate system Use that system to tag new data. Use that larger set of training data to train a new system Bootstrap More Data 9/25/2019 CPSC503 Spring 2004

Bootstrapping: how to pick the seeds Hand-labeling: Likely correct Likely to be prototypical One sense per collocation: search for words or phrases strongly associates with target senses. Then automatic labeling. Pick a word that you as an analyst think will co-occur with your target word in particular sense Grep through your corpus for your target word and the hypothesized word Assume that the target tag is the right one E.g., bass play is strongly associated with the music sense whereas fish is strongly associated the fish sense 9/25/2019 CPSC503 Spring 2004

Unsupervised Methods [Schultze ’98] Training Data Machine Learning (word + vector)1 …… (word + vector)n K Clusters ci Hand-labeling (c1 sense1) …… PROBLEMS - #of senses may not be known - Clusters can be heterogeneous with respect to the sense - #of clusters is almost always different from the number of senses Tested on a small sample of words (word + vector) sense Vector/cluster Similarity 9/25/2019 CPSC503 Spring 2004

Agglomerative Clustering Assign each instance to its own cluster Repeat Merge the two clusters that are more similar Until (specified # of clusters is reached) Large number of standard algorithms that can be applied to inputs structured as vectors of numerical values If there are too many training instances ->random sampling 9/25/2019 CPSC503 Spring 2004

Problems Given these general ML approaches, how many classifiers do I need to perform WSD robustly One for each ambiguous word in the language How do you decide what set of tags/labels/senses to use for a given word? Depends on the application 9/25/2019 CPSC503 Spring 2004

Recent Work on WSD Word Sense Disambiguation: Recent Successes and Future Directions A SIGLEX/SENSEVAL Workshop at ACL 2002 University of Pennsylvania 9/25/2019 CPSC503 Spring 2004

Today 24/3 Word-Sense Disambiguation Information Retrieval (ad hoc) Machine Learning Approaches Information Retrieval (ad hoc) Stand-alone with minimal assumptions on what information will be provided by other processes 9/25/2019 CPSC503 Spring 2004

Information Retrieval Retrieving relevant documents from document repositories Sub-Areas: Ad hoc retrieval (Query-> List of documents) Text Categorization (Document -> Category) Eg BusinessNews (OIL, ACQ, … ) Start your own search engine company… IR Def. Retrieving information (relevant documents) from document repositories In ad hoc retrieval an untrained user poses a query to a system and is presented with an ordered list of documents that are thought to be relevant to the query. Filtering (special case of TC, with 2 categories - relevant/non-relevant) 9/25/2019 CPSC503 Spring 2004

Information Retrieval Bag of words assumption: in modern IR the meanings of documents is captured by analyzing (counting) the words that occur in them. Efficiency Works in practice Tobias Scheffer and Stefan Wrobel. Text classification beyond the bag-of-words representation : In Proceedings of the ICML-Workshop on Text Learning. 2002. The basic assumption of modern IR is that the meanings of documents can be captured by analyzing (counting) the words that occur in them. This is known as the bag of words approach. http://www-ai.ijs.si/DunjaMladenic/TextML02/ 9/25/2019 CPSC503 Spring 2004

IR Terminology Documents Collection Terms Query Any contiguous bunch of text (E.g. News article, Web page, paragraph) Collection A bunch of documents Terms Words that occur in a collection (but it may include common phrases E.g. car insurance) Preliminaries Query Terms that express an information need 9/25/2019 CPSC503 Spring 2004

Terms Selection and Creation Stop list? a list of frequent largely content-free words that are not considered (of, the, a, to, etc.) Stemming? Are terms stems or words? Eg. Are dog and dogs separate terms or are they collapsed to dog? Function words Phrases? Include most frequent biagrams as phrases 9/25/2019 CPSC503 Spring 2004

Ad hoc Ranked Retrieval Documents in collection ranked by relevance query d1 d2 … dM Vector space model: both documents and queries are represented as vectors of numbers. N = number of term types in the collection Given a query, we want to know how relevant all the documents in the collection are to that query The numbers are derived from the words that occur in the collection what should a t express? Whether and to what extent that term should contribute to the meaning of the query/document What should a t express? 9/25/2019 CPSC503 Spring 2004

First approximation: bit vector ti = 1 if the corresponding word type occurs in the document ( ti = 0 otherwise ) Similarity: we can compare two docs or a query and a doc by counting the terms they have in common No it treats all the terms as equally important Is this a satisfying solution? NO - Of course, the bit vector idea is a little limited. It treats all terms that occur in the query and the document equally. Its better to give the more important terms greater weight - Longer documents have an advantage Is this a satisfying solution? 9/25/2019 CPSC503 Spring 2004

Better: Term Weighting Local weight: How important is this term to the meaning of this document Global weight: How well does this term discriminate among the documents in the collection The more documents a term occurs in the less important it is; Two measures are used… LOCAL… Usually based on the frequency of the term in the document GLOBAL…. The standard technique is known as inverse document frequency The fewer the better. To get the weight for a term in a document, multiply the term’s frequency derived weight by its inverse document frequency. SOLUTION: combine Local and Global 9/25/2019 CPSC503 Spring 2004

New Similarity: the cosine measure d q  normalized 9/25/2019 CPSC503 Spring 2004

Ad Hoc Retrieval: Summary Given a user’s query: find all the documents that contain any of the terms in the query Why only those documents? Convert the query to a vector Compute the cosine between the query vector and all the candidate documents and sort using the same weighting scheme that was used to represent the documents query d1 d2 … dM Documents in collection ranked by relevance 9/25/2019 CPSC503 Spring 2004

IR Evaluation (1) What do we want? d1 d2 … dM We want documents relevant to the query to be near the top of the list d1 d2 … dM Use a test collection where you have A set of documents A set of queries A set of relevance judgments that tell you which documents are relevant to each query 9/25/2019 CPSC503 Spring 2004

IR Evaluation (2) Can we use Precision and Recall? Not directly... Precision: #relevant docs returned/#docs returned Recall: #relevant docs returned/#relevant docs total Not directly... d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d2 d4 d6 d1 d7 d3 d5 d8 d9 d10 Precision is the percentage of things you find (return) that are right. Recall is the percentage of right things out there that you found (returned) Need cut-off points d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d2 d4 d6 d1 d7 d3 d5 d8 d9 d10 9/25/2019 CPSC503 Spring 2004

Precision and Recall Plots Higher cut-off 1 precision 1 recall 9/25/2019 CPSC503 Spring 2004

IR Current Research TREC (Text Retrieval Conference) large document sets for testing uniform scoring systems Different Tracks: Interactive Track: studying user interaction with text retrieval systems. Question Answering Track Web Track Terabyte Track ……... Documents/queries/relevance-judgments 9/25/2019 CPSC503 Spring 2004

Next Time Discourse and Dialog Chp. 18 and 19 9/25/2019 CPSC503 Spring 2004