Lexical Semantics CSCI-GA.2590 – Lecture 7A

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.
The Impact of Task and Corpus on Event Extraction Systems Ralph Grishman New York University Malta, May 2010 NYU.
Text Similarity David Kauchak CS457 Fall 2011.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
5/16/2015CPSC503 Winter CPSC 503 Computational Linguistics Computational Lexical Semantics Lecture 14 Giuseppe Carenini.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
LSA 311 Computational Lexical Semantics Dan Jurafsky Stanford University Lecture 2: Word Sense Disambiguation.
Word Sense Disambiguation. Word Sense Disambiguation (WSD) Given A word in context A fixed inventory of potential word senses Decide which sense of the.
SI485i : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
Text Classification, Active/Interactive learning.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Data Mining and Text Mining. The Standard Data Mining process.
Graph-based WSD の続き DMLA /7/10 小町守.
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining Practical Machine Learning Tools and Techniques
Relation Extraction CSCI-GA.2591
Machine Learning Basics
Vector-Space (Distributional) Lexical Semantics
Lecture 21 Computational Lexical Semantics
Statistical NLP: Lecture 9
Introduction Task: extracting relational facts from text
A method for WSD on Unrestricted Text
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Lexical Semantics CSCI-GA.2590 – Lecture 7A NYU Lexical Semantics CSCI-GA.2590 – Lecture 7A Ralph Grishman

Words and Senses Until now we have manipulated structures based on words But if we are really interested in the meaning of sentences, we must consider the senses of words most words have several senses frequently several words share a common sense both are important for information extraction 3/3/15 NYU

Terminology multiple senses of a word polysemy (and homonymy for totally unrelated senses ("bank")) metonomy for certain types of regular, productive polysemy ("the White House", "Washington”) zeugma (conjunction combining distinct senses) as test for polysemy ("serve")    synonymy:  when two words mean (more-or-less) the same thing hyponymy:  X is the hyponym of Y if X denotes a more specific subclass  of Y        (X is the hyponym, Y is the hypernym) 3/3/15 NYU

WordNet large-scale database of lexical relations organized as graph whose nodes are synsets (synonym sets) each synset consists of 1 or more word senses which are considered synonymous fine-grained senses primary relation: hyponym / hypernym sense-annotated corpus SEMCOR subset of Brown corpus available on Web along with foreign-language Wordnets

Two basic tasks Two basic tasks we will consider today: given an inventory of senses for a word, deciding which sense is used in a given context given a word, identifying other words that have a similar meaning [in a given context] 3/3/15 NYU

Word Sense Disambiguation process of identifying the sense of a word in context WSD evaluation:  either using WordNet or coarser senses (e.g., main senses from a dictionary) local cues (Weaver):  train a classifier using nearby words as features either treat words at specific positions relative to target word as separate features or put all words within a given window (e.g., 10 words wide) as a 'bag of words’ simple demo for 'interest' 3/3/15 NYU

Simple supervised WSD algorithm: naive Bayes select sense s' = argmax(sense s) P(s | F) = argmax(s) P(s) Πi P(f[i] | s)        where F = {f1, f2, … } is the set of context features typically specific words in immediate context Maximum likelihood estimates for P(s) and P(f[i] | s) can be easily obtained by counting some smoothing (e.g., add-one smoothing) is needed works quite well at selecting best sense (not at estimating probabilities)        But needs substantial annotated training data for each word 3/3/15 NYU

Sources of training data for supervised methods SEMCOR and other hand-made WSD corpora dictionaries Lesk algorithm: overlap of definition and context bitexts (parallel bilingual data) crowdsourcing Wikipedia links treat alternative articles linked from the same word as alternative senses (Mihalcea NAACL 2007)articles provide lots of info for use by classifier 3/3/15 NYU

Wikification and Grounding We can extend the notion of disambiguating individual words to cover multi-word terms and names. Wikipedia comes closest to providing an inventory of such concepts: people, places, classes of objects, .... This has led to the process of Wikification: linking the phrases in a text to Wikipedia articles. Wikification demo(UIUC) annual evaluation (for names) as part of NIST Text Analysis Conference 3/3/15 NYU

Local vs Global Disambiguation Local disambiguation each mention (word, name, term) in an article is disambiguated separately based on context (other words in article) Global disambiguation: take into account coherence of disambiguations across document optimize sum of local disambiguation scores plus a term representing coherence of referents coherence reflected in links between Wikipedia entries relative importance of prominence, local features, and global coherence varies greatly 3/3/15 NYU

Using Coherence Wikipedia entries document: ? … the Texas Rangers defeated the New York Yankees … Major League Baseball Texas Rangers (lawmen) Texas Rangers (baseball team) NY Yankees (baseball team) 3/3/15 NYU

Using Coherence Wikipedia entries links in Wikipedia document: ? … the Texas Rangers defeated the New York Yankees … Major League Baseball Texas Rangers (lawmen) Texas Rangers (baseball team) NY Yankees (baseball team) 3/3/15 NYU

Using Coherence Wikipedia entries document: links in Wikipedia document: … the Texas Rangers defeated the New York Yankees … Major League Baseball Texas Rangers (lawmen) Texas Rangers (baseball team) NY Yankees (baseball team) 3/3/15 NYU

Supervised vs. Semi-supervised problem: training some classifiers (such as WSD) needs lots of labeled data supervised learners: all data labeled alternative: semi-supervised learners some labeled data (“seed”) + lots of unlabeled data 3/3/15 NYU

Bootstrapping: a semi-supervised learner Basic idea of bootstrapping: start with a small set of labeled seeds L and a large set of unlabeled examples U repeat train classifier C on L apply C to U identify examples with most confident labels; remove them from U and add them (with labels) to L 3/3/15 NYU

Bootstrapping WSD Premises: one sense per discourse (document) one sense per collocation 3/3/15 NYU

“bass” as fish or musical term example “bass” as fish or musical term 3/3/15 NYU

example bass catch bass bass play bass catch bass play bass 3/3/15 NYU

example label initial examples bass fish catch bass bass music play bass catch bass play bass 3/3/15 NYU

example label other instances in same document bass fish catch bass music play bass catch bass play bass 3/3/15 NYU

example learn collocations: catch …  fish; play …  music bass fish catch bass bass music play bass catch bass play bass 3/3/15 NYU

example label other instances of collocations bass fish catch bass music play bass catch bass fish play bass music 3/3/15 NYU

Identifying semantically similar words using WordNet (or similar ontologies) using distributional analysis of corpora 3/3/15 NYU

Using WordNet Simplest measures of semantic similarity based on WordNet: path length: longer path  less similar mammals felines apes cats tigers gorillas humans 3/3/15 NYU

Using WordNet path length ignores differences in degrees of generalization in different hyponym relations: mammals cats people a cat’s view of the world (cats and people are similar) 3/3/15 NYU

Information Content P(c) = probability that a word in a corpus is an instance of the concept (matches the synset c or one of its hyponyms) Information content of a concept    IC(c) = -log P(c) If LCS(c1, c2) is the lowest common subsumer of c1 and c2, the JC distance between c1 and c2 is    IC(c1) + IC(c2) - 2 IC(LCS(c1, c2)) 3/3/15 NYU

Similarity metric from corpora Basic idea:  characterize words by their contexts;  words sharing more contexts are more similar Contexts can either be defined in terms of adjacency or dependency (syntactic relations Given a word w and a context feature f, define pointwise mutual information PMI: PMI(w,f) = log ( P(w,f) / P(w) P(f)) 3/3/15 NYU

simcosine = Σvi × wi / ( | v | × | w | ) Given a list of contexts (words left and right) we can compute a context vector for each word. The similarity of two vectors v and w (representing two words) can be computed in many ways;  a standard way is using the cosine (normalized dot product):    simcosine = Σvi × wi / ( | v | × | w | ) See the Thesaurus demo by Patrick Pantel. 3/3/15 NYU

Clusters By applying clustering methods we have an unsupervised way of creating semantic word classes. 3/3/15 NYU

Word Embeddings In many NLP applications which look for specific words, we would prefer a soft match (between 0 and 1, reflecting semantic similarity) to a hard match (0 or 1) Can we use context vectors? 3/3/15 NYU

Can we use context vectors? In principle, yes, but very large (> 10^5 words, > 10^10 entries) sparse matrix representation not convenient for neural networks sparse  context vectors rarely overlap Want to reduce dimensionality 3/3/15 NYU

Word embeddings A low dimension real valued distributed representation of a word, computed from its distribution in a corpus NLP analysis pipeline will operate on these vectors 3/3/15 NYU

Producing word embeddings Dimensionality reduction methods can be applied to the full co-occurrence matrix Neural network models can produce word embeddings great strides in efficiency of word embedding generators in the last few years skip-grams now widely used 3/3/15 NYU

Skip-Grams given current word, build model to predict a few immediately preceding and following words captures local context use log-linear models (for efficient training) train with gradient descent can build WE’s from a 6 GW corpus in < 1 day on a cluster (about 100 cores) 3/3/15 NYU

Word similarity Distributional similarity effectively captured in compact representation 100-200 element d.p. vector (< 2 KB / word) cosine metric between vectors provides good measure of similarity 3/3/15 NYU

Features from Embeddings How to use information from word embeddings in a feature-based system? directly: use components of vector as features via clusters cluster words based on similarity of embeddings use cluster membership as features via prototypes select prototypical terms for task featurei (w) = sim(w, ti) > τ 3/3/15 NYU