Download presentation
Published byBlanche Jenkins Modified over 9 years ago
1
Word sense induction using continuous vector space models
Mikael Kågebäck, Fredrik Johansson, Richard Johansson*, Devdatt Dubhashi LAB, Chalmers University of Technology *Språkbanken, University of Gothenburg
2
Word Sense Induction (WSI)
Automatic discovery of word senses. Given a corpus discover senses of a given word, e.g. rock
3
Applications of WSI Novel sense detection
Temporal/Geographical word sense drift Localized word sense lexicons Machine translation Text understanding more…
4
Context clustering Compute embeddings for word instances in a corpus, based on their context. Cluster the space. Let the centroids represent the senses. Pioneered by Hinrich schütze (1998). Assumption: Distributional hypothesis valid.
5
Instance-context Embeddings (ICE)
Based on word embeddings computed using the skip-gram model. Low rank approximate factorization of a normalized co-occurrence matrix C. Context word embeddings in V and word embeddings in U.
6
Instance-context Embeddings (ICE)
Let the mean skip-gram vector representing the context form the Instance vector but: Apply a triangular window function Weight each context word using Naturally removes stop words Related to the PMI, Goldberg et al (2014).
7
Plotted instances for ‘paper’
ICE Mean vector Plotted using t-sne
8
Proposed algorithm Train skip gram model on the corpus.
Compute instance representations using ICE. One for each instance of a word in the corpus. Cluster using (nonparametric) k-means. Cluster evaluation from Pham et al. (2005). (Evaluation) disambiguate test data using obtained cluster centroids.
9
SemEval 2013 task 13 WSI: Identify senses in ukWaC.
WSD: Disambiguate test words To one of the induced senses. Evaluation :Compare to the annotated WordNet labels.
10
Detailed results Semeval 2013 – task 13
11
Detailed results Semeval 2013 – task 13
12
Detailed results Semeval 2013 – task 13
13
Conclusions Using skip-gram word embeddings clearly boost the performance of WSI. Semantic representation for word. Tell which context words are most important.
14
ICE profile
15
Evaluation SemEval 2013 - task 13 ukWaC
50 lemmas and 100 instances per lemma. Annotated with a WordNet senses.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.