Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson*, Devdatt Dubhashi LAB, Chalmers University of Technology *Språkbanken, University of Gothenburg
Word Sense Induction (WSI) Automatic discovery of word senses. Given a corpus discover senses of a given word, e.g. rock
Applications of WSI Novel sense detection Temporal/Geographical word sense drift Localized word sense lexicons Machine translation Text understanding more…
Context clustering Compute embeddings for word instances in a corpus, based on their context. Cluster the space. Let the centroids represent the senses. Pioneered by Hinrich schütze (1998). Assumption: Distributional hypothesis valid.
Instance-context Embeddings (ICE) Based on word embeddings computed using the skip-gram model. Low rank approximate factorization of a normalized co-occurrence matrix C. Context word embeddings in V and word embeddings in U.
Instance-context Embeddings (ICE) Let the mean skip-gram vector representing the context form the Instance vector but: Apply a triangular window function Weight each context word using Naturally removes stop words Related to the PMI, Goldberg et al (2014).
Plotted instances for ‘paper’ ICE Mean vector Plotted using t-sne
Proposed algorithm Train skip gram model on the corpus. Compute instance representations using ICE. One for each instance of a word in the corpus. Cluster using (nonparametric) k-means. Cluster evaluation from Pham et al. (2005). (Evaluation) disambiguate test data using obtained cluster centroids.
SemEval 2013 task 13 WSI: Identify senses in ukWaC. WSD: Disambiguate test words To one of the induced senses. Evaluation :Compare to the annotated WordNet labels.
Detailed results Semeval 2013 – task 13
Detailed results Semeval 2013 – task 13
Detailed results Semeval 2013 – task 13
Conclusions Using skip-gram word embeddings clearly boost the performance of WSI. Semantic representation for word. Tell which context words are most important.
ICE profile
Evaluation SemEval 2013 - task 13 ukWaC 50 lemmas and 100 instances per lemma. Annotated with a WordNet senses.