Vector space word representations Rani Nelken, PhD Director of Research, Outbrain @RaniNelken
https://www.flickr.com/photos/hyku/295930906/in/photolist-EbXgJ-ajDBs8-9hevWb-s9HX1-5hZqnb-a1Jk8H-a1Mcx7-7QiUWL-6AFs53-9TRtkz-bqt2GQ-F574u-F56EA-3imqK7/
Words = atoms?
That would be crazy for numbers https://www.flickr.com/photos/proimos/4199675334/
The distributional hypothesis What is a word? Wittgenstein (1953): The meaning of a word is its use in the language Firth (1957): You shall know a word by the company it keeps
From atomic symbols to vectors Map words to dense numerical vectors “representing” their contexts Map words with similar contexts to vectors with small angle
History Hard Clustering: Brown clustering Soft clustering: LSA, Random projections, LDA Neural nets
Feedforward Neural Net Language Model
Training Input is one-hot vectors of context (0…0,1,0…0) We’re trying to learn a vector for each word (“projection”) Such that the output is close to the one-hot vector of w(t)
Simpler model: Word2Vec
What can we do with these representations? Plug them into your existing classifier Plug them into further neural nets – better! Improves accuracy on many NLP tasks Named entity recognition POS tagging sentiment analysis semantic role labeling
Back to cheese… cos(crumbled, cheese) = 0.042 cos(crumpled, cheese) = 0.203
And now for the magic http://en.wikipedia.org/wiki/Penn_%26_Teller#mediaviewer/File:Penn_and_Teller_(1988).jpg
“Magical” property [Paris] - [France] + [Italy] ≈ [Rome] [king] - [man] + [woman] ≈ [queen] We can use it to solve word analogy problems Boston: Red_Sox= New_York: ? Demo
Why does it work? [king] - [man] + [woman] ≈ [queen] cos (x, ([king] – [man] + [woman])) = cos (x, [king]) – cos(x, [man]) + cos(x, [woman]) [queen] is a good candidate
It doesn’t always work London : England = Baghdad : ? We expect Iraq, but get Mosul We’re looking for a word that is close to Baghdad, and to England, but not to London
Why did it fail? London : England = Baghdad : ? cos(Mosul, Baghdad) >> cos(Iraq, London) Instead of adding the cosines, multiply them Improves accuracy
Word2Vec Open source C implementation from Google Comes with pre-learned embeddings Gensim: fast python implementation
Active field of research Bilingual embeddings Joint word and image embeddings Embeddings for sentiment Phrase and document embeddings
Bigger picture: how can we make NLP less fragile? 90’s: Linguistic engineering 00’s: Feature engineering 10’s: Unsupervised preprocessing
References https://code.google.com/p/word2vec/ http://www.cs.bgu.ac.il/~yoavg/publications/conll2014analogies.pdf http://radimrehurek.com/2014/02/word2vec-tutorial/
Thanks @RaniNelken We’re hiring for NLP positions