Word2Vec CS246 Junghoo “John” Cho.

Word2Vec CS246 Junghoo “John” Cho

Neural Language Model of [Mikolov 2010, 2011a, 2011b]
Follow-up work to [Bengio 2003] on neural language models Used simple Recurrent Neural Network (RNN) Recurrent structure allows looking at “longer” context in principle

𝑓 ( 𝑤 1 ,…, 𝑤 𝑛 ) of [Mikolov 2010, 2011a, 2011b]
𝑤(𝑡) [V] … 𝑦(𝑡) [V] 𝑣(𝑡) =𝑊 𝑤(𝑡) 𝑠(𝑡) =sigmoid( 𝑣 𝑡 +𝑈 ℎ 𝑡−1 ) 𝑦(𝑡) =𝑉 ℎ(𝑡) 𝑓(𝑡) =softmax( 𝑦(𝑡) ) W … 𝑣(𝑡) [m] … ℎ(𝑡) V + sigmoid softmax ℎ(𝑡−1) U [m] … sigmoid 𝑥 = 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [m]

Result of [Mikolov 2010, 2011a, 2011b] Great results on speech recognition tasks But more importantly, in his next paper, Mikolov wondered, does the vector 𝑣 mean something more? Does it in any way capture some “semantic meaning” of words? What does “distance” between two word vectors represent, for example?

Observation in [Mikolov 2013a]
The difference between 𝑣 1 and 𝑣 2 captures the syntactic/semantic “relationship” between the two words!

Vector Difference Captures Relationship
𝑣 (king) - 𝑣 (man) + 𝑣 (woman) = 𝑣 (queen)!!! Experimental result Trained RNN with 640-dimensional word vectors on 320M words Broadcast News data 35% accuracy on syntactic relationship test (good:better – bad:worse) 8% accuracy on semantic relationship test (London:England – Beijing:China) First result showing that word vectors represent much more than what was expected man woman king queen

Follow-up Work of [Mikolov 2013b]
Questions Can we do better? Shall we get better results if much larger dataset is used? [Mikolov 2013b]: significantly simplified neural network models Continuous Bag Of Words (CBOW) model Continuous Skip-Gram model Significant reduction in learning complexity, allowing to use much larger dataset

CBOW Model: 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 + W1 W2T 𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡)
𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡) ℎ = 𝑖=1 𝑛 𝑣 𝑖 𝑡 𝑦 = 𝑊 2 𝑇 ℎ 𝑓 =softmax( 𝑦 ) Much simpler model No recurrent structure Addition of all context words No internal nonlinearity Softmax only on the final output … W1 … 𝑣 1 𝑦 … ℎ [m] … [V] . [m] W2T [V] + softmax 𝑤 𝑛 … 𝑣 𝑛 [m] … [V]

Skip-Gram Model: 𝑃 𝑤 1 ,…, 𝑤 𝑛 𝑤
𝑣 = 𝑊 1 𝑤 𝑤 𝑖 = 𝑊 2 𝑇 𝑣 𝑓 𝑖 =softmax( 𝑤 𝑖 ) Again much simpler model No recurrent structure Shared W2T for all context words No internal nonlinearity Weighted sampling from context words to reduce complexity … softmax 𝑤 W2T … . W1 … 𝑣 [V] 𝑤 𝑛 [m] … [V] softmax [V]

Result of [Mikolov 2013b] (1)
Trained 640-dimensional word vectors on 320M words input data with 82K vocabulary 60% accuracy on syntactic test and 55% accuracy on semantic test! Semantic Accuracy Syntactic Accuracy RNN 9% 36% NN 23% 53% CBOW 24% 64% Skip-Gram 55% 59%

Result of [Mikolov 2013b] (2)
Trained 1000-dimensional word vectors on 6B words input data for ~2days on ~150 CPUs Higher dimension vector representation and training on larger dataset improves accuracy significantly Semantic Accuracy Syntactic Accuracy CBOW 57% 69% Skip-Gram 66% 65%

Example Results from [Mikolov 2013b]

Follow-up Work of [Mikolov 2013c]
Further optimizations on training process Hierarchical Softmax Negative Sampling Frequent Word Subsampling

Hierarchical Softmax Use binary tree to represent all words in the vocabulary A weight vector is associated with each internal node of the tree Avoid updating weights for all V vectors per each training instance Update weights for only log(V) vectors as opposed to V vectors

Negative Sampling An alternative way to avoid updating all V vector weights per each training instance Sample just a few words in the output vector! Sampling method Keep the “positive” output words Sample only a few (between 5 to 15) “negative” output words Sampling distribution 𝑃 𝑤 𝑖 ∝𝑓 𝑤 𝑖 3/4 The log-likelihood function for training was slightly revised to make the optimization process work

Frequent Word Subsampling
During training, sample fewer instances from frequent words Sampling frequency 𝑃 𝑤 𝑖 = 𝑡 𝑓( 𝑤 𝑖 ) (𝑡= 10 −5 in experiments) Reduces both (1) training time and (2) influence of frequent words on final vector representation

Result of [Mikolov 2013c] Trained 300-dimensional Skip-gram model on 1B words input data with 692K vocabulary With 𝑡= 10 −5 subsampling Semantic Accuracy Syntactic Accuracy Time NEG-5 54% 63% 38 min NEG-15 58% 97 min H-Softmax 40% 53% 41 min Semantic Accuracy Syntactic Accuracy Time NEG-5 58% 61% 14 min NEG-15 36 min H-Softmax 59% 52% 21 min

Summary Amazing results. Why does it work so well?
Does our brain represent words as vectors in high-dimensional space? I do not have a first-hand experience on how well they “really” work Are neural-model vectors really better than vectors from other models, such as LSI and LDA? I don’t know word2vec code and trained vector representations of words are all publicly available

References [Mikolov 2010] Recurrent neural-network-based language model [Mikolov 2011a] Strategies for training large-scale neural-network language model [Mikolov 2011b] Extensions of recurrent neural-network-based language model [Mikolov 2013a] Linguistic regularities in continuous-space word representation [Mikolov 2013b] Efficient estimation of word representations in vector space [Mikolov 2013c] Distributed representations of words and phrases and their compositionality

Word2Vec CS246 Junghoo “John” Cho.

Similar presentations

Presentation on theme: "Word2Vec CS246 Junghoo “John” Cho."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Word2Vec CS246 Junghoo “John” Cho.

Similar presentations

Presentation on theme: "Word2Vec CS246 Junghoo “John” Cho."— Presentation transcript:

Similar presentations

About project

Feedback