Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word2Vec CS246 Junghoo “John” Cho.

Similar presentations


Presentation on theme: "Word2Vec CS246 Junghoo “John” Cho."— Presentation transcript:

1 Word2Vec CS246 Junghoo “John” Cho

2 Neural Language Model of [Mikolov 2010, 2011a, 2011b]
Follow-up work to [Bengio 2003] on neural language models Used simple Recurrent Neural Network (RNN) Recurrent structure allows looking at “longer” context in principle

3 𝑓 ( 𝑤 1 ,…, 𝑤 𝑛 ) of [Mikolov 2010, 2011a, 2011b]
𝑤(𝑡) [V] 𝑦(𝑡) [V] 𝑣(𝑡) =𝑊 𝑤(𝑡) 𝑠(𝑡) =sigmoid( 𝑣 𝑡 +𝑈 ℎ 𝑡−1 ) 𝑦(𝑡) =𝑉 ℎ(𝑡) 𝑓(𝑡) =softmax( 𝑦(𝑡) ) W 𝑣(𝑡) [m] ℎ(𝑡) V + sigmoid softmax ℎ(𝑡−1) U [m] sigmoid 𝑥 = 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [m]

4 Result of [Mikolov 2010, 2011a, 2011b] Great results on speech recognition tasks But more importantly, in his next paper, Mikolov wondered, does the vector 𝑣 mean something more? Does it in any way capture some “semantic meaning” of words? What does “distance” between two word vectors represent, for example?

5 Observation in [Mikolov 2013a]
The difference between 𝑣 1 and 𝑣 2 captures the syntactic/semantic “relationship” between the two words!

6 Vector Difference Captures Relationship
𝑣 (king) - 𝑣 (man) + 𝑣 (woman) = 𝑣 (queen)!!! Experimental result Trained RNN with 640-dimensional word vectors on 320M words Broadcast News data 35% accuracy on syntactic relationship test (good:better – bad:worse) 8% accuracy on semantic relationship test (London:England – Beijing:China) First result showing that word vectors represent much more than what was expected man woman king queen

7 Follow-up Work of [Mikolov 2013b]
Questions Can we do better? Shall we get better results if much larger dataset is used? [Mikolov 2013b]: significantly simplified neural network models Continuous Bag Of Words (CBOW) model Continuous Skip-Gram model Significant reduction in learning complexity, allowing to use much larger dataset

8 CBOW Model: 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 + W1 W2T 𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡)
𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡) ℎ = 𝑖=1 𝑛 𝑣 𝑖 𝑡 𝑦 = 𝑊 2 𝑇 ℎ 𝑓 =softmax( 𝑦 ) Much simpler model No recurrent structure Addition of all context words No internal nonlinearity Softmax only on the final output W1 𝑣 1 𝑦 [m] [V] . [m] W2T [V] + softmax 𝑤 𝑛 𝑣 𝑛 [m] [V]

9 Skip-Gram Model: 𝑃 𝑤 1 ,…, 𝑤 𝑛 𝑤
𝑣 = 𝑊 1 𝑤 𝑤 𝑖 = 𝑊 2 𝑇 𝑣 𝑓 𝑖 =softmax( 𝑤 𝑖 ) Again much simpler model No recurrent structure Shared W2T for all context words No internal nonlinearity Weighted sampling from context words to reduce complexity softmax 𝑤 W2T . W1 𝑣 [V] 𝑤 𝑛 [m] [V] softmax [V]

10 Result of [Mikolov 2013b] (1)
Trained 640-dimensional word vectors on 320M words input data with 82K vocabulary 60% accuracy on syntactic test and 55% accuracy on semantic test! Semantic Accuracy Syntactic Accuracy RNN 9% 36% NN 23% 53% CBOW 24% 64% Skip-Gram 55% 59%

11 Result of [Mikolov 2013b] (2)
Trained 1000-dimensional word vectors on 6B words input data for ~2days on ~150 CPUs Higher dimension vector representation and training on larger dataset improves accuracy significantly Semantic Accuracy Syntactic Accuracy CBOW 57% 69% Skip-Gram 66% 65%

12 Example Results from [Mikolov 2013b]

13 Follow-up Work of [Mikolov 2013c]
Further optimizations on training process Hierarchical Softmax Negative Sampling Frequent Word Subsampling

14 Hierarchical Softmax Use binary tree to represent all words in the vocabulary A weight vector is associated with each internal node of the tree Avoid updating weights for all V vectors per each training instance Update weights for only log(V) vectors as opposed to V vectors

15 Negative Sampling An alternative way to avoid updating all V vector weights per each training instance Sample just a few words in the output vector! Sampling method Keep the “positive” output words Sample only a few (between 5 to 15) “negative” output words Sampling distribution 𝑃 𝑤 𝑖 ∝𝑓 𝑤 𝑖 3/4 The log-likelihood function for training was slightly revised to make the optimization process work

16 Frequent Word Subsampling
During training, sample fewer instances from frequent words Sampling frequency 𝑃 𝑤 𝑖 = 𝑡 𝑓( 𝑤 𝑖 ) (𝑡= 10 −5 in experiments) Reduces both (1) training time and (2) influence of frequent words on final vector representation

17 Result of [Mikolov 2013c] Trained 300-dimensional Skip-gram model on 1B words input data with 692K vocabulary With 𝑡= 10 −5 subsampling Semantic Accuracy Syntactic Accuracy Time NEG-5 54% 63% 38 min NEG-15 58% 97 min H-Softmax 40% 53% 41 min Semantic Accuracy Syntactic Accuracy Time NEG-5 58% 61% 14 min NEG-15 36 min H-Softmax 59% 52% 21 min

18 Summary Amazing results. Why does it work so well?
Does our brain represent words as vectors in high-dimensional space? I do not have a first-hand experience on how well they “really” work Are neural-model vectors really better than vectors from other models, such as LSI and LDA? I don’t know word2vec code and trained vector representations of words are all publicly available

19 References [Mikolov 2010] Recurrent neural-network-based language model [Mikolov 2011a] Strategies for training large-scale neural-network language model [Mikolov 2011b] Extensions of recurrent neural-network-based language model [Mikolov 2013a] Linguistic regularities in continuous-space word representation [Mikolov 2013b] Efficient estimation of word representations in vector space [Mikolov 2013c] Distributed representations of words and phrases and their compositionality


Download ppt "Word2Vec CS246 Junghoo “John” Cho."

Similar presentations


Ads by Google