Download presentation
Presentation is loading. Please wait.
1
Sivan Biham & Adam Yaari
Word2Vec Sivan Biham & Adam Yaari
2
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
3
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
4
Motivation Natural Language Processing (NLP) applications and products:
5
A way for computers to analyze, understand, and derive meaning from human language.
text mining machine translation automated question answering automatic text summarization And many more… What is NLP?
6
For the computer to be able to “understand” a vector representation of a word is required.
Examples of early approaches: One-hot vector Joint distribution Word embeddings
7
Curse of Dimensionality
Mock example: Vocabulary size: 100,000 unique words. Window size: 10 context words (unidirectional). One-hot vector: 100,000 free parameters, and no knowledge of words semantic or syntactic relations. Joint distribution: 100,00010 – 1 = 1050 – 1 free parameters. Curse of Dimensionality A Neural Probabilistic Language Model. Bengio at al., 2003
8
Curse of Dimensionality
Similar words should be close to each other in the hyper dimensional space. Non-similar words should be far apart from each other in the hyper dimensional space. One-Hot vector example: Apple = [1, 0, 0] Orange = [0, 1, 0] Plane = [0, 0, 1] Curse of Dimensionality One-Hot vectors Wanted behavior Plane Orange Orange Plane Apple Apple
9
Word Distributed Representation
Many – to – many representation of words. All vector cells participate in representing each word. Words are represented by real valued dense vectors of significantly smaller dimensions (e.g. 100 – 1000). Intuition: consider each vector cell as a representative of some feature. Word Distributed Representation Distributed vectors motivation The amazing power of word vectors, Th morning paper blog, Adrian Colyer
10
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
11
A good statistical model for NLP is the conditional probability of the next word w given its previous ones, i.e.: 𝑝 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑝 𝑤 1 𝑛 =𝑝 𝑤 1 𝑝 𝑤 2 𝑤 1 𝑝 𝑤 3 𝑤 1 2 …𝑝 𝑤 𝑛 𝑤 1 𝑛−1 = 𝑡=1 𝑛 𝑝( 𝑤 𝑡 | 𝑤 1 𝑡−1 ) Allows us to take advantage of both word order, and the fact that temporally closer words have a stronger dependency. N-Gram Model A Neural Probabilistic Language Model. Bengio at al., 2003
12
But, by the Markov assumption, that we can predict the probability of some future unit without looking too far into the past it can be simplified to: 𝑝 𝑤 1 𝑛 ≈ 𝑡=1 𝑛 𝑝 ( 𝑤 𝑡 | 𝑤 𝑡−𝑁+1 𝑡−1 ) Creating the N-gram model which predicts the word w according to N – 1 words into the past (a.k.a. the context of the word) . N-Gram Model A Neural Probabilistic Language Model. Bengio at al., 2003
13
Back to Word Distributed Representation
Continues Bag – of – Words (CBOW): As in N – Gram, predicts a word given its context (bidirectional). Skip – Gram: Opposite from N – Gram, predicts the context given a word (bidirectional). Back to Word Distributed Representation Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
14
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
15
Done via two layer neural network, and Stochastic gradient ascent.
Project a word into a continues space. Word vector size: 100 – 1000 dimensions (typically) Words that appear together has high value to their dot product. Skip-Gram Model Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
16
Skip-Gram Model Skip- Gram maximizes the log - likelihood:
Where: T – # of words in the corpus. c – unidirectional window size of the context. Skip-Gram Model Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
17
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
18
𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑇 exp( 𝑣′ 𝑖 𝑇 𝑣 𝑤 )
The basic Skip-Gram utilizes the SoftMax function: 𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑇 exp( 𝑣′ 𝑖 𝑇 𝑣 𝑤 ) Where: T – # of words in the corpus. 𝑣 𝑤 - input vector of w. 𝑣′ 𝑤 - output vector of w. SoftMax Word Input Output King [0.2,0.9,0.1] [0.5,0.4,0.5] Queen [0.2,0.8,0.2] [0.4,0.5,0.5] Apple [0.9,0.5,0.8] [0.3,0.9,0.1] Orange [0.9,0.4,0.9] [0.1,0.7,0.2] Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
19
Skip – Gram & SoftMax
20
“If a dog chews shoes, whose shoes does he choose?”
For example consider the following sentence: “If a dog chews shoes, whose shoes does he choose?” Input word: shoes Window size: 2 Skip – Gram & SoftMax dog chews shoes whose shoes
21
𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑊 exp( 𝑣′ 𝑖 𝑇 𝑣 𝑤 )
The basic Skip-Gram utilizes the SoftMax function: 𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑊 exp( 𝑣′ 𝑖 𝑇 𝑣 𝑤 ) Where: W – number of words in the vocabulary. 𝑣 𝑤 - input vector of w. 𝑣′ 𝑤 - output vector of w. SoftMax Word Input Output King [0.2,0.9,0.1] [0.5,0.4,0.5] Queen [0.2,0.8,0.2] [0.4,0.5,0.5] Apple [0.9,0.5,0.8] [0.3,0.9,0.1] Orange [0.9,0.4,0.9] [0.1,0.7,0.2] Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
22
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
23
SoftMax is impractical because the cost of computing ∇log 𝑝 𝑐 𝑤 is proportional to T, which is often large (105–107 terms). H-SoftMax can yield speedups of word prediction tasks of 50X and more. Computing the normalized probability of a context word as a path in binary tree instead of going over all examples in the corpus. Hierarchical SoftMax Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
24
The T words are represented as leaves, and form a full binary tree along with T – 1 internal nodes.
Each word has a single vector representation vw and each internal node n is represented by a vector vn. The path from the root to a leaf is used to estimate the probability of the word represented by this leaf It can be shown that the probabilities of all leaves sum up to 1. Hierarchical SoftMax word2vec Parameter Learning Explained, Xin Rong, 2016
25
The probability of a word being the context word is defined as:
Where: 𝑛 𝑤,𝑗 – is the j-th node on the path from the root to w. 𝐿 𝑤 – is the length of the above mentioned path. 𝑐ℎ(𝑛) – is the left child of node n. 𝑥 ={ 𝑖𝑓 𝑥 𝑖𝑠 𝑡𝑟𝑢𝑒 − 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 Hierarchical SoftMax 𝑝 𝑐 𝑤 = 𝑗=1 𝐿 𝑤 −1 𝜎( 𝑛 𝑐,𝑗+1 =𝑐ℎ(𝑛(𝑤,𝑗)) ∙ 𝑣 𝑛(𝑐,𝑗) 𝑇 𝑣 𝑤 ) 𝑛 𝑤,1 = ? 𝑛 𝑤,𝐿 𝑤 = ? 𝑛 𝑤,1 =𝑟𝑜𝑜𝑡 𝑛 𝑤,𝐿 𝑤 =𝑤 Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013
26
Suppose we want to compute the probability of w2 being the output word.
The probabilities of going right/left in a node n are: 𝑝 𝑛,𝑙𝑒𝑓𝑡 =𝜎 𝑣 𝑛 𝑇 𝑣 𝑤 𝑝 𝑛,𝑟𝑖𝑔ℎ𝑡 =1−𝜎 𝑣 𝑛 𝑇 𝑣 𝑤 =𝜎 − 𝑣 𝑛 𝑇 𝑣 𝑤 Hierarchical SoftMax 𝑝 𝑤 2 =𝑐 =𝑝 𝑛 𝑤 2 ,1 ,𝑙𝑒𝑓𝑡 ∙𝑝 𝑛 𝑤 2 ,2 ,𝑙𝑒𝑓𝑡 ∙ 𝑝 𝑛 𝑤 2 ,3 ,𝑟𝑖𝑔ℎ𝑡 =𝜎 𝑣 𝑛 𝑤 2 ,1 𝑇 𝑣 𝑤 ∙𝜎 𝑣 𝑛 𝑤 2 ,2 𝑇 𝑣 𝑤 ∙𝜎 − 𝑣 𝑛 𝑤 2 ,3 𝑇 𝑣 𝑤
27
Only by using a binary tree the performance of Skip – Gram were improved from 𝑂 𝑇 to 𝑂( log 𝑇) on average (could also be worst case, by using a balanced tree). But, it can be improved even further using a Huffman tree: Designed to compress binary code of a given text. A full binary suffix tree that guarantees a minimal average weighted path length when some words are frequently used. Huffman Trees
28
Hierarchical SoftMax + Huffman Trees
Empirical performance were improved even more (≈ 30%) by using HT. Where a common words, such as “the” or “a” will have a probability computation time of 𝑂 1 . Hierarchical SoftMax + Huffman Trees 60.8% 57.6% 0.392 39.2% 0.608 X X = 0.137 0.121 0.137 0.137 0.105 0.078 0.007 0.023 word2vec Parameter Learning Explained, Xin Rong, 2016
29
Break Time!!!
30
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
31
How do we choose negative samples?
For each positive example we draw K negative examples. The negative examples are drawn according to the unigram distribution of the data How do we choose negative samples? #(c) sum of times the context appear every word in W Unigram distribution - number of times the context appear in D / size of D = the probabily to sample this context as negative sample’ Very frequent words will be, in high probability Conjunctions. This words appear in specific places in the sentence, hence sample them to an arbitrary word can create negative sample. Size of negative samples : 2-5 for large, 5-20 for smaller training sets.
32
Negative Sampling Would like to know if 𝑤,𝑐 ∈𝐷 ?
p 𝐷=1|𝑤,𝑐 is the probability that 𝑤,𝑐 ∈𝐷. p 𝐷=0|𝑤,𝑐 =1−p 𝐷=1|𝑤,𝑐 is the probability that (𝑤,𝑐)∉D. 𝑝(𝐷=1|𝑤,𝑐) must be low ⇒ 𝑝(𝐷=0|𝑤,𝑐) will be high. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. Yoav Goldberg and Omer Levy
33
Objective And for one sample: O(K)
34
Subsampling of frequent words
First Part Second Part Word embedding Introduction Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
35
Subsampling of frequent words
Very frequent words usually provide less information value than the rare words Subsampling of frequent words threshold The subsampling method presented in (Mikolov et al., 2013) randomly removes words that are more frequent than some threshold tt with a probability of pp, where ff marks the word’s corpus frequency: Another (important) implementation detail of subsampling in word2vec is that the random removal of tokens is done before the corpus is processed into word-context pairs (i.e. before word2vec is actually run). This essentially enlarges the context window’s size for many tokens, because they can now reach words that were not in their original L-sized windows. T is the threshold which above it we decide the word is frequent. If f(wi) = t the probability = 0 as f gets bigger the square root is smaller and the probability higher. Meaning - the more frequent the word, the higher the probability to subsample it. The threshold t is recommended to be 10^-5, as we will see in the results When using subsampling they report improvement in both training speed and accuracy Typical number dropout frequency of word wi Distributed representations of words and phrases and their compositionality. Mikolov et al.
36
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
37
How to evaluate the representation?
Analogical reasoning task: How to evaluate the representation? The Montreal Canadiens are a professional ice hockey team based in Montreal The Toronto Maple Leafs are a professional ice hockey team based in Toronto
38
How to evaluate the representation?
A correct answer: “Madrid”:”Spain” is like “Paris”:”France” How to evaluate the representation? vec(“Madrid”) - vec(“Spain”) + vec(“France”) = X X ≈ vec(“Paris”) Distance measure is done using cosine distance
39
Results
40
Additive compositionality
First Part Second Part Word embedding Introduction Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
41
It is possible to combine words by an element-wise addition of their vector representation.
Word vectors are trained to predict their surrounding words in the sentence. The vectors represent the distribution of the context in which a word appears. Intuition Distributed representations of words and phrases and their compositionality. Mikolov et al.
42
vec(Madrid) - vec(Spain) + vec(France) ≈ vec(Paris) vec(Madrid) - vec(Spain) ≈ vec(France) - vec(Paris) Example
43
Example vec(Madrid) - vec(Spain) ≈ vec(France) - vec(Paris)
44
Results Distributed representations of words and phrases and
their compositionality. Mikolov et al.
45
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
46
Pointwise Mutual Information
Symmetric association measure between a pair of variables Pointwise Mutual Information PMI between words: measures the association between a word w and a context c by calculating the log of the ratio between their joint probability (the frequency in which they occur together) and their marginal probabilities (the frequency in which they occur independently). PMI can be estimated empirically by considering the actual number of observations in a corpus:
47
Example PMI(drink,it) = = 0.11 PMI(drink,wine) = = 0.23
“Drink it” is more common than “drink wine” But “wine” is more a “drinkable” thing than “it” Example PMI(drink,it) = = 0.11 PMI(drink,wine) = = 0.23
48
PMI matrix PMI matrix is a word-context association matrix.
Those matrices are very common in the NLP word-similarity literature. PMI matrix
49
SGNS as implicit matrix factorization
After training the model we get word and context matrices W and C. The rows of matrix W are typically used in NLP tasks while matrix C is ignored. M is used for word-similarity tasks Neural word embedding as implicit matrix factorization. Yoav Goldberg and Omer Levy
50
SGNS as implicit matrix factorization
What about Mij? Each cell Mij = reflects the strength of association between that particular word-context pair SGNS as implicit matrix factorization
51
SGNS as implicit matrix factorization
Our objective: SGNS as implicit matrix factorization We get: #(w,c) - number of times the word w appeard with this specific context c
52
What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today
53
How can we learn phrases?
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. Find words that appear frequently together, and infrequently in other contexts. Replace the words by a unique token. How can we learn phrases? Distributed representations of words and phrases and their compositionality. Mikolov et al.
54
How do we choose phrases?
Score above a chosen threshold are then used as phrases ? How do we choose phrases? Symetric correlation will get high score and will be phrase
55
Example “The dog barked in New York city”
“A pretty dog walked in the street” Score(dog, barked) = Score(New, York) =
56
Evaluation The Montreal Canadiens are a professional ice hockey team based in Montreal The Toronto Maple Leafs are a professional ice hockey team based in Toronto
57
Phrases – Some results When used 33 billion words they got accuracy of 72%. When they reduced the size of the data set to 6 bollion they got lower accuracy of 66%. Which suggest that large amount of training the is important.
58
Phrases - Some results Vasco de Gama - Portuguese explorer and the first European to reach India by sea. Lingasugur is a municipal town in Raichur district in the Indian state of Karnataka. Lake Baikal - rift lake in Russia, located in southern Siberia. Aral Sea - endorheic lake lying between Kazakhstan in the north and Uzbekistan Alan Bean - fourth person to walk on the Moon. rebecca naomi - American actress and singer Ionian Islands - The Ionian Islands are part of Greece and lie off the country’s west coast, in the Ionian Sea. Rügen is a German island in the Baltic Sea. תמונות
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.