Sivan Biham & Adam Yaari

Sivan Biham & Adam Yaari
Word2Vec Sivan Biham & Adam Yaari

What we’ll do today First Part Second Part Word embedding Introduction
Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today

Motivation Natural Language Processing (NLP) applications and products:

A way for computers to analyze, understand, and derive meaning from human language.
text mining machine translation automated question answering automatic text summarization And many more… What is NLP?

For the computer to be able to “understand” a vector representation of a word is required.
Examples of early approaches: One-hot vector Joint distribution Word embeddings

Curse of Dimensionality
Mock example: Vocabulary size: 100,000 unique words. Window size: 10 context words (unidirectional). One-hot vector: 100,000 free parameters, and no knowledge of words semantic or syntactic relations. Joint distribution: 100,00010 – 1 = 1050 – 1 free parameters. Curse of Dimensionality A Neural Probabilistic Language Model. Bengio at al., 2003

Curse of Dimensionality
Similar words should be close to each other in the hyper dimensional space. Non-similar words should be far apart from each other in the hyper dimensional space. One-Hot vector example: Apple = [1, 0, 0] Orange = [0, 1, 0] Plane = [0, 0, 1] Curse of Dimensionality One-Hot vectors Wanted behavior Plane Orange Orange Plane Apple Apple

Word Distributed Representation
Many – to – many representation of words. All vector cells participate in representing each word. Words are represented by real valued dense vectors of significantly smaller dimensions (e.g. 100 – 1000). Intuition: consider each vector cell as a representative of some feature. Word Distributed Representation Distributed vectors motivation The amazing power of word vectors, Th morning paper blog, Adrian Colyer

A good statistical model for NLP is the conditional probability of the next word w given its previous ones, i.e.: 𝑝 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑝 𝑤 1 𝑛 =𝑝 𝑤 1 𝑝 𝑤 2 𝑤 1 𝑝 𝑤 3 𝑤 1 2 …𝑝 𝑤 𝑛 𝑤 1 𝑛−1 = 𝑡=1 𝑛 𝑝( 𝑤 𝑡 | 𝑤 1 𝑡−1 ) Allows us to take advantage of both word order, and the fact that temporally closer words have a stronger dependency. N-Gram Model A Neural Probabilistic Language Model. Bengio at al., 2003

But, by the Markov assumption, that we can predict the probability of some future unit without looking too far into the past it can be simplified to: 𝑝 𝑤 1 𝑛 ≈ 𝑡=1 𝑛 𝑝 ( 𝑤 𝑡 | 𝑤 𝑡−𝑁+1 𝑡−1 ) Creating the N-gram model which predicts the word w according to N – 1 words into the past (a.k.a. the context of the word) . N-Gram Model A Neural Probabilistic Language Model. Bengio at al., 2003

Back to Word Distributed Representation
Continues Bag – of – Words (CBOW): As in N – Gram, predicts a word given its context (bidirectional). Skip – Gram: Opposite from N – Gram, predicts the context given a word (bidirectional). Back to Word Distributed Representation Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

Done via two layer neural network, and Stochastic gradient ascent.
Project a word into a continues space. Word vector size: 100 – 1000 dimensions (typically) Words that appear together has high value to their dot product. Skip-Gram Model Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

Skip-Gram Model Skip- Gram maximizes the log - likelihood:
Where: T – # of words in the corpus. c – unidirectional window size of the context. Skip-Gram Model Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑇 exp⁡( 𝑣′ 𝑖 𝑇 𝑣 𝑤 )
The basic Skip-Gram utilizes the SoftMax function: 𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑇 exp⁡( 𝑣′ 𝑖 𝑇 𝑣 𝑤 ) Where: T – # of words in the corpus. 𝑣 𝑤 - input vector of w. 𝑣′ 𝑤 - output vector of w. SoftMax Word Input Output King [0.2,0.9,0.1] [0.5,0.4,0.5] Queen [0.2,0.8,0.2] [0.4,0.5,0.5] Apple [0.9,0.5,0.8] [0.3,0.9,0.1] Orange [0.9,0.4,0.9] [0.1,0.7,0.2] Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

Skip – Gram & SoftMax

“If a dog chews shoes, whose shoes does he choose?”
For example consider the following sentence: “If a dog chews shoes, whose shoes does he choose?” Input word: shoes Window size: 2 Skip – Gram & SoftMax dog chews shoes whose shoes

𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑊 exp⁡( 𝑣′ 𝑖 𝑇 𝑣 𝑤 )
The basic Skip-Gram utilizes the SoftMax function: 𝑝 𝑐 𝑤 = exp 𝑣′ 𝑐 𝑇 𝑣 𝑤 𝑖=1 𝑊 exp⁡( 𝑣′ 𝑖 𝑇 𝑣 𝑤 ) Where: W – number of words in the vocabulary. 𝑣 𝑤 - input vector of w. 𝑣′ 𝑤 - output vector of w. SoftMax Word Input Output King [0.2,0.9,0.1] [0.5,0.4,0.5] Queen [0.2,0.8,0.2] [0.4,0.5,0.5] Apple [0.9,0.5,0.8] [0.3,0.9,0.1] Orange [0.9,0.4,0.9] [0.1,0.7,0.2] Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

SoftMax is impractical because the cost of computing ∇log 𝑝 𝑐 𝑤 is proportional to T, which is often large (105–107 terms). H-SoftMax can yield speedups of word prediction tasks of 50X and more. Computing the normalized probability of a context word as a path in binary tree instead of going over all examples in the corpus. Hierarchical SoftMax Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

The T words are represented as leaves, and form a full binary tree along with T – 1 internal nodes.
Each word has a single vector representation vw and each internal node n is represented by a vector vn. The path from the root to a leaf is used to estimate the probability of the word represented by this leaf It can be shown that the probabilities of all leaves sum up to 1. Hierarchical SoftMax word2vec Parameter Learning Explained, Xin Rong, 2016

The probability of a word being the context word is defined as:
Where: 𝑛 𝑤,𝑗 – is the j-th node on the path from the root to w. 𝐿 𝑤 – is the length of the above mentioned path. 𝑐ℎ(𝑛) – is the left child of node n. 𝑥 ={ 𝑖𝑓 𝑥 𝑖𝑠 𝑡𝑟𝑢𝑒 − 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 Hierarchical SoftMax 𝑝 𝑐 𝑤 = 𝑗=1 𝐿 𝑤 −1 𝜎( 𝑛 𝑐,𝑗+1 =𝑐ℎ(𝑛(𝑤,𝑗)) ∙ 𝑣 𝑛(𝑐,𝑗) 𝑇 𝑣 𝑤 ) 𝑛 𝑤,1 = ? 𝑛 𝑤,𝐿 𝑤 = ? 𝑛 𝑤,1 =𝑟𝑜𝑜𝑡 𝑛 𝑤,𝐿 𝑤 =𝑤 Distributed Representations of Words and Phrases and their Compositionality. Mikolov at al., 2013

Suppose we want to compute the probability of w2 being the output word.
The probabilities of going right/left in a node n are: 𝑝 𝑛,𝑙𝑒𝑓𝑡 =𝜎 𝑣 𝑛 𝑇 𝑣 𝑤 𝑝 𝑛,𝑟𝑖𝑔ℎ𝑡 =1−𝜎 𝑣 𝑛 𝑇 𝑣 𝑤 =𝜎 − 𝑣 𝑛 𝑇 𝑣 𝑤 Hierarchical SoftMax 𝑝 𝑤 2 =𝑐 =𝑝 𝑛 𝑤 2 ,1 ,𝑙𝑒𝑓𝑡 ∙𝑝 𝑛 𝑤 2 ,2 ,𝑙𝑒𝑓𝑡 ∙ 𝑝 𝑛 𝑤 2 ,3 ,𝑟𝑖𝑔ℎ𝑡 =𝜎 𝑣 𝑛 𝑤 2 ,1 𝑇 𝑣 𝑤 ∙𝜎 𝑣 𝑛 𝑤 2 ,2 𝑇 𝑣 𝑤 ∙𝜎 − 𝑣 𝑛 𝑤 2 ,3 𝑇 𝑣 𝑤

Only by using a binary tree the performance of Skip – Gram were improved from 𝑂 𝑇 to 𝑂( log 𝑇) on average (could also be worst case, by using a balanced tree). But, it can be improved even further using a Huffman tree: Designed to compress binary code of a given text. A full binary suffix tree that guarantees a minimal average weighted path length when some words are frequently used. Huffman Trees

Hierarchical SoftMax + Huffman Trees
Empirical performance were improved even more (≈ 30%) by using HT. Where a common words, such as “the” or “a” will have a probability computation time of 𝑂 1 . Hierarchical SoftMax + Huffman Trees 60.8% 57.6% 0.392 39.2% 0.608 X X = 0.137 0.121 0.137 0.137 0.105 0.078 0.007 0.023 word2vec Parameter Learning Explained, Xin Rong, 2016

Break Time!!!

How do we choose negative samples?
For each positive example we draw K negative examples. The negative examples are drawn according to the unigram distribution of the data How do we choose negative samples? #(c) sum of times the context appear every word in W Unigram distribution - number of times the context appear in D / size of D = the probabily to sample this context as negative sample’ Very frequent words will be, in high probability Conjunctions. This words appear in specific places in the sentence, hence sample them to an arbitrary word can create negative sample. Size of negative samples : 2-5 for large, 5-20 for smaller training sets.

Negative Sampling Would like to know if 𝑤,𝑐 ∈𝐷 ?
p 𝐷=1|𝑤,𝑐 is the probability that 𝑤,𝑐 ∈𝐷. p 𝐷=0|𝑤,𝑐 =1−p 𝐷=1|𝑤,𝑐 is the probability that (𝑤,𝑐)∉D. 𝑝(𝐷=1|𝑤,𝑐) must be low ⇒ 𝑝(𝐷=0|𝑤,𝑐) will be high. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. Yoav Goldberg and Omer Levy

Objective And for one sample: O(K)

Subsampling of frequent words
First Part Second Part Word embedding Introduction Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today

Subsampling of frequent words
Very frequent words usually provide less information value than the rare words Subsampling of frequent words threshold The subsampling method presented in (Mikolov et al., 2013) randomly removes words that are more frequent than some threshold tt with a probability of pp, where ff marks the word’s corpus frequency: Another (important) implementation detail of subsampling in word2vec is that the random removal of tokens is done before the corpus is processed into word-context pairs (i.e. before word2vec is actually run). This essentially enlarges the context window’s size for many tokens, because they can now reach words that were not in their original L-sized windows. T is the threshold which above it we decide the word is frequent. If f(wi) = t the probability = 0 as f gets bigger the square root is smaller and the probability higher. Meaning - the more frequent the word, the higher the probability to subsample it. The threshold t is recommended to be 10^-5, as we will see in the results When using subsampling they report improvement in both training speed and accuracy Typical number dropout frequency of word wi Distributed representations of words and phrases and their compositionality. Mikolov et al.

How to evaluate the representation?
Analogical reasoning task: How to evaluate the representation? The Montreal Canadiens are a professional ice hockey team based in Montreal The Toronto Maple Leafs are a professional ice hockey team based in Toronto

How to evaluate the representation?
A correct answer: “Madrid”:”Spain” is like “Paris”:”France” How to evaluate the representation? vec(“Madrid”) - vec(“Spain”) + vec(“France”) = X X ≈ vec(“Paris”) Distance measure is done using cosine distance

Results

Additive compositionality
First Part Second Part Word embedding Introduction Negative Sampling N – Gram Subsampling of frequent words Skip – Gram Evaluation SoftMax Additive compositionality Hierarchical SoftMax SGNS and PMI Learning phrases What we’ll do today

It is possible to combine words by an element-wise addition of their vector representation.
Word vectors are trained to predict their surrounding words in the sentence. The vectors represent the distribution of the context in which a word appears. Intuition Distributed representations of words and phrases and their compositionality. Mikolov et al.

vec(Madrid) - vec(Spain) + vec(France) ≈ vec(Paris) vec(Madrid) - vec(Spain) ≈ vec(France) - vec(Paris) Example

Example vec(Madrid) - vec(Spain) ≈ vec(France) - vec(Paris)

Results Distributed representations of words and phrases and
their compositionality. Mikolov et al.

Pointwise Mutual Information
Symmetric association measure between a pair of variables Pointwise Mutual Information PMI between words: measures the association between a word w and a context c by calculating the log of the ratio between their joint probability (the frequency in which they occur together) and their marginal probabilities (the frequency in which they occur independently). PMI can be estimated empirically by considering the actual number of observations in a corpus:

Example PMI(drink,it) = = 0.11 PMI(drink,wine) = = 0.23
“Drink it” is more common than “drink wine” But “wine” is more a “drinkable” thing than “it” Example PMI(drink,it) = = 0.11 PMI(drink,wine) = = 0.23

PMI matrix PMI matrix is a word-context association matrix.
Those matrices are very common in the NLP word-similarity literature. PMI matrix

SGNS as implicit matrix factorization
After training the model we get word and context matrices W and C. The rows of matrix W are typically used in NLP tasks while matrix C is ignored. M is used for word-similarity tasks Neural word embedding as implicit matrix factorization. Yoav Goldberg and Omer Levy

What about Mij? Each cell Mij = reflects the strength of association between that particular word-context pair SGNS as implicit matrix factorization

Our objective: SGNS as implicit matrix factorization We get: #(w,c) - number of times the word w appeard with this specific context c

How can we learn phrases?
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. Find words that appear frequently together, and infrequently in other contexts. Replace the words by a unique token. How can we learn phrases? Distributed representations of words and phrases and their compositionality. Mikolov et al.

How do we choose phrases?
Score above a chosen threshold are then used as phrases ? How do we choose phrases? Symetric correlation will get high score and will be phrase

Example “The dog barked in New York city”
“A pretty dog walked in the street” Score(dog, barked) = Score(New, York) =

Evaluation The Montreal Canadiens are a professional ice hockey team based in Montreal The Toronto Maple Leafs are a professional ice hockey team based in Toronto

Phrases – Some results When used 33 billion words they got accuracy of 72%. When they reduced the size of the data set to 6 bollion they got lower accuracy of 66%. Which suggest that large amount of training the is important.

Phrases - Some results Vasco de Gama - Portuguese explorer and the first European to reach India by sea. Lingasugur is a municipal town in Raichur district in the Indian state of Karnataka. Lake Baikal - rift lake in Russia, located in southern Siberia. Aral Sea - endorheic lake lying between Kazakhstan in the north and Uzbekistan Alan Bean - fourth person to walk on the Moon. rebecca naomi - American actress and singer Ionian Islands - The Ionian Islands are part of Greece and lie off the country’s west coast, in the Ionian Sea. Rügen is a German island in the Baltic Sea. תמונות

Sivan Biham & Adam Yaari

Similar presentations

Presentation on theme: "Sivan Biham & Adam Yaari"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sivan Biham & Adam Yaari

Similar presentations

Presentation on theme: "Sivan Biham & Adam Yaari"— Presentation transcript:

Similar presentations

About project

Feedback