Professor Junghoo “John” Cho UCLA

Professor Junghoo “John” Cho UCLA
Language Models Professor Junghoo “John” Cho UCLA

Today’s Topics Higher-level vector embedding
Quoc V. Le and Tomas Mikolov: Distributed Representations of Sentences and Documents Building better language models Joshua Goodman: A bit of progress in language modeling, MSR Technical Report

Paragraph Embedding [Le & Mikolov 2014]
Question: Can we create a vector representation for a phrase, sentence, paragraph, document, etc? More formally Convert a sequence of words 𝑤 1 𝑤 2 … 𝑤 𝑛 , into a 𝑘-dimensional vector where “similar sequences” are mapped into similar vector Q: How?

Paragraph Embedding [Le & Mikolov 2014]
Q: How did we do it for words? A: For any word pair 𝑤 𝑖 , 𝑤 𝑗 find their vector embedding 𝑣 𝑖 , 𝑣 𝑗 such that 𝑃 𝑤 𝑗 𝑤 𝑖 =softmax 𝑣 𝑗 𝑇 𝑣 𝑖 A bias term 𝑏 𝑗 may be added inside softmax Q: Can we formulate the paragraph embedding problem similarly?

PV-DBOW (Bag of Words of Paragraph Vector)
Given 𝑝 𝑖 = 𝑤 1 𝑤 2 … 𝑤 𝑛 , find their vector embedding 𝑞 𝑖 , 𝑣 1 , 𝑣 2 ,…, 𝑣 𝑛 such that 𝑃 𝑤 𝑗 𝑝 𝑖 =softmax 𝑣 𝑗 𝑇 𝑞 𝑖 Again, a bias term 𝑏 𝑗 maybe added Train 𝑞 𝑖 , 𝑣 𝑗 vectors on large text corpus to maximize the log likelihood 𝑖,𝑗 ∈𝐷 log 𝑃( 𝑤 𝑖 | 𝑝 𝑗 ) 𝑞 𝑖 ( 𝑣 𝑗 , 𝑏 𝑗 )

PV-DM (Distributed Memory for Paragraph Vector)
Given 𝑝 𝑖 = 𝑤 1 𝑤 2 … 𝑤 𝑛 , find their vector embedding 𝑞 𝑖 , 𝑣 1 ,…, 𝑣 𝑛 , 𝑣 1 , …, 𝑣 𝑛 such that 𝑃 𝑤 4 𝑝 𝑖 𝑤 1 𝑤 2 𝑤 3 =softmax 𝑣 4 𝑇 𝑞 𝑖 𝑣 1 𝑣 2 𝑣 3 A bias term 𝑏 4 may be added Train 𝑞 𝑖 , 𝑣 𝑗 , 𝑣 𝑗 vectors on large text corpus to maximize the log likelihood 𝑖,𝑗 ∈𝐷 log 𝑃( 𝑤 𝑖 | 𝑝 𝑗 ) ( 𝑣 4 , 𝑏 4 ) Concatenate/average 𝑞 𝑖 𝑣 1 𝑣 2 𝑣 3

Results of [Le & Mikolov 2014]
PV-DM works better than PV-DBOW Slight improvement when used together 12.2% error rate for Stanford sentiment analysis task using PV 20% improvement from state-of-the-art 7.42% error rate for IMDB review sentiment analysis task 15% improvement from state-of-the-art 3.82% error rate for paragraph similarity task 32% improvement from state-of-the-art Vector embedding works well at the sentence/paragraph level!

Vector Embedding Today, word embedding is used as the first step in almost all NLP tasks Word2Vec, GloVe, ELMO, BERT, … In general, “vector embedding” is an extremely hot research topic for many different types of datasets Graph embedding User embedding Time-series data embedding …

Any Questions? Next topic: language models

Language Model: A Brief Recap
Given a sequence of words 𝑤 1 𝑤 2 … 𝑤 𝑛 , assign probability 𝑃( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) Q: How can we compute 𝑃( 𝑤 1 𝑤 2 … 𝑤 𝑛 )? A: For each word sequence 𝑤 1 𝑤 2 … 𝑤 𝑛 , count their frequency in the corpus, #( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) and divide it by the corpus size 𝑁 Divide by (𝑁−𝑛+1) to be precise Challenge: Data Sparsity For any reasonably large 𝑛, #( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) is likely to be zero

Chain Rule and N-gram Language Model
𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 𝑤 2 …𝑃 𝑤 𝑛 𝑤 1 … 𝑤 𝑛−1 For 3-gram model, approximate 𝑃 𝑤 𝑖 𝑤 1 … 𝑤 𝑖−1 ≈𝑃 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 Then, 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 ≈𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 𝑤 2 …𝑃 𝑤 𝑛 𝑤 𝑛−2 𝑤 𝑛−1 Q: Good first-order approximation. How can we make it better? Q: When does it work well? When does it fail? How can we avoid failure cases?

Ideas for n-Gram Model Improvement
Use longer n-grams! Higher-order n-gram Estimates from low frequency n-grams are inherently inaccurate “Back-off” to lower-order estimate if zero count Smoothing: Laplace, Jelinek-Mercer, absolute discount, Kats, Kneser-Ney Skip missing word in sequence: skipping Use context-specific language model People tend to use words that they used before: caching Build a separate language model for each sentence/document type: sentence- mixture model Use word classes Clustering: If “on Tuesday” is frequent, “on Wednesday” is also likely to be frequent

Dealing with Zero-Frequency Grams
Q: Some 3-grams, say 𝑎𝑏𝑐, do not appear in our corpus. Should we set 𝑃 𝑐 𝑎𝑏 =0? “Back-off policy” Use the estimate from a lower-order n-gram 𝑃 𝑐 𝑎𝑏 = 𝑃 𝑐 𝑎𝑏 if # 𝑎𝑏𝑐 >0 𝑃 𝑐 𝑏 if # 𝑏𝑐 >0 𝑃 𝑐 if # 𝑐 > /𝑁 otherwise.

Improving Low-Frequency Estimates
Estimates from low frequency gram is inherently inaccurate Zero-count estimate is an extreme example of this inaccuracy A word/gram cannot appear 1.3 times! Smoothing: class of techniques that try to improve low-frequency estimates “Smooth out” low frequency estimates, while “preserving” how frequency estimates Q: How?

Laplace Smoothing Add 1 to all gram frequency counts
𝑃 𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑐 𝑎𝑏 = #(𝑎𝑏𝑐)+1 𝑁+𝑉 #(𝑎𝑏𝑐): gram frequency 𝑁: corpus size 𝑉: vocabulary size Assume every gram has been seen once even before we see any Q: Why did people think it was a good idea? A: Assign non-zero probability to zero-frequency grams High-frequency gram’s estimate is close to the original estimate

Jelinek-Mercer Smoothing (a.k.a. Simple Interpolation)
Back off uses low-order estimate only for zero frequency. Let us use it more! “Mix” higher-order n-gram estimate with low-order n-gram estimates 𝑃 𝐽𝑀 (𝑐|𝑎𝑏) = 𝜆 1 𝑃(𝑐|𝑎𝑏)+ 𝜆 2 (𝑐|𝑏)+ 𝜆 3 𝑃 𝑐 + 𝜆 4 /𝑉 = 𝜆 1 #(𝑎𝑏𝑐) #(𝑎𝑏) 𝜆 2 #(𝑏𝑐) #(𝑏) 𝜆 3 #(𝑐) 𝑁 + 𝜆 4 1 𝑉 with 𝜆 1 + 𝜆 2 + 𝜆 3 + 𝜆 4 =1 “Smoothed version” of back-off

Absolute Discount Smoothing
“Subtract” a constant 𝑑 from each frequency and “mix-in” lower-order estimate 𝑃 𝐴𝐷 (𝑐|𝑎𝑏) = max⁡(#(𝑎𝑏𝑐)−𝐷,0) #(𝑏𝑐) +𝛼 𝑃 𝐴𝐷 (𝑎|𝑏) 𝛼 is chosen so that 𝑤 𝑃 𝐴𝐷 (𝑤|𝑎𝑏) =1 Q: Why absolute discount? Why did people think thought it would be a good idea? A: When #(𝑎𝑏𝑐) is high, 𝑃 𝐴𝐷 is close to non-smoothed estimate When #(𝑎𝑏𝑐) is low, the lower-order estimate becomes important

Kneser-Ney Smoothing Problem of absolute discounting
Suppose the phrase San Francisco shows up many times in training corpus We might have seen a lot more San Francisco than glasses #(Francisco) > #(glasses) Let’s predict: I can’t see without my reading ___. 𝑃 𝐴𝐷 Francisco reading = max⁡(#(reading Francisco)−𝑑,0) #(reading) +𝛼 max⁡(#(Francisco)−𝑑,0) Very likely, #(reading Francisco) < d, and so the second term, #(Francisco), will dominate. We might end up predicting the blank to be Francisco! Problem: Francisco appears mostly behind San, but not others! Glasses may appear behind many different words Q: How can we incorporate this information?

Kneser-Ney Smoothing Assign higher probability if a word appears behind many different words Make the probability proportional to the # of previous context words 𝑃 𝐾𝑁 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = max⁡(#( 𝑤 𝑖−2 𝑤 𝑖−1 𝑤 𝑖 )−𝑑,0) #( 𝑤 𝑖−2 𝑤 𝑖−1 ) +𝛼 𝑣 #(𝑣 𝑤 𝑖 )>0 𝑤 𝑣 #(𝑣𝑤)>0 Modified Kneser-Ney smoothing Use different discount 𝑑 per each gram count

Katz Smoothing The estimates for grams that appear only once in the corpus are likely to be higher than their true values A gram cannot appear only 0.2 times! We need to “discount” their estimates appropriately Q: By how much? “Good-Turing Frequency Estimator” Any n-gram that occurs 𝑟 times should be discounted by 𝑑𝑖𝑠𝑐 𝑟 = 𝑟+1 𝑛 𝑟+1 𝑛 𝑟 ( 𝑛 𝑟 :# 𝑛grams that occur 𝑟 times) Katz Smoothing: intuition 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑎𝑏 = 𝑑𝑖𝑠𝑐(#𝑎𝑏𝑐) #𝑎𝑏

Katz Smoothing 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑎𝑏 = 𝑑𝑖𝑠𝑐(#(𝑎𝑏𝑐)) #(𝑎𝑏) if # 𝑎𝑏𝑐 >0 𝛼 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑏 otherwise

Skipping Basic idea: Why do we have to drop the farthest word if frequency is low? What about “told Jelinek news”? “Jelinek news” is still infrequent because “Jelinek” is infrequent word! “Skip” or “ignore” middle words 𝑃 𝑠𝑘𝑖𝑝 𝑑 𝑎𝑏𝑐 = 𝜆 1 𝑃 𝑑 𝑎𝑏𝑐 + 𝜆 2 𝑃 𝑑 𝑏𝑐 + 𝜆 3 𝑃 𝑑|𝑎𝑐 + 𝜆 4 𝑃 𝑑|𝑎𝑏 with 𝜆 1 + 𝜆 2 + 𝜆 3 + 𝜆 4 =1

Caching: User-specific Language Model
A person tend to use the same words that she used before Build a “user-specific” language model and mix it with the general language model 𝑃 𝑐𝑎𝑐ℎ𝑒 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = 𝜆 1 𝑃 𝑔𝑒𝑛𝑒𝑟𝑎𝑙 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 + 𝜆 2 𝑃 𝑢𝑠𝑒𝑟 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 “User” can be “recent past” “User model” from past 1000 words

Sentence-Mixture Model
Health news language model should be very different from Twitter language model Build separate language model for each type of corpus Use appropriate model if the source is known Mix the models if the source is unknown 𝑃 𝑚𝑖𝑥 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = 𝜆 1 𝑃 Health 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 + 𝜆 2 𝑃 Twitter 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 “Different source” can be different sentence types Questions, exclamation, statements,…

Summary: Language Model
Caching provides the most significant accuracy improvement But collecting accurate user-specific statistics is often difficult and the complexity of the model increases significantly Among smoothing techniques, modified Kneser-Ney performs best in practice Sentence-mixture model also works well in practice But requires significantly more training data and increases the model complexity Combining all three techniques do work

Announcement Start preparing your presentation
Your presentations will start next week! Decide project ideas for your group Project proposal is due by 5th week Project idea presentation during the 5th week Working on a Kaggle competition can be an option If you get stuck, I am available to help Leverage my office hour to your advantage

Professor Junghoo “John” Cho UCLA

Similar presentations

Presentation on theme: "Professor Junghoo “John” Cho UCLA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Professor Junghoo “John” Cho UCLA

Similar presentations

Presentation on theme: "Professor Junghoo “John” Cho UCLA"— Presentation transcript:

Similar presentations

About project

Feedback