Professor Junghoo “John” Cho UCLA

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Language Modeling.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Distributed Representations of Sentences and Documents
1 Advanced Smoothing, Evaluation of Language Models.
Section 3.1 Measurements and Their Uncertainty
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Distributed Representations for Natural Language Processing
N-Grams Chapter 4 Part 2.
Deep learning David Kauchak CS158 – Fall 2016.
Statistical Language Models
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Smoothing 10/27/2017.
Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.
Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.
Machine Learning Basics
Estimating with PROBE II
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
CSCI 5832 Natural Language Processing
Word2Vec CS246 Junghoo “John” Cho.
Neural Language Model CS246 Junghoo “John” Cho.
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Distributed Representation of Words, Sentences and Paragraphs
Introduction to Summary Statistics
Introduction to Summary Statistics
Stat 217 – Day 28 Review Stat 217.
Inferential Statistics
N-Gram Model Formulas Word sequences Chain rule of probability
Word Embedding Word2Vec.
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CS246: Information Retrieval
Machine Learning in Practice Lecture 6
Junghoo “John” Cho UCLA
Vector Representation of Text
Word embeddings (continued)
INF 141: Information Retrieval
Unsupervised Learning of Narrative Schemas and their Participants
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Word representations David Kauchak CS158 – Fall 2016.
Part-of-Speech Tagging Using Hidden Markov Models
Vector Representation of Text
CS249: Neural Language Model
Presentation transcript:

Professor Junghoo “John” Cho UCLA Language Models Professor Junghoo “John” Cho UCLA

Today’s Topics Higher-level vector embedding Quoc V. Le and Tomas Mikolov: Distributed Representations of Sentences and Documents Building better language models Joshua Goodman: A bit of progress in language modeling, MSR Technical Report

Paragraph Embedding [Le & Mikolov 2014] Question: Can we create a vector representation for a phrase, sentence, paragraph, document, etc? More formally Convert a sequence of words 𝑤 1 𝑤 2 … 𝑤 𝑛 , into a 𝑘-dimensional vector where “similar sequences” are mapped into similar vector Q: How?

Paragraph Embedding [Le & Mikolov 2014] Q: How did we do it for words? A: For any word pair 𝑤 𝑖 , 𝑤 𝑗 find their vector embedding 𝑣 𝑖 , 𝑣 𝑗 such that 𝑃 𝑤 𝑗 𝑤 𝑖 =softmax 𝑣 𝑗 𝑇 𝑣 𝑖 A bias term 𝑏 𝑗 may be added inside softmax Q: Can we formulate the paragraph embedding problem similarly?

PV-DBOW (Bag of Words of Paragraph Vector) Given 𝑝 𝑖 = 𝑤 1 𝑤 2 … 𝑤 𝑛 , find their vector embedding 𝑞 𝑖 , 𝑣 1 , 𝑣 2 ,…, 𝑣 𝑛 such that 𝑃 𝑤 𝑗 𝑝 𝑖 =softmax 𝑣 𝑗 𝑇 𝑞 𝑖 Again, a bias term 𝑏 𝑗 maybe added Train 𝑞 𝑖 , 𝑣 𝑗 vectors on large text corpus to maximize the log likelihood 𝑖,𝑗 ∈𝐷 log 𝑃( 𝑤 𝑖 | 𝑝 𝑗 ) 𝑞 𝑖 ( 𝑣 𝑗 , 𝑏 𝑗 )

PV-DM (Distributed Memory for Paragraph Vector) Given 𝑝 𝑖 = 𝑤 1 𝑤 2 … 𝑤 𝑛 , find their vector embedding 𝑞 𝑖 , 𝑣 1 ,…, 𝑣 𝑛 , 𝑣 1 , …, 𝑣 𝑛 such that 𝑃 𝑤 4 𝑝 𝑖 𝑤 1 𝑤 2 𝑤 3 =softmax 𝑣 4 𝑇 𝑞 𝑖 𝑣 1 𝑣 2 𝑣 3 A bias term 𝑏 4 may be added Train 𝑞 𝑖 , 𝑣 𝑗 , 𝑣 𝑗 vectors on large text corpus to maximize the log likelihood 𝑖,𝑗 ∈𝐷 log 𝑃( 𝑤 𝑖 | 𝑝 𝑗 ) ( 𝑣 4 , 𝑏 4 ) Concatenate/average 𝑞 𝑖 𝑣 1 𝑣 2 𝑣 3

Results of [Le & Mikolov 2014] PV-DM works better than PV-DBOW Slight improvement when used together 12.2% error rate for Stanford sentiment analysis task using PV 20% improvement from state-of-the-art 7.42% error rate for IMDB review sentiment analysis task 15% improvement from state-of-the-art 3.82% error rate for paragraph similarity task 32% improvement from state-of-the-art Vector embedding works well at the sentence/paragraph level!

Vector Embedding Today, word embedding is used as the first step in almost all NLP tasks Word2Vec, GloVe, ELMO, BERT, … In general, “vector embedding” is an extremely hot research topic for many different types of datasets Graph embedding User embedding Time-series data embedding …

Any Questions? Next topic: language models

Language Model: A Brief Recap Given a sequence of words 𝑤 1 𝑤 2 … 𝑤 𝑛 , assign probability 𝑃( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) Q: How can we compute 𝑃( 𝑤 1 𝑤 2 … 𝑤 𝑛 )? A: For each word sequence 𝑤 1 𝑤 2 … 𝑤 𝑛 , count their frequency in the corpus, #( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) and divide it by the corpus size 𝑁 Divide by (𝑁−𝑛+1) to be precise Challenge: Data Sparsity For any reasonably large 𝑛, #( 𝑤 1 𝑤 2 … 𝑤 𝑛 ) is likely to be zero

Chain Rule and N-gram Language Model 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 𝑤 2 …𝑃 𝑤 𝑛 𝑤 1 … 𝑤 𝑛−1 For 3-gram model, approximate 𝑃 𝑤 𝑖 𝑤 1 … 𝑤 𝑖−1 ≈𝑃 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 Then, 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 ≈𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 𝑤 2 …𝑃 𝑤 𝑛 𝑤 𝑛−2 𝑤 𝑛−1 Q: Good first-order approximation. How can we make it better? Q: When does it work well? When does it fail? How can we avoid failure cases?

Ideas for n-Gram Model Improvement Use longer n-grams! Higher-order n-gram Estimates from low frequency n-grams are inherently inaccurate “Back-off” to lower-order estimate if zero count Smoothing: Laplace, Jelinek-Mercer, absolute discount, Kats, Kneser-Ney Skip missing word in sequence: skipping Use context-specific language model People tend to use words that they used before: caching Build a separate language model for each sentence/document type: sentence- mixture model Use word classes Clustering: If “on Tuesday” is frequent, “on Wednesday” is also likely to be frequent

Dealing with Zero-Frequency Grams Q: Some 3-grams, say 𝑎𝑏𝑐, do not appear in our corpus. Should we set 𝑃 𝑐 𝑎𝑏 =0? “Back-off policy” Use the estimate from a lower-order n-gram 𝑃 𝑐 𝑎𝑏 = 𝑃 𝑐 𝑎𝑏 if # 𝑎𝑏𝑐 >0 𝑃 𝑐 𝑏 if # 𝑏𝑐 >0 𝑃 𝑐 if # 𝑐 >0 1/𝑁 otherwise.

Improving Low-Frequency Estimates Estimates from low frequency gram is inherently inaccurate Zero-count estimate is an extreme example of this inaccuracy A word/gram cannot appear 1.3 times! Smoothing: class of techniques that try to improve low-frequency estimates “Smooth out” low frequency estimates, while “preserving” how frequency estimates Q: How?

Laplace Smoothing Add 1 to all gram frequency counts 𝑃 𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑐 𝑎𝑏 = #(𝑎𝑏𝑐)+1 𝑁+𝑉 #(𝑎𝑏𝑐): gram frequency 𝑁: corpus size 𝑉: vocabulary size Assume every gram has been seen once even before we see any Q: Why did people think it was a good idea? A: Assign non-zero probability to zero-frequency grams High-frequency gram’s estimate is close to the original estimate

Jelinek-Mercer Smoothing (a.k.a. Simple Interpolation) Back off uses low-order estimate only for zero frequency. Let us use it more! “Mix” higher-order n-gram estimate with low-order n-gram estimates 𝑃 𝐽𝑀 (𝑐|𝑎𝑏) = 𝜆 1 𝑃(𝑐|𝑎𝑏)+ 𝜆 2 (𝑐|𝑏)+ 𝜆 3 𝑃 𝑐 + 𝜆 4 /𝑉 = 𝜆 1 #(𝑎𝑏𝑐) #(𝑎𝑏) + 𝜆 2 #(𝑏𝑐) #(𝑏) + 𝜆 3 #(𝑐) 𝑁 + 𝜆 4 1 𝑉 with 𝜆 1 + 𝜆 2 + 𝜆 3 + 𝜆 4 =1 “Smoothed version” of back-off

Absolute Discount Smoothing “Subtract” a constant 𝑑 from each frequency and “mix-in” lower-order estimate 𝑃 𝐴𝐷 (𝑐|𝑎𝑏) = max⁡(#(𝑎𝑏𝑐)−𝐷,0) #(𝑏𝑐) +𝛼 𝑃 𝐴𝐷 (𝑎|𝑏) 𝛼 is chosen so that 𝑤 𝑃 𝐴𝐷 (𝑤|𝑎𝑏) =1 Q: Why absolute discount? Why did people think thought it would be a good idea? A: When #(𝑎𝑏𝑐) is high, 𝑃 𝐴𝐷 is close to non-smoothed estimate When #(𝑎𝑏𝑐) is low, the lower-order estimate becomes important

Kneser-Ney Smoothing Problem of absolute discounting Suppose the phrase San Francisco shows up many times in training corpus We might have seen a lot more San Francisco than glasses #(Francisco) > #(glasses) Let’s predict: I can’t see without my reading ___. 𝑃 𝐴𝐷 Francisco reading = max⁡(#(reading Francisco)−𝑑,0) #(reading) +𝛼 max⁡(#(Francisco)−𝑑,0) Very likely, #(reading Francisco) < d, and so the second term, #(Francisco), will dominate. We might end up predicting the blank to be Francisco! Problem: Francisco appears mostly behind San, but not others! Glasses may appear behind many different words Q: How can we incorporate this information?

Kneser-Ney Smoothing Assign higher probability if a word appears behind many different words Make the probability proportional to the # of previous context words 𝑃 𝐾𝑁 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = max⁡(#( 𝑤 𝑖−2 𝑤 𝑖−1 𝑤 𝑖 )−𝑑,0) #( 𝑤 𝑖−2 𝑤 𝑖−1 ) +𝛼 𝑣 #(𝑣 𝑤 𝑖 )>0 𝑤 𝑣 #(𝑣𝑤)>0 Modified Kneser-Ney smoothing Use different discount 𝑑 per each gram count

Katz Smoothing The estimates for grams that appear only once in the corpus are likely to be higher than their true values A gram cannot appear only 0.2 times! We need to “discount” their estimates appropriately Q: By how much? “Good-Turing Frequency Estimator” Any n-gram that occurs 𝑟 times should be discounted by 𝑑𝑖𝑠𝑐 𝑟 = 𝑟+1 𝑛 𝑟+1 𝑛 𝑟 ( 𝑛 𝑟 :# 𝑛grams that occur 𝑟 times) Katz Smoothing: intuition 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑎𝑏 = 𝑑𝑖𝑠𝑐(#𝑎𝑏𝑐) #𝑎𝑏

Katz Smoothing 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑎𝑏 = 𝑑𝑖𝑠𝑐(#(𝑎𝑏𝑐)) #(𝑎𝑏) if # 𝑎𝑏𝑐 >0 𝛼 𝑃 𝐾𝑎𝑡𝑧 𝑐 𝑏 otherwise

Skipping Basic idea: Why do we have to drop the farthest word if frequency is low? What about “told Jelinek news”? “Jelinek news” is still infrequent because “Jelinek” is infrequent word! “Skip” or “ignore” middle words 𝑃 𝑠𝑘𝑖𝑝 𝑑 𝑎𝑏𝑐 = 𝜆 1 𝑃 𝑑 𝑎𝑏𝑐 + 𝜆 2 𝑃 𝑑 𝑏𝑐 + 𝜆 3 𝑃 𝑑|𝑎𝑐 + 𝜆 4 𝑃 𝑑|𝑎𝑏 with 𝜆 1 + 𝜆 2 + 𝜆 3 + 𝜆 4 =1

Caching: User-specific Language Model A person tend to use the same words that she used before Build a “user-specific” language model and mix it with the general language model 𝑃 𝑐𝑎𝑐ℎ𝑒 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = 𝜆 1 𝑃 𝑔𝑒𝑛𝑒𝑟𝑎𝑙 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 + 𝜆 2 𝑃 𝑢𝑠𝑒𝑟 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 “User” can be “recent past” “User model” from past 1000 words

Sentence-Mixture Model Health news language model should be very different from Twitter language model Build separate language model for each type of corpus Use appropriate model if the source is known Mix the models if the source is unknown 𝑃 𝑚𝑖𝑥 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 = 𝜆 1 𝑃 Health 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 + 𝜆 2 𝑃 Twitter 𝑤 𝑖 𝑤 𝑖−2 𝑤 𝑖−1 “Different source” can be different sentence types Questions, exclamation, statements,…

Summary: Language Model Caching provides the most significant accuracy improvement But collecting accurate user-specific statistics is often difficult and the complexity of the model increases significantly Among smoothing techniques, modified Kneser-Ney performs best in practice Sentence-mixture model also works well in practice But requires significantly more training data and increases the model complexity Combining all three techniques do work

Announcement Start preparing your presentation Your presentations will start next week! Decide project ideas for your group Project proposal is due by 5th week Project idea presentation during the 5th week Working on a Kaggle competition can be an option If you get stuck, I am available to help Leverage my office hour to your advantage