Neural Language Model CS246 Junghoo “John” Cho.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
A Neural Probabilistic Language Model Keren Ye.
Pattern Recognition and Machine Learning
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Distributed Representations of Sentences and Documents
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Advanced Smoothing, Evaluation of Language Models.
Biointelligence Laboratory, Seoul National University
Text Classification Eric Doi Harvey Mudd College November 20th, 2008.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Neural Net Language Models
Statistical Models for Automatic Speech Recognition Lukáš Burget.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A Tutorial on ML Basics and Embedding Chong Ruan
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
N-Grams Chapter 4 Part 2.
Convolutional Neural Network
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Generative Adversarial Networks
Statistical Models for Automatic Speech Recognition
Intro to NLP and Deep Learning
Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.
Intelligent Information System Lab
Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.
Different Units Ramakrishna Vedantam.
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Neural Networks and Backpropagation
Mixture Density Networks
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Statistical Models for Automatic Speech Recognition
Transformer result, convolutional encoder-decoder
N-Gram Model Formulas Word sequences Chain rule of probability
network of simple neuron-like computing elements
Word Embedding Word2Vec.
EE513 Audio Signals and Systems
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
ML – Lecture 3B Deep NN.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Junghoo “John” Cho UCLA
CPSC 503 Computational Linguistics
实习生汇报 ——北邮 张安迪.
Logistic Regression Chapter 7.
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Word2vec推导 北京大学 苑斌.
Introduction to Neural Networks
Word representations David Kauchak CS158 – Fall 2016.
Language model using HTK
Deep Neural Network Language Models
Pattern Recognition: Statistical and Neural
Neural Machine Translation by Jointly Learning to Align and Translate
Listen Attend and Spell – a brief introduction
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Neural Language Model CS246 Junghoo “John” Cho

Statistical Language Model The probability of word sequence 𝑤 1 ,…, 𝑤 𝑛 : 𝑃 𝑤 1 ,…, 𝑤 𝑛 Q: Why is it useful? A: Many applications 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 = 𝑃( 𝑤 1 ,…, 𝑤 𝑛 , 𝑤) 𝑃( 𝑤 1 ,…, 𝑤 𝑛 ) word prediction, spell correction, … Q: How can we “learn” the language model 𝑃 𝑤 1 ,…, 𝑤 𝑛 ?

Learning Language Model Input corpus: 1 billion word sequence. 10,000 word vocabulary. Q: How many times each word 𝑤 𝑖 (unigram) be seen on average? Will the estimated 𝑃 𝑤 𝑖 be reliable? Q: What about 𝑤 𝑖 𝑤 𝑗 (bigram)? Q: What about 𝑤 𝑖 𝑤 𝑗 𝑤 𝑘 (trigram)? The quality of learned language model depends heavily on the “quality” and “quantity” of the input corpus

Generalization Q: How can we “estimate” 𝑃 𝑤 1 ,…, 𝑤 𝑛 for longer 𝑤 1 ,…, 𝑤 𝑛 ? A: Chain rule and approximation by n-gram 𝑃 𝑤 1 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 , 𝑤 2 …𝑃( 𝑤 𝑛 | 𝑤 1 ,…, 𝑤 𝑛−1 ) Unigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 bigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 N-gram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 ,…, 𝑤 𝑗+𝑛−1 In practice, N is at most 3 due to high dimension of 𝑤 𝑖 ′ s This was the state-of-the-art until early 2000’s [Goodman 2001] Q: Can we estimate 𝑃 𝑤 1 ,…, 𝑤 𝑛 for long 𝑤 1 ,…, 𝑤 𝑛 better?

Neural Probabilistic Language Model [Bengio2003] Learn the probability function 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 using neural network! Represent each word 𝑤 𝑖 to an m-dimensional vector 𝑣 𝑖 Express 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 as a function of 𝑣 𝑖 ′ 𝑠: 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) 𝑖 𝑓( 𝑤 𝑖 , 𝑣 1 , 𝑣 𝑛 )=1 Learn the function 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) and vector representation 𝑣 𝑖 ’s of word 𝑤 𝑖 ’s simultaneously by learning on a neural network

Example 4 words, 2d vector representation 𝑤 1 = 0.1, 0.8 𝑤 2 = 0.3, 0.2 𝑤 3 = 0.7, 0.1 𝑤 4 = 0.6, 0.4 𝑃 𝑤 𝑖 𝑤 3 𝑤 2 𝑤 4 =𝑓 𝑤 𝑖 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 𝑓 𝑤 1 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.6 𝑓 𝑤 2 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.2 𝑓 𝑤 3 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓 𝑤 4 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓() is a “vector function” that maps a sequence of input word vectors into an output probability vector of V-dimension

Probability Function 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 ) Function from 𝑛 m-dimensional vectors to a V-dimensional probability vector Each dimension in the output vector represents the probability of each word 𝑤 𝑖 Q: How can we represent 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 )? 𝑓 𝑣 1 𝑣 2 𝑣 𝑛 ⋮ 𝑤 1 :0.03 𝑤 2 :0.01 ⋮ 𝑤 𝑉 :0.05

Word to Vector Mapping V words to N-dimensional vectors Can be represented as mxV matrix, where each column corresponds to the vector representation of each word Can be interpreted as mapping from 1-hot-encoding of a word to its m- dimensional vector encoding | | 𝑣 1 𝑣 2 … | | | 𝑣 𝑉 | 0 ⋮ 1 ⋮ 0 = | 𝑣 𝑖 |

Neural Network of 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 ) in [Bengio 2003] 𝑤 1 … … 𝑣 1 𝑣 𝑛 [m] W … [h] ℎ 𝑣 𝑖 =𝑊 𝑤 𝑖 ℎ =𝐻 𝑣 1 ⋮ 𝑣 𝑛 𝑦 =𝑈 tanh ℎ 𝑓 =softmax( 𝑦 ) H . 𝑦 … [V] softmax U . [V] tanh 𝑤 𝑛 … tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [V]

Learning 𝑓 (and 𝑣 )[Bengio 2003] Use back-propagation algorithm to maximize the log-likelihood of the training data For every training instance, the entire matrix of U (all h x V entries), H (all n x m x h entries), and W (just n x m entries) are updated Q: V = 20,000, n = 10, m=50, h = 100. How many parameter updates in each layer? Updating U is computationally most expensive Parallelized U update algorithm for fast learning

Result of [Bengio 2003] Obtained state-of-the-art result on 15 million word input data in terms of perplexity 10-20% better than smoothed trigram model A lot of excitement due to another significant improvement due to “neural network” Much follow-up work was done after this paper