Neural Language Model CS246 Junghoo “John” Cho.

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

A Neural Probabilistic Language Model Keren Ye.

Pattern Recognition and Machine Learning

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Distributed Representations of Sentences and Documents

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

1 Advanced Smoothing, Evaluation of Language Models.

Biointelligence Laboratory, Seoul National University

Text Classification Eric Doi Harvey Mudd College November 20th, 2008.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Neural Net Language Models

Statistical Models for Automatic Speech Recognition Lukáš Burget.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

A Tutorial on ML Basics and Embedding Chong Ruan

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

N-Grams Chapter 4 Part 2.

Convolutional Neural Network

CEE 6410 Water Resources Systems Analysis

Deep Feedforward Networks

Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.

Adversarial Learning for Neural Dialogue Generation

Neural Machine Translation by Jointly Learning to Align and Translate

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Generative Adversarial Networks

Statistical Models for Automatic Speech Recognition

Intro to NLP and Deep Learning

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Intelligent Information System Lab

Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.

Different Units Ramakrishna Vedantam.

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Neural Networks and Backpropagation

Mixture Density Networks

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Statistical Models for Automatic Speech Recognition

Transformer result, convolutional encoder-decoder

N-Gram Model Formulas Word sequences Chain rule of probability

network of simple neuron-like computing elements

Word Embedding Word2Vec.

EE513 Audio Signals and Systems

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

ML – Lecture 3B Deep NN.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Junghoo “John” Cho UCLA

CPSC 503 Computational Linguistics

实习生汇报 ——北邮张安迪.

Logistic Regression Chapter 7.

Word embeddings (continued)

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Word2vec推导北京大学苑斌.

Introduction to Neural Networks

Word representations David Kauchak CS158 – Fall 2016.

Language model using HTK

Deep Neural Network Language Models

Pattern Recognition: Statistical and Neural

Neural Machine Translation by Jointly Learning to Align and Translate

Listen Attend and Spell – a brief introduction

CS249: Neural Language Model

Professor Junghoo “John” Cho UCLA

Presentation transcript:

Neural Language Model CS246 Junghoo “John” Cho

Statistical Language Model The probability of word sequence 𝑤 1 ,…, 𝑤 𝑛 : 𝑃 𝑤 1 ,…, 𝑤 𝑛 Q: Why is it useful? A: Many applications 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 = 𝑃( 𝑤 1 ,…, 𝑤 𝑛 , 𝑤) 𝑃( 𝑤 1 ,…, 𝑤 𝑛 ) word prediction, spell correction, … Q: How can we “learn” the language model 𝑃 𝑤 1 ,…, 𝑤 𝑛 ?

Learning Language Model Input corpus: 1 billion word sequence. 10,000 word vocabulary. Q: How many times each word 𝑤 𝑖 (unigram) be seen on average? Will the estimated 𝑃 𝑤 𝑖 be reliable? Q: What about 𝑤 𝑖 𝑤 𝑗 (bigram)? Q: What about 𝑤 𝑖 𝑤 𝑗 𝑤 𝑘 (trigram)? The quality of learned language model depends heavily on the “quality” and “quantity” of the input corpus

Generalization Q: How can we “estimate” 𝑃 𝑤 1 ,…, 𝑤 𝑛 for longer 𝑤 1 ,…, 𝑤 𝑛 ? A: Chain rule and approximation by n-gram 𝑃 𝑤 1 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 , 𝑤 2 …𝑃( 𝑤 𝑛 | 𝑤 1 ,…, 𝑤 𝑛−1 ) Unigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 bigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 N-gram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 ,…, 𝑤 𝑗+𝑛−1 In practice, N is at most 3 due to high dimension of 𝑤 𝑖 ′ s This was the state-of-the-art until early 2000’s [Goodman 2001] Q: Can we estimate 𝑃 𝑤 1 ,…, 𝑤 𝑛 for long 𝑤 1 ,…, 𝑤 𝑛 better?

Neural Probabilistic Language Model [Bengio2003] Learn the probability function 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 using neural network! Represent each word 𝑤 𝑖 to an m-dimensional vector 𝑣 𝑖 Express 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 as a function of 𝑣 𝑖 ′ 𝑠: 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) 𝑖 𝑓( 𝑤 𝑖 , 𝑣 1 , 𝑣 𝑛 )=1 Learn the function 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) and vector representation 𝑣 𝑖 ’s of word 𝑤 𝑖 ’s simultaneously by learning on a neural network

Example 4 words, 2d vector representation 𝑤 1 = 0.1, 0.8 𝑤 2 = 0.3, 0.2 𝑤 3 = 0.7, 0.1 𝑤 4 = 0.6, 0.4 𝑃 𝑤 𝑖 𝑤 3 𝑤 2 𝑤 4 =𝑓 𝑤 𝑖 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 𝑓 𝑤 1 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.6 𝑓 𝑤 2 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.2 𝑓 𝑤 3 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓 𝑤 4 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓() is a “vector function” that maps a sequence of input word vectors into an output probability vector of V-dimension

Probability Function 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 ) Function from 𝑛 m-dimensional vectors to a V-dimensional probability vector Each dimension in the output vector represents the probability of each word 𝑤 𝑖 Q: How can we represent 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 )? 𝑓 𝑣 1 𝑣 2 𝑣 𝑛 ⋮ 𝑤 1 :0.03 𝑤 2 :0.01 ⋮ 𝑤 𝑉 :0.05

Word to Vector Mapping V words to N-dimensional vectors Can be represented as mxV matrix, where each column corresponds to the vector representation of each word Can be interpreted as mapping from 1-hot-encoding of a word to its m- dimensional vector encoding | | 𝑣 1 𝑣 2 … | | | 𝑣 𝑉 | 0 ⋮ 1 ⋮ 0 = | 𝑣 𝑖 |

Neural Network of 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 ) in [Bengio 2003] 𝑤 1 … … 𝑣 1 𝑣 𝑛 [m] W … [h] ℎ 𝑣 𝑖 =𝑊 𝑤 𝑖 ℎ =𝐻 𝑣 1 ⋮ 𝑣 𝑛 𝑦 =𝑈 tanh ℎ 𝑓 =softmax( 𝑦 ) H . 𝑦 … [V] softmax U . [V] tanh 𝑤 𝑛 … tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [V]

Learning 𝑓 (and 𝑣 )[Bengio 2003] Use back-propagation algorithm to maximize the log-likelihood of the training data For every training instance, the entire matrix of U (all h x V entries), H (all n x m x h entries), and W (just n x m entries) are updated Q: V = 20,000, n = 10, m=50, h = 100. How many parameter updates in each layer? Updating U is computationally most expensive Parallelized U update algorithm for fast learning

Result of [Bengio 2003] Obtained state-of-the-art result on 15 million word input data in terms of perplexity 10-20% better than smoothed trigram model A lot of excitement due to another significant improvement due to “neural network” Much follow-up work was done after this paper