Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural Language Model CS246 Junghoo “John” Cho.

Similar presentations


Presentation on theme: "Neural Language Model CS246 Junghoo “John” Cho."— Presentation transcript:

1 Neural Language Model CS246 Junghoo “John” Cho

2 Statistical Language Model
The probability of word sequence 𝑤 1 ,…, 𝑤 𝑛 : 𝑃 𝑤 1 ,…, 𝑤 𝑛 Q: Why is it useful? A: Many applications 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 = 𝑃( 𝑤 1 ,…, 𝑤 𝑛 , 𝑤) 𝑃( 𝑤 1 ,…, 𝑤 𝑛 ) word prediction, spell correction, … Q: How can we “learn” the language model 𝑃 𝑤 1 ,…, 𝑤 𝑛 ?

3 Learning Language Model
Input corpus: 1 billion word sequence. 10,000 word vocabulary. Q: How many times each word 𝑤 𝑖 (unigram) be seen on average? Will the estimated 𝑃 𝑤 𝑖 be reliable? Q: What about 𝑤 𝑖 𝑤 𝑗 (bigram)? Q: What about 𝑤 𝑖 𝑤 𝑗 𝑤 𝑘 (trigram)? The quality of learned language model depends heavily on the “quality” and “quantity” of the input corpus

4 Generalization Q: How can we “estimate” 𝑃 𝑤 1 ,…, 𝑤 𝑛 for longer 𝑤 1 ,…, 𝑤 𝑛 ? A: Chain rule and approximation by n-gram 𝑃 𝑤 1 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 𝑤 1 𝑃 𝑤 3 𝑤 1 , 𝑤 2 …𝑃( 𝑤 𝑛 | 𝑤 1 ,…, 𝑤 𝑛−1 ) Unigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 bigram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 N-gram model: 𝑃 𝑤 𝑖 | 𝑤 𝑗 ,… 𝑤 𝑘 ≅𝑃 𝑤 𝑖 |𝑤 𝑗 ,…, 𝑤 𝑗+𝑛−1 In practice, N is at most 3 due to high dimension of 𝑤 𝑖 ′ s This was the state-of-the-art until early 2000’s [Goodman 2001] Q: Can we estimate 𝑃 𝑤 1 ,…, 𝑤 𝑛 for long 𝑤 1 ,…, 𝑤 𝑛 better?

5 Neural Probabilistic Language Model [Bengio2003]
Learn the probability function 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 using neural network! Represent each word 𝑤 𝑖 to an m-dimensional vector 𝑣 𝑖 Express 𝑃 𝑤| 𝑤 1 ,… 𝑤 𝑛 as a function of 𝑣 𝑖 ′ 𝑠: 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) 𝑖 𝑓( 𝑤 𝑖 , 𝑣 1 , 𝑣 𝑛 )=1 Learn the function 𝑓(𝑤, 𝑣 1 ,…, 𝑣 𝑛 ) and vector representation 𝑣 𝑖 ’s of word 𝑤 𝑖 ’s simultaneously by learning on a neural network

6 Example 4 words, 2d vector representation
𝑤 1 = 0.1, 𝑤 2 = 0.3, 𝑤 3 = 0.7, 𝑤 4 = 0.6, 0.4 𝑃 𝑤 𝑖 𝑤 3 𝑤 2 𝑤 4 =𝑓 𝑤 𝑖 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 𝑓 𝑤 1 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.6 𝑓 𝑤 2 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.2 𝑓 𝑤 3 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓 𝑤 4 , 0.7, 0.1 , 0.3, 0.2 , 0.6, 0.4 =0.1 𝑓() is a “vector function” that maps a sequence of input word vectors into an output probability vector of V-dimension

7 Probability Function 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 )
Function from 𝑛 m-dimensional vectors to a V-dimensional probability vector Each dimension in the output vector represents the probability of each word 𝑤 𝑖 Q: How can we represent 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 )? 𝑓 𝑣 1 𝑣 2 𝑣 𝑛 𝑤 1 : 𝑤 2 :0.01 ⋮ 𝑤 𝑉 :0.05

8 Word to Vector Mapping V words to N-dimensional vectors
Can be represented as mxV matrix, where each column corresponds to the vector representation of each word Can be interpreted as mapping from 1-hot-encoding of a word to its m- dimensional vector encoding | | 𝑣 1 𝑣 2 … | | | 𝑣 𝑉 | ⋮ 1 ⋮ 0 = | 𝑣 𝑖 |

9 Neural Network of 𝑓 ( 𝑣 1 ,…, 𝑣 𝑛 ) in [Bengio 2003]
𝑤 1 𝑣 1 𝑣 𝑛 [m] W [h] 𝑣 𝑖 =𝑊 𝑤 𝑖 ℎ =𝐻 𝑣 1 ⋮ 𝑣 𝑛 𝑦 =𝑈 tanh ℎ 𝑓 =softmax( 𝑦 ) H . 𝑦 [V] softmax U . [V] tanh 𝑤 𝑛 tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [V]

10 Learning 𝑓 (and 𝑣 )[Bengio 2003]
Use back-propagation algorithm to maximize the log-likelihood of the training data For every training instance, the entire matrix of U (all h x V entries), H (all n x m x h entries), and W (just n x m entries) are updated Q: V = 20,000, n = 10, m=50, h = 100. How many parameter updates in each layer? Updating U is computationally most expensive Parallelized U update algorithm for fast learning

11 Result of [Bengio 2003] Obtained state-of-the-art result on 15 million word input data in terms of perplexity 10-20% better than smoothed trigram model A lot of excitement due to another significant improvement due to “neural network” Much follow-up work was done after this paper


Download ppt "Neural Language Model CS246 Junghoo “John” Cho."

Similar presentations


Ads by Google