Download presentation
Presentation is loading. Please wait.
Published byHenry Randall Modified over 9 years ago
1
A Neural Probabilistic Language Model 2014-12-16 Keren Ye
2
CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)
3
n-gram models Construct tables of conditional probabilities for the next word Combinations of the last n-1 words
4
n-gram models i.e. “I like playing basketball” – Unigram(1-gram) – Bigram(2-gram) – Trigram(3-gram)
5
n-gram models Disadvantages – It is not taking into account contexts farther than 1 or 2 words – It is not taking into account the similarity between words i.e.“The cat is walking in the bedroom”(training corpus) “A dog was running in a room”(?)
6
n-gram models Disadvantages – Curse of Dimensionality
7
CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)
8
Fighting the Curse of Dimensionality Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in ) Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence Learn simultaneously the word feature vectors and the parameters of that probability function
9
Fighting the Curse of Dimensionality Word feature vectors – Each word is associated with a point in a vector space – The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)
10
Fighting the Curse of Dimensionality Probability function – Using a multi-layer neural network to predict the next word given the previous ones, in the experiments – This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data
11
Fighting the Curse of Dimensionality Why does it work? – If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from The cat is walking in the bedroom – to and likewise to A dog was running in a room The cat is running in a room A dog is walking in a bedroom ….
12
Fighting the Curse of Dimensionality NNLM – Neural Network Language Model
13
CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)
14
A Neural Probabilistic Language Mode Denotations – The training set is a sequence of words, where the vocabulary V is a large but finite set – The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood – The only constraint on model is that for any choice of, the sum
15
A Neural Probabilistic Language Mode Objective function – Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term
16
A Neural Probabilistic Language Mode Model – We decompose the function in two parts A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context,, to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability
17
A Neural Probabilistic Language Mode
18
Model details (two hidden layers) – The shared word features layer C, which has no non-linearity (it would not add anything useful) – The ordinary hyperbolic tangent hidden layer
19
A Neural Probabilistic Language Mode Model details (formal description) – The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1
20
A Neural Probabilistic Language Mode Model details (formal description) – The are the unnormalized log-probabilities for each output word i, computed as follows, with parameters b, W, U, d and H Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections) And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C
21
A Neural Probabilistic Language Mode
22
ParametersBriefDimensions bOutput biases|V| dHidden layer biesesh WNo direct connections0 UHidden-to-output weights|V|*h matrix HWord features to output weightsh*(n-1)m matrix CWord features|V|*m matrix
23
A Neural Probabilistic Language Mode
24
Stochastic gradient ascent – Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window
25
A Neural Probabilistic Language Mode Parallel Implementation – Data-Parallel Processing Relied on synchronization commands – slow No locks – noise seems to be very small and did not apparently slow down training – Parameter-parallel Processing Parallelize across the parameters
26
A Neural Probabilistic Language Mode
28
Continuous Bag of Words(Word2vec) Bag of words – Traditional solution for the problem of Curse of Dimensionality
29
CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)
30
Continuous Bag of Words
31
Continuous Bag of Words(Word2vec) Distinctness – Projection layer Sum vs Concatenate Order of words – Hidden layer tanh vs NULL – Hierarchical Softmax
32
Thanks Q&A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.