A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

A Neural Probabilistic Language Model 2014-12-16 Keren Ye

CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)

n-gram models Construct tables of conditional probabilities for the next word Combinations of the last n-1 words

n-gram models i.e. “I like playing basketball” – Unigram(1-gram) – Bigram(2-gram) – Trigram(3-gram)

n-gram models Disadvantages – It is not taking into account contexts farther than 1 or 2 words – It is not taking into account the similarity between words i.e.“The cat is walking in the bedroom”(training corpus) “A dog was running in a room”(?)

n-gram models Disadvantages – Curse of Dimensionality

Fighting the Curse of Dimensionality Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in ) Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence Learn simultaneously the word feature vectors and the parameters of that probability function

Fighting the Curse of Dimensionality Word feature vectors – Each word is associated with a point in a vector space – The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)

Fighting the Curse of Dimensionality Probability function – Using a multi-layer neural network to predict the next word given the previous ones, in the experiments – This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data

Fighting the Curse of Dimensionality Why does it work? – If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from The cat is walking in the bedroom – to and likewise to A dog was running in a room The cat is running in a room A dog is walking in a bedroom ….

Fighting the Curse of Dimensionality NNLM – Neural Network Language Model

A Neural Probabilistic Language Mode Denotations – The training set is a sequence of words, where the vocabulary V is a large but finite set – The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood – The only constraint on model is that for any choice of, the sum

A Neural Probabilistic Language Mode Objective function – Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term

A Neural Probabilistic Language Mode Model – We decompose the function in two parts A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context,, to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability

A Neural Probabilistic Language Mode

Model details (two hidden layers) – The shared word features layer C, which has no non-linearity (it would not add anything useful) – The ordinary hyperbolic tangent hidden layer

A Neural Probabilistic Language Mode Model details (formal description) – The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1

A Neural Probabilistic Language Mode Model details (formal description) – The are the unnormalized log-probabilities for each output word i, computed as follows, with parameters b, W, U, d and H Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections) And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C

ParametersBriefDimensions bOutput biases|V| dHidden layer biesesh WNo direct connections0 UHidden-to-output weights|V|*h matrix HWord features to output weightsh*(n-1)m matrix CWord features|V|*m matrix

Stochastic gradient ascent – Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window

A Neural Probabilistic Language Mode Parallel Implementation – Data-Parallel Processing Relied on synchronization commands – slow No locks – noise seems to be very small and did not apparently slow down training – Parameter-parallel Processing Parallelize across the parameters

Continuous Bag of Words(Word2vec) Bag of words – Traditional solution for the problem of Curse of Dimensionality

Continuous Bag of Words

Continuous Bag of Words(Word2vec) Distinctness – Projection layer Sum vs Concatenate Order of words – Hidden layer tanh vs NULL – Hierarchical Softmax

Thanks Q&A

A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

Similar presentations

Presentation on theme: "A Neural Probabilistic Language Model 2014-12-16 Keren Ye."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

Similar presentations

Presentation on theme: "A Neural Probabilistic Language Model 2014-12-16 Keren Ye."— Presentation transcript:

Similar presentations

About project

Feedback