A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Neural Networks Marco Loog.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
Distributed Representations of Sentences and Documents
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Radial Basis Function Networks
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Text Classification Eric Doi Harvey Mudd College November 20th, 2008.
Graphical models for part of speech tagging
EM and expected complete log-likelihood Mixture of Experts
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Machine Learning Chapter 4. Artificial Neural Networks
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Neural Net Language Models
Dropout as a Bayesian Approximation
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Neural Networks 2nd Edition Simon Haykin
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
ConvNets for Image Classification
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
RNNs: An example applied to the prediction task
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Multimodal Learning with Deep Boltzmann Machines
Intelligent Information System Lab
Machine Learning Basics
Efficient Estimation of Word Representation in Vector Space
Neural Networks and Backpropagation
Neural Language Model CS246 Junghoo “John” Cho.
RNNs: Going Beyond the SRN in Language Prediction
Classification Neural Networks 1
Word Embedding Word2Vec.
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
RNNs: Going Beyond the SRN in Language Prediction
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Attention for translation
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
CS249: Neural Language Model
Presentation transcript:

A Neural Probabilistic Language Model Keren Ye

CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)

n-gram models Construct tables of conditional probabilities for the next word Combinations of the last n-1 words

n-gram models i.e. “I like playing basketball” – Unigram(1-gram) – Bigram(2-gram) – Trigram(3-gram)

n-gram models Disadvantages – It is not taking into account contexts farther than 1 or 2 words – It is not taking into account the similarity between words i.e.“The cat is walking in the bedroom”(training corpus) “A dog was running in a room”(?)

n-gram models Disadvantages – Curse of Dimensionality

CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)

Fighting the Curse of Dimensionality Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in ) Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence Learn simultaneously the word feature vectors and the parameters of that probability function

Fighting the Curse of Dimensionality Word feature vectors – Each word is associated with a point in a vector space – The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)

Fighting the Curse of Dimensionality Probability function – Using a multi-layer neural network to predict the next word given the previous ones, in the experiments – This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data

Fighting the Curse of Dimensionality Why does it work? – If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from The cat is walking in the bedroom – to and likewise to A dog was running in a room The cat is running in a room A dog is walking in a bedroom ….

Fighting the Curse of Dimensionality NNLM – Neural Network Language Model

CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)

A Neural Probabilistic Language Mode Denotations – The training set is a sequence of words, where the vocabulary V is a large but finite set – The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood – The only constraint on model is that for any choice of, the sum

A Neural Probabilistic Language Mode Objective function – Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term

A Neural Probabilistic Language Mode Model – We decompose the function in two parts A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context,, to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability

A Neural Probabilistic Language Mode

Model details (two hidden layers) – The shared word features layer C, which has no non-linearity (it would not add anything useful) – The ordinary hyperbolic tangent hidden layer

A Neural Probabilistic Language Mode Model details (formal description) – The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1

A Neural Probabilistic Language Mode Model details (formal description) – The are the unnormalized log-probabilities for each output word i, computed as follows, with parameters b, W, U, d and H Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections) And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C

A Neural Probabilistic Language Mode

ParametersBriefDimensions bOutput biases|V| dHidden layer biesesh WNo direct connections0 UHidden-to-output weights|V|*h matrix HWord features to output weightsh*(n-1)m matrix CWord features|V|*m matrix

A Neural Probabilistic Language Mode

Stochastic gradient ascent – Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window

A Neural Probabilistic Language Mode Parallel Implementation – Data-Parallel Processing Relied on synchronization commands – slow No locks – noise seems to be very small and did not apparently slow down training – Parameter-parallel Processing Parallelize across the parameters

A Neural Probabilistic Language Mode

Continuous Bag of Words(Word2vec) Bag of words – Traditional solution for the problem of Curse of Dimensionality

CONTENTS N-gram Models Fighting the Curse of Dimensionality A Neural Probabilistic Language Model Continuous Bag of Words(Word2vec)

Continuous Bag of Words

Continuous Bag of Words(Word2vec) Distinctness – Projection layer Sum vs Concatenate Order of words – Hidden layer tanh vs NULL – Hierarchical Softmax

Thanks Q&A