Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Qi Yu, Antti Sorjamaa, Yoan Miche, and Eric Severin (Jean-Paul Murara) Wednesday
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
A Neural Probabilistic Language Model Keren Ye.
Word/Doc2Vec for Sentiment Analysis
On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.
Neural Networks Basic concepts ArchitectureOperation.
Chapter 6: Multilayer Neural Networks
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Distributed Representations of Sentences and Documents
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Image Compression Using Neural Networks Vishal Agrawal (Y6541) Nandan Dubey (Y6279)
Yang-de Chen Tutorial: word2vec Yang-de Chen
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Multiple-Layer Networks and Backpropagation Algorithms
Authors : Ramon F. Astudillo, Silvio Amir, Wang Lin, Mario Silva, Isabel Trancoso Learning Word Representations from Scarce Data By: Aadil Hayat (13002)
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Efficient Estimation of Word Representations in Vector Space
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.
Neural Modeling - Fall NEURAL TRANSFORMATION Strategy to discover the Brain Functionality Biomedical engineering Group School of Electrical Engineering.
Support Vector Machines Project מגישים : גיל טל ואורן אגם מנחה : מיקי אלעד נובמבר 1999 הטכניון מכון טכנולוגי לישראל הפקולטה להנדסת חשמל המעבדה לעיבוד וניתוח.
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
DeepWalk: Online Learning of Social Representations
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Distributed Representations for Natural Language Processing
Comparison with other Models Exploring Predictive Architectures
Rate of growth or decay (rate of change)
Intro to NLP and Deep Learning
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Neural Language Model CS246 Junghoo “John” Cho.
Machine Learning Today: Reading: Maria Florina Balcan
Word Embedding Word2Vec.
Socialized Word Embeddings
Optimizing Neural Information Capacity through Discretization
Word2Vec.
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Word embeddings (continued)
Presented by: Anurag Paul
Word representations David Kauchak CS158 – Fall 2016.
Modeling IDS using hybrid intelligent systems
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Deep Neural Network Language Models
Vector Representation of Text
CS249: Neural Language Model
Presentation transcript:

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published in Sept CS:6501: Paper presentation

Goal To introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and millions of words in the vocabulary. None of the previous methods has been successfully trained on more than a few hundred of millions of words, with dimensionality of word vectors between 50 – 100. CS:6501: Paper presentation

Introduction One-hot encoding For a vocabulary size D = 10, the one-hot vector of word ID w = 4 is e(w) = [ ] This method doesn’t make any assumption about word similarity ||e(w) – e(w’)||2 = 0 if w = w’ ||e(w) – e(w’)||2 = 2 if w != w’ This choice has several good reasons – simplicity, robustness and can be trained on huge amount of data. However, the performance is highly dependent on the size and quality of data. In recent years it has become possible to train more complex models on much larger data set and it has been found that distributed representation of words using neural network based language models significantly outperform N-gram models. CS:6501: Paper presentation

Problems with One-hot encoding The dimensionality of e(w) is the size of the vocabulary. Typical vocabulary size is ~ 100,000. Window of 10 words would correspond to input vector of atleast 1,000,000. This has two consequences 1) Vulnerability to overfitting 2) Computationally expensive CS:6501: Paper presentation

Continuous word representation The idea is to replace the One-hot encoding which doesn’t take into account similarity of word and replace it with learned vector representation. These vectors where found to capture the syntactic and semantic similarities between the words CS:6501: Paper presentation

Continuous word representation From: Blitzer et al CS:6501: Paper presentation

Fdgddggfhgfhgf Gfghfhgjhgjhgjgjg NNLM Fhgj Model architecture for estimating neural network language model (NNLM) where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model.

New Log-linear Models The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model. While this is what makes neural networks so attractive, the author explains a simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently. The two proposed architectures are: a) Continuous Bag-of-words Model b) Continuous Skip-gram Model CS:6501: Paper presentation

Continuous Bag-of-words Model Continuous Bag-of-words model is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). The weight matrix between the input and the projection layer is shared for all word positions in the same way as in the NNLM. CS:6501: Paper presentation

Continuous Skip-gram Model Here each word is taken as an input to Classifier with continuous projection layer, and predict words within a certain range before and after the current word. Increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. CS:6501: Paper presentation

Results Table 1: Examples of five types of semantic and nine types of syntactic questions in the Semantic- Syntactic Word Relationship test set. CS:6501: Paper presentation

Semantic and syntactic similarities CS:6501: Paper presentation