Distributed Representations for Natural Language Processing

Slides:



Advertisements
Similar presentations
Deep Learning in NLP Word representation and how to use it for Parsing
Advertisements

Distributed Representations of Sentences and Documents
Linguistic Regularities in Sparse and Explicit Word Representations
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Efficient Estimation of Word Representations in Vector Space
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
Deep Visual Analogy-Making
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Vector Semantics Dense Vectors.
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Fill-in-The-Blank Using Sum Product Network
He Xiangnan Research Fellow National University of Singapore
Unsupervised Learning of Video Representations using LSTMs
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Comparison with other Models Exploring Predictive Architectures
Deep Learning for Bacteria Event Identification
Deep learning David Kauchak CS158 – Fall 2016.
Natural Language and Text Processing Laboratory
Recursive Neural Networks
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Vector Semantics Introduction.
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
A Deep Learning Technical Paper Recommender System
Personalized Social Image Recommendation
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.
Basic machine learning background with Python scikit-learn
Deep learning and applications to Natural language processing
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Convolutional Neural Networks for sentence classification
Jun Xu Harbin Institute of Technology China
Word Embedding Word2Vec.
Word embeddings based mapping
Word embeddings based mapping
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Resource Recommendation for AAN
Socialized Word Embeddings
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Learning linguistic structure with simple recurrent neural networks
Vector Representation of Text
Word2Vec.
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
From Unstructured Text to StructureD Data
Word representations David Kauchak CS158 – Fall 2016.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Vector Representation of Text
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Distributed Representations for Natural Language Processing Tomas Mikolov, Facebook ML Prague 2016

Structure of this talk Motivation Word2vec Discussion Architecture Evaluation Examples Discussion

Motivation Representation of text is very important for performance of many real-world applications: search, ads recommendation, ranking, spam filtering, … Local representations N-grams 1-of-N coding Bag-of-words Continuous representations Latent Semantic Analysis Latent Dirichlet Allocation Distributed Representations

Motivation: example Suppose you want to quickly build a classifier: Input = keyword, or user query Output = is user interested in X? (where X can be a service, ad, …) Toy classifier: is X capital city? Getting training examples can be difficult, costly, and time consuming With local representations of input (1-of-N), one will need many training examples for decent performance

Motivation: example Suppose we have a few training examples: (Rome, 1) (Turkey, 0) (Prague, 1) (Australia, 0) … Can we build a good classifier without much effort?

Motivation: example Suppose we have a few training examples: (Rome, 1) (Turkey, 0) (Prague, 1) (Australia, 0) … Can we build a good classifier without much effort? YES, if we use good pre-trained features.

Motivation: example Pre-trained features: to leverage vast amount of unannotated text data Local features: Prague = (0, 1, 0, 0, ..) Tokyo = (0, 0, 1, 0, ..) Italy = (1, 0, 0, 0, ..) Distributed features: Prague = (0.2, 0.4, 0.1, ..) Tokyo = (0.2, 0.4, 0.3, ..) Italy = (0.5, 0.8, 0.2, ..)

Distributed representations We hope to learn such representations so that Prague, Rome, Berlin, Paris etc. will be close to each other We do not want just to cluster words: we seek representations that can capture multiple degrees of similarity: Prague is similar to Berlin in some way, and to Czech Republic in another way Can this be even done without manually created databases like Wordnet / Knowledge graphs?

Word2vec Simple neural nets can be used to obtain distributed representations of words (Hinton et al, 1986; Elman, 1991; …) The resulting representations have interesting structure – vectors can be obtained using shallow network (Mikolov, 2007)

Word2vec Deep learning for NLP (Collobert & Weston, 2008): let’s use deep neural networks! It works great! Back to shallow nets: Word2vec toolkit (Mikolov at el, 2013) -> much more efficient than deep networks for this task

Word2vec Two basic architectures: Skip-gram CBOW Two training objectives: Hierarchical softmax Negative sampling Plus bunch of tricks: weighting of distant words, down-sampling of frequent words

Skip-gram Architecture Predicts the surrounding words given the current word

Continuous Bag-of-words Architecture Predicts the current word given the context

Word2vec: Linguistic Regularities After training is finished, the weight matrix between the input and hidden layers represent the word feature vectors The word vector space implicitly encodes many regularities among words:

Linguistic Regularities in Word Vector Space The resulting distributed representations of words contain surprisingly a lot of syntactic and semantic information There are multiple degrees of similarity among words: KING is similar to QUEEN as MAN is similar to WOMAN KING is similar to KINGS as MAN is similar to MEN Simple vector operations with the word vectors provide very intuitive results (King – man + woman ~= Queen)

Linguistic Regularities - Evaluation Regularity of the learned word vector space was evaluated using test set with about 20K analogy questions The test set contains both syntactic and semantic questions Comparison to previous state of art (pre-2013)

Linguistic Regularities - Evaluation

Linguistic Regularities - Examples

Visualization using PCA

Summary and discussion Word2vec: much faster and way more accurate than previous neural net based solutions - speed up of training compared to prior state of art is more than 10 000 times! (literally from weeks to seconds) Features derived from word2vec are now used across all big IT companies in plenty of applications (search, ads, ..) Very popular also in research community: simple way how to boost performance in many NLP tasks Main reasons of success: very fast, open-source, easy to use the resulting features to boost many applications (even non-NLP)

Follow up work Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Turns out neural based approaches are very close to traditional distributional semantics models Luckily, word2vec significantly outperformed the best previous models across many tasks 

Follow up work Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation Word2vec version from Stanford: almost identical, but a new name  In some sense step back: word2vec counts co-occurrences and does dimensionality reduction together, Glove is two-pass algorithm

Follow up work Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word embeddings Hyper-parameter tuning is important: debunks the claims of superiority of Glove Compares models trained on the same data (unlike Glove…), word2vec is faster & vectors better & much less memory consuming Many others did end up with similar conclusions (Radim Rehurek, …)

Final notes Word2vec is successful because it is simple, but it cannot be applied everywhere For modeling sequences of words, consider Recurrent networks Do not sum word vectors to obtain representations of sentences, it will not work well Be careful about the hype, as always … the most cited papers often contain non-reproducible results

References Mikolov (2007): Language Modeling for Speech Recognition in Czech Collobert, Weston (2008): A unified architecture for natural language processing: Deep neural networks with multitask learning Mikolov, Karafiat, Burget, Cernocky, Khudanpur (2010): Recurrent neural network based language model Mikolov (2012): Statistical Language Models Based on Neural Networks Mikolov, Yih, Zweig (2013): Linguistic Regularities in Continuous Space Word Representations Mikolov, Chen, Corrado, Dean (2013): Efficient estimation of word representations in vector space Mikolov, Sutskever, Chen, Corrado, Dean (2013): Distributed representations of words and phrases and their compositionality Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word embeddings