Word2Vec CS246 Junghoo “John” Cho.

Slides:

Advertisements

Similar presentations

歡迎 IBM Watson 研究員詹益毅博士蒞臨國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.

Advertisements

Deep Learning in NLP Word representation and how to use it for Parsing

Vector space word representations

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Distributed Representations of Sentences and Documents

Yang-de Chen Tutorial: word2vec Yang-de Chen

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Community Architectures for Network Information Systems

CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]

Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.

Efficient Estimation of Word Representations in Vector Space

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

Semantic Compositionality through Recursive Matrix-Vector Spaces

Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.

Vector Semantics Dense Vectors.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

A Tutorial on ML Basics and Embedding Chong Ruan

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Distributed Representations for Natural Language Processing

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

An Introduction to Triple Scoring (WSDM Cup T2)

Sentiment analysis using deep learning methods

Deep Learning RUSSIR 2017 – Day 3

Sivan Biham & Adam Yaari

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Learning for Bacteria Event Identification

Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.

Deep learning David Kauchak CS158 – Fall 2016.

Deep Learning Amin Sobhani.

Intro to NLP and Deep Learning

Intelligent Information System Lab

Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.

Deep learning and applications to Natural language processing

Vector-Space (Distributional) Lexical Semantics

Neural Machine Translation By Learning to Jointly Align and Translate

Efficient Estimation of Word Representation in Vector Space

Neural Language Model CS246 Junghoo “John” Cho.

Distributed Representation of Words, Sentences and Paragraphs

Jun Xu Harbin Institute of Technology China

Paraphrase Generation Using Deep Learning

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Final Presentation: Neural Network Doc Summarization

Word Embedding Word2Vec.

Word embeddings based mapping

Word embeddings based mapping

Neural Networks Geoff Hulten.

Artificial Intelligence Lecture No. 28

Deep Learning for Non-Linear Control

Vector Representation of Text

实习生汇报 ——北邮张安迪.

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Word embeddings (continued)

Word2vec推导北京大学苑斌.

Introduction to Sentiment Analysis

Word representations David Kauchak CS158 – Fall 2016.

Recurrent Neural Networks

Vector Representation of Text

CS249: Neural Language Model

Professor Junghoo “John” Cho UCLA

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Word2Vec CS246 Junghoo “John” Cho

Neural Language Model of [Mikolov 2010, 2011a, 2011b] Follow-up work to [Bengio 2003] on neural language models Used simple Recurrent Neural Network (RNN) Recurrent structure allows looking at “longer” context in principle

𝑓 ( 𝑤 1 ,…, 𝑤 𝑛 ) of [Mikolov 2010, 2011a, 2011b] 𝑤(𝑡) [V] … 𝑦(𝑡) [V] 𝑣(𝑡) =𝑊 𝑤(𝑡) 𝑠(𝑡) =sigmoid( 𝑣 𝑡 +𝑈 ℎ 𝑡−1 ) 𝑦(𝑡) =𝑉 ℎ(𝑡) 𝑓(𝑡) =softmax( 𝑦(𝑡) ) W … 𝑣(𝑡) [m] … ℎ(𝑡) V + sigmoid softmax ℎ(𝑡−1) U [m] … sigmoid 𝑥 = 1 1+ 𝑒 −𝑥 softmax 𝑥 𝑖 = 𝑒 𝑥 𝑖 𝑖 𝑒 𝑥 𝑖 [m]

Result of [Mikolov 2010, 2011a, 2011b] Great results on speech recognition tasks But more importantly, in his next paper, Mikolov wondered, does the vector 𝑣 mean something more? Does it in any way capture some “semantic meaning” of words? What does “distance” between two word vectors represent, for example?

Observation in [Mikolov 2013a] The difference between 𝑣 1 and 𝑣 2 captures the syntactic/semantic “relationship” between the two words!

Vector Difference Captures Relationship 𝑣 (king) - 𝑣 (man) + 𝑣 (woman) = 𝑣 (queen)!!! Experimental result Trained RNN with 640-dimensional word vectors on 320M words Broadcast News data 35% accuracy on syntactic relationship test (good:better – bad:worse) 8% accuracy on semantic relationship test (London:England – Beijing:China) First result showing that word vectors represent much more than what was expected man woman king queen

Follow-up Work of [Mikolov 2013b] Questions Can we do better? Shall we get better results if much larger dataset is used? [Mikolov 2013b]: significantly simplified neural network models Continuous Bag Of Words (CBOW) model Continuous Skip-Gram model Significant reduction in learning complexity, allowing to use much larger dataset

CBOW Model: 𝑃 𝑤 𝑤 1 ,…, 𝑤 𝑛 + W1 W2T 𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡) 𝑣 𝑖 (𝑡) = 𝑊 1 𝑤 𝑖 (𝑡) ℎ = 𝑖=1 𝑛 𝑣 𝑖 𝑡 𝑦 = 𝑊 2 𝑇 ℎ 𝑓 =softmax( 𝑦 ) Much simpler model No recurrent structure Addition of all context words No internal nonlinearity Softmax only on the final output … W1 … 𝑣 1 𝑦 … ℎ [m] … [V] . [m] W2T [V] + softmax 𝑤 𝑛 … 𝑣 𝑛 [m] … [V]

Skip-Gram Model: 𝑃 𝑤 1 ,…, 𝑤 𝑛 𝑤 𝑣 = 𝑊 1 𝑤 𝑤 𝑖 = 𝑊 2 𝑇 𝑣 𝑓 𝑖 =softmax( 𝑤 𝑖 ) Again much simpler model No recurrent structure Shared W2T for all context words No internal nonlinearity Weighted sampling from context words to reduce complexity … softmax 𝑤 W2T … . W1 … 𝑣 [V] 𝑤 𝑛 [m] … [V] softmax [V]

Result of [Mikolov 2013b] (1) Trained 640-dimensional word vectors on 320M words input data with 82K vocabulary 60% accuracy on syntactic test and 55% accuracy on semantic test! Semantic Accuracy Syntactic Accuracy RNN 9% 36% NN 23% 53% CBOW 24% 64% Skip-Gram 55% 59%

Result of [Mikolov 2013b] (2) Trained 1000-dimensional word vectors on 6B words input data for ~2days on ~150 CPUs Higher dimension vector representation and training on larger dataset improves accuracy significantly Semantic Accuracy Syntactic Accuracy CBOW 57% 69% Skip-Gram 66% 65%

Example Results from [Mikolov 2013b]

Follow-up Work of [Mikolov 2013c] Further optimizations on training process Hierarchical Softmax Negative Sampling Frequent Word Subsampling

Hierarchical Softmax Use binary tree to represent all words in the vocabulary A weight vector is associated with each internal node of the tree Avoid updating weights for all V vectors per each training instance Update weights for only log(V) vectors as opposed to V vectors

Negative Sampling An alternative way to avoid updating all V vector weights per each training instance Sample just a few words in the output vector! Sampling method Keep the “positive” output words Sample only a few (between 5 to 15) “negative” output words Sampling distribution 𝑃 𝑤 𝑖 ∝𝑓 𝑤 𝑖 3/4 The log-likelihood function for training was slightly revised to make the optimization process work

Frequent Word Subsampling During training, sample fewer instances from frequent words Sampling frequency 𝑃 𝑤 𝑖 = 𝑡 𝑓( 𝑤 𝑖 ) (𝑡= 10 −5 in experiments) Reduces both (1) training time and (2) influence of frequent words on final vector representation

Result of [Mikolov 2013c] Trained 300-dimensional Skip-gram model on 1B words input data with 692K vocabulary With 𝑡= 10 −5 subsampling Semantic Accuracy Syntactic Accuracy Time NEG-5 54% 63% 38 min NEG-15 58% 97 min H-Softmax 40% 53% 41 min Semantic Accuracy Syntactic Accuracy Time NEG-5 58% 61% 14 min NEG-15 36 min H-Softmax 59% 52% 21 min

Summary Amazing results. Why does it work so well? Does our brain represent words as vectors in high-dimensional space? I do not have a first-hand experience on how well they “really” work Are neural-model vectors really better than vectors from other models, such as LSI and LDA? I don’t know word2vec code and trained vector representations of words are all publicly available https://code.google.com/archive/p/word2vec/

References [Mikolov 2010] Recurrent neural-network-based language model [Mikolov 2011a] Strategies for training large-scale neural-network language model [Mikolov 2011b] Extensions of recurrent neural-network-based language model [Mikolov 2013a] Linguistic regularities in continuous-space word representation [Mikolov 2013b] Efficient estimation of word representations in vector space [Mikolov 2013c] Distributed representations of words and phrases and their compositionality