Vector Representation of Text

Slides:

Advertisements

Similar presentations

2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

Advertisements

Dimensionality Reduction PCA -- SVD

Deep Learning in NLP Word representation and how to use it for Parsing

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Word/Doc2Vec for Sentiment Analysis

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.

Distributed Representations of Sentences and Documents

Yang-de Chen Tutorial: word2vec Yang-de Chen

Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Connectionist Models of Language Development: Grammar and the Lexicon Steve R. Howell McMaster University, 1999.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Vector Semantics Dense Vectors.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

A Tutorial on ML Basics and Embedding Chong Ruan

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

DeepWalk: Online Learning of Social Representations

Sentiment Analysis CMPT 733. Outline What is sentiment analysis? Overview of approach Feature Representation Term Frequency – Inverse Document Frequency.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Fill-in-The-Blank Using Sum Product Network

Distributed Representations for Natural Language Processing

Dimensionality Reduction and Principle Components Analysis

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Comparison with other Models Exploring Predictive Architectures

Deep learning David Kauchak CS158 – Fall 2016.

Chinese Academy of Sciences, Beijing, China

Vector Semantics Introduction.

A Deep Learning Technical Paper Recommender System

Intro to NLP and Deep Learning

Word Embeddings and their Applications

Vector-Space (Distributional) Lexical Semantics

Unsupervised Learning and Autoencoders

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Distributed Representation of Words, Sentences and Paragraphs

Convolutional Neural Networks for sentence classification

Jun Xu Harbin Institute of Technology China

Word Embeddings with Limited Memory

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Word Embedding Word2Vec.

Creating Data Representations

Word embeddings based mapping

Word embeddings based mapping

Resource Recommendation for AAN

RCNN, Fast-RCNN, Faster-RCNN

Vector Representation of Text

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

Word embeddings (continued)

Attention for translation

Word representations David Kauchak CS158 – Fall 2016.

Natural Language Processing Is So Difficult

Dr. Reda Bouadjenek Data-Driven Decision Making Lab (D3M)

Professor Junghoo “John” Cho UCLA

Presentation transcript:

Vector Representation of Text Vagelis Hristidis Prepared with the help of Nhat Le Many slides are from Richard Socher, Stanford CS224d: Deep Learning for NLP

To compare pieces of text We need effective representation of : Words Sentences Text Approach 1: Use existing thesauri or ontologies like WordNet and Snomed CT (for medical). Drawbacks: Manual Not context specific Approach 2: Use co-occurrences for word similarity. Drawbacks: Quadratic space needed Relative position and order of words not considered

Approach 3: low dimensional vectors Store only “important” information in fixed, low dimensional vector. Singular Value Decomposition (SVD) on co-occurrence matrix 𝑋 is the best rank k approximation to X , in terms of least squares Motel = [0.286, 0.792, -0.177, -0.107, 0.109, -0.542, 0.349, 0.271] m = n = size of vocabulary 𝑆 is the same matrix as S except that it contains only the top largest singular values

Approach 3: low dimensional vectors An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence, Rohde et al. 2005

Problems with SVD Computational cost scales quadratically for n x m matrix: O(mn2) flops (when n<m) Hard to incorporate new words or documents Does not consider order of words

word2vec approach to represent the meaning of word Represent each word with a low-dimensional vector Word similarity = vector similarity Key idea: Predict surrounding words of every word Faster and can easily incorporate a new sentence/document or add a word to the vocabulary

Represent the meaning of word – word2vec 2 basic neural network models: Continuous Bag of Word (CBOW): use a window of word to predict the middle word Skip-gram (SG): use a word to predict the surrounding ones in window.

Word2vec – Continuous Bag of Word E.g. “The cat sat on floor” Window size = 2 the cat sat on floor

Index of cat in vocabulary Input layer 1 … Index of cat in vocabulary Hidden layer cat Output layer 1 … one-hot vector one-hot vector sat 1 … on

𝑊 𝑉×𝑁 𝑊′ 𝑁×𝑉 𝑊 𝑉×𝑁 We must learn W and W’ Input layer Hidden layer cat 1 … Hidden layer cat 𝑊 𝑉×𝑁 Output layer 1 … V-dim 𝑊′ 𝑁×𝑉 sat 1 … 𝑊 𝑉×𝑁 N-dim V-dim on V-dim N will be the size of word vector

𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 0.1 2.4 1.6 1.8 0.5 0.9 … 3.2 2.6 1.4 2.9 1.5 3.6 6.1 0.6 2.7 1.9 2.0 1.2 1 … 2.4 2.6 … 1.8 Input layer 1 … × = xcat Output layer 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 1 … V-dim 𝑣 = 𝑣 𝑐𝑎𝑡 + 𝑣 𝑜𝑛 2 + sat 1 … 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 V-dim Hidden layer xon N-dim V-dim

𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 0.1 2.4 1.6 1.8 0.5 0.9 … 3.2 2.6 1.4 2.9 1.5 3.6 6.1 0.6 2.7 1.9 2.0 1.2 1 … 1.8 2.9 … 1.9 Input layer 1 … × = xcat Output layer 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑐𝑎𝑡 = 𝑣 𝑐𝑎𝑡 1 … V-dim 𝑣 = 𝑣 𝑐𝑎𝑡 + 𝑣 𝑜𝑛 2 + sat 1 … 𝑊 𝑉×𝑁 𝑇 × 𝑥 𝑜𝑛 = 𝑣 𝑜𝑛 V-dim Hidden layer xon N-dim V-dim

𝑊 𝑉×𝑁 𝑦 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 𝑊 𝑉×𝑁 ′ × 𝑣 =𝑧 𝑊 𝑉×𝑁 Input layer Hidden layer cat 1 … Hidden layer cat 𝑊 𝑉×𝑁 Output layer 1 … V-dim 𝑊 𝑉×𝑁 ′ × 𝑣 =𝑧 𝑦 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 1 … 𝑣 𝑊 𝑉×𝑁 on N-dim 𝑦 sat V-dim V-dim N will be the size of word vector

𝑊 𝑉×𝑁 𝑊 𝑉×𝑁 ′ × 𝑣 =𝑧 𝑦 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 𝑊 𝑉×𝑁 Input layer 1 … We would prefer 𝑦 close to 𝑦 𝑠𝑎𝑡 Hidden layer cat 𝑊 𝑉×𝑁 Output layer 1 … 0.01 0.02 0.00 0.7 … V-dim 𝑊 𝑉×𝑁 ′ × 𝑣 =𝑧 𝑦 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) 1 … 𝑣 𝑊 𝑉×𝑁 on N-dim 𝑦 sat V-dim V-dim N will be the size of word vector 𝑦

𝑊 𝑉×𝑁 𝑇 0.1 2.4 1.6 1.8 0.5 0.9 … 3.2 2.6 1.4 2.9 1.5 3.6 6.1 0.6 2.7 1.9 2.0 1.2 Contain word’s vectors Input layer 1 … xcat Output layer 𝑊 𝑉×𝑁 1 … V-dim 𝑊 𝑉×𝑁 ′ sat 1 … 𝑊 𝑉×𝑁 W contains input word vectors. W’ contains output word vectors. We can consider either W or W’ as the word’s representation. Or even take the average. V-dim Hidden layer xon N-dim V-dim We can consider either W or W’ as the word’s representation. Or even take the average.

Some interesting results

Word analogies

Represent the meaning of sentence/text Simple approach: take avg of the word2vecs of its words Another approach: Paragraph vector (2014, Quoc Le, Mikolov) Extend word2vec to text level Also two models: add paragraph vector as the input

Applications Search, e.g., query expansion Sentiment analysis Classification Clustering

Resources Stanford CS224d: Deep Learning for NLP http://cs224d.stanford.edu/index.html The best “word2vec Parameter Learning Explained”, Xin Rong https://ronxin.github.io/wevi/ Word2Vec Tutorial - The Skip-Gram Model http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram- model/ Improvements and pre-trained models for word2vec: https://nlp.stanford.edu/projects/glove/ https://fasttext.cc/ (by Facebook)