Word/Doc2Vec for Sentiment Analysis

Slides:

Advertisements

Similar presentations

Deep Learning in NLP Word representation and how to use it for Parsing

Advertisements

Vector space word representations

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin

Distributed Representations of Sentences and Documents

Yang-de Chen Tutorial: word2vec Yang-de Chen

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.

Efficient Estimation of Word Representations in Vector Space

Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel

Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.

CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.

Semantic Compositionality through Recursive Matrix-Vector Spaces

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.

Vector Semantics Dense Vectors.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

A Tutorial on ML Basics and Embedding Chong Ruan

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

DeepWalk: Online Learning of Social Representations

Sentiment Analysis CMPT 733. Outline What is sentiment analysis? Overview of approach Feature Representation Term Frequency – Inverse Document Frequency.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Fill-in-The-Blank Using Sum Product Network

Distributed Representations for Natural Language Processing

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Sentiment analysis using deep learning methods

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

Unsupervised Learning of Video Representations using LSTMs

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Comparison with other Models Exploring Predictive Architectures

Deep Learning for Bacteria Event Identification

Deep learning David Kauchak CS158 – Fall 2016.

Deep Learning Amin Sobhani.

Natural Language and Text Processing Laboratory

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Matt Gormley Lecture 16 October 24, 2016

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

A Deep Learning Technical Paper Recommender System

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

Giuseppe Attardi Dipartimento di Informatica Università di Pisa

Intro to NLP and Deep Learning

Deep learning and applications to Natural language processing

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Distributed Representation of Words, Sentences and Paragraphs

Jun Xu Harbin Institute of Technology China

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Word Embedding Word2Vec.

Word embeddings based mapping

Word embeddings based mapping

Generalization in deep learning

Resource Recommendation for AAN

Vector Representation of Text

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Word embeddings (continued)

Attention for translation

Word representations David Kauchak CS158 – Fall 2016.

Vector Representation of Text

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Word/Doc2Vec for Sentiment Analysis Michael Czerny DC Natural Language Processing 4/8/2015

Who am I? MSc Cosmology and Particle Physics Data Scientist at L-3 Data Tactics Interested in applying forefront research in NLP and ML to industry problems Email: michael.czerny@l-3com.com @m0_z

Outline: What is sentiment analysis? Previous (i.e. pre-W/D2V) approaches to SA Word2Vec explained How it can be used for SA Example/App(?) Doc2Vec explained “ “ Conclusions

What is sentiment analysis?

What is sentiment analysis? In a nutshell: extracting attitudes toward something from human language SA aims to map qualitative data to a quantitative output(s) => Positive (?) => Negative (?) (Or something else entirely?)

What is sentiment analysis? Easy (relatively) for humans1, hard for machines! How do we convert human language to a machine-readable form? 1mashable.com/2010/04/19/sentiment-analysis/

Previous approaches to SA

Previous approaches to SA Keyword lookup: Assign sentiment score to words (“hate”: -1, “love”: +1) Aggregate scores of all words in text Overall + / - determines sentiment http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Previous approaches to SA Drawbacks: Need to label words Can’t implicitly capture negation (“Not good” = 0 ??) Ignores word context

Previous approaches to SA Slap a classifier on it! One-hot encode text so that each word is a column (“bag of words”) “John likes to watch movies. Mary likes movies too.” => [1 2 1 1 2 0 0 0 1 1] “John also likes to watch football games.” => [1 1 1 1 0 1 1 1 0 0] Use these vectors as input features to some classifier with labeled data

Previous approaches to SA Drawbacks: Feature space grows linearly with vocab size Ignores word context Input features contain no information on words themselves (“bad” is just as similar to “good” as “great” is)

Previous approaches to SA Sentiment Treebank (actually came after W2V) Fine-grained sentiment labels for 215,154 phrases of 11,855 sentences Train a recurrent ANN “bottom-up” by using previous children vectors to predict parent-phrase sentiment Does very well (85% test-set accuracy) on Treebank sentences2 Good at finding negation 2 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, R. Socher et al. 2013. http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Previous approaches to SA Drawbacks: Probably does not generalize well to all tasks (phrase score for “#YOLO swag” = ??) Good in theory, hard in practice (good luck implementing it!)

What can: Give continuous vector rep.’s of words? Capture context? Require minimal feature creation?

Answer:

Or… Word2Vec!3 Maps words to continuous vector representations (i.e. points in an N-dimensional space) Learns vectors from training data (generalizable!) Minimal feature creation! 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013. http://arxiv.org/pdf/1301.3781.pdf

Two methods: Skip-gram and CBOW Word2Vec How does it work? Two methods: Skip-gram and CBOW 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013. http://arxiv.org/pdf/1301.3781.pdf

CBOW Randomly initialize input/output weight matrices of sizes VxN and NxV where V: vocab size, N: vector size (parameter) Predict target word (one-hot encoded) from input of context word vectors (average) using single-layer NN Update weight matrices using SGD, backprop. and cross entropy over corpus Hidden layer size corresponds to word vector dim. 4 word2vec Parameter Learning Explained, X. Rong, 2014 http://arxiv.org/pdf/1411.2738v1.pdf

Skip-gram Method very similar, except now we predict window of words given single word vector Boils down to maximizing dot-product similarity of context words and target word5 Skip-gram typically outperforms CBOW on semantic and syntactic accuracy (Mikolov et al.) 4 word2vec Parameter Learning Explained, X. Rong, 2014 http://arxiv.org/pdf/1411.2738v1.pdf 5 word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Y. Goldberg & O. Levy, 2014

What does Word2Vec give us? Vectors! More importantly, stuff like: vector(“king”) – vector(“man”) + vector(“woman”) ≈ vector(“queen”)

Simple vector operations give us Interesting relationships: 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013. http://arxiv.org/pdf/1301.3781.pdf

Word2Vec for Sentiment Analysis

Word2Vec for SA Learned W2V features Sentiment classifier Bonus: Word2Vec has implementations in python (gensim), Java, C++, and Spark MLlib

Example: Tweets Methodology: Collect tweets using emoticons  and  as fuzzy labels for positive and negative sentiment (can quickly & easily collect many this way!) Preprocess tweets Split into train-test Train word2vec on train set Average word vectors for each tweet as input to classifier Validate model All using python!

Example: Tweets Tutorial at: http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis Word2Vec trained on ~400,000 tweets gives us 73% classification accuracy Gensim word2vec implementation Sklearn Logit SGD classifier Improves to 77% using ANN classifier ROC curve

Example: Tweets Negative tweets: Positive tweets:

Example: Tweets Extend with neutral class (“#news” is our fuzzy label) ~83% test accuracy with ANN classifier Seems to do impossibly well for neutral…

Example: Tweets Neutral tweets:

Example: Tweets Why does averaging tweets work?

Example: Tweets Words in 2D space

Example: Tweets Words in 2D space

Example: Tweets Words in 2D space

Example: Convolutional Nets Window of word vecs => convolve => classify 6 Convolutional Neural Networks for Sentence Classification, Y. Kim, 2014. http://arxiv.org/pdf/1408.5882v2.pdf

6 Convolutional Neural Networks for Sentence Classification, Y 6 Convolutional Neural Networks for Sentence Classification, Y. Kim, 2014. http://arxiv.org/pdf/1408.5882v2.pdf

But Google released 3 million word vecs trained on 100 billion words! Drawbacks: Quality depends on input data, number of samples, and size of vectors (possibly long computation time!) But Google released 3 million word vecs trained on 100 billion words! Averaging vec’s does not work well (in my experience) on large text (> tweet level) W2V cannot provide fixed-length feature vectors for variable-length text (pretty much everything!) 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013. http://arxiv.org/pdf/1301.3781.pdf

Doc2Vec

Doc2Vec7 Generalizes W2V to whole documents (phrases, sentences, etc.) Provides fixed-length vector Distributed Memory (DM) and Distributed Bag of Words (DBOW) 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014 http://arxiv.org/abs/1405.4053

Distributed Memory (DM) Assign and randomly initialize paragraph vector for each doc Predict next work using context words + paragraph vec Slide context window across doc but keep paragraph vec fixed (hence distributed memory) Updating done via SGD and backprop. 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014 http://arxiv.org/abs/1405.4053

Distributed Bag of Words (DBOW) ONLY use paragraph vec (no word vecs!) Take window of words in paragraph and randomly sample which one to predict using paragraph vec (ignores word ordering) Simpler, more memory efficient DM typically outperforms DBOW (but DM + DBOW is even better!) 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014 http://arxiv.org/abs/1405.4053

How does it perform? Outperforms sentiment Treebank RNN (and everything else) on its own dataset on both coarse and fine-grained sentiment classification Paragraph vec + 7 words to predict 8th word Concatenates 400 dim. DBOW and DM vecs as input Predicts test-set paragraph vec’s from frozen train-set word vec’s 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014 http://arxiv.org/abs/1405.4053

Outperforms everything on Stanford IMDB movie review data set How does it perform? Outperforms everything on Stanford IMDB movie review data set Paragraph vec + 9 words to predict 10th word Concatenates 400 dim. DBOW and DM vecs as input Predicts test-set paragraph vec’s from frozen train-set word vec’s 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014 http://arxiv.org/abs/1405.4053

Doc2Vec on Wikipedia8 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

Doc2Vec on Wikipedia LDA vs. Doc2Vec for nearest neighbors to “Machine learning” (bold = unrelated to ML) 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

Doc2Vec on Wikipedia 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

Using Doc2Vec Gensim has an implementation already! Let’s try it on the Stanford IMDB set…

 Using Doc2Vec Only see ~13% test error (compared to reported 7.42%) See my blog post for full details: http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis Others have had similar issues (can get to 10% error) Code used in paper coming in the near future! (?) Gensim cannot infer new doc vecs, but that is also coming!

Conclusion

Will Word/Doc2Vec solve all my problems?! No, but maybe!

“No Free Lunch Theorem9” Applying machine learning is an art! Test many tools and pick the right one. 9The Lack of A Priori Distinctions Between Learning Algorithms, D.H. Wolpert, 1996

W/D2V find contextual-based continuous vector representations of text Many applications! Information retrieval Document classification Recommendation algorithms …

Thank you!