Distributed Representations of Sentences and Documents

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Co Training Presented by: Shankar B S DMML Lab
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.
Lecture 14 – Neural Networks
Word/Doc2Vec for Sentiment Analysis
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Deep Belief Networks for Spam Filtering
Presented by Zeehasham Rasheed
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Machine Learning CS 165B Spring 2012
Active Learning for Class Imbalance Problem
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Memoryless Document Vector Dongxu Zhang Advised by Dong Wang
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Jointly Modeling Aspects, Ratings and Sentiments for Movie Recommendation (JMARS) Authors: Qiming Diao, Minghui Qiu, Chao-Yuan Wu Presented by Gemoh Mal.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Unsupervised Learning of Video Representations using LSTMs
Convolutional Neural Network
End-To-End Memory Networks
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recursive Neural Networks
Adversarial Learning for Neural Dialogue Generation
Restricted Boltzmann Machines for Classification
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Basic machine learning background with Python scikit-learn
Neural networks (3) Regularization Autoencoder
Introduction Feature Extraction Discussions Conclusions Results
Self organizing networks
Distributed Representation of Words, Sentences and Paragraphs
Convolutional Neural Networks for sentence classification
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Word Embedding Word2Vec.
Word embeddings based mapping
Word embeddings based mapping
Model generalization Brief summary of methods
Vector Representation of Text
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Attention for translation
Automatic Handwriting Generation
Vector Representation of Text
Presentation transcript:

Distributed Representations of Sentences and Documents Quoc Le , Tomas Mikolov Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043 2014/12/11 陳思澄

Outline Introduction Algorithm: Experiments Conclusion Learning Vector Representation of word Paragraph Vector: A distributed memory model Paragraph Vector without word ordering: Distributed bag of words Experiments Conclusion

Introduction In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Learning Vector Representation of Words The task is to predict a word given the other words in a context. Given a sequence of training words w1,w2,w3, …,wT , the objective of the word vector model is to maximize the average log probability: where U , b are the softmax parameters. h is constructed by a concatenation or average of word vectors extracted from W.

Paragraph Vector: A distributed memory model (PVDM) In our Paragraph Vector framework, every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

Paragraph Vector: A distributed memory model (PVDM) In this model, the concatenation or average of this vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing information from the current context and can act as a memory of the topic of the paragraph..

The paragraph vectors and word vectors are trained using stochastic gradient descent and the gradient is obtained via backpropagation. In summary, the algorithm itself has two key stages: (1) training to get word vectors W, softmax weights U, b and paragraph vectors D on already seen paragraphs. (2) “the inference stage” to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U , b fixed. We use D to make a prediction about some particular labels using a standard classifier.

Advantages of paragraph vectors: An important advantage of paragraph vectors is that they are learned from unlabeled data and thus can work well for tasks that do not have enough labeled data. Paragraph vectors also address some of the key weaknesses of bag-of-words models. First, they inherit an important property of the word vectors: the semantics of the words. In this space, “powerful” is closer to “strong” than to “Paris.” The second advantage of the paragraph vectors is that they take into consideration the word order.

Paragraph Vector without word ordering:(PV-DBOW) Distributed bag of words Another way is to ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. At each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector. In this version, the paragraph vector is trained to predict the words in a small window.

Experiments In our experiments, each paragraph vector is a combination of two vectors: one learned by the standard paragraph vector with distributed memory (PV-DM) and one learned by the paragraph vector with distributed bag of words (PVDBOW). We perform experiments to better understand the behavior of the paragraph vectors. To achieve this, we benchmark Paragraph Vector on a text understanding problem that require fixed-length vector representations of paragraphs: sentiment analysis.

Experiments Sentiment analysis: we use two datasets: Stanford sentiment treebank dataset (It has 11855 sentences taken from the movie review site Rotten Tomatoes)and IMDB dataset (consists of 100,000 movie reviews .One key aspect of this dataset is that each movie review has several sentences.). Tasks and Baselines: Two ways of benchmarking. First, one could consider a 5-way fine-grained classification task where the labels are {Very Negative, Negative, Neutral, Positive, Very Positive} or a 2-way coarse-grained classification task where the labels are {Negative, Positive}.

Experimental protocols: To make use of the available labeled data, in our model, each subphrase is treated as an independent sentence and we learn the representations for all the subphrases in the training set. After learning the vector representations for training sentences and their subphrases, we feed them to a logistic regression to learn a predictor of the movie rating. The vector presented to the classifier is a concatenation of two vectors, one from PV-DBOW and one from PV-DM.

We report the error rates of different methods in Table 1.

Experimental protocols: We learn the word vectors and paragraph vectors using 75,000 training documents (25,000 labeled and 50,000 unlabeled instances). The paragraph vectors for the 25,000 labeled instances are then fed through a neural network with one hidden layer with 50 units and a logistic classifier to learn to predict the sentiment. The results of Paragraph Vector and other baselines are reported in Table 2.

Conclusion We described Paragraph Vector, an unsupervised learning algorithm that learns vector representations for variable length pieces of texts such as sentences and documents. The vector representations are learned to predict the surrounding words in contexts sampled from the paragraph. Our experiments on several text classification tasks such as Stanford Treebank and IMDB sentiment analysis datasets show that the method is competitive with state-of-the-art methods. The good performance demonstrates the merits of Paragraph Vector in capturing the semantics of paragraphs. In fact, paragraph vectors have the potential to overcome many weaknesses of bag-of-words models.