Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1

Slides:

Advertisements

Similar presentations

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

Advertisements

Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.

Deep Learning in NLP Word representation and how to use it for Parsing

Recent Developments in Deep Learning Quoc V. Le Stanford University and Google.

Po-Sen Huang1 Xiaodong He2 Jianfeng Gao2 Li Deng2

Word/Doc2Vec for Sentiment Analysis

Latent Dirichlet Allocation a generative model for text

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Distributed Representations of Sentences and Documents

Linguistic Regularities in Sparse and Explicit Word Representations

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.

Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

A shallow introduction to Deep Learning

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel

Latent Dirichlet Allocation

Link Distribution on Wikipedia [0407]KwangHee Park.

Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Collaborative Deep Learning for Recommender Systems

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

DeepWalk: Online Learning of Social Representations

Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation Ganesh J 1 Manish Gupta 1,2 Vasudeva Varma 1 1 IIIT, Hyderabad, India, 2.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Unsupervised Sparse Vector Densification for Short Text Similarity

Distributed Representations for Natural Language Processing

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*

Online Multiscale Dynamic Topic Models

Deep Learning Amin Sobhani.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Adversarial Learning for Neural Dialogue Generation

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

References [1] - Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11): ,

Multimodal Learning with Deep Boltzmann Machines

Deep learning and applications to Natural language processing

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Deep Learning based Machine Translation

convolutional neural networkS

Distributed Representation of Words, Sentences and Paragraphs

Word Embeddings with Limited Memory

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

convolutional neural networkS

Word embeddings based mapping

Word embeddings based mapping

Socialized Word Embeddings

Hierarchical Relational Models for Document Networks

Vector Representation of Text

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

CSE 291G : Deep Learning for Sequences

Vector Representation of Text

CS249: Neural Language Model

Presentation transcript:

Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1 Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1 1IIIT Hyderabad, India 2Microsoft, India

Problem Learn low-dimensional, dense representations (or embeddings) for documents. Document embeddings can be used off-the-shelf to solve many IR and DM applications such as, Document Classification Document Retrieval Document Ranking

Power of Neural Networks Bag-of-words (BOW) or Bag-of-n-grams [Harris et al.] Data sparsity High dimensionality Not capturing word order Latent Dirichlet Allocation (LDA) [Blei et al.] Computationally inefficient for larger dataset. Paragraph2Vec [Le et al.] Dense representation Compact representation Captures word order Efficient to estimate

Paragraph2Vec Learn document embedding by predicting the next word in the document using the context of the word and the (‘unknown’) document vector as features. Resulting vector captures the topic of the document. Update the document vectors, but not the word vectors [Le et al.] Update the document vectors, along with the word vectors [Dai et al.] Improvement in the accuracy for document similarity tasks.

Doc2Sent2Vec Idea - Being granular helps Should we learn the document embedding from the word context directly? Can we learn the document embedding from the sentence context? Explicitly exploit the sentence-level and word-level coherence to learn document and sentence embedding respectively.

Notations Document Set: D = {d1, d2, …, dM}; ‘M’ documents; Document: dm = {s(m,1), s(m,2), …, s(m,Tm)}; ‘Tm’ sentences; Sentence: s(m,n) = {w(n,1), w(n,2), …, w(n,Tn)}; ‘Tn’ words; Word: w(n,t); Doc2Sent2Vec’s goal is to learn low-dimensional representations of words, sentences and documents as a continuous feature vector of dimensionality Dw , Ds and Dd respectively.

Architecture Diagram

Phase 1: Learn Sentence Embedding Idea: Learn sentence representation from the word sequence within the sentence. Input Features: Context words for target word w(n,t): w(n,t-cw), …, w(n,t-1), w(n,t+1), …, w(n,t+cw) (where ‘cw’ is the word context size) Target Sentence: s(m,n) (where ‘m’ is the document id) Output: w(n,t) Task: Predict the target word using the concatenation of word vectors of context words along with the sentence vector as features. Maximize the word likelihood: Lword = P(w(n,t)| w(n,t-cw), …, w(n,t-1), w(n,t+1), …, w(n,t+cw), s(m,n))

Phase 2: Learn Document Embedding Idea: Learn document representation from the sentence sequence within the document. Input Features: Context sentences for target sentence s(m,t): s(m,t-cs), …, s(m,t-1), s(m,t+1), …, s(m,t+cS) (where ‘cS’ is the sentence context size) Target Document: d(m) Output: s(m,t) Novel Task: Predict the target sentence using the concatenation of sentence vectors of context sentences along with the document vector as features. Maximize the sentence likelihood: Lsent = P(s(m,t)| s(m,t-cs), …, s(m,t-1), s(m,t+1), …, s(m,t+cS), d(m))

Training Overall objective function: L = Lword + Lsent Use Stochastic Gradient Descent (SGD) to learn parameters. Use Hierarchical Softmax (Mikolov et al.) to facilitate faster training.

Evaluation Dataset Models (Including baselines) Citation Network Dataset (CND): Sampled 8000 research papers belonging to one of the 8 different computer science fields. [Chakraborty et al.] Wiki10+ Dataset: 19,740 Wikipedia pages belonging to one or more of the 25 different social tags. [Zubiaga et al.] Models (Including baselines) Paragraph2Vec w/o WT [Le et al.]: Paragraph2Vec algorithm without Word Training. Paragraph2Vec [Dai et al.] Doc2Sent2Vec w/o WT: Our approach without Word Training. Doc2Sent2Vec

Scientific Article Classification Model F1 Paragraph2Vec w/o WT 0.1275 Paragraph2Vec 0.135 Doc2Sent2Vec w/o WT 0.1288 Doc2Sent2Vec 0.1513 Beneficial to learn word vectors too. (~6% and ~18% F1 increase for Paragraph2Vec and Doc2Sent2Vec) Doc2Sent2Vec beats the best baseline by ~12%.

Wikipedia Page Classification Not beneficial to learn word vectors too for Paragraph2Vec shown by ~7% decline in F1 score. (Problem: Semantically accurate word vectors gets distorted during training) Doc2Sent2Vec robust to the above problem shown by ~27% increase in F1 score. Doc2Sent2Vec beats the best baseline by ~7%. Model F1 Paragraph2Vec w/o WT 0.0476 Paragraph2Vec 0.0445 Doc2Sent2Vec w/o WT 0.0401 Doc2Sent2Vec 0.0509

Conclusion and Future Works Proposed Doc2Sent2Vec – a novel approach to learn document embedding in an unsupervised fashion. Beats the best baseline in two classification tasks. Future Work: Extend to a general multi-phase approach where every phase corresponds to a logical sub-division of a document like words, sentences, paragraphs, subsections, sections and documents. Consider the document sequence in a stream such as news click- through streams [Djuric et al.].

References Harris, Z.: Distributional structure. Word, 10(23). (1954) 146 – 162 Blei, D., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. In: JMLR. (2013) Le, Q., Mikolov, T.: Distributed Representations of Sentences and Documents. In: ICML. (2014) 1188-1196 Dai, A.M., Olah, C., Le, Q.V., Corrado, G.S.: Document embedding with paragraph vectors. In: NIPS Deep Learning Workshop. (2014) Chakraborty, T., Sikdar, S., Tammana, V., Ganguly, N., Mukherjee, A.: Computer Science Fields as Ground- truth Communities: Their Impact, Rise and Fall. In: ASONAM. (2013) 426-433 Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. In: ICLR Workshop. (2013) vol. abs/1301.3781 Djuric, N., Wu, H., Radosavljevic, V., Grbovic, M., Bhamidipati, N.: Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content. In: WWW. (2015) 248-255

References Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A Neural Probabilistic Language Model. In: JMLR. (2003) 1137-1155 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural Language Processing (Almost) from Scratch. In: JMLR. (2011) 2493-2537 Morin, F., Bengio, Y.: Hierarchical Probabilistic Neural Network Language Model. In: AISTATS. (2005) 246-252 Pennington, J., Socher, R., Manning, C.D.: GloVe: Global Vectors for Word Representation. In: EMNLP. (2014) 1532-1543 Rumelhart, D., Hinton, G., Williams, R.: Learning Representations by Back-propagating Errors. In: Nature. (1986) 533-536 Dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: COLING. (2014) 69-78 Zubiaga, A.: Enhancing navigation on wikipedia with social tags. In: arxiv. (2012) vol. abs/1202.5469