Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation Ganesh J 1 Manish Gupta 1,2 Vasudeva Varma 1 1 IIIT, Hyderabad, India, 2.

Slides:



Advertisements
Similar presentations
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Advertisements

Information retrieval – LSI, pLSI and LDA
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Modeling the Evolution of Product Entities “Newer Model" Feature on Amazon Paper ID: sp093 1.Product search engine ranking 2.Recommendation systems 3.Comparing.
Modeling the Evolution of Product Entities Priya Radhakrishnan 1, Manish Gupta 1,2, Vasudeva Varma 1 1 Search and Information Extraction Lab, IIIT-Hyderabad,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Po-Sen Huang1 Xiaodong He2 Jianfeng Gao2 Li Deng2
Latent Dirichlet Allocation a generative model for text
Distributed Representations of Sentences and Documents
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Presented By Wanchen Lu 2/25/2013
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign Unifying Learning to Rank and Domain Adaptation -- Enabling Cross-Task.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
Link Distribution on Wikipedia [0407]KwangHee Park.
Memoryless Document Vector Dongxu Zhang Advised by Dong Wang
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Collaborative Deep Learning for Recommender Systems
DeepWalk: Online Learning of Social Representations
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.
Distributed Representations for Natural Language Processing
IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*
Online Multiscale Dynamic Topic Models
Object Detection based on Segment Masks
A Deep Learning Technical Paper Recommender System
Compositional Human Pose Regression
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Deep Learning based Machine Translation
Distributed Representations of Subgraphs
Distributed Representation of Words, Sentences and Paragraphs
Community Distribution Outliers in Heterogeneous Information Networks
Community-based User Recommendation in Uni-Directional Social Networks
Weakly Learning to Match Experts in Online Community
Two-Stream Convolutional Networks for Action Recognition in Videos
Word Embedding Word2Vec.
Word embeddings based mapping
35 35 Extracting Semantic Knowledge from Wikipedia Category Names
Word embeddings based mapping
Socialized Word Embeddings
Learning Probabilistic Graphical Models Overview Learning Problems.
prerequisite chain learning and the introduction of LectureBank
John H.L. Hansen & Taufiq Al Babba Hasan
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Ernest Valveny Computer Vision Center
Hierarchical Relational Models for Document Networks
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
Human-object interaction
Presented By: Harshul Gupta
Bidirectional LSTM-CRF Models for Sequence Tagging
Anirban Laha and Vikas C. Raykar, IBM Research – India.
Presentation transcript:

Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation Ganesh J 1 Manish Gupta 1,2 Vasudeva Varma 1 1 IIIT, Hyderabad, India, 2 Microsoft, India Problem: Learn low-dimensional, dense representations (or embeddings) for documents. Document embeddings can be used off-the-shelf to solve many IR and DM applications such as  Document Classification  Document Retrieval  Document Ranking Related Works: Bag-of-words (BOW) or Bag-of-n-grams  Data sparsity  High dimensionality  Not capturing word order Latent Dirichlet Allocation (LDA)  Computationally inefficient for larger dataset. Paragraph2Vec [1]  Dense representation  Compact representation  Captures word order  Efficient to estimate Main Idea: Should we learn the document embedding from the word context directly? Can we learn the document embedding from the sentence context?  Explicitly exploit the sentence-level and word-level coherence to learn document and sentence embedding. Notations: Document Set: D = {d 1, d 2, …, d M }; ‘M’ documents; Document: d m = {s(m,1), s(m,2), …, s(m,T m )}; ‘T m ’ sentences; Sentence: s(m,n) = {w(n,1), w(n,2), …, w(n,T n )}; ‘T n ’ words; Word: w(n,t); Phase 1: Learn Sentence Embedding Idea: Learn sentence representation from the word sequence within the sentence. Input Features: Context words for target word w(n,t): w(n,t-c w ), …, w(n,t-1), w(n,t+1), …, w(n,t+c w ) (where ‘c w ’ is the word context size) Target Sentence: s(m,n) (where ‘m’ is the document id) Output: w(n,t) Task: Predict the target word using the concatenation of word vectors of context words along with the sentence vector as features. Maximize the word likelihood: L word = P(w(n,t)|w(n,t-c w ),…,w(n,t-1),w(n,t+1),…,w(n,t+c w ),s(m,n)) Phase 2: Learn Document Embedding Idea: Learn document representation from the sentence sequence within the document. Input Features: Context sentences for target sentence s(m,t): s(m,t-c S ), …, s(m,t-1), s(m,t+1), …, s(m,t+c S ) (where ‘c s ’ is the sentence context size) Target Document: d(m) Output: s(m,t) Task: Predict the target sentence using the concatenation of sentence vectors of context sentences along with the document vector as features. Maximize the sentence likelihood: L sent = P(s(m,t)| s(m,t-c S ),…,s(m,t-1),s(m,t+1),…,s(m,t+c S ),d(m)) Doc2Sent2Vec -Architecture Diagram Model \ Measure (Dataset)F1 (CND)F1 (Wiki10+) Paragraph2Vec w/o WT [1] Paragraph2Vec [2] Doc2Sent2Vec w/o WT Doc2Sent2Vec Document Classification Performance Evaluation: Tasks and Dataset Scientific article classification: Citation Network Dataset (CND) [3] Wikipedia page classification: Wiki10+ [4] Analysis Beneficial to learn word vectors for CND. Not beneficial to learn word vectors for Wiki10+ using Paragraph2Vec (Distortion of word vectors problem). Doc2Sent2Vec is robust to this problem. Doc2Sent2Vec beats the best baseline for both tasks. Summary: Doc2Sent2Vec = learning document representation in a multi- phase level leads to high quality document embeddings. We plan to consider other phases such as paragraphs, subsections, sections. consider the document sequence in a stream such as news click-through streams. References: 1.Le, Q., Mikolov, T.: Distributed Representations of Sentences and Documents. In: ICML. (2014) Dai, A.M., Olah, C., Le, Q.V., Corrado, G.S.: Document embedding with paragraph vectors. In: NIPS Deep Learning Workshop. (2014) 3.Chakraborty, T., Sikdar, S., Tammana, V., Ganguly, N., Mukherjee, A.: Computer Science Fields as Ground- truth Communities: Their Impact, Rise and Fall. In: ASONAM. (2013) Zubiaga, A.: Enhancing navigation on wikipedia with social tags. In: arxiv. (2012) vol. abs/ Acknowledgements: This work is supported by SIGIR Donald B. Crouch Grant. Thanks to NVIDIA for donating Tesla K40 graphics card.