Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Presented by Zeehasham Rasheed
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Jifeng Dai 2011/09/27.  Introduction  Structural SVM  Kernel Design  Segmentation and parameter learning  Object Feature Descriptors  Experimental.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling Zhengchen Zhang , Shuzhi Sam Ge , Hongsheng.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Cross-modal Hashing Through Ranking Subspace Learning
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Naifan Zhuang, Jun Ye, Kien A. Hua
CNN-RNN: A Unified Framework for Multi-label Image Classification
Bag-of-Visual-Words Based Feature Extraction
Saliency-guided Video Classification via Adaptively weighted learning
An Empirical Study of Learning to Rank for Entity Search
Correlative Multi-Label Multi-Instance Image Annotation
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
A Deep Learning Technical Paper Recommender System
Personalized Social Image Recommendation
Multimodal Learning with Deep Boltzmann Machines
Deep Cross-Modal Hashing
Presenter: Hajar Emami
Accounting for the relative importance of objects in image retrieval
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Discriminative Frequent Pattern Analysis for Effective Classification
Deep Cross-media Knowledge Transfer
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Deep Robust Unsupervised Multi-Modal Network
Large scale multilingual and multimodal integration
Unsupervised Pretraining for Semantic Parsing
Learning to Rank with Ties
Human-object interaction
Efficient Processing of Top-k Spatial Preference Queries
Deep Object Co-Segmentation
Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.
Topic: Semantic Text Mining
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Extracting Why Text Segment from Web Based on Grammar-gram
Presented By: Harshul Gupta
Week 7 Presentation Ngoc Ta Aidean Sharghi
Visual Grounding.
Presentation transcript:

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting Zhuang Zhejiang University

Table of Content Introduction Algorithm Experiments Conclusion

Introduction - Cross-modal Learning to Rank Learn a model to sort the retrieved documents according to their relevance to a given query, where the documents and the queries are from the different modalities. Three stages of Cross-modal Learning to Rank 1. Learn a common-representation for multi-modal data 2. Evaluate the relevance between documents and queries in the learned common space 3. Devise a ranking function to generate a ranking list and preserve the order of the relevance between multi-modal data.

Introduction - Two Types of Rrepresentations Two types of representations It is more attractive to exploit the local structure and global structure in pairs of images and text to learn a better common space for multimodal data. This paper proposes to learn two type of common spaces to embed the collaboratively grounded semantics in both local structure and global structure respectively. Local Alignment The semantics of visual objects and the textual words in local structure is embedded into a local common space, where the visual objects and their relevant textual words are aligned with each other Global Alignment The global alignment embeds the image-level and text-level compositional semantics into a global common space, where each image and its relevant text are aligned with each other.

Introduction - The proposed Methods (C2MLR) Common Representations 1. Jointly uses both local and global alignment 2. Generates the compositional semantics embeddings of an image and text from the isolated semantics embeddings of visual objects in the image and textual words in the text 3. Predicts the ranking list by evaluating the relevance between an image and text based on both local and global alignment. Evaluate Relevance The relevance between image and text are evaluated by computing their embedding similarity in both local and global common space. Ranking Function Two type of common representations are joint learned under a a large-margin pair-wise learning to rankin framework.

Introduction - The proposed Methods (C2MLR)

Algorithm Local alignment for objects and words Given a feature vector r extracted from an object region, the visual object inside the object region is mapped into a local common space by a non-linear projection as follows: where r is a dr-dimensional feature vector, WI is a d×dr matrix that maps the visual object into a d-dimensional local common space via local alignment, bI are the biases. For a given particular grammatical class p of words (e.g., the part of speech), we will learn a mapping matrix to embed all of words belonging to this grammatical class p into the local common space as follows: where Wp is a d×dw matrix that maps each dw-dimensional vector w into a d-dimensional vector, bp are biases.

Algorithm Global alignment for image and text The compositional semantics of the image I is encoded by a compositional semantics embedding matrix Wc as follows: where Wc is a dc × d matrix that maps the image I into a dc-dimensional global common space, and the operator | · | is the cardinality of a set. Similarly, the compositional semantics of S is encoded as follows:

Algorithm The relevance score based on local alignment Given a pair of image I and text S, the relevance score in terms of their local structure is obtained by: 1. aligning each visual object r in I with the textual word w whose semantic embedding has the highest cosine similarity with r 2. summing up the cosine similarity of all the aligned visual objects and textual words as the overall relevance score: The relevance score based on global alignment Overall relevance

Algorithm Parameter Estimation The parameters of the model are learned under a max-margin learning to rank framework. where W denotes two sets of model parameters for both local alignment and global alignment.

Experiments Datesets Tags in Pascal07 are more likely to express high-level concepts instead of directly describing the specific visual objects in images in Flickr8K. To adapt different natures of different datasets, an algorithms using both local and global alignment is needed.

Experiments Performance Comparison

Experiments The observation on Pascal07 The observation on Flickr8K Our proposed method (C2MLR) outperforms the other methods on both search directions in terms of most of the performance metrics. The methods utilizing global alignment (C2MLR, DeViSE) achieve a better performance over Pascal07, which verifies that global alignment has better ability to find relevance between image and text based on high level concept. The observation on Flickr8K C2MLR outperforms the other methods on both search direction in terms of all the performance metrics. The ranking methods adopting local alignment (DeepFE) achieve good performance over Flickr8k, which verifies that local alignment has the better ability to find relevance between image and text based on an explicit relevance between objects and words. The observation on both datasets By combining both global and local alignment, C2MLR achieve good performance on both datasets, which validates that mere local alignment or global alignment does not suffice. C2MLR- outperforms the other global alignment ranking methods (PAMIR and DeepRank) most of the time, which verifies that a compositional semantics embedding from visual objects and textual words can achieve better performance than traditional embedding.

Experiments Examples of ranking textual documents with image query using C2MLR as well as the discovered relevant object-word pairs.

Experiments Examples of twos types of objects and words alignment (nouns and adjectives) discoverd by C2MLR

Conclusions This paper proposes a new method for cross-modal ranking called C2MLR. The proposed method uses the local alignment to embed visual objects and textual words into a local common space, and employs the global alignment to map images and text in a global common space The global common space is learned by obtaining the image-level and sentence-level compositional semantics embeddings of multi-modal data from the visual objects and textual words. The local common space, the compositional semantics and the global common space are learned jointly in a max-margin learning to ranking manner. Experiments show that using both local alignment and global alignment for cross-modal ranking is able to boost the ranking performance.

Thank you! Q & A