Delving Deep into Personal Photo and Video Search

Slides:



Advertisements
Similar presentations
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Patch to the Future: Unsupervised Visual Prediction
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Evaluating Search Engine
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Presented by Zeehasham Rasheed
Nonnegative Shared Subspace Learning and Its Application to Social Media Retrieval Presenter: Andy Lim.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Viral Video Style: A Closer Look at Viral Videos on YouTube Lu Jiang, Yajie Miao, Yi Yang, Zhenzhong Lan, Alexander G. Hauptmann School of Computer Science,
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Naifan Zhuang, Jun Ye, Kien A. Hua
Distributed Representations for Natural Language Processing
Queensland University of Technology
CNN-RNN: A Unified Framework for Multi-label Image Classification
What is a Hidden Markov Model?
Neighborhood - based Tag Prediction
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
ALADDIN A Locality Aligned Deep Model for Instance Search
The Relationship between Deep Learning and Brain Function
Object Detection based on Segment Masks
2 Research Department, iFLYTEK Co. LTD.
Syntax-based Deep Matching of Short Texts
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
A Deep Learning Technical Paper Recommender System
Summary Presented by : Aishwarya Deep Shukla
Word2Vec CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Weakly Learning to Match Experts in Online Community
Word embeddings based mapping
Word embeddings based mapping
Struggling and Success in Web Search
On Convolutional Neural Network
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
Resource Recommendation for AAN
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Panagiotis G. Ipeirotis Luis Gravano
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Retrieval Performance Evaluation - Measures
Reuben Feinman Research advised by Brenden Lake
Automatic Handwriting Generation
Visual Question Answering
Presented by: Anurag Paul
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presented By: Harshul Gupta
CS249: Neural Language Model
Presentation transcript:

Delving Deep into Personal Photo and Video Search Lu Jiang, Yannis Kalantidis, Liangliang Cao, Sachin Farfade, Jiliang Tang, Alexander G. Hauptmann Yahoo, CMU and MSU

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions This is the outline of today’s talk.

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions First let’s look at the introduction

Introduction reality “zootopia” Personal photo and video data are being accumulated at an unprecedented speed. Because of the sentimental attachment, these videos are probably more valuable to us than some popular videos on YouTube. However, 80% of the photos and videos do not have the textual metadata. Therefore, it often takes us lots of swipes to find some photo or video on our cell phone 13.9 petabytes personal media were uploaded to Google photo by 200M users in just one year. More than 80% personal media do not have user tags.

Introduction Personal media: personal photos and videos. Personal media search is a novel and challenging problem: what are the differences when users search their own photos/videos versus the ones on the web? Can we use our findings to improve personal media search? Conduct our research on large-scale real-world Flickr search logs. We use the term personal media to refer to personal photos and videos Personal media search is a novel problem and in this talk we seek answers to two questions: what are the differences when users search their own photos/videos versus the ones on the web? Can we use our findings to improve personal media search? We will try to answer the questions based on our experiments on a large-scale Flickr search log data.

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions Let’s look at our experimental data set

Data Three types of search: Large-scale real-world search log data on Flickr: Three types of search: personal: queries searching their own photos social: searching their friends’ photos web: searching anyone’s photos on the entire public Flickr We sampled about 4 million search logs from Flickr in 2015 There are 3 types of searches in the logs: personal: queries searching users’ own photos social: searching their friends’ photos web: searching anyone’s photos on the entire public Flickr We are interested in personal category and include other categories for comparison

Data Concepts are automatically detected objects, people, scenes, actions, etc. from images or videos. false positive Concepts: hands, lettuce, kitchen, notebook. For each photo and video, we extract about 5000 concepts. Concepts are detected objects, people, scenes, actions, etc. from the content of images and videos. Worth mentioning that the concepts are automatically detected and therefore their accuracy are limited. For example, in this photo, we have correctly detected concepts hands, lettuce but we also have a false-positive concept called notebook. As most photos have no metadata, the concept is one of the few options that are available to search personal media The precision of recall detected concepts are limited. As most photos have no metadata, the concept is one of the few options. We extract 5000+ concepts from the image and video on the data.

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions Let’s look at our discovered characteristics about personal media on the Flickr dataset.

Observation I personal queries are more “visual”. Query Personal Social Web Visual 85.3% 60.9% 70.4% visual: snow, flower, lake non-visual: 2014, NYC, social media First, personal media queries contain more visual words. Visual words are the words that can be detected from the media content. For example, snow , flower, lake are visual words. But 2014, NYC, social media are not. To detect a visual word, we first map a query word to its synset in WordNet, and then we check if the synset is in the largest visual word vocabularies, such as ImageNet and LabelMe Visual Concept Vocabularies WordNet synsets query words

Observation II The majority of personal media have no user tags and the percentage with tags is decreasing. Second, about 80% of personal media do not have tags. In this graph, the solid green curve represents the percentage. As we can see only 15% of personal media uploaded in 2015 have tags. The percentage of personal media with tags is also decreasing over time. Interestingly, the percentage of personal media with GPS is increasing steadily.

Observation III Users are interested in “4W queries” in their personal media. what (object, thing, action, plant, etc.) who (person, animal) where (scene, country, city, GPS) when (year, month, holiday, date, etc.) Third, personal queries are mainly about “4W”: what, who, where and when The figure shows the distribution of the 4W in different types of queries. The left-most bars represent the 4W in personal queries.

Observation IV Personal search sessions are shorter. The median clicked position is 2. Getting the top2 personal photos correct is very important. Forth, personal search sessions are shorter. The graph shows the comparison between personal and web search session in terms of their median clicked positions. The take home message here is that personal sessions are much shorter and the key is to get the top 2 results correct.

Observation V A big gap between personal queries and automatically detected concepts. Users might search millions of topics, but the system can detect only a few thousand concepts. Last but not least, we found a big gap between the personal queries and automatically detected concepts. In other words, a gap between what users intent to search and what can be detected by the system. For example, a user would like to find “making a sandwich” photos but we do not have “making a sandwich” concepts. The gap significantly impairs the quality of personal media search. It is one of the must-solve problems in personal media. user’s information need what can be detected by the system.

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions In order to reduce the gap, we propose deep query understanding methods.

Problem To reduce the gap, we can: Increase the #concepts  non-trivial (more labeled data) Better understand the query  focus of this paper   query understanding: how to map out-of-vocabulary query words to the concepts in our vocabulary?   Generally, we can reduce the gap by either: increasing the number concepts so that we can detect more things or having better understanding about the query so that we can better map queries to our existing concepts. This talk will focus on the second direction. For example, suppose we do not have “making a sandwich” detector, but we can map the query to concepts such as food, bread, cheese, etc., all of which exist in our concept vocabulary. generated query user query food, bread, cheese, kitchen, cooking, room, lunch, dinner; Making a sandwich all concepts are in our vocabulary

Query Understanding generated query user query food, bread, cheese, kitchen, cooking, room, lunch, dinner; Making a sandwich Query understanding is a challenging problem. Existing methods include: Exact word matching WordNet Similarity [Miller, 1995]: structural depths in WordNet taxonomy. Word embedding mapping [Mikolov et al., 2013]: word distance in a learned embedding space in Wikipedia by the skip-gram model (word2vec). Query understanding boils down to measure the similarity between query words and concepts. Current methods include measuring the similarity by the WordNet similarity and the Word2vec embeeding similarity G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39-41, 1995. T. Mikolov and J. Dean. Distributed representations of words and phrases and their compositionality. 2013.

Deep Visual Query Embedding Our solution: learning deep visual query embedding mined from the Flickr search logs. clicked photo User query making a sandwich … … … Concepts in clicked photos: bread, vegetable, kitchen, cooking, ham In this paper, we propose a simple approach to directly map a query to relevant concepts using the data mined from the personal search logs. The idea is pretty simple. When a user issues a query to Flickr, the system will retrieve several photos. He might click some of them, and we assume the clicked photos have some relevance to the query. Then we extract the concepts from the clicked photos, and train model to map a query to the clicked concepts. For example, we map the query “making a sandwich” to the clicked concepts such as bread, vegetable, kitchen, cooking, etc. The table show more examples. For example, the query bluebell is mapped to “purple” and “flower” as bluebell are purple-ish flower n photos. Training data: mined from Flickr search logs. 18

Max-Pooled MLP In order to, we propose two deep embedding model to learn the query-to-concept mapping. The first one is called Max-pooled MLP. The input is the query words along with their Part-of-Speech tags, and the model outputs the relevant concepts. For example, the query “birthday party” will be mapped to relevant concepts such as cake, presents, kids, and candles. The query words first are input to embedding layer,s which are then sent to a max-pooled layer, followed by a number of fully connected layers.

Two-channel RNN The previous Max-pooled MLP does not consider the query word sequence. So birthday party is the same as party birthday. In our second model, we employ two LSTM layers to capture the query word sequence. The query words and their POS tags are sequentially fed into the model by two channels. The hidden LSTM states are pooled together and then are sent to a number of fully connected layers.

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions Let’s look at our experimental results.

Experimental Setups Task: search 148,000 personal photos by concepts. We discard all textual metadata. Goal: rank the user clicked photos closer to the top. Training: 20,600 queries from 3,978 users Test on 2,443 queries from 1,620 users. Evaluated by the mean average precision (mAP) and the concept recall at k (CR@k). mAP how well the clicked photos are ranked in the results. CR@k  how accurate are the top-k predicted concepts. Our task is to search 148,000 personal photos by concepts. Since in real-world scenarios, most of personal media do not have tags. We discard all textual metadata in our data. We can extract the clicked photos from the search logs. Our goal is to rank the user clicked photos closer to the top. We trained our models on 20k queries from 4k users, and test on 2k queries on 1.6k users. We evaluate the results by mean average precision (mAP) and concept recall at k. mAP captures how well the clicked photos are ranked. CR@k captures how accurate are the mapped concepts.

Results mAP is pretty low  challenging problem. Word2vec embedding on GoogleNews Proposed model (45% relative improvement) We compare proposed method with a number of baseline methods including SkipGram model. We use the word2vec embedding trained on GoogleNews. The proposed methods are VQE (MaxMLP) and VQE(RNN). As we see, the proposed VQE(MaxMLP) significantly outperforms other methods. This suggest learning deep embedding over search logs helps. However, the best map is still 0.39 suggesting the problem is very challenging. By comparing the RNN and MaxMLP models, we found that considering word sequence in our model might not help. The main reason that RNN is worse than MaxMLP is that RNN converges much slower. When we stop training after a day, RNN still yields much worse results than the MaxMLP model. mAP is pretty low  challenging problem. Learning the deep embedding over the search logs helps. Considering the word sequence might not help. RNN model converges much slower and thus gets worse performance.

Examples of top 2 results The figures show two representative queries: paintball and key west. Key west is an island in Florida. The ranked list on the left is produced by our model and the one on the right is produced by the baseline word2vec model. The photo with green border indicate they were clicked by the user. As we see, in both cases our method find more meaningful results

Empirical Observations deeper models are better. max pooling is better than average pooling. softmax loss yields the best results  probably because concepts in the clicked photos are sparse. We conducted more experiments to understand the behavior of our models. Here is the summary: We found that generally, Deeper models lead to more accurate results even with fewer parameters Max-pooling works better than average pooling in our best model We compare three different losses: softmax cross-entropy, cross-entropy and cosine distance. We find that softmax loss yields the best results. It is probably because concepts in the clicked photos are sparse

Outline Introduction Flickr Data Statistical Characteristics Deep Query Understanding Experiments Conclusions Here are the conclusions.

Take home messages Personal media search: Personal query sessions are shorter, and queries are more “visual”. Users are interested in “4W queries”. 80% of personal media have no textual metadata and the percentage is decreasing. Utilizing deep learning for query understanding is promising in improving personal media search. We believe personal media search is a novel and challenging problem, which needs further research. There are 3 take-home messages from today’s talk: First we found several characteristics about personal media search: personal query sessions are shorter, and queries are more “visual” Users are interested in “4W queries”, i.e. what, who, where and when 80% of personal media have no textual metadata and the percentage is decreasing. Second, we showed that Utilizing deep learning for query understanding is promising in improving personal media search. Finally, we believe personal media search is novel and challenging problem, which needs further research.

Questions to: lujiang@cs.cmu.edu Thank You. Questions to: lujiang@cs.cmu.edu Check out our application. Search MemoryQA on YouTube.