Delving Deep into Personal Photo and Video Search

Delving Deep into Personal Photo and Video Search
Lu Jiang, Yannis Kalantidis, Liangliang Cao, Sachin Farfade, Jiliang Tang, Alexander G. Hauptmann Yahoo, CMU and MSU

Outline Introduction Flickr Data Statistical Characteristics
Deep Query Understanding Experiments Conclusions This is the outline of today’s talk.

Deep Query Understanding Experiments Conclusions First let’s look at the introduction

Introduction reality “zootopia”
Personal photo and video data are being accumulated at an unprecedented speed. Because of the sentimental attachment, these videos are probably more valuable to us than some popular videos on YouTube. However, 80% of the photos and videos do not have the textual metadata. Therefore, it often takes us lots of swipes to find some photo or video on our cell phone 13.9 petabytes personal media were uploaded to Google photo by 200M users in just one year. More than 80% personal media do not have user tags.

Introduction Personal media: personal photos and videos.
Personal media search is a novel and challenging problem: what are the differences when users search their own photos/videos versus the ones on the web? Can we use our findings to improve personal media search? Conduct our research on large-scale real-world Flickr search logs. We use the term personal media to refer to personal photos and videos Personal media search is a novel problem and in this talk we seek answers to two questions: what are the differences when users search their own photos/videos versus the ones on the web? Can we use our findings to improve personal media search? We will try to answer the questions based on our experiments on a large-scale Flickr search log data.

Deep Query Understanding Experiments Conclusions Let’s look at our experimental data set

Data Three types of search:
Large-scale real-world search log data on Flickr: Three types of search: personal: queries searching their own photos social: searching their friends’ photos web: searching anyone’s photos on the entire public Flickr We sampled about 4 million search logs from Flickr in 2015 There are 3 types of searches in the logs: personal: queries searching users’ own photos social: searching their friends’ photos web: searching anyone’s photos on the entire public Flickr We are interested in personal category and include other categories for comparison

Data Concepts are automatically detected objects, people, scenes, actions, etc. from images or videos. false positive Concepts: hands, lettuce, kitchen, notebook. For each photo and video, we extract about 5000 concepts. Concepts are detected objects, people, scenes, actions, etc. from the content of images and videos. Worth mentioning that the concepts are automatically detected and therefore their accuracy are limited. For example, in this photo, we have correctly detected concepts hands, lettuce but we also have a false-positive concept called notebook. As most photos have no metadata, the concept is one of the few options that are available to search personal media The precision of recall detected concepts are limited. As most photos have no metadata, the concept is one of the few options. We extract concepts from the image and video on the data.

Deep Query Understanding Experiments Conclusions Let’s look at our discovered characteristics about personal media on the Flickr dataset.

Observation I personal queries are more “visual”.
Query Personal Social Web Visual 85.3% 60.9% 70.4% visual: snow, flower, lake non-visual: 2014, NYC, social media First, personal media queries contain more visual words. Visual words are the words that can be detected from the media content. For example, snow , flower, lake are visual words. But 2014, NYC, social media are not. To detect a visual word, we first map a query word to its synset in WordNet, and then we check if the synset is in the largest visual word vocabularies, such as ImageNet and LabelMe Visual Concept Vocabularies WordNet synsets query words

Observation II The majority of personal media have no user tags and the percentage with tags is decreasing. Second, about 80% of personal media do not have tags. In this graph, the solid green curve represents the percentage. As we can see only 15% of personal media uploaded in 2015 have tags. The percentage of personal media with tags is also decreasing over time. Interestingly, the percentage of personal media with GPS is increasing steadily.

Observation III Users are interested in “4W queries” in their personal media. what (object, thing, action, plant, etc.) who (person, animal) where (scene, country, city, GPS) when (year, month, holiday, date, etc.) Third, personal queries are mainly about “4W”: what, who, where and when The figure shows the distribution of the 4W in different types of queries. The left-most bars represent the 4W in personal queries.

Observation IV Personal search sessions are shorter. The median clicked position is 2. Getting the top2 personal photos correct is very important. Forth, personal search sessions are shorter. The graph shows the comparison between personal and web search session in terms of their median clicked positions. The take home message here is that personal sessions are much shorter and the key is to get the top 2 results correct.

Observation V A big gap between personal queries and automatically detected concepts. Users might search millions of topics, but the system can detect only a few thousand concepts. Last but not least, we found a big gap between the personal queries and automatically detected concepts. In other words, a gap between what users intent to search and what can be detected by the system. For example, a user would like to find “making a sandwich” photos but we do not have “making a sandwich” concepts. The gap significantly impairs the quality of personal media search. It is one of the must-solve problems in personal media. user’s information need what can be detected by the system.

Deep Query Understanding Experiments Conclusions In order to reduce the gap, we propose deep query understanding methods.

Problem To reduce the gap, we can:
Increase the #concepts  non-trivial (more labeled data) Better understand the query  focus of this paper query understanding: how to map out-of-vocabulary query words to the concepts in our vocabulary? Generally, we can reduce the gap by either: increasing the number concepts so that we can detect more things or having better understanding about the query so that we can better map queries to our existing concepts. This talk will focus on the second direction. For example, suppose we do not have “making a sandwich” detector, but we can map the query to concepts such as food, bread, cheese, etc., all of which exist in our concept vocabulary. generated query user query food, bread, cheese, kitchen, cooking, room, lunch, dinner; Making a sandwich all concepts are in our vocabulary

Query Understanding generated query user query food, bread, cheese, kitchen, cooking, room, lunch, dinner; Making a sandwich Query understanding is a challenging problem. Existing methods include: Exact word matching WordNet Similarity [Miller, 1995]: structural depths in WordNet taxonomy. Word embedding mapping [Mikolov et al., 2013]: word distance in a learned embedding space in Wikipedia by the skip-gram model (word2vec). Query understanding boils down to measure the similarity between query words and concepts. Current methods include measuring the similarity by the WordNet similarity and the Word2vec embeeding similarity G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39-41, 1995. T. Mikolov and J. Dean. Distributed representations of words and phrases and their compositionality

Deep Visual Query Embedding
Our solution: learning deep visual query embedding mined from the Flickr search logs. clicked photo User query making a sandwich … … … Concepts in clicked photos: bread, vegetable, kitchen, cooking, ham In this paper, we propose a simple approach to directly map a query to relevant concepts using the data mined from the personal search logs. The idea is pretty simple. When a user issues a query to Flickr, the system will retrieve several photos. He might click some of them, and we assume the clicked photos have some relevance to the query. Then we extract the concepts from the clicked photos, and train model to map a query to the clicked concepts. For example, we map the query “making a sandwich” to the clicked concepts such as bread, vegetable, kitchen, cooking, etc. The table show more examples. For example, the query bluebell is mapped to “purple” and “flower” as bluebell are purple-ish flower n photos. Training data: mined from Flickr search logs. 18

Max-Pooled MLP In order to, we propose two deep embedding model to learn the query-to-concept mapping. The first one is called Max-pooled MLP. The input is the query words along with their Part-of-Speech tags, and the model outputs the relevant concepts. For example, the query “birthday party” will be mapped to relevant concepts such as cake, presents, kids, and candles. The query words first are input to embedding layer,s which are then sent to a max-pooled layer, followed by a number of fully connected layers.

Two-channel RNN The previous Max-pooled MLP does not consider the query word sequence. So birthday party is the same as party birthday. In our second model, we employ two LSTM layers to capture the query word sequence. The query words and their POS tags are sequentially fed into the model by two channels. The hidden LSTM states are pooled together and then are sent to a number of fully connected layers.

Deep Query Understanding Experiments Conclusions Let’s look at our experimental results.

Experimental Setups Task: search 148,000 personal photos by concepts. We discard all textual metadata. Goal: rank the user clicked photos closer to the top. Training: 20,600 queries from 3,978 users Test on 2,443 queries from 1,620 users. Evaluated by the mean average precision (mAP) and the concept recall at k mAP how well the clicked photos are ranked in the results.  how accurate are the top-k predicted concepts. Our task is to search 148,000 personal photos by concepts. Since in real-world scenarios, most of personal media do not have tags. We discard all textual metadata in our data. We can extract the clicked photos from the search logs. Our goal is to rank the user clicked photos closer to the top. We trained our models on 20k queries from 4k users, and test on 2k queries on 1.6k users. We evaluate the results by mean average precision (mAP) and concept recall at k. mAP captures how well the clicked photos are ranked. captures how accurate are the mapped concepts.

Results mAP is pretty low  challenging problem.
Word2vec embedding on GoogleNews Proposed model (45% relative improvement) We compare proposed method with a number of baseline methods including SkipGram model. We use the word2vec embedding trained on GoogleNews. The proposed methods are VQE (MaxMLP) and VQE(RNN). As we see, the proposed VQE(MaxMLP) significantly outperforms other methods. This suggest learning deep embedding over search logs helps. However, the best map is still 0.39 suggesting the problem is very challenging. By comparing the RNN and MaxMLP models, we found that considering word sequence in our model might not help. The main reason that RNN is worse than MaxMLP is that RNN converges much slower. When we stop training after a day, RNN still yields much worse results than the MaxMLP model. mAP is pretty low  challenging problem. Learning the deep embedding over the search logs helps. Considering the word sequence might not help. RNN model converges much slower and thus gets worse performance.

Examples of top 2 results
The figures show two representative queries: paintball and key west. Key west is an island in Florida. The ranked list on the left is produced by our model and the one on the right is produced by the baseline word2vec model. The photo with green border indicate they were clicked by the user. As we see, in both cases our method find more meaningful results

Empirical Observations
deeper models are better. max pooling is better than average pooling. softmax loss yields the best results  probably because concepts in the clicked photos are sparse. We conducted more experiments to understand the behavior of our models. Here is the summary: We found that generally, Deeper models lead to more accurate results even with fewer parameters Max-pooling works better than average pooling in our best model We compare three different losses: softmax cross-entropy, cross-entropy and cosine distance. We find that softmax loss yields the best results. It is probably because concepts in the clicked photos are sparse

Deep Query Understanding Experiments Conclusions Here are the conclusions.

Take home messages Personal media search:
Personal query sessions are shorter, and queries are more “visual”. Users are interested in “4W queries”. 80% of personal media have no textual metadata and the percentage is decreasing. Utilizing deep learning for query understanding is promising in improving personal media search. We believe personal media search is a novel and challenging problem, which needs further research. There are 3 take-home messages from today’s talk: First we found several characteristics about personal media search: personal query sessions are shorter, and queries are more “visual” Users are interested in “4W queries”, i.e. what, who, where and when 80% of personal media have no textual metadata and the percentage is decreasing. Second, we showed that Utilizing deep learning for query understanding is promising in improving personal media search. Finally, we believe personal media search is novel and challenging problem, which needs further research.

Questions to: lujiang@cs.cmu.edu
Thank You. Questions to: Check out our application. Search MemoryQA on YouTube.

Delving Deep into Personal Photo and Video Search

Similar presentations

Presentation on theme: "Delving Deep into Personal Photo and Video Search"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Delving Deep into Personal Photo and Video Search

Similar presentations

Presentation on theme: "Delving Deep into Personal Photo and Video Search"— Presentation transcript:

Similar presentations

About project

Feedback