OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning
a chicken and egg problem…
…among users, researchers, and data Users of the web search engines would like better image results; developer of these search engines would like more robust visual models to improve these results; computer vision researchers are developing the visual models and algorithms for this purpose; but in order to do so, it is critical to have large and diverse object image datasets for training and evaluations; this, goes back to the same problem that the users face. Currently, there is no optimal solution for this problem. Researchers have to manually select the desire images. Well known datasets such as Caltech, LabelMe and LHI etc are collected this way. e.g. Caltech101, LabelMe, LHI Images
Framework Dataset Category model Classification Keyword: accordion The intuition here is: Since the web is vast open, we want to use it as our resource in an iterative fashion which scales well while accumulates the knowledge. Given a very small number of seed images of an object class (either provided by human or automatically), our algorithm learns a model that best describes this class. Serving as a classifier, the algorithm can pull from the web those images that belong to the object class. The newly collected images go to the object dataset, serving as new training data to update and improve the object model. With this new model, the algorithm can then go back to the web and pull more relevant images. This is an iterative process that continuously gathers a highly accurate image dataset while learning a more and more robust object model. Start from a very small number of seed images of an object class (either provided by human or automatically), our algorithm can learn a category model that best describes this class . Then this model will classify the images downloaded from the internet. And the good ones will be appended to the dataset. For the next iteration, a subset of the newly incorporated images will be used to update the category model. With this new model, the algorithm can then go back to the downloaded images and pull more relevant images. Keyword: accordion Li, Wang & Fei-Fei, CVPR 2007
Framework Dataset Category model Classification Keyword: accordion Now let’s talk about the category model Keyword: accordion Li, Wang & Fei-Fei, CVPR 2007
Image representation … Kadir&Brady interest point detector Codewords representation How do we obtain the ‘bag of words’ visual representation? Kb is used …blah blah … Compute SIFT descriptor [Lowe’99]
Nonparametric topic model -Hierarchical Dirichlet Process (HDP) Each image Latent topic models are widely used for unsupervised learning since they can offer natural clustering. Topics are used to discover subsets of the data with common attributes. There are a lot of latent topic models including plSA, LDA and HDP etc. In the scenario of incremental learning, we don’t know the number of topics in advance. Therefore we choose nonparametric topic model, specifically, Hierarchical Dirichlet Process as our visual model. We first define some notation for the graphical model. Start from the inner rectangular, a patch x is the basic unit of an image, each patch is defined by a codeword member of the visual dictionary. Codewords with common attributes share the same topic. Z is the topic index of a particular patch. An image is a collection of N patches which form the inner rectangular. Pi is the mixture proportion of the image. M Images in a corpus (the outer rectangular) have mixture proportions sampled from the same parameters gamma and alpha. beta determines the average weight of topics (E[pijk] = betak), while alpha controls the variability of topics weights across groups. H is the prior distribution for thetas, which correspond to the parameters of topics shared among different images and therefore determine the distribution of the patches. We have such an model per category. Each patch N M Teh, et al. 2004; Sudderth et al. CVPR 2006; Wang, Zhang & Fei-Fei, CVPR 2006
Nonparametric topic model -Hierarchical Dirichlet Process (HDP) Latent topic models are widely used for unsupervised learning since they can offer natural clustering. Topics are used to discover subsets of the data with common attributes. There are a lot of latent topic models including plSA, LDA and HDP etc. In the scenario of incremental learning, we don’t know the number of topics in advance. Therefore we choose nonparametric topic model, specifically, Hierarchical Dirichlet Process as our visual model. We first define some notation for the graphical model. Start from the inner rectangular, a patch x is the basic unit of an image, each patch is defined by a codeword member of the visual dictionary. Codewords with common attributes share the same topic. Z is the topic index of a particular patch. An image is a collection of N patches which form the inner rectangular. Pi is the mixture proportion of the image. M Images in a corpus (the outer rectangular) have mixture proportions sampled from the same parameters gamma and alpha. beta determines the average weight of topics (E[pijk] = betak), while alpha controls the variability of topics weights across groups. H is the prior distribution for thetas, which correspond to the parameters of topics shared among different images and therefore determine the distribution of the patches. We have such an model per category. N M Teh, et al. 2004; Sudderth et al. CVPR 2006; Wang, Zhang & Fei-Fei, CVPR 2006
Classification Category likelihood for I: Likelihood ratio for decision: During data collecting, we do a binary classification by using the likelihood ratio of a foreground object model as well as a background model learned from unrelated images. For a dataset collecting approach, incorporating a bad image into the dataset (False Positive) is way worse than missing a good image (False Negative). Hence, a risk function is also introduced to penalize more heavily the False Positive. Here R is the shorthand of risk. Li, Wang & Fei-Fei, CVPR 2007
Annotation Li, Wang & Fei-Fei, CVPR 2007 With the object model, we can retrieve a large number of clean images with few mistake. Furthermore, we can do meaningful annotation by integrating out the topics parameter to find the most likely local patches given the object category. Li, Wang & Fei-Fei, CVPR 2007
Pitfall #1: model drift … … Object Model Object Model However, model updating based on a few strong cues is likely to be biased. Suppose we always update the object model using very similar images like these two. It’s very likely the collected images are very similar to these two images. Meanwhile, images like these with lower likelihood ratio will be missed. We can only get very good images like this. But, we can also get images like this because we use local patches and only appearance is considered in learning and classification without any globe information. Kid in strawberry has the most similar local patches as the two training images, thus it has a high likelihood ratio. Li, Wang & Fei-Fei, CVPR 2007
Pitfall #2: model diversity Object Model … However, model updating based on a few strong cues is likely to be biased. Suppose we always update the object model using very similar images like these two. It’s very likely the collected images are very similar to these two images. Meanwhile, images like these with lower likelihood ratio will be missed. We can only get very good images like this. But, we can also get images like this because we use local patches and only appearance is considered in learning and classification without any globe information. Kid in strawberry has the most similar local patches as the two training images, thus it has a high likelihood ratio. Good Images Bad Images Li, Wang & Fei-Fei, CVPR 2007
The “cache set” Li, Wang & Fei-Fei, CVPR 2007 Hence, a cache set is designed for our approach. First, images with high likelihood will be accepted. Then accepted images will be measured by their entropy. High entropy ones, which indicate new topics, will go to the cache. While low entropy ones, which we are very sure about, will be appended to the dataset. Incremental learning is only conducted on the cache set. H(z|I)=-sum(p(z|I)lnp(z|I)) Li, Wang & Fei-Fei, CVPR 2007
Raw image dataset Category Model Enlarged dataset Cache classification Incremental learning Category Model Enlarged dataset Cache classification Raw image dataset
Result Li, Wang & Fei-Fei, CVPR 2007 Here we use accordion as an example to show the result of the dataset collecting and the annotation. The four images on the left are the annotation results, our approach can precisely locate the accordion even in a very clutter background. On the right, the bar plot is the comparison of number of images collected by OPTIMOL versus existing datasets. Y axis represents the number of images. The blue bar is the labelme dataset, which has no accordion images, thus the bar is invisible. The yellow bar is the manually selected images from caltech 101 raw dataset. The red bar is images collected by OPTIMOL also from the caltech 101 raw dataset. In this case, OPTIMOL is comparable to human with only a few mistake, which is represented by the darker red part on the top. What we want to emphasize here is the two datasets are extracting from exactly the same raw dataset. The green bar is the number of clean images retrieved from our own raw web images. From the figure, we can see OPTIMOL can collect a lot more images than the existing dataset with few mistake. Li, Wang & Fei-Fei, CVPR 2007
Given in previous slide, human also made mistake when they collect the dataset, our result is reasonable. Li, Wang & Fei-Fei, CVPR 2007
OPTIMOL also learns good models Li, Wang & Fei-Fei, CVPR 2007
Team OPTIMOL (UIUC-Princeton): 1st Place in the Software League