A Thousand Words in a Scene P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez and T. Tuytelaars PAMI, Sept. 2006
Outline Introduction Image Representation –Bag-of-Visterms (BOV) Representation –Probabilistic Latent Semantic Analysis (PLSA) Scene Classification Experiments –Classification –Image Ranking Conclusion
Introduction Main work –Scene modeling and classification What’s new? –Combine text modeling methods and local invariant features to represent an image. A text-like bag-of-visterms representation (histogram of quantized local visual features) A text-like bag-of-visterms representation (histogram of quantized local visual features) Probabilistic Latent Semantic Analysis (PLSA) Probabilistic Latent Semantic Analysis (PLSA) –Scene classification is based on the image representation –Scenes can be ranked via PLSA
Introduction Framework An image Interest point detector Local descriptors Quantization BOVPLSA Classification (SVM) Classification / ranking Low level feature extraction Approach to text- like representation Text-modeling methods Feature Extraction
Image Representation Local invariant features – Interest point detection Extract characteristic points and more generally regions from the images. Invariant to geometric and photometric transformations, given an image and transformed versions, same points are extracted. Employ the Difference of Gaussians (DOG) point detector: – –Compare a point with its eight neighbors to find minimum/maximum. –Invariant to translation, scale, rotation and illumination variations.
Image Representation –Local descriptors Compute the descriptor on the region around each interest point. Use Scale Invariant Feature Transform (SIFT) feature as local descriptor. –Low level feature extraction example Each point has a feature vector of 128D
Image Representation Quantization – Quantize each local descriptor into a symbol via K- means Bag-of-visterms representation –Histogram of the visterms –Cons: no spatial information between visterms.
Image Representation Probabilistic Latent Semantic Analysis (PLSA) – Introduce latent variables z l, called aspect, and associate a z l with each observation (visterm), –Build a joint probability model over images and visterms –Likelihood of the model parameters is –Image representation
Image Representation Polysemy and synonymy with visterms –Polysemy: a single visterm may represent different scene content. –Synonymy: several visterms may characterized the same image content. –Example: samples from 3 randomly selected visterms from a vocabulary of size samples from 3 randomly selected visterms from a vocabulary of size not all visterms have a clear semantic interpretation. not all visterms have a clear semantic interpretation. –Pros of PLSA Introduce aspect to capture visterm co-occurrence, thus can handle polysemy and synonymy issues.
Experiments Classification –BOV classification (three-class) Dataset: indoor, city, landscape Training&testing: the whole dataset is slip into 10 parts, one for training, the other 9 for testing. Baseline methods: histograms on low-level features;
Experiments –PLSA classification (three-class) PLSA-I: use the same part of data to train SVM as well as learning the aspect models. PLSA-O: use an auxiliarty dataset to learn the aspect models.
Experiments Aspect-based image ranking –Given an aspect z, images can be ranked according to –Dataset: landscape/city
Conclusion The proposed scene modeling method is effective for scene classification A visual scene is presented as a mixture of aspects in PLSA modeling.