Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon
MESA BRIDGES Project Outline A content-based image retrieval system on Android phone Finding similar images that matching the image captured on the cell phone Gist Algorithm
MESA BRIDGES Accuracy: should retain enough information to be able to make broad categorizations Speed: should be able to quickly perform gist transformation and exemplar matching Gist & Scene Categorization Source image 160 x 120 pixels 19,200 numbers (grayscale) Gist vector ~100 numbers Requirements Category Exemplars Some new scene
MESA BRIDGES Client-Server application Project Design Camera Image Recorder Gist Estimator Http Handler User Interface Web Server PHP handler Perl Module C++ SVM Classifier Image Database Http Request Http Response
Compute SIFT grid Feature Extraction Spatial Pyramid Spatial Histogram Computer Gist Vector SVM Classification MESA BRIDGES Lazebnik Algorithm
MESA BRIDGES Edge points at 8 orientations and 2 scales. These channels are the vocabulary. Vocabulary size M = 16 SIFT on 16 x 16 pixel patches Vocabulary from K-means on SIFT descriptors. Typically, M = 200 or 400 Lazebnik Algorithm Feature Extraction Weak Features Strong Features
MESA BRIDGES Lazebnik Algorithm Spatial Matching The idea is to “contextualize” the visual words by performing a sort of geometric match X m and Y m are sets of 2D vectors representing positions of the visual words in the input and training images For each word, we apply the pyramid match kernel K L to the above position vectors Categorization is done with an SVM trained using the one-versus-all rule
MESA BRIDGES Lazebnik Algorithm Pyramid Matching
MESA BRIDGES Caltech %-0%,75%-25%,50%-50% 8 categories: Car Side, Cellphone, Chair, Cup, Faces, Laptop, Motorbikes, Pizza Vocabulary Size: 25,50,100,200 Training is done on the server-side Experimental Setup
MESA BRIDGES 25% Training 75% Testing. 200 Vocabulary 57.3% overall classification accuracy Testing Result Car SideCellphoneChairCupFacesLaptopMotorbikesPizzaUnknown Car Side Cellphone Chair Cup Faces Laptop Motorbikes Pizza Ground Truth
MESA BRIDGES 123 Speed vs. Accuracy
MESA BRIDGES Edge points at 8 orientations and 2 scales. These channels are the vocabulary. Vocabulary size M = 16 SIFT on 16 x 16 pixel patches Vocabulary from K-means on SIFT descriptors. Typically, M = 200 or 400 Result 3 Pyramid Matching
MESA BRIDGES Client-Server Design makes application easy to port different embedded system. Compute gist vector is an expensive process on embedded system. Reduce vocabulary size will improve processing speed with lower some accuracy Discussion & Conclusions
MESA BRIDGES Lazebnik, S., Schmid, C., Ponce, J. "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Catgories“ CVPR, 2006 Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features," International Symposium on Mixed and Augmented Reality (ISMAR) Or Or