Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce
Introduction Invariant local descriptors => robust recognition of specific objects or scenes Recognition of textures and object classes => description of intra-class variation, selection of discriminant features, spatial relations texture recognitioncar detection
1.An affine-invariant texture recognition (CVPR’03) 2.A two-layer architecture for texture segmentation and recognition (ICCV’03) 3.Feature selection for object class recognition (ICCV’03) 4.Building affine-invariant part models for recognition Overview
Affine-invariant texture recognition Texture recognition under viewpoint changes and non-rigid transformations Use of affine-invariant regions –invariance to viewpoint changes –spatial selection => more compact representation, reduction of redundancy in texton dictionary [A sparse texture representation using affine-invariant regions, S. Lazebnik, C. Schmid and J. Ponce, CVPR 2003]
Spatial selection clustering each pixelclustering selected pixels
Overview of the approach
Harris detector Laplace detector Region extraction
Descriptors – Spin images
Signature and EMD Hierarchical clustering => Signature : Earth movers distance –robust distance, optimizes the flow between distributions –can match signatures of different size –not sensitive to the number of clusters S S = { ( m 1, w 1 ), …, ( m k, w k ) } SS’ D( S, S’ ) = [ i,j f ij d( m i, m’ j )] / [ i,j f ij ]
Database with viewpoint changes 20 samples of 10 different textures
Results Spin images Gabor-like filters
1.An affine-invariant texture recognition (CVPR’03) 2.A two-layer architecture for texture segmentation and recognition (ICCV’03) 3.Feature selection for object class recognition (ICCV’03) 4.Building affine-invariant part models for recognition Overview
A two-layer architecture Texture recognition + segmentation Classification of individual regions + spatial layout [A generative architecture for semi-supervised texture recognition, S. Lazebnik, C. Schmid, J. Ponce, ICCV 2003]
A two-layer architecture Modeling : 1.Distribution of the local descriptors (affine invariants) Gaussian mixture model estimation with EM, allows incorporating unsegmented images 2.Co-occurrence statistics of sub-class labels over affinely adapted neighborhoods Segmentation + Recognition : 1.Generative model for initial class probabilities 2.Co-occurrence statistics + relaxation to improve labels
Texture Dataset – Training Images T1 (brick)T2 (carpet)T3 (chair)T4 (floor 1) T5 (floor 2) T6 (marble)T7 (wood)
Effect of relaxation + co-occurrence Original image Top: before relaxation (indivual regions), bottom: after relaxation (co-occurrence)
Recognition + Segmentation Examples
Animal Dataset – Training Images no manual segmentation, weakly supervised 10 training images per animal (with background) no purely negative images
Recognition + Segmentation Examples
1.An affine-invariant texture recognition (CVPR’03) 2.A two-layer architecture for texture segmentation and recognition (ICCV’03) 3.Feature selection for object class recognition (ICCV’03) 4.Building affine-invariant part models for recognition Overview
Object class detection/classification Description of intra-class variations of object parts [Selection of scale inv. regions for object class recognition, G. Dorko and C. Schmid, ICCV’03]
Object class detection/classification Description of intra-class variations of object parts Selection of discrimiant features (weakly supervised)
Training the model Training phase 1 –Input : Images of the object with background (positive images), no normalization, alignment of the image –Extraction of local descriptors : Harris-Laplace, Kadir-Brady, SIFT –Clustering : estimation of Gaussian mixture with EM
Training the model Training phase 1 –Input : Images of the object with background (positive images), no normalization, alignment of the image/object –Extraction of local descriptors : Harris-Laplace, Kadir-Brady, SIFT –Clustering : estimation of Gaussian mixture with EM
Training the model Training phase 2 (selection) –Input : verification set, positive and negative images –Rank each cluster with likelihood (or mutual information) –MAP classifier with the n top clusters
5 LikelihoodMutual Information 25 Likelihood – mutual information –likelihood: more discriminant but very specific –mutual Information: discriminant but not too specific
Results for test images Harris-Laplace 354 points49 correct + 37 incorrect31 correct + 20 incorrect 25 Likelihood10 Mutual InformationDetection Harris-Laplace 277 points43 correct + 36 incorrect26 correct + 20 incorrect
Relaxation – propagation of probablities
Classification Assign each test descriptor to the most probable cluster (MAP) Each descriptor assigned to one of the top n clusters is positive If the number of positive descriptors are above a threshold p classify the image as positive
Classification experiments AirplanesMotorbikes Wild Cats Training Phase 1 #Positive images Training Phase 2 #Positive images #Negative images450 Testing #Positive images #Negative images450 Training Verification Test Image Library
Results: Motorbikes Equal-Error-Rates as a function of p. Receiver-Operating-Characteristic p=6
BestEstimated pp=6Fergus p%p%% Airplanes Harris 897, Kadir Motorbikes Harris Kadir Wild Cats Harris Kadir Classification results: ROC equal error rates
1.An affine-invariant texture recognition (CVPR’03) 2.A two-layer architecture for texture segmentation and recognition (ICCV’03) 3.Feature selection for object class recognition (ICCV’03) 4.Building affine-invariant part models for recognition Overview
Matching collections of local affine-invariant regions that map with an affine transformation => part Matching works for unsegmented images Model = a collection of parts A Affine-invariant part models
Matching: Faces spurious match
Matching: 3D Objects closeup
Matching: Finding Repeated Patterns
Matching: Finding Symmetry
Modeling for Recognition Match multiple pairs of training images to produce several candidate parts. Use additional validation images to evaluate repeatability of parts and individual patches. Retain a fixed number of parts having the best repeatability score as class model. No background model
The Butterfly Dataset 16 training images (8 pairs) per class 10 validation images per class 437 test images 619 images total
Butterfly Models Top two rows: pairs of images used for modeling. Bottom two rows: closeup views of some of the parts making up the models of the seven butterfly classes.
Recognition Top 10 models per class used for recognition Multi-class classification results: total model size (smallest/largest)
Classification Rate vs. Number of Parts Number of parts
Successful Detection Examples Model part Yellow: detected in test image Blue: occluded in test image Test image: All ellipses Test image: Matched ellipses Note: only one of the two training images is shown
Successful Detection Examples (cont.)
Detection of Multiple Instances
Detection Failures
Future Work Spatial relation –non-rigid models –relations between clusters and affine-invariant parts Feature selection: dimensionality reduction Shape information: appropriate descriptors Rapid search: structuring of the data