Programme 2pm Introduction –Andrew Zisserman, Chris Williams 2.10pm Overview of the challenge and results –Mark Everingham (Oxford) 2.40pm Session 1: The Classification Task –Frederic Jurie presenting work by Jianguo Zhang (INRIA) 20 mins Frederic Jurie (INRIA) 20 mins –Thomas Deselaers (Aachen) 20 mins –Jason Farquhar (Southampton) 20 mins pm Coffee break 4.30pm Session 2: The Detection Task –Stefan Duffner/Christophe Garcia (France Telecom) 30 mins –Mario Fritz (Darmstadt) 30 mins 5.30pm Discussion –Lessons learnt, and future challenges
The PASCAL Visual Object Classes Challenge Mark Everingham Luc Van Gool Chris Williams Andrew Zisserman
Challenge Four object classes –Motorbikes –Bicycles –People –Cars Classification –Predict object present/absent Detection –Predict bounding boxes of objects
Competitions Train on any (non-test) data –How well do state-of-the-art methods perform on these problems? –Which methods perform best? Train on supplied data –Which methods perform best given specified training data?
Data sets train, val, test1 –Sampled from the same distribution of images –Images taken from PASCAL image databases –“Easier” challenge test2 –Freshly collected for the challenge (mostly Google Images) –“Harder” challenge
Training and first test set ClassImagesObjects Motorbikes Bicycles People84152 Cars Total684 ClassImagesObjects Motorbikes Bicycles People84149 Cars Total689 train+valtest1
Example images
Second test set ClassImagesObjects Motorbikes Bicycles People Cars Total1282 test2
Example images
Annotation for training Object class present/absent Sub-class labels (partial) –Car side, Car rear, etc. Bounding boxes Segmentation masks (partial)
Issues in ground truth What objects should be considered detectable? –Subjective judgement by size in image, level of occlusion, detection without ‘inference’ Disagreements will cause noise in evaluation i.e. incorrectly- judged false positives “Errors” in training data –Un-annotated objects Requires machine learning algorithms robust to noise on class labels –Inaccurate bounding boxes Hard to specify for some instances e.g. bicycles Detection threshold was set “liberally”
Results: Classification
Participants test1test2 ParticipantMotorbikesBicyclesPeopleCarsMotorbikesBicyclesPeopleCars Aachen Darmstadt Edinburgh FranceTelecom HUT INRIA: dalal INRIA: dorko INRIA: jurie INRIA: zhang METU MPITuebingen Southampton
Methods Interest points (LoG/Harris) + patches/SIFT –Histogram of clustered descriptors SVM: INRIA: Dalal, INRIA: Zhang Log-linear model: Aachen Logistic regression: Edinburgh Other: METU –No clustering step SVM with other kernels: MPITuebingen, Southampton –Additional features Color: METU, moments: Southampton
Methods Image segmentation and region features: HUT –MPEG-7 color, shape, etc. –Self organizing map Classification by detection: Darmstadt –Generalized Hough transform/SVM verification
Evaluation Receiver Operating Characteristic (ROC) –Equal Error Rate (EER) –Area Under Curve (AUC) EER AUC
Competition 1: train+val/test1 1.1: Motorbikes Max EER: (INRIA: Jurie)
Competition 1: train+val/test1 1.2: Bicycles Max EER: (INRIA: Jurie, INRIA: Zhang)
Competition 1: train+val/test1 1.3: People Max EER: (INRIA: Jurie, INRIA: Zhang)
Competition 1: train+val/test1 1.4: Cars Max EER: (INRIA: Jurie)
Competition 2: train+val/test2 2.1: Motorbikes Max EER: (INRIA: Zhang)
Competition 2: train+val/test2 2.2: Bicycles Max EER: (INRIA: Zhang)
Competition 2: train+val/test2 2.3: People Max EER: (INRIA: Zhang)
Competition 2: train+val/test2 2.4: Cars Max EER: (INRIA: Zhang)
Classes and test1 vs. test2 Mean EER of ‘best’ results across classes –test1 : 0.946, test2 : 0.741
Conclusions? Interest points + SIFT + clustering (histogram) + SVM did ‘best’ –Log-linear model (Aachen) a close second –Results with SVM (INRIA) significantly better than with logistic regression (Edinburgh) Method using detection (Darmstadt) did not do so well –Cannot exploit context (= unintended bias?) of image –Used subset of training data and is able to localize
Competitions 3 & 4 Classification Any (non-test) training data to be used No entries submitted
Results: Detection
Participants test1test2 ParticipantMotorbikesBicyclesPeopleCarsMotorbikesBicyclesPeopleCars Aachen Darmstadt Edinburgh FranceTelecom HUT INRIA: dalal INRIA: dorko INRIA: jurie INRIA: zhang METU MPITuebingen Southampton
Methods Generalized Hough Transform –Interest points, clustered patches/descriptors, GHT Darmstadt: (SVM verification stage), side views with segmentation mask used for training INRIA: Dorko: SIFT features, semi-supervised clustering, single detection per image “Sliding window” classifiers –Exhaustive search over translation and scale FranceTelecom: Convolutional neural network INRIA: Dalal: SVM with SIFT-based input representation
Methods Baselines: Edinburgh –Detection confidence class prior probability Whole-image classifier (SIFT + logistic regression) –Bounding box Entire image Scale-normalized mean bounding box from training data Bounding box of all interest points Bounding box of interest points weighted by ‘class purity’
Evaluation Correct detection: 50% overlap in bounding boxes –Multiple detections considered as (one true + ) false positives Precision/Recall –Average Precision (AP) as defined by TREC Mean precision interpolated at recall = 0,0.1,…,0.9,1 Measured Interpolated
Competition 5: train+val/test1 5.1: Motorbikes Max AP: (Darmstadt)
Competition 5: train+val/test1 5.2: Bicycles Max AP: (Edinburgh)
Competition 5: train+val/test1 5.3: People Max AP: (INRIA: Dalal)
Competition 5: train+val/test1 5.4: Cars Max AP: (INRIA: Dalal)
Competition 6: train+val/test2 6.1: Motorbikes Max AP: (Darmstadt)
Competition 6: train+val/test2 6.2: Bicycles Max AP: (Edinburgh)
Competition 6: train+val/test2 6.3: People Max AP: (INRIA: Dalal)
Competition 6: train+val/test2 6.4: Cars Max AP: (INRIA: Dalal)
Classes and test1 vs. test2 Mean AP of ‘best’ results across classes –test1 : 0.408, test2 : 0.195
Conclusions? GHT (Darmstadt) method did ‘best’ on classes entered –SVM verification stage effective –Limited to lower recall (by use of only side views) SVM (INRIA: Dalal) comparable for cars, better on test2 –Smaller objects?, higher recall Performance on bicycles, people was ‘poor’ –“Non-solid” objects, articulation?
Competition 7: any train/ test1 One entry: 7.3: people (INRIA: Dalal) AP: Use of own training data improved results dramatically (AP: 0.013)
Competition 8: any train/ test2 One entry: 8.3: people (INRIA: Dalal) AP: Use of own training data improved results dramatically (AP: 0.021)
Conclusions Classification –Variety of methods and variations on SIFT+SVM –Encouraging performance on all object classes Detection –Variety of methods and variations on GHT –Encouraging performance on cars, motorbikes People and bicycles more challenging Use of own training data –Only one entry (people detection), much better results than using provided training data –State-of-the-art performance for pre-built classification/detection remains to be assessed