Training Discriminative Computer Vision Models with Weak Supervision

Training Discriminative Computer Vision Models with Weak Supervision
Boris Babenko PhD Defense University of California, San Diego

Outline Overview Weakly Labeled Location Weakly Labeled Categories
Supervised Learning Weakly Supervised Learning Weakly Labeled Location Object Localization and Recognition Object Detection with Parts Object Tracking Weakly Labeled Categories Object Detection with Sub-categories Object Recognition with Super-categories Theoretical Analysis of Multiple Instance Learning Conclusions & Future Work

Computer Vision Problems
Want to detect, recognize/classify, track objects in images and videos Examples: Face detection for point-and-shoot cameras Pedestrian detection for cars Animal tracking for behavioral science Landmark/place recognition for search-by-image Large chunk of computer vision research focuses on detecting, recognizing and tracking objects in images and videos.

Old School Hand tuned models per application Example: face detection
- Specialized to the application and domain [Yang et al. ‘94]

New School Adopt methods from machine learning
Train a generic* system by providing labeled examples (supervised learning) Labeling examples is intuitive Adapt to new domains/applications Learn subtle queues that would be impossible to model by hand * Hand tuning/design still often required :-/ In the last years Rather than starting from scratch for each new application

Supervised Learning ( , face) ( , face) ( , non-face) ( , non-face)
Training data: pairs of inputs and labels Train classifier to predict label for novel input TRAINING RUN TIME ( , face) ( ) ( , face) ( , non-face) ( , non-face)

Supervised Learning Training data: Most common case:
Want to train a classifier: Typically a classifier also outputs a confidence score, in addition to label Inputs/instances: Labels: In the binary classification case, you typically take the confidence score and threshold it

Discriminative vs Generative
Generative: model the distribution of the data Discriminative: directly minimize classification error, model the boundary E.g. SVM, AdaBoost, Perceptron Tends to outperform generative models We can split learning approaches into two broad categories

Training Discriminative Model
Objective (minimize training error) Loss function, , is typically a convex upper bound on 0/1 loss Regularization term can help avoid over-fitting

Weak Supervision Slightly overloaded term…
Any form of learning where the training data is missing some labels (i.e. latent variables) Weak supervision is a super set What are the possible goals? Strong supervision somehow more expensive

Object Detection w/ Weak Supervision
Goal: train object detector Strong: Weak: only presence of object is known, not location ( , face) ( , face) ( , non-face) + -

Goal: train object detector Strong: Weak: only presence of object is known, not location <- latent ( , face) ( , face) ( , non-face) + -

Weak Supervision: Advantages
Reduce labor cost Deal with inherent ambiguity & human error Automatically discover latent information Weak supervision is a super set What are the possible goals? Strong supervision somehow more expensive

Training w/ Latent Variables
Classifier now takes in input AND latent input To predict label: Objective: - Find a classifier h such that for the best choice of alpha, the loss will be small for all examples x

Classifier now takes in input AND latent input To predict label: Objective: Not convex! - Find a classifier h such that for the best choice of alpha, the loss will be small for all examples x

Two ways of solving Method 1: Alternate between finding latent variables and training classifier Finding latent variables given a fixed classifier may require domain knowledge E.g. EM (Dempster et al.), Latent Structural SVM (Yu & Joachims) – based on CCCP (Yuille & Rangarajan), Latent SVM (Felzenszwalb et al.), MI-SVM (Andrews et al.) - Two categories of algorithms for training

Method 2: Replace the hard max with “soft” approximation, and then do gradient descent E.g. MILBoost (Viola et al.), MIL-Logistic Regression (Ray et al.)

Supervised Learning Weakly Supervised Learning Weakly Labeled Location Object Detection, Localization and Recognition Object Detection with Parts Object Tracking Weakly Labeled Categories Object Detection with Sub-categories Object Recognition with Super-categories Theoretical Analysis of Multiple Instance Learning Conclusions & Future Work

Goal: train object detector Only presence of object is known, not location Can’t “just throw these into a learning alg.” – very difficult to design invariant features + + -

Multiple Instance Learning (MIL)
(set of inputs, label) pairs provided MIL lingo: set of inputs = bag of instances Learner does not see instance labels Bag labeled positive if at least one instance in bag is positive [Keeler et al. ‘90, Dietterich et al. ‘97]

Object Detection w/ MIL
{ … } + Instance: image patch Instance Label: is face? Bag: whole image Bag Label: contains face? + { … } { … } - [Andrews et al. ’02, Viola et al. ’05, Dollar et al. 08, Galleguillos et al. 08]

MIL Notation Training input: Bags: Bag Labels: Instance Labels:
(unknown during training)

MIL Positive bag contains at least one positive instance
Goal: learning instance classifier Corresponding bag classifier

MIL Algorithms Many “standard” learning algorithms have been adapted to the MIL scenario: SVM (Andrews et al. ‘02), Boosting (Viola et al. ‘05), Logistic Regression (Ray et al. ‘05) Some specialized algorithms also exist DD (Maron et al. ’98), EM-DD (Zhang et al. ‘02)

MIL Algorithms Objective: minimize bag error on training data
MILBoost (Viola et al. ‘05) Replace max with differentiable approximation Use functional gradient descent (Mason et al. ’00, Friedman ’01) Bag label according to , i.e.

{ …} { …} Object Detection
Have a learning framework (MIL), and an algorithm to train classifier (MILBoost) Question: how exactly do we form a bag? { …} Sliding Window { …} Segmentation

Forming a bag via segmentation
Pro: get more precise localization Con: segmentation algorithms often fail; require prior knowledge (e.g. number of segments) If segmentation fails, we might not see “the” positive instance in a positive bag Only way to prevent this is to use ALL possible segments… not practical

Multiple Stable Segmentations (MSS)
Solution: Multiple Stable Segmentations (Rabinovich et al. ‘06) A heuristic for picking out a few “good” segments from the huge set of all possible segments End up with more segments, but higher chance of getting the “right” segment MSS generates MORE segments, and this gives us a higher chance of getting the “right” segment

Multiple Instance Learning with Stable Segmentation (MILSS)
Localization and Recognition Features: BOF on SIFT Classifier: MILBoost one-vs-all (for multiclass) { …} Multiple Stable Segmentation BOF [ Work with Carolina Galleguillos, Andrew Rabinovich & Serge Belongie – ECCV ‘08]

Results: Landmarks

More segments = better results
Our System NCuts w/ k=6 NCuts w/ k=4

Object Detection with Parts
Pedestrians are non-rigid Difficult to design features that are invariant Decision boundary very complex Objects parts are rigid

Object Detection with Parts
Naïve sol’n: label parts and train detectors Labor intensive Sub-optimal (e.g. “space between the legs”) Better sol’n: Use rough location of objects Treat part locations as latent variables [Mohan et al. ’01, Mikolajczyk et al. ‘04]

Multiple Component Learning (MCL)
How to train a part detector from weakly labeled data? How to train many, diverse part detectors How to combine part detectors and incorporate spatial information [Work with Piotr Dollar, Pietro Perona, Zhuowen Tu & Serge Belongie ECCV ‘08]

{ … } { … } + + MCL: One Part Detector Fits perfectly into MIL
Which part does it learn? { … } + + { … }

MCL: Diverse Parts Pedestrian images are “roughly aligned”
Choose random sections of the images to feed into MIL

MCL: Top 5 Learned Detectors

MCL: Combining Part Detectors
Run part detectors, get response map Compute Haar features on top, plug into Boosting Confidence maps from each part detector

MCL: Results INRIA Pedestrian dataset

MCL: Results

MCL: Related Work P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. "Object Detection with Discriminatively Trained Part-Based Models" IEEE PAMI. Sept 2009. Very similar model, uses SVM instead of Boosting, and an explicit shape model L. Bourdev, S. Maji, T. Brox, J. Malik. “Detecting people using mutually consistent poselet activations” ECCV 2010.

Object Tracking Problem: given location of object in first frame, track object through video Tracking by Detection: alternate training detector and running it on each frame

Tracking by Detection First frame is labeled

Tracking by Detection First frame is labeled
Classifier Online classifier (i.e. Online AdaBoost)

Tracking by Detection Grab one positive patch, and some negative patch, and train/update the model. negative positive Classifier

Tracking by Detection Get next frame negative positive Classifier

Tracking by Detection Evaluate classifier in some search window
negative positive Classifier Classifier

Tracking by Detection Evaluate classifier in some search window
negative positive old location X Classifier Classifier

Tracking by Detection Find max response negative positive old location
new location X X Classifier Classifier

Tracking by Detection Repeat… negative negative positive positive
Classifier Classifier

Problems What if classifier is a bit off?
Tracker starts to drift How to choose training examples? [Work with Ming-Hsuan Yang, & Serge Belongie – CVPR ’09, PAMI ‘11]

How to Get Training Examples
- Trouble localizing MIL Classifier Classifier Classifier

How to Get Training Examples
MIL Classifier Classifier Classifier

Experiments Compare MILTrack to: All params were FIXED
OAB1 = Online AdaBoost w/ 1 pos. per frame OAB5 = Online AdaBoost w/ 45 pos. per frame SemiBoost = Online Semi-supervised Boosting FragTrack = Static appearance model All params were FIXED 9 videos, labeled every 5 frames by hand (available on the web) [Grabner ‘06, Adam ‘06, Grabner ’08]

Experiments

Weakly Labeled Categories
Discovering sub-categories Object Detection Discovering super-categories Image Categorization Animals Birds Cats Bluejay Warbler In the next two projects that I’ll talk about, the categories that are given during training are at the wrong level of a hierarchy, and we will want to automatically discover how to break categories into sub-categories or group categories into super-categories.

Start with binary problem
Examples from the “positive” category Difficult to design invariant features [Images from the Sheffield Face Database]

Learning sub-categories
Discover sub-categories Train detector for each sub-category =

Multiple Pose Learning (MPL)
Naïve sol’n: run k-means on data, and then train classifier on each cluster Our sol’n: Simultaneously group data and train classifiers; treat group membership as a latent variable Similar to MIL… [Work with Piotr Dollar, Serge Belongie & Zhuowen Tu – Faces in Real-Life Images, ECCV ‘08]

MPLBoost Objective for MIL:
Instead of having many instances per bag, we now want to train many classifiers:

MPL: Results Some preliminary “toy” results LFW MNIST

Other MPL Results Same algorithm was published in T-K. Kim and R. Cipolla. MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features, NIPS, developed independently Recently, MPL evaluated in pedestrian detection: C. Wojek, S. Walk, B. Schiele. Multi-Cue Onboard Pedestrian Detection, CVPR, 2009 “In general however, MPLBoost seemed to be the most robust classifier with respect to challenging lighting conditions while being computationally less expensive than SVMs.”

Learning Super-categories
Application: classifying images into multiple categories Idea: use training data to learn metric, plug into kNN As we all know, to recognize multiple object categories we require… - One idea that has been getting popular in the last couple years

Multiple Similarity Learning (MuSL)
Learn a single global similarity metric Labeled Dataset Monolithic Query Image Similarity Metric Category 4 Category 3 Category 2 Category 1 There a couple different paradigms for doing this The most straight forward [ Jones et al. ‘03, Chopra et al. ‘05, Goldberger et al. ‘05, Shakhnarovich et al. ’05, Weinberger et al. ‘08 Torralba et al. ’08, McFee et al. ‘10]

Learn similarity metric for each category (1-vs-all) Labeled Dataset Monolithic Category Specific Query Image Similarity Metric Category 4 Category 3 Category 2 Category 1 Alternatively, we could train a similarity metric for each category For example, in this case we have 4 categories, so .. Now to compute the similarity between the query image and an image from the labeled dataset, we will use the similarity metric that was trained for that particular category An example of this type of approach is training a multiple kernel SVM in a 1-vs-all fashion, since each category gets its own cue weighting [ Varma et al. ‘07, Frome et al. ‘07, Weinberger et al. ‘08 Nilsback et al. ’08]

How many should we train?
Per category: More powerful Do we really need thousands of metrics? Have to train for new categories Global/Monolithic: Less powerful Can generalize to new categories A natural question to ask is, which of these paradigms is better Per category is more powerful because each category requires different cues or features to be recognized There is something unsatisfying about this approach. Suppose we had 1000 categories. Do we need that many. Would many of them be redundant? The second issue is that if we decide to add a new category to our dataset, we of course have to do some more training global problems neither of these paradigms is quite right. And as a quick side note, the paper that will be presented in the next talk does a much better job of organizing all the different metric learning approaches, but for our purposes we will focus on just these two. [Ramanan & Baker, 2010]

Would like to explore space between two extremes Idea: Group categories into super-categories Learn a few similarity metrics, one for each super-category The goal of this work is.. Our idea is very simple: [Work with Steve Branson & Serge Belongie, ICCV ‘09]

Learn a few good similarity metrics Query Image Similarity Metric Labeled Dataset Monolithic Category 1 Category 2 - Here’s what that will look like.. We have 4 categories, and instead of training just 1 or 4 different metrics, we train 2 metrics. We group the two flower categories because they probably share a lot of features MuSL Category 3 Category Specific Category 4

Learning a Similarity Metric
Training data: Can treat problem as binary classification Confidence/score from classifier -> similarity Images Category Labels ( , ), 1 ( , ), 0

Goal: train where and recover mapping At runtime To compute similarity of query image to use Category 4 Category 3 Category 2 Category 1 Our goal… and to recover this mapping s which will tell us which metric to use for each category. So at run time when we compute the similarity of a new image and an image x_i from our set, we’re gonna look up which metric to use by plugging the label of x_i into s

Naïve Solution Run pre-processing to group categories (i.e. k-means), then train as usual Drawbacks: Hacky / not elegant Not optimal: pre-processing not informed by class confusions, etc. How can we train & group simultaneously? How do we train this?

MuSL Boosting Objective: where How well works with category
- This quantity tells us how well classifier k works with category c How well works with category

MuSL Results Created dataset with hierarchical structure of categories
We wanted to run experiments on some data where there was a learn hierarchical structure for the categories So what we ended up doing is taking a few categories from 3 different sources: We had 20 categories total In the plot on the right we see the categorization accuracy versus the number of metrics. we are comparing two variations of our method to two other approaches: the first is to group the categories using kmeans, random two things to notice about this plot – the first is that our original hypothesis was correct, using more metrics is indeed advantageous, but using just a few metrics achieves almost as good of performance as using all 20 Merged categories from: Caltech 101 [Griffin et al.] Oxford Flowers [Nilsback et al.] UIUC Textures [Lazebnik et al.]

Recovered Super-categories
MuSL k-means

Generalizing to New Categories
In this experiment we took 10 more categories from those same 3 sources We then took the metrics that we trained using the original 20 categories, and we assigned one of these metrics to each of the new categories And when we looked at the performance as we swept through the value of K, we saw something really interesting As we start to use more metrics, the performance goes up. However, if we continue increasing the number of metrics, the performance starts to degrade significantly. This is a type of over fitting behavior. New categories only Both new and old categories Training more metrics overfits!

PAC Analysis Probably Approximately Correct (PAC)
Unknown but fixed distribution of data Given i.i.d. training examples to learner Learner returns classifier What can we say about generalization/test error of a classifier? The probably approximately correct style analysis was formulated by valiant. In this framework a learner receives iid samples from some fixed distribution of data (distribution over inputs and outputs) and the learner returns some classifier. We’d like to guarantee that w/ high probability over choice of training set, the classifier that the learner returns will have low generalization error. And a typical bound looks like this: [Valiant ‘84]

PAC Bound depends on number of training examples
Sample complexity: how many examples you need to guarantee certain error rate generalization (test) error complexity term empirical (train) error The probably approximately correct style analysis was formulated by valiant. In this framework a learner receives iid samples from some fixed distribution of data (distribution over inputs and outputs) and the learner returns some classifier. We’d like to guarantee that w/ high probability over choice of training set, the classifier that the learner returns will have low generalization error. And a typical bound looks like this: The tradeoff between emp error and complexity is essentially like the bias variance trade-off

PAC Analysis of MIL Bound bag generalization error in terms of empirical error Data model (bottom up) Draw instances and their labels from fixed distribution Create bag from instances, determine its label (max of instance labels) Return bag & bag label to learner Goal of this work is to do a PAC style analysis of MIL. More precisely we’d like to bound… this type of analysis has been done before and it’s typically formalized as follows.

Data Model Bag 1: positive (instance space) Negative instance

Data Model Bag 2: positive Negative instance Positive instance

Data Model Bag 3: negative Negative instance Positive instance

PAC Analysis of MIL Blum & Kalai (1998) Sabato & Tishby (2009)
If: access to noise tolerant instance learner, instances drawn independently Then: bag sample complexity linear in Sabato & Tishby (2009) If: can minimize empirical error on bags Then: bag sample complexity logarithmic in Disconnect between theory and applications in computer vision etc Main thing to notice here is that the complexity grows with r, which again is the bag size or the number of instances in each bag.

MIL Example: Face Detection (Images)
{ … } + Bag: whole image Instance: image patch + { … } { … } - [Andrews et al. ’02, Viola et al. ’05, Dollar et al. 08, Galleguillos et al. 08]

MIL Example: Phoneme Detection (Audio)
Detecting ‘sh’ phoneme + Bag: audio of word Instance: audio clip “machine” - Suppose we are given audio clips of words being spoken, and these are labeled at a word level. Now suppose we wanted to train a ‘sh’ phoneme detector. We can do this using MIL: we’ll take the first clip of the word “machine”, chop it up into many small pieces, and label this bag positive because one of those short audio pieces is the ‘sh’ phoneme. We’ll do the same for the second audio, but we’ll label that bag negative because none of the short audio pieces in that bag is a ‘sh’. “learning” [Mandel et al. ‘08]

MIL Example: Event Detection (Video)
+ Bag: video Instance: few frames event of interest - [Ali et al. ‘08, Buehler et al. ’09, Stikic et al. ‘09]

Observations for these applications
Top down process: draw entire bag from a bag distribution, then get instances Instances of a bag lie on a manifold [Work with Nakul Verma, Piotr Dollar and Serge Belongie, ICML 2011]

Manifold Bags Negative region Positive region
In this case the instances are patches of a fixed size so there are two degrees of freedom. The other applications I mentioned would look similar because they also involve a sliding window of some sort. Negative region Positive region

Manifold Bags For such problems:
Existing analysis not appropriate because number of instances is infinite Expect sample complexity to scale with manifold parameters (curvature, dimension, volume, etc) Why not appropriate

Manifold Bags: Formulation
Manifold bag drawn from bag distribution Instance hypotheses: Corresponding bag hypotheses:

Typical Route: VC Dimension
VC Dimension: characterizes the power/complexity of the hypothesis class Error Bound: [Vapnik & Chervonenkis, ‘71]

Error Bound: generalization error # of training bags empirical error

Error Bound: VC Dimension of bag hypothesis class

Relating to We do have a handle on
For finite sized bags, Sabato & Tishby: Question: can we assume manifold bags are smooth and use a covering argument? - Saure’s lemma

Relating to We do have a handle on
For finite sized bags, Sabato & Tishby: Question: can we assume manifold bags are smooth and use a covering argument? Answer: No! Can show is unbounded even if is very simple and bags are smooth - Saure’s lemma

Issue Bag hypothesis class too powerful
For positive bag, need to only classify 1 instance as positive Negligible part of bag responsible for the positive label Infinitely many instances -> too much flexibility for bag hypothesis Would like to ensure a non-negligible portion of positive bags is labeled positive Our intuition is that we’d like to ensure…

Solution Switch to real-valued hypothesis class
Incorporate a notion of margin Intuition: if bag is classified with large margin, and hypotheses are smooth, then large portion of the bag is classified positive So now we are in a real-valued setting and we are interested in the margin, so how do we analyze it? VC analysis is for binary hypotheses, and it turns out there is a corresponding notion for real-valued hypotheses called FAT SHATTERING!

Fat-shattering Dimension
= “Fat-shattering” dimension of real-valued hypothesis class Analogous to VC dimension Relates generalization error to empirical error at margin i.e. not only does binary label have to be correct, margin has be to Add picture [Anthony & Bartlett ‘99]

Fat-shattering of Manifold Bags
Error Bound:

Error Bound: generalization error # of training bags empirical error at margin

Error Bound: fat shattering of bag hypothesis class

Bound in terms of Use covering arguments – approximate manifold with finite number of points Analogous to Sabato & Tishby’s analysis of finite size bags Tau_o depends on smoothness (lambda), and curvature of manifolds

Error Bound With high probability: Log term hidden

Error Bound With high probability: complexity term
generalization error Tau_o depends on smoothness (lambda), and curvature of manifolds empirical error at margin

Error Bound With high probability:
fat shattering of instance hypothesis class Tau_o depends on smoothness (lambda), and curvature of manifolds

Error Bound With high probability: number of training bags
Tau_o depends on smoothness (lambda), and curvature of manifolds

Error Bound With high probability: manifold dimension

Error Bound With high probability: manifold volume

Error Bound With high probability:
term depends on smoothness parameters (e.g. curvy manifold bags -> high complexity) Tau_o depends on smoothness (lambda), and curvature of manifolds

Error Bound With high probability: Obvious strategy for learner:
Minimize empirical error & maximize margin This is what most MIL algorithms already do Tau_o depends on smoothness (lambda), and curvature of manifolds

Learning from Queried Instances
Previous result assumes learner has access entire manifold bag In practice learner will only access small number of instances ( ) Not enough instances -> might not draw a pos. instance from pos. bag { … }

Learning from Queried Instances
Bound holds with failure probability increased by if

Take-home Message Increasing reduces complexity term
Increasing reduces failure probability Seems to contradict previous results (smaller bag size is better) Important difference between and ! If is too small we may only get negative instances from a positive bag Increasing requires extra labels, increasing does not More color

Iterative Querying Heuristic (IQH)
Problem: want many instances/bag, but have computational limits Heuristic solution: Grab small number of instances/bag, run standard MIL algorithm Query more instances from each bag, only keep the ones that get high score from current classifier At each iteration, train with small # of instances

Experiments Synthetic Data (will skip in interest of time) Real Data
INRIA Heads (images) TIMIT Phonemes (audio) All of the bounds that we’ve seen so far are upper bounds, and PAC is a worst case analysis so the bounds are most likely loose. So the goal of the experiments is to essentially provide an empirical lower bound.

INRIA Heads pad=16 pad=32 [Dalal et al. ‘05]

TIMIT Phonemes + “machine” - “learning” [Garofolo et al., ‘93]

Padding (volume) INRIA Heads TIMIT Phonemes

Number of Instances ( ) INRIA Heads TIMIT Phonemes

Number of Iterations (heuristic)
INRIA Heads TIMIT Phonemes

Future Work: Vision Strong supervision is becoming cheaper (thanks to Amazon Mechanical Turk, etc) Ambiguity issues still exist (e.g. which bird parts should you ask to be labeled?) Would be nice to combine strong and weak supervision, add active element Would be nice to (automatically) learn better low/mid level features

Future Work: Learning Manifold properties
Would be nice to develop algorithms that take more advantage of the manifold structure of the data

Thanks! All of my collaborators: My committee: Funding:
Pitor Dollar, Nakul Verma, Carolina Galleguillos, Andrew Rabinovich, Steve Branson, Zhuowen Tu, Pietro Perona, Ming-Hsuan Yang, Kai Wang, Catherine Wah, Peter Wellinder, Serge Belongie My committee: Serge Belongie, David Kriegman, Lawrence Saul, Virginia de Sa & Gert Lanckriet Funding: NSF IGERT, Google

Training Discriminative Computer Vision Models with Weak Supervision

Similar presentations

Presentation on theme: "Training Discriminative Computer Vision Models with Weak Supervision"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Training Discriminative Computer Vision Models with Weak Supervision

Similar presentations

Presentation on theme: "Training Discriminative Computer Vision Models with Weak Supervision"— Presentation transcript:

Similar presentations

About project

Feedback