Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning and Vision: Discriminative Models

Similar presentations


Presentation on theme: "Learning and Vision: Discriminative Models"— Presentation transcript:

1 Learning and Vision: Discriminative Models
Paul Viola Microsoft Research

2 Overview Perceptrons Support Vector Machines AdaBoost
Face and pedestrian detection AdaBoost Faces Building Fast Classifiers Trading off speed for accuracy… Face and object detection Memory Based Learning Simard Moghaddam

3 History Lesson 1950’s Perceptrons are cool
Very simple learning rule, can learn “complex” concepts Generalized perceptrons are better -- too many weights 1960’s Perceptron’s stink (M+P) Some simple concepts require exponential # of features Can’t possibly learn that, right? 1980’s MLP’s are cool (R+M / PDP) Sort of simple learning rule, can learn anything (?) Create just the features you need MLP’s stink Hard to train : Slow / Local Minima Perceptron’s are cool

4 Why did we need multi-layer perceptrons?
Problems like this seem to require very complex non-linearities. Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.

5 Why an exponential number of features?
14th Order??? 120 Features N=21, k=5 --> 65,000 features

6 MLP’s vs. Perceptron MLP’s are hard to train…
Takes a long time (unpredictably long) Can converge to poor minima MLP are hard to understand What are they really doing? Perceptrons are easy to train… Type of linear programming. Polynomial time. One minimum which is global. Generalized perceptrons are easier to understand. Polynomial functions.

7 Perceptron Training is Linear Programming
Polynomial time in the number of variables and in the number of constraints. What about linearly inseparable?

8 Rebirth of Perceptrons
How to train effectively Linear Programming (… later quadratic programming) Though on-line works great too. How to get so many features inexpensively?!? Kernel Trick How to generalize with so many features? VC dimension. (Or is it regularization?) Support Vector Machines

9 Lemma 1: Weight vectors are simple
The weight vector lives in a sub-space spanned by the examples… Dimensionality is determined by the number of examples not the complexity of the space.

10 Lemma 2: Only need to compare examples
This is called the kernel matrix.

11 Simple Kernels yield Complex Features

12 But Kernel Perceptrons Can Generalize Poorly

13 Perceptron Rebirth: Generalization
Too many features … Occam is unhappy Perhaps we should encourage smoothness? This not a convergent. Smoother

14 Linear Program is not unique
The linear program can return any multiple of the correct weight vector... Slack variables & Weight prior - Force the solution toward zero

15 Definition of the Margin
Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane Classifier Margin

16 Require non-zero margin
Allows solutions with zero margin Enforces a non-zero margin between examples and the decision boundary.

17 Constrained Optimization
Find the smoothest function that separates data Quadratic Programming (similar to Linear Programming) Single Minima Polynomial Time algorithm

18 Constrained Optimization 2

19 SVM: examples

20 SVM: Key Ideas Augment inputs with a very large feature set
Polynomials, etc. Use Kernel Trick(TM) to do this efficiently Enforce/Encourage Smoothness with weight penalty Introduce Margin Find best solution using Quadratic Programming

21 SVM: Zip Code recognition
Data dimension: 256 Feature Space: 4 th order roughly 100,000,000 dims

22 The Classical Face Detection Process
Larger Scale Smallest Scale 50,000 Locations/Scales

23 Classifier is Learned from Labeled Data
Training Data 5000 faces All frontal 108 non faces Faces are normalized Scale, translation Many variations Across individuals Illumination Pose (rotation both in plane and out) This situation with negative examples is actually quite common… where negative examples are free.

24 Key Properties of Face Detection
Each image contains thousand locs/scales Faces are rare per image 1000 times as many non-faces as faces Extremely small # of false positives: 10-6 As I said earlier the classifier is evaluated 50,000 Faces are quite rare…. Perhaps 1 or 2 faces per image A reasonable goal is to make the false positive rate less than the true positive rate...

25 Sung and Poggio

26 Rowley, Baluja & Kanade First Fast System - Low Res to Hi

27 Osuna, Freund, and Girosi

28 Support Vectors

29 P, O, & G: First Pedestrian Work
Flaws: very poor model for feature selection. Little real motivation for features at all. Why not use the original image.

30 On to AdaBoost Given a set of weak classifiers
None much better than random Iteratively combine classifiers Form a linear combination Training error converges to 0 quickly Test error is related to training margin

31 AdaBoost Freund & Shapire Weak Classifier 1 Weights Increased Weak
Final classifier is linear combination of weak classifiers

32 AdaBoost Properties

33 AdaBoost: Super Efficient Feature Selector
Features = Weak Classifiers Each round selects the optimal feature given: Previous selected features Exponential Loss

34 Boosted Face Detection: Image Features
“Rectangle filters” Similar to Haar wavelets Papageorgiou, et al. For real problems results are only as good as the features used... This is the main piece of ad-hoc (or domain) knowledge Rather than the pixels, we have selected a very large set of simple functions Sensitive to edges and other critcal features of the image ** At multiple scales Since the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron. We introduce a threshold to yield binary features Unique Binary Features

35

36

37 Feature Selection For each round of boosting:
Evaluate each rectangle filter on each example Sort examples by filter values Select best threshold for each filter (min Z) Select best filter/threshold (= Feature) Reweight examples M filters, T thresholds, N examples, L learning time O( MT L(MTN) ) Naïve Wrapper Method O( MN ) Adaboost feature selector

38 Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier

39 Building Fast Classifiers
Given a nested set of classifier hypothesis classes Computational Risk Minimization vs false neg determined by % False Pos % Detection 50 In general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)… A nested set of classifier classes The training process is reminiscent of boosting… - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positives FACE IMAGE SUB-WINDOW Classifier 1 F T NON-FACE Classifier 3 Classifier 2

40 Other Fast Classification Work
Simard Rowley (Faces) Fleuret & Geman (Faces)

41 Cascaded Classifier 50% 20% 2% IMAGE SUB-WINDOW 1 Feature 5 Features 20 Features FACE F F F NON-FACE NON-FACE NON-FACE A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) using data from previous stage. A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

42 Comparison to Other Systems
(94.8) Roth-Yang-Ahuja 94.4 Schneiderman-Kanade 89.9 90.1 89.2 86.0 83.2 Rowley-Baluja-Kanade 93.7 91.8 91.1 90.8 90.0 88.8 85.2 78.3 Viola-Jones 422 167 110 95 78 65 50 31 10 False Detections Detector

43 Output of Face Detector on Test Images

44 Solving other “Face” Tasks
Profile Detection Facial Feature Localization Demographic Analysis

45 Feature Localization Surprising properties of our framework
The cost of detection is not a function of image size Just the number of features Learning automatically focuses attention on key regions Conclusion: the “feature” detector can include a large contextual region around the feature

46 Feature Localization Features
Learned features reflect the task

47 Profile Detection

48 More Results

49 Profile Features

50 Features, Features, Features
In almost every case: Good Features beat Good Learning Learning beats No Learning Critical classifier ratio: AdaBoost >> SVM This is not to say that a


Download ppt "Learning and Vision: Discriminative Models"

Similar presentations


Ads by Google