CS 1699: Intro to Computer Vision Bias-Variance Trade-off + Other Models and Problems Prof. Adriana Kovashka University of Pittsburgh November 3, 2015.

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Quiz 1 on Wednesday ~20 multiple choice or short answer questions
Machine learning continued Image source:
Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Classification and Decision Boundaries
Discriminative and generative methods for bags of features
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Support Vector Machine
Recognition: A machine learning approach
Decision making in episodic environments
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Lecture 28: Bag-of-words models
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Bag-of-features models
ICS 273A Intro Machine Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Discriminative and generative methods for bags of features
Machine learning Image source:
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Machine learning Image source:
Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Computer Vision James Hays, Brown
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Classification Tamara Berg CSE 595 Words & Pictures.
Classification 2: discriminative models
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Support Vector Machine & Image Classification Applications
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Image Categorization Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/11.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Methods for classification and image representation
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Machine learning Image source:
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
CS 2750: Machine Learning Bias-Variance Trade-off (cont’d) + Image Representations Prof. Adriana Kovashka University of Pittsburgh January 20, 2016.
CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 17, 2016.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Lecture IX: Object Recognition (2)
Object detection as supervised classification
Machine Learning Crash Course
CS 1674: Intro to Computer Vision Scene Recognition
CS 2770: Computer Vision Intro to Visual Recognition
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
CS 2750: Machine Learning Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
COSC 4335: Other Classification Techniques
Presentation transcript:

CS 1699: Intro to Computer Vision Bias-Variance Trade-off + Other Models and Problems Prof. Adriana Kovashka University of Pittsburgh November 3, 2015

Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers – Decision trees – Hidden Markov models Other problems – Clustering: agglomerative clustering – Dimensionality reduction

Lines in R 2 Let distance from point to line Kristen Grauman

Support vector machines Want line that maximizes the margin. Support vectors For support, vectors, wx+b=-1 wx+b=0 wx+b=1 Distance between point and line: For support vectors: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998A Tutorial on Support Vector Machines for Pattern Recognition Margin

Finding the maximum margin line 1.Maximize margin 2/||w|| 2.Correctly classify all training data points: Quadratic optimization problem: Minimize Subject to y i (w·x i +b) ≥ 1 One constraint for each training point. Note sign trick. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998A Tutorial on Support Vector Machines for Pattern Recognition

Finding the maximum margin line Solution: b = y i – w·x i (for any support vector) Classification function: Notice that it relies on an inner product between the test point x and the support vectors x i (Solving the optimization problem also involves computing the inner products x i · x j between all pairs of training points) If f(x) < 0, classify as negative, otherwise classify as positive. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998A Tutorial on Support Vector Machines for Pattern Recognition

Datasets that are linearly separable work out great: But what if the dataset is just too hard? We can map it to a higher-dimensional space: 0x 0 x 0 x x2x2 Nonlinear SVMs Andrew Moore

Examples of kernel functions Linear: Gaussian RBF: Histogram intersection: Andrew Moore

Allowing misclassifications Maximize marginMinimize misclassification Slack variable The w that minimizes… Misclassification cost # data samples

What about multi-class SVMs? Unfortunately, there is no “definitive” multi-class SVM formulation In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs One vs. others – Training: learn an SVM for each class vs. the others – Testing: apply each SVM to the test example, and assign it to the class of the SVM that returns the highest decision value One vs. one – Training: learn an SVM for each pair of classes – Testing: each learned SVM “votes” for a class to assign to the test example Svetlana Lazebnik

Evaluating Classifiers Accuracy –# correctly classified / # all test examples Precision/recall –Precision = # predicted true pos / # predicted pos –Recall = # predicted true pos / # true pos F-measure = 2PR / (P + R) Want evaluation metric to be in some range, e.g. [0 1] –0 = worst possible classifier, 1 = best possible classifier

Precision / Recall / F-measure Precision = 2 / 5 = 0.4 Recall= 2 / 4 = 0.5 F-measure = 2*0.4*0.5 / = 0.44 True positives (images that contain people) True negatives (images that do not contain people) Predicted positives (images predicted to contain people) Predicted negatives (images predicted not to contain people) Accuracy: 5 / 10 = 0.5

Support Vector Regression

Regression Regression is like classification except the labels are real valued Example applications: ‣ Stock value prediction ‣ Income prediction ‣ CPU power consumption Subhransu Maji

Regularized Error Function for Regression In linear regression, we minimize the error function: Use the Є-insensitive error function: An example of Є-insensitive error function: Adapted from Huan Liu for otherwise true valuepredicted value

Slack Variables for Regression For a target point to lie inside the tube: Introduce slack variables to allow points to lie outside the tube: Adapted from Huan Liu Subject to: Minimize:

Next time: Support Vector Ranking

Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers – Decision trees – Hidden Markov models Other problems – Clustering: agglomerative clustering – Dimensionality reduction

Generalization How well does a learned model generalize from the data it was trained on to a new test set? Training set (labels known)Test set (labels unknown) Slide credit: L. Lazebnik

Generalization Components of generalization error –Bias: how much the average model over all training sets differs from the true model Error due to inaccurate assumptions/simplifications made by the model –Variance: how much models estimated from different training sets differ from each other Underfitting: model is too “simple” to represent all the relevant class characteristics –High bias and low variance –High training error and high test error Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data –Low bias and high variance –Low training error and high test error Slide credit: L. Lazebnik

Bias-Variance Trade-off Models with too few parameters are inaccurate because of a large bias (not enough flexibility). Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). Slide credit: D. Hoiem

Fitting a model Figures from Bishop Is this a good fit?

With more training data Figures from Bishop

Bias-variance tradeoff Training error Test error UnderfittingOverfitting Complexity Low Bias High Variance High Bias Low Variance Error Slide credit: D. Hoiem

Bias-variance tradeoff Many training examples Few training examples Complexity Low Bias High Variance High Bias Low Variance Test Error Slide credit: D. Hoiem

Choosing the trade-off Need validation set Validation set is separate from the test set Training error Test error Complexity Low Bias High Variance High Bias Low Variance Error Slide credit: D. Hoiem

Effect of Training Size Testing Training Generalization Error Number of Training Examples Error Fixed prediction model Adapted from D. Hoiem

How to reduce variance? Choose a simpler classifier Use fewer features Get more training data Regularize the parameters Slide credit: D. Hoiem

Regularization Figures from Bishop

Characteristics of vision learning problems Lots of continuous features – Spatial pyramid may have ~15,000 features Imbalanced classes – Often limited positive examples, practically infinite negative examples Difficult prediction tasks Recently, massive training sets became available – If we have a massive training set, we want classifiers with low bias (high variance is ok) and reasonably efficient training Adapted from D. Hoiem

Remember… No free lunch: machine learning algorithms are tools Three kinds of error – Inherent: unavoidable – Bias: due to over-simplifications – Variance: due to inability to perfectly estimate parameters from limited data Try simple classifiers first Better to have smart features and simple classifiers than simple features and smart classifiers Use increasingly powerful classifiers with more training data (bias-variance tradeoff) Adapted from D. Hoiem

Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers – Decision trees – Hidden Markov models Other problems – Clustering: agglomerative clustering – Dimensionality reduction

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories CVPR 2006 Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign Cordelia Schmid INRIA Rhône-Alpes, France Jean Ponce Ecole Normale Supérieure, France

Bags of words Slide credit: L. Lazebnik

1.Extract local features 2.Learn “visual vocabulary” using clustering 3.Quantize local features using visual vocabulary 4.Represent images by frequencies of “visual words” Bag-of-features steps Slide credit: L. Lazebnik

… Local feature extraction Slide credit: Josef Sivic

Learning the visual vocabulary … Slide credit: Josef Sivic

Learning the visual vocabulary Clustering … Slide credit: Josef Sivic

Learning the visual vocabulary Clustering … Slide credit: Josef Sivic Visual vocabulary

Image categorization with bag of words Training 1.Extract bag-of-words representation 2.Train classifier on labeled examples using histogram values as features Testing 1.Extract keypoints/descriptors 2.Quantize into visual words using the clusters computed at training time 3.Compute visual word histogram 4.Compute label using classifier Slide credit: D. Hoiem

What about spatial layout? All of these images have the same color histogram Slide credit: D. Hoiem

Spatial pyramid Compute histogram in each spatial bin Slide credit: D. Hoiem

Spatial pyramid [Lazebnik et al. CVPR 2006]Lazebnik et al. CVPR 2006 Slide credit: D. Hoiem

Level 2 Level 1 Level 0 Feature histograms: Level 3 Total weight (value of pyramid match kernel): Pyramid matching Indyk & Thaper (2003), Grauman & Darrell (2005) Matching using pyramid and histogram intersection for some particular visual word: Original images Adapted from L. Lazebnik xixi xjxj K( x i, x j )

Scene category dataset Fei-Fei & Perona: 65.2% Multi-class classification results (100 training images per class) Fei-Fei & Perona (2005), Oliva & Torralba (2001) Slide credit: L. Lazebnik

Scene category confusions Difficult indoor images kitchenliving roombedroom Slide credit: L. Lazebnik

Caltech101 dataset Multi-class classification results (30 training images per class) Fei-Fei et al. (2004) Slide credit: L. Lazebnik

Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers – Decision trees – Hidden Markov models Other problems – Clustering: agglomerative clustering – Dimensionality reduction

Decision tree classifier Example problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1.Alternate: is there an alternative restaurant nearby? 2.Bar: is there a comfortable bar area to wait in? 3.Fri/Sat: is today Friday or Saturday? 4.Hungry: are we hungry? 5.Patrons: number of people in the restaurant (None, Some, Full) 6.Price: price range ($, $$, $$$) 7.Raining: is it raining outside? 8.Reservation: have we made a reservation? 9.Type: kind of restaurant (French, Italian, Thai, Burger) 10.WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) Slide credit: L. Lazebnik

Decision tree classifier Slide credit: L. Lazebnik

Decision tree classifier Slide credit: L. Lazebnik

Sequence Labeling Problem Unlike most computer vision problems, many NLP problems can viewed as sequence labeling. Each token in a sequence is assigned a label. Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d). foo bar blam zonk zonk bar blam Adapted from Ray Mooney

Markov Model / Markov Chain A finite state machine with probabilistic state transitions. Makes Markov assumption that next state only depends on the current state and independent of previous history. Ray Mooney

Sample Markov Model for POS stop start Det Noun PropNoun Verb Ray Mooney

Sample Markov Model for POS stop start Det Noun PropNoun Verb P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1= Adapted from Ray Mooney “Mary ate the cake.”

Hidden Markov Model Probabilistic generative model for sequences. Assume an underlying set of hidden (unobserved) states in which the model can be (e.g. parts of speech). Assume probabilistic transitions between states over time (e.g. transition from POS to another POS as sequence is generated). Assume a probabilistic generation of tokens from states (e.g. words generated for each POS). Ray Mooney

Sample HMM for POS PropNoun John Mary Alice Jerry Tom Noun cat dog car pen bed apple Det a the that a the a Verb bit ate saw played hit gave 0.05 stop start Ray Mooney

Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers – Decision trees – Hidden Markov models Other problems – Clustering: agglomerative clustering – Dimensionality reduction

Slide credit: D. Hoiem

Clustering Strategies K-means – Iteratively re-assign points to the nearest cluster center Mean-shift clustering – Estimate modes Graph cuts – Split the nodes in a graph based on assigned links with similarity weights Agglomerative clustering – Start with each point as its own cluster and iteratively merge the closest clusters

Agglomerative clustering

How to define cluster similarity? -Average distance between points, maximum distance, minimum distance How many clusters? -Clustering creates a dendrogram (a tree) -Threshold based on max number of clusters or based on distance between merges distance Adapted from J. Hays

Why do we cluster? Summarizing data – Look at large amounts of data – Represent a large continuous vector with the cluster number Counting – Histograms of texture, color, SIFT vectors Segmentation – Separate the image into different regions Prediction – Images in the same cluster may have the same labels Slide credit: J. Hays, D. Hoiem

Slide credit: D. Hoiem

Dimensionality Reduction Figure from Genevieve Patterson, IJCV 2014