Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

ICS 178 Intro Machine Learning
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Data Mining Classification: Alternative Techniques
Computer vision: models, learning and inference
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
HMMs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Decision making in episodic environments
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
ICS 273A Intro Machine Learning
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Three kinds of learning
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Machine learning Image source:
Learning Chapter 18 and Parts of Chapter 20
Machine learning Image source:
Face Detection using the Viola-Jones Method
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Machine Learning Overview Tamara Berg Language and Vision.
Machine Learning Overview Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
CS 231A Section 1: Linear Algebra & Probability Review
Classification Tamara Berg CSE 595 Words & Pictures.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Recognition using Boosting Modified from various sources including
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Lecture 29: Face Detection Revisited CS4670 / 5670: Computer Vision Noah Snavely.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Methods for classification and image representation
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Machine learning Image source:
Machine Learning Overview Tamara Berg Recognizing People, Objects, and Actions.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
KNN & Naïve Bayes Hongning Wang
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Learning from Observations
Machine learning Image source:
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Decision making in episodic environments
Learning Chapter 18 and Parts of Chapter 20
Speech recognition, machine learning
Learning from Observations
A task of induction to find patterns
Machine Learning: Decision Tree Learning
Speech recognition, machine learning
Presentation transcript:

Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Announcements HW3 due tomorrow, 11:59pm Midterm2 next Wednesday, Nov 4 –Bring a simple calculator –You may bring one 3x5 notecard of notes (both sides) Monday, Nov 2 we will have in class practice questions 2

Midterm Topic List Probability –Random variables –Axioms of probability –Joint, marginal, conditional probability distributions –Independence and conditional independence –Product rule, chain rule, Bayes rule Bayesian Networks General –Structure and parameters –Calculating joint and conditional probabilities –Independence in Bayes Nets (Bayes Ball) Bayesian Inference –Exact Inference (Inference by Enumeration, Variable Elimination) –Approximate Inference (Forward Sampling, Rejection Sampling, Likelihood Weighting) –Networks for which efficient inference is possible 3

Midterm Topic List Naïve Bayes –Parameter learning including Laplace smoothing –Likelihood, prior, posterior –Maximum likelihood (ML), maximum a posteriori (MAP) inference –Application to spam/ham classification and image classification HMMs –Markov Property –Markov Chains –Hidden Markov Model (initial distribution, transitions, emissions) –Filtering (forward algorithm) –Application to speech recognition and robot localization 4

Midterm Topic List Machine Learning –Unsupervised/supervised/semi-supervised learning –K Means clustering –Hierarchical clustering (agglomerative, divisive) –Training, tuning, testing, generalization –Nearest Neighbor –Decision Trees –Boosting –Application of algorithms to research problems (e.g. visual word discovery, pose estimation, im2gps, scene completion, face detection) 5

The basic classification framework y = f(x) Learning: given a training set of labeled examples {(x 1,y 1 ), …, (x N,y N )}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) outputclassification function input

Classification by Nearest Neighbor Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in? 7

Classification by Nearest Neighbor 8

Classify the test document as the class of the document “nearest” to the query document (use vector similarity, e.g. Euclidean distance, to find most similar doc) 9

Classification by kNN Classify the test document as the majority class of the k documents “nearest” to the query document. 10

Decision tree classifier Example problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1.Alternate: is there an alternative restaurant nearby? 2.Bar: is there a comfortable bar area to wait in? 3.Fri/Sat: is today Friday or Saturday? 4.Hungry: are we hungry? 5.Patrons: number of people in the restaurant (None, Some, Full) 6.Price: price range ($, $$, $$$) 7.Raining: is it raining outside? 8.Reservation: have we made a reservation? 9.Type: kind of restaurant (French, Italian, Thai, Burger) 10.WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 11

Decision tree classifier 12

Decision tree classifier 13

14 Shall I play tennis today?

15

16 How do we choose the best attribute? Leaf nodes Choose next attribute for splitting

17 Criterion for attribute selection Which is the best attribute? –The one which will result in the smallest tree –Heuristic: choose the attribute that produces the “ purest ” nodes Need a good measure of purity!

18 Information Gain Which test is more informative? Humidity <=75%>75% <=20 >20 Wind

19 Information Gain Impurity/Entropy (informal) –Measures the level of impurity in a group of examples

20 Impurity Very impure group Less impure Minimum impurity

21 Entropy: a common way to measure impurity Entropy = p i is the probability of class i Compute it as the proportion of class i in the set.

22 2-Class Cases: What is the entropy of a group in which all examples belong to the same class? entropy = - 1 log 2 1 = 0 What is the entropy of a group with 50% in either class? entropy = -0.5 log – 0.5 log =1 Minimum impurity Maximum impurity

23 Information Gain We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. Information gain tells us how useful a given attribute of the feature vectors is. We can use it to decide the ordering of attributes in the nodes of a decision tree.

24 Calculating Information Gain Entire population (30 instances) 17 instances 13 instances (Weighted) Average Entropy of Children = Information Gain= = 0.38 Information Gain = entropy(parent) – [weighted average entropy(children)] parent entropy child entropy child entropy

25 e.g. based on information gain

Model Ensembles

Random Forests 30 A variant of bagging proposed by Breiman Classifier consists of a collection of decision tree- structure classifiers. Each tree cast a vote for the class of input x.

A simple algorithm for learning robust classifiers –Freund & Shapire, 1995 –Friedman, Hastie, Tibshhirani, 1998 Provides efficient algorithm for sparse visual feature selection –Tieu & Viola, 2000 –Viola & Jones, 2003 Easy to implement, doesn’t require external optimization tools. Used for many real problems in AI. Boosting 31

Defines a classifier using an additive model: Boosting Strong classifier Weak classifier Weight Input feature vector 32

Defines a classifier using an additive model: We need to define a family of weak classifiers Boosting Strong classifier Weak classifier Weight Input feature vector Selected from a family of weak classifiers 33

Adaboost Input: training samples Initialize weights on samples For T iterations: Select best weak classifier based on weighted error Update sample weights Output: final strong classifier (combination of selected weak classifier predictions)

Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t = Boosting It is a sequential procedure: x t=1 x t=2 xtxt 35

Toy example Weak learners from the family of lines h => p(error) = 0.5 it is at chance Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t = 36

Toy example This one seems to be the best Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t = This is a ‘weak classifier’: It performs slightly better than chance. 37

Toy example Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t = 38

Toy example Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t = 39

Toy example Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t = 40

Toy example Each data point has a class label: w t w t exp{-y t H t } We update the weights: +1 ( ) -1 ( ) y t = 41

Toy example The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. f1f1 f2f2 f3f3 f4f4 42

Adaboost Input: training samples Initialize weights on samples For T iterations: Select best weak classifier based on weighted error Update sample weights Output: final strong classifier (combination of selected weak classifier predictions)

Boosting for Face Detection 44

Face detection features ? classify +1 face -1 not face We slide a window over the image Extract features for each window Classify each window into face/non-face x F(x)y ??

What is a face? Eyes are dark (eyebrows+shadows) Cheeks and forehead are bright. Nose is bright Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04

Basic feature extraction Information type: –intensity Sum over: –gray and white rectangles Output: gray-white Separate output value for –Each type –Each scale –Each position in the window FEX(im)=x=[x 1,x 2,…….,x n ] Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04 x 120 x 357 x 629 x 834

Decision trees Stump: –1 root –2 leaves If x i > a then positive else negative Very simple “Weak classifier” Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04 x 120 x 357 x 629 x 834

Summary: Face detection Use decision stumps as week classifiers Use boosting to build a strong classifier Use sliding window to detect the face x 120 x 357 x 629 x 834 X 234 >1.3 Non-face +1 Face Yes No

Discriminant Function It can be arbitrary functions of x, such as: Nearest Neighbor Decision Tree Linear Functions 50