Lecture 4 Linear machine

Slides:



Advertisements
Similar presentations
G53MLE | Machine Learning | Dr Guoping Qiu
Advertisements

Component Analysis (Review)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CHAPTER 10: Linear Discrimination
Linear Separators.
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimension reduction (1)
Lecture 3 Nonparametric density estimation and classification
Separating Hyperplanes
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Linear Discriminant Functions
x – independent variable (input)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Eigenfaces As we discussed last time, we can reduce the computation by dimension reduction using PCA –Suppose we have a set of N images and there are c.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Principles of Pattern Recognition
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Discriminant Functions
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Linear Models for Classification
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Lecture 2. Bayesian Decision Theory
Support Vector Machines
Large Margin classifiers
LECTURE 10: DISCRIMINANT ANALYSIS
LINEAR DISCRIMINANT FUNCTIONS
Perceptrons Support-Vector Machines
Linear machines 28/02/2017.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
LECTURE 09: DISCRIMINANT ANALYSIS
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating hyperplane

Linear discriminant functions g(x)=wtx+w0 If w is unit vector, r is signed distance. Decide class by its sign.

Linear discriminant functions If x1 and x2 are both on the decision surface, From the discriminant function point of view:

Linear discriminant functions More than two classes. #classes=c Dichotomize? c linear discriminants Pairwise? c(c-1)/2 linear discriminants

Linear discriminant functions Remember what we did in the Bayes Decision class? Define c linear discriminant functions: The overall classifier will be to maximize g(x) at every x: if The resulting classifier is a Linear Machine. The space is divided into c regions. The boundary between neighboring regions is linear, because:

Linear discriminant functions

Generalized linear discriminant functions When we transform x, linear discriminant functions can lead to non-linear separation in the original feature space.

Generalized linear discriminant functions Here in two class case, g(x)=g1(x)-g2(x) Example: a’=(-3,2,5) g(x)=-3+2x+5x2 g(x)=0 when x=3/5 or x=-1 g(x)>0 when x>3/5 or x<-1, decide R1 g(x)<0 when -1<x<3/5, decide R2

Generalized linear discriminant functions

Generalized linear discriminant functions

Fisher Linear discriminant The goal: project the data from d dimensions onto a line. Find the line that maximizes the class separation after projection. The magnitude of w is irrelevant, as it just scales y The direction of w is what matters. Projected mean:

Fisher Linear discriminant Then the distance between projected mean: Our goal is to make the distance large relative to a measure of variation in each class. Define the scatter: is an estimate of the pooled variance. Fisher linear discriminant aims at maximizing over all w

Fisher Linear discriminant Let Note, this is the sample version of Sw: within-class scatter matrix SB: between-class scatter matrix Let Then Let Then

Fisher Linear discriminant Because for any w, SBw is always in the direction of m1-m2 Notice this is the same result when the two densities are normal with equal variance matrix, using the Bayes decision rule.

Multiple discriminant analysis Now there are c classes. The goal is to project to c-1 dimensional space and maximize the between-group scatter relative to within-group scatter. Why c-1 ? We need c-1 discriminant functions. Within-class scatter: Total mean:

Multiple discriminant analysis Between group scatter Total scatter Take a d×(c-1) projection matrix W:

Multiple discriminant analysis The goal is to maximize: The solution: every column vector in W is among the first c-1generalized eigen vectors in Since the projected scatter is not class-specific, this is more like a dimension reduction procedure which captures as much class information as possible.

Multiple discriminant analysis

Multiple discriminant analysis Eleven classes. Projected onto the first two eigen vectors:

Multiple discriminant analysis With the increase of the eigen vector rank, the seperability decreases.

Multiple discriminant analysis

Separating hyperplane Let’s do some data augmentation to make things easier. If we have a decision boundary between two classes: Let Then What’s the benefit? The hyperplane always goes through the origin.

Linearly separable case Now we want to use the training samples to find the weight vector a which classifies all samples correctly. If a exists, the samples are linearly separable. for every yi in class 1 for every yi in class 2 If all yi in class 2 are replaced by its negative, then we are trying to find a such that for every sample. Such an a is a “separating vector” or “solution vector”. is a hyperplane through the origin of weight space with yi as a normal vector. The overall solution lies on the positive side of every such hyperplane. Or in the intersection of n half-spaces.

Linearly separable case Every vector in the grey region is a solution vector. The region is called the “solution region”. A vector in the middle looks better. We can impose conditions to select it.

Linearly separable case Maximize the minimum distance from the samples to the plane

Gradient descent procedure How to find a solution vector? A general approach: Define a function J(a) which is minimized if a is a solution vector. Start with an arbitrary vector Find the gradient Move from to the direction of the gradient to find Iterate; stop when the gain is smaller than a threshold.

Gradient descent procedure

Perceptron Y(a) is the set of samples mis-classified by a. When Y(a) is empty, define J(a)=0. Because aty <0 when yi is misclassified, J(a) is non-negative. The gradient is simple: The update rule is: Learning rate

Perceptron

Perceptron

Perceptron

Perceptron

Perceptron The perceptron adjusts a only according to misclassified samples; correctly classified samples are ignored. The final a is a linear combination of the training points. To have good testing-sample performance, a large set of training samples is needed; however, it is almost certain that a large set of training samples is not linearly separable. In the case of linearly non-separable, the iteration doesn’t stop. We can let η(k)  0 as k∞. However, how to choose the rate of change?

Optimal separating hyperplane The perceptron finds a separating plane out of infinite possibilities. How do we find the best among them? The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point. Unique solution Better test sample performance

Optimal separating hyperplane Notation change!!!! Here we use yi as the class label of sample i. min ||a||2 s.t. a’yi ≥ 1, i=1,…,N We shall visit the support vector machine next time.