Linear Discriminant Functions

Slides:



Advertisements
Similar presentations
Separating Hyperplanes
Advertisements

Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Overview over different methods – Supervised Learning
Chapter 5: Linear Discriminant Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Binary Classification Problem Learn a Classifier from the Training Set
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Unconstrained Optimization Problem
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Math for CSLecture 51 Function Optimization. Math for CSLecture 52 There are three main reasons why most problems in robotics, vision, and arguably every.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminators Chapter 20 From Data to Knowledge.
Collaborative Filtering Matrix Factorization Approach
Machine learning Image source:
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
SVM by Sequential Minimal Optimization (SMO)
Discriminant Functions
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Non-Parameter Estimation 主講人:虞台文. Contents Introduction Parzen Windows k n -Nearest-Neighbor Estimation Classification Techiques – The Nearest-Neighbor.
Feed-Forward Neural Networks 主講人 : 虞台文. Content Introduction Single-Layer Perceptron Networks Learning Rules for Single-Layer Perceptron Networks – Perceptron.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Multivariate Unconstrained Optimisation First we consider algorithms for functions for which derivatives are not available. Could try to extend direct.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Linear Classification with Perceptrons
Neural Nets: Something you can use and something to think about Cris Koutsougeras What are Neural Nets What are they good for Pointers to some models and.
Lecture 4 Linear machine
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
Chapter 2 Single Layer Feedforward Networks
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
November 21, 2013Computer Vision Lecture 14: Object Recognition II 1 Statistical Pattern Recognition The formal description consists of relevant numerical.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Chapter 2 Single Layer Feedforward Networks
LINEAR DISCRIMINANT FUNCTIONS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Collaborative Filtering Matrix Factorization Approach
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Presentation transcript:

Linear Discriminant Functions 主講人:虞台文

Contents Introduction Linear Discriminant Functions and Decision Surface Linear Separability Learning Gradient Decent Algorithm Newton’s Method

Linear Discriminant Functions Introduction

Decision-Making Approaches Probabilistic Approaches Based on the underlying probability densities of training sets. For example, Bayesian decision rule assumes that the underlying probability densities were available. Discriminating Approaches Assume we know the proper forms for the discriminant functions were known. Use the samples to estimate the values of parameters of the classifier.

Linear Discriminating Functions Easy for computing, analysis and learning. Linear classifiers are attractive candidates for initial, trial classifier. Learning by minimizing a criterion function, e.g., training error. Difficulty: a small training error does not guarantee a small test error.

Linear Discriminant Functions Linear Discriminant Functions and Decision Surface

Linear Discriminant Functions weight vector bias or threshold The two-category classification:

Implementation w0 1 1 w1 w2 wn x1 x2 xd

Implementation 1 1 w0 1 1 w0 w1 w2 wn x0=1 x1 x2 xd

Decision Surface 1 1 w0 w1 w2 wn x0=1 x1 x2 xd

Decision Surface x1 x2 xn g(x)=0 x w0/||w|| w

Decision Surface xn x2 x w x1 A linear discriminant function divides the feature space by a hyperplane. 2. The orientation of the surface is determined by the normal vector w. 3. The location of the surface is determined by the bias w0. x1 x2 xn g(x)=0 x w0/||w|| w

Augmented Space x2 x1 w d-dimension Let (d+1)-dimension g(x)=w1x1+ w2x2 + w0=0 Decision surface:

Pass through the origin of augmented space feature vector Augmented weight vector d-dimension (d+1)-dimension x1 x2 Let w g(x)=w1x1+ w2x2 + w0=0 Decision surface: Pass through the origin of augmented space

Augmented Space ay = 0 y1 y2 y0 x1 x2 a w g(x)=w1x1+ w2x2 + w0=0 1

Augmented Space Decision surface in feature space: By using this mapping, the problem of finding weight vector w and threshold w0 is reduced to finding a single vector a. Augmented Space Decision surface in feature space: Pass through the origin only when w0=0. Decision surface in augmented space: Always pass through the origin.

Linear Discriminant Functions Linear Separability

The Two-Category Case 2 2 1 1 Not Linearly Separable

How to find a? The Two-Category Case 2 1 if yi is labeled 1 Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, 2 if there exists a vector a such that if yi is labeled 1 1 if yi is labeled 2 then the samples are said to be Linearly Separable

Normalization How to find a? if yi is labeled 1 if yi is labeled 2 Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, Withdrawing all labels of samples and replacing the ones labeled 2 by their negatives, it is equivalent to find a vector a such that if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

Solution Region in Feature Space Separating Plane: Solution Region in Feature Space Normalized Case

Solution Region in Weight Space Separating Plane: Solution Region in Weight Space Shrink solution region by margin

Linear Discriminant Functions Learning

Criterion Function To facilitate learning, we usually define a scalar criterion function. It usually represents the penalty or cost of a solution. Our goal is to minimize its value. ≡Function optimization.

Learning Algorithms To design a learning algorithm, we face the following problems: Whether to stop? In what direction to proceed? How long a step to take? Is the criterion satisfactory? e.g., steepest decent : learning rate

Criterion Functions: The Two-Category Case J(a) Criterion Functions: The Two-Category Case # of misclassified patterns solution state where to go? Is this criterion appropriate?

Criterion Functions: The Two-Category Case Y : the set of misclassified patterns Perceptron Criterion Function solution state where to go? Is this criterion better? What problem it has?

Criterion Functions: The Two-Category Case Y : the set of misclassified patterns A Relative of Perceptron Criterion Function solution state where to go? Is this criterion much better? What problem it has?

Criterion Functions: The Two-Category Case Y : the set of misclassified patterns What is the difference with the previous one? solution state where to go? Is this criterion good enough? Are there others?

Gradient Decent Algorithm Learning Gradient Decent Algorithm

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm a1 a2 J(a) Contour Map a2 (a1, a2) a1

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm Define aJ is a vector

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm a1 a2 J(a) (a1, a2) How long a step shall we take? aJ go this way

Gradient Decent Algorithm (k): Learning Rate Gradient Decent Algorithm Initial Setting a, , k = 0 do k  k + 1 a  a  (k)a J(a) until |a J(a)| < 

Trajectory of Steepest Decent If improper learning rate (k) is used, the convergence rate may be poor. Too small: slow convergence. Too large: overshooting a* Basin of Attraction a2 a0 J(a1) J(a0) Furthermore, the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent. a1

Learning Rate Paraboloid Q: symmetric & positive definite

Learning Rate Paraboloid All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

Learning Rate Global minimum (x*): Set Paraboloid We will discuss the convergence properties using paraboloids. Learning Rate Paraboloid Global minimum (x*): Set

Learning Rate Paraboloid Define Error

Learning Rate Paraboloid Define Error We want to minimize Clearly, and

Learning Rate Suppose that we are at yk. Let The steepest decent direction will be Let learning rate be k. That is, yk+1 = yk + k pk.

Learning Rate We’ll choose the most ‘favorable’ k . That is, setting we get

Learning Rate If Q = I,

Convergence Rate

Convergence Rate [Kantorovich Inequality] Let Q be a positive definite, symmetric, nn matrix. For any vector there holds where 1 2  …  n are eigenvalues of Q. The smaller the better

Convergence Rate Faster convergence Slower convergence Condition Number Faster convergence Slower convergence

Convergence Rate Paraboloid Suppose we are now at xk. Updating Rule:

Trajectory of Steepest Decent In this case, the condition number of Q is moderately large. One then see that the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent.

Learning Newton’s Method

Global minimum of a Paraboloid We can find the global minimum of a paraboloid by setting its gradient to zero.

Function Approximation Taylor Series Expansion All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

Function Approximation Hessian Matrix

Newton’s Method Set

Comparison Newton’s method Gradient Decent

Comparison Newton’s Method will usually give a greater improvement per step than the simple gradient decent algorithm, even with optimal value of k. However, Newton’s Method is not applicable if the Hessian matrix Q is singular. Even when Q is nonsingular, compute Q is time consuming O(d3). It often takes less time to set k to a constant (small than necessary) than it is to compute the optimum k at each step.