Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 4 Linear Discriminant Functions 19.04.2012 Bastian Leibe.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Chapter 4: Linear Models for Classification

Computer vision: models, learning and inference

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

Simple Neural Nets For Pattern Classification

x – independent variable (input)

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Radial Basis Function Networks

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Biointelligence Laboratory, Seoul National University

Machine Learning – Lecture 4

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Machine Learning – Lecture 6

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Lecture 4 Linear machine

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 7 Statistical Learning Theory & SVMs Bastian.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 6 Statistical Learning Theory & SVMs Bastian.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 5 Linear Discriminant Functions Bastian Leibe.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Deep Feedforward Networks

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Probabilistic Models with Latent Variables

Parametric Methods Berlin Chen, 2005 References:

Presentation transcript:

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 4 Linear Discriminant Functions Bastian Leibe RWTH Aachen TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A AA A A Many slides adapted from B. Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Course Outline Fundamentals (2 weeks)  Bayes Decision Theory  Probability Density Estimation Discriminative Approaches (5 weeks)  Linear Discriminant Functions  Support Vector Machines  Ensemble Methods & Boosting  Randomized Trees, Forests & Ferns Generative Models (4 weeks)  Bayesian Networks  Markov Random Fields B. Leibe 2

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Mixture of Gaussians (MoG) “Generative model” 3 B. Leibe “Weight” of mixture component Mixture component Mixture density Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Estimating MoGs – Iterative Strategy Assuming we knew the values of the hidden variable… 4 B. Leibe j ML for Gaussian #1ML for Gaussian #2 assumed known Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Estimating MoGs – Iterative Strategy Assuming we knew the mixture components… Bayes decision rule: Decide j = 1 if 5 B. Leibe assumed known j Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Iterative procedure 1.Initialization: pick K arbitrary centroids (cluster means) 2.Assign each sample to the closest centroid. 3.Adjust the centroids to be the means of the samples assigned to them. 4.Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations.  Local optimum  Final result depends on initialization. Recap: K-Means Clustering 6 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: EM Algorithm Expectation-Maximization (EM) Algorithm  E-Step: softly assign samples to mixture components  M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments 7 B. Leibe = soft number of samples labeled j Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Gaussian Mixture Models Properties  Very general, can represent any (continuous) distribution.  Once trained, very fast to evaluate.  Can be updated online. Problems / Caveats  Some numerical issues in the implementation  Need to apply regularization in order to avoid singularities.  EM for MoG is computationally expensive –Especially for high-dimensional problems! –More computational overhead and slower convergence than k-Means –Results very sensitive to initialization  Run k-Means for some iterations as initialization!  Need to select the number of mixture components K.  Model selection problem (see Lecture 10) 8 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 EM – Technical Advice When implementing EM, we need to take care to avoid singularities in the estimation!  Mixture components may collapse on single data points.  Need to introduce regularization  Enforce minimum width for the Gaussians  E.g., by enforcing that none of its eigenvalues gets too small.  if ¸ i < ² then ¸ i = ² 9 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Applications Mixture models are used in many practical applications.  Wherever distributions with complex or unknown shapes need to be represented… Popular application in Computer Vision  Model distributions of pixel colors.  Each pixel is one data point in e.g. RGB space.  Learn a MoG to represent the class-conditional densities.  Use the learned models to classify other pixels. 10 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Application: Color-Based Skin Detection Collect training samples for skin/non-skin pixels. Estimate MoG to represent the skin/ non-skin densities M. Jones and J. Rehg, Statistical Color Models with Application to Skin Detection, IJCV 2002.Statistical Color Models with Application to Skin Detection skin non-skin 11 Classify skin color pixels in novel images

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Interested to Try It? Here’s how you can access a webcam in Matlab: function out = webcam % uses "Image Acquisition Toolbox„ adaptorName = 'winvideo'; vidFormat = 'I420_320x240'; vidObj1= videoinput(adaptorName, 1, vidFormat); set(vidObj1, 'ReturnedColorSpace', 'rgb'); set(vidObj1, 'FramesPerTrigger', 1); out = vidObj1 ; cam = webcam(); img=getsnapshot(cam); 12 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD)  Classification as dimensionality reduction  Linear discriminant analysis  Multiple discriminant analysis  Applications 13 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent 14 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Bayesian Decision Theory  Model conditional probability densities and priors  Compute posteriors (using Bayes’ rule)  Minimize probability of misclassification by maximizing. New approach  Directly encode decision boundary  Without explicit modeling of probability densities  Minimize misclassification probability directly. 15 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Formulate classification in terms of comparisons  Discriminant functions  Classify x as class C k if Examples (Bayes Decision Theory) 16 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Example: 2 classes Decision functions (from Bayes Decision Theory) 17 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Learning Discriminant Functions General classification problem  Goal: take a new input x and assign it to one of K classes C k.  Given: training set X = { x 1, …, x N } with target values T = { t 1, …, t N }.  Learn a discriminant function y ( x ) to perform the classification. 2-class problem  Binary target values: K-class problem  1-of-K coding scheme, e.g. 18 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions 2-class problem  y ( x ) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions  If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. 19 B. Leibe weight vector“bias” (= threshold) Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions Decision boundary defines a hyperplane  Normal vector:  Offset: 20 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions Notation  D : Number of dimensions 21 B. Leibe withconstant Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes Two simple strategies  How many classifiers do we need in both cases?  What difficulties do you see for those strategies? 22 B. Leibe One-vs-all classifiersOne-vs-one classifiers Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes Problem  Both strategies result in regions for which the pure classification result ( y k > 0) is ambiguous.  In the one-vs-all case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8 j  k. Solution  We can avoid those difficulties by taking K linear functions of the form and defining the decision boundaries directly by deciding for C k iff y k > y j 8 j  k.  This corresponds to a 1-of-K coding scheme 23 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes K-class discriminant  Combination of K linear functions  Resulting decision hyperplanes:  It can be shown that the decision regions of such a discriminant are always singly connected and convex.  This makes linear discriminant models particularly suitable for problems for which the conditional densities p ( x | w i ) are unimodal. 24 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD)  Classification as dimensionality reduction  Linear discriminant analysis  Multiple discriminant analysis  Applications 25 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent 26 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 General Classification Problem Classification problem  Let’s consider K classes described by linear models  We can group those together using vector notation where  The output will again be in 1-of-K notation.  We can directly compare it to the target value. 27 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 General Classification Problem Classification problem  For the entire dataset, we can write and compare this to the target matrix T where  Result of the comparison: 28 B. Leibe Goal: Choose such that this is minimal!

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Simplest approach  Directly try to minimize the sum-of-squares error  We could write this as  But let’s stick with the matrix notation for now… 29 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Simplest approach  Directly try to minimize the sum-of-squares error  Taking the derivative yields 30 B. Leibe chain rule:using:

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Minimizing the sum-of-squares error  We then obtain the discriminant function as  Exact, closed-form solution for the discriminant function parameters. 31 B. Leibe “pseudo-inverse”

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Problems with Least Squares Least-squares is very sensitive to outliers!  The error function penalizes predictions that are “too correct”. 32 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Problems with Least-Squares Another example:  3 classes (red, green, blue)  Linearly separable problem  Least-squares solution: Most green points are misclassified! Deeper reason for the failure  Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution.  However, our binary target vectors have a distribution that is clearly non-Gaussian!  Least-squares is the wrong probabilistic tool in this case! 33 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD)  Classification as dimensionality reduction  Linear discriminant analysis  Multiple discriminant analysis  Applications 34 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent 35 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Models 36 B. Leibe Linear model Generalized linear model  g ( ¢ ) is called an activation function and may be nonlinear.  The decision surfaces correspond to  If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x.

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Models Consider 2 classes: 37 B. Leibe with Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Logistic Sigmoid Activation Function 38 B. Leibe Example: Normal distributions with identical covariance Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Normalized Exponential General case of K > 2 classes:  This is known as the normalized exponential or softmax function  Can be regarded as a multiclass generalization of the logistic sigmoid. 39 B. Leibe with Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Relationship to Neural Networks 2-Class case Neural network (“single-layer perceptron”) 40 B. Leibe withconstant Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Relationship to Neural Networks Multi-class case Multi-class perceptron 41 B. Leibe withconstant Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Logistic Discrimination If we use the logistic sigmoid activation function… … then we can interpret the y ( x ) as posterior probabilities! 42 B. Leibe Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Other Motivation for Nonlinearity Recall least-squares classification  One of the problems was that data points that are “too correct” have a strong influence on the decision surface under a squared-error criterion.  Reason: the output of y ( x n ; w ) can grow arbitrarily large for some x n :  By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences 43 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discussion: Generalized Linear Models Advantages  The nonlinearity gives us more flexibility.  Can be used to limit the effect of outliers.  Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage  Least-squares minimization in general no longer leads to a closed-form analytical solution.  Need to apply iterative methods.  Gradient descent. 44 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Separability Up to now: restrictive assumption  Only consider linear decision boundaries Classical counterexample: XOR 45 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Separability Even if the data is not linearly separable, a linear decision boundary may still be “optimal”.  Generalization  E.g. in the case of Normal distributed data (with equal covariance matrices) Choice of the right discriminant function is important and should be based on  Prior knowledge (of the general functional form)  Empirical comparison of alternative models  Linear discriminants are often used as benchmark. 46 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Discriminants Generalization  Transform vector x with M nonlinear basis functions Á j ( x ) :  Purpose of Á j ( x ) : basis functions  Allow non-linear decision boundaries.  By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation 47 B. Leibe with Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Discriminants Model  K functions (outputs) y k ( x ; w ) Learning in Neural Networks  Single-layer networks: Á j are fixed, only weights w are learned.  Multi-layer networks: both the w and the Á j are learned.  In the following, we will not go into details about neural networks in particular, but consider generalized linear discriminants in general… 48 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Learning the weights w :  N training data points: X = { x 1, …, x N }  K outputs of decision functions: y k ( x n ; w )  Target vector for each data point: T = { t 1, …, t N }  Error function (least-squares error) of linear model 49 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Problem  The error function can in general no longer be minimized in closed form. ldea (Gradient Descent)  Iterative minimization  Start with an initial guess for the parameter values.  Move towards a (local) minimum by following the gradient.  This simple scheme corresponds to a 1 st -order Taylor expansion (There are more complex procedures available). 50 B. Leibe ´ : Learning rate

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent – Basic Strategies “Batch learning”  Compute the gradient based on all training data: 51 B. Leibe Slide credit: Bernt Schiele ´ : Learning rate

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent – Basic Strategies “Sequential updating”  Compute the gradient based on a single data point at a time: 52 B. Leibe ´ : Learning rate Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Error function 53 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Delta rule (=LMS rule)  where  Simply feed back the input data point, weighted by the classification error. 54 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Cases with differentiable, non-linear activation function Gradient descent 55 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Generalized Linear Discriminants Properties  General class of decision functions.  Nonlinearity g ( ¢ ) and basis functions Á j allow to address linearly non-separable problems.  Shown simple sequential learning approach for parameter estimation using gradient descent.  Better 2 nd order gradient descent approaches available (e.g. Newton-Raphson). Limitations / Caveats  Flexibility of model is limited by curse of dimensionality – g ( ¢ ) and Á j often introduce additional parameters. –Models are either limited to lower-dimensional input space or need to share parameters.  Linearly separable case often leads to overfitting. –Several possible parameter choices minimize training error. 56

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions  Definition  Extension to multiple classes Least-squares classification  Derivation  Shortcomings Generalized linear models  Connection to neural networks  Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD)  Classification as dimensionality reduction  Linear discriminant analysis  Multiple discriminant analysis  Applications 57 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Classification as dimensionality reduction  We can interpret the linear classification model as a projection onto a lower-dimensional space.  E.g. take the D -dimensional input vector x and project it down to one dimension by applying the function  If we now place a threshold at y ¸ – w 0, we obtain our standard two-class linear classifier.  The classifier will have a lower error the better this projection separates the two classes.  New interpretation of the learning problem  Try to find the projection vector w that maximizes the class separation. 58 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Two questions  How to measure class separation?  How to find the best projection (with maximal class separation)? 59 B. Leibe bad separationgood separation Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Measuring class separation  We could simply measure the separation of the class means.  Choose w so as to maximize Problems with this approach 1.This expression can be made arbitrarily large by increasing.  Need to enforce additional constraint. 2.The criterion may result in bad separation if the clusters have elongated shapes. 60 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisher’s Linear Discriminant Analysis (FLD) Better idea:  Find a projection that maximizes the ratio of the between-class variance to the within-class variance:  Usually, this is written as  where 61 B. Leibe with between-class scatter matrix within-class scatter matrix

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisher’s Linear Discriminant Analysis (FLD) Maximize distance between classes Minimize distance within a class Criterion: S B … between-class scatter matrix S W … within-class scatter matrix The optimal solution for w can be obtained as: Classification function: 62 B. Leibe Class 1 Class 2 w x x Slide credit: Ales Leonardis

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Multiple Discriminant Analysis Generalization to K classes  where 63 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Maximizing J(W) Generalized eigenvalue problem  The columns of the optimal W are the eigenvectors corresponding to the largest eigenvectors of  Defining, we get which is a regular eigenvalue problem.  Solve to get eigenvectors of v, then from that of w. For the K-class case we obtain (at most) K -1 projections.  (i.e. eigenvectors corresponding to non-zero eigenvalues.) 64 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 What Does It Mean? What does it mean to apply a linear classifier? Classifier interpretation  The weight vector has the same dimensionality as x.  Positive contributions where sign( x i ) = sign( w i ).  The weight vector identifies which input dimensions are important for positive or negative classification (large | w i | ) and which ones are irrelevant (near-zero w i ).  If the inputs x are normalized, we can interpret w as a “template” vector that the classifier tries to match. 65 B. Leibe Input vectorWeight vector

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Example Application: Fisherfaces Visual discrimination task  Training data:  Test: 66 B. Leibe C 2 : Subjects without glasses C 1 : Subjects with glasses  glasses? = Take each image as a vector of pixel values and apply FLD… Image source: Yale Face Database

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisherfaces: Interpretability Resulting weight vector for “Glasses/NoGlasses“ 67 B. Leibe [Belhumeur et.al. 1997] Slide credit: Peter Belhumeur

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Fisher’s Linear Discriminant Properties  Simple method for dimensionality reduction, preserves class discriminability.  Can use parametric methods in reduced-dim. space that might not be feasible in original higher-dim. space.  Widely used in practical applications. Restrictions / Caveats  Not possible to get more than K -1 projections.  FLD reduces the computation to class means and covariances.  Implicit assumption that class distributions are unimodal and well- approximated by a Gaussian/hyperellipsoid.  Assumption that N > D (more training examples than dims.) –This may not be given in some domains (e.g. vision) –Solution: apply PCA first to get problem with D = N -1 dimensions. 68 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop’s book (in particular Chapter 4.1). B. Leibe 69 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006