Download presentation
Presentation is loading. Please wait.
Published byErick Carr Modified over 9 years ago
1
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 4 Linear Discriminant Functions 19.04.2012 Bastian Leibe RWTH Aachen http://www.mmp.rwth-aachen.de leibe@umic.rwth-aachen.de TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A AA A A Many slides adapted from B. Schiele
2
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields B. Leibe 2
3
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Mixture of Gaussians (MoG) “Generative model” 3 B. Leibe 1 2 3 “Weight” of mixture component Mixture component Mixture density Slide credit: Bernt Schiele
4
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Estimating MoGs – Iterative Strategy Assuming we knew the values of the hidden variable… 4 B. Leibe 1 111 00 0 0 0 000 11 1 1 1 111 22 2 2 j ML for Gaussian #1ML for Gaussian #2 assumed known Slide credit: Bernt Schiele
5
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: Estimating MoGs – Iterative Strategy Assuming we knew the mixture components… Bayes decision rule: Decide j = 1 if 5 B. Leibe assumed known 1 111 22 2 2 j Slide credit: Bernt Schiele
6
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Iterative procedure 1.Initialization: pick K arbitrary centroids (cluster means) 2.Assign each sample to the closest centroid. 3.Adjust the centroids to be the means of the samples assigned to them. 4.Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Recap: K-Means Clustering 6 B. Leibe Slide credit: Bernt Schiele
7
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Recap: EM Algorithm Expectation-Maximization (EM) Algorithm E-Step: softly assign samples to mixture components M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments 7 B. Leibe = soft number of samples labeled j Slide adapted from Bernt Schiele
8
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Gaussian Mixture Models Properties Very general, can represent any (continuous) distribution. Once trained, very fast to evaluate. Can be updated online. Problems / Caveats Some numerical issues in the implementation Need to apply regularization in order to avoid singularities. EM for MoG is computationally expensive –Especially for high-dimensional problems! –More computational overhead and slower convergence than k-Means –Results very sensitive to initialization Run k-Means for some iterations as initialization! Need to select the number of mixture components K. Model selection problem (see Lecture 10) 8 B. Leibe
9
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 EM – Technical Advice When implementing EM, we need to take care to avoid singularities in the estimation! Mixture components may collapse on single data points. Need to introduce regularization Enforce minimum width for the Gaussians E.g., by enforcing that none of its eigenvalues gets too small. if ¸ i < ² then ¸ i = ² 9 B. Leibe Image source: C.M. Bishop, 2006
10
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Applications Mixture models are used in many practical applications. Wherever distributions with complex or unknown shapes need to be represented… Popular application in Computer Vision Model distributions of pixel colors. Each pixel is one data point in e.g. RGB space. Learn a MoG to represent the class-conditional densities. Use the learned models to classify other pixels. 10 B. Leibe Image source: C.M. Bishop, 2006
11
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Application: Color-Based Skin Detection Collect training samples for skin/non-skin pixels. Estimate MoG to represent the skin/ non-skin densities M. Jones and J. Rehg, Statistical Color Models with Application to Skin Detection, IJCV 2002.Statistical Color Models with Application to Skin Detection skin non-skin 11 Classify skin color pixels in novel images
12
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Interested to Try It? Here’s how you can access a webcam in Matlab: function out = webcam % uses "Image Acquisition Toolbox„ adaptorName = 'winvideo'; vidFormat = 'I420_320x240'; vidObj1= videoinput(adaptorName, 1, vidFormat); set(vidObj1, 'ReturnedColorSpace', 'rgb'); set(vidObj1, 'FramesPerTrigger', 1); out = vidObj1 ; cam = webcam(); img=getsnapshot(cam); 12 B. Leibe
13
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD) Classification as dimensionality reduction Linear discriminant analysis Multiple discriminant analysis Applications 13 B. Leibe
14
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 14 B. Leibe
15
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Bayesian Decision Theory Model conditional probability densities and priors Compute posteriors (using Bayes’ rule) Minimize probability of misclassification by maximizing. New approach Directly encode decision boundary Without explicit modeling of probability densities Minimize misclassification probability directly. 15 B. Leibe Slide credit: Bernt Schiele
16
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Formulate classification in terms of comparisons Discriminant functions Classify x as class C k if Examples (Bayes Decision Theory) 16 B. Leibe Slide credit: Bernt Schiele
17
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discriminant Functions Example: 2 classes Decision functions (from Bayes Decision Theory) 17 B. Leibe Slide credit: Bernt Schiele
18
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Learning Discriminant Functions General classification problem Goal: take a new input x and assign it to one of K classes C k. Given: training set X = { x 1, …, x N } with target values T = { t 1, …, t N }. Learn a discriminant function y ( x ) to perform the classification. 2-class problem Binary target values: K-class problem 1-of-K coding scheme, e.g. 18 B. Leibe
19
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions 2-class problem y ( x ) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. 19 B. Leibe weight vector“bias” (= threshold) Slide credit: Bernt Schiele
20
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions Decision boundary defines a hyperplane Normal vector: Offset: 20 B. Leibe Slide credit: Bernt Schiele
21
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Discriminant Functions Notation D : Number of dimensions 21 B. Leibe withconstant Slide credit: Bernt Schiele
22
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes Two simple strategies How many classifiers do we need in both cases? What difficulties do you see for those strategies? 22 B. Leibe One-vs-all classifiersOne-vs-one classifiers Image source: C.M. Bishop, 2006
23
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes Problem Both strategies result in regions for which the pure classification result ( y k > 0) is ambiguous. In the one-vs-all case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8 j k. Solution We can avoid those difficulties by taking K linear functions of the form and defining the decision boundaries directly by deciding for C k iff y k > y j 8 j k. This corresponds to a 1-of-K coding scheme 23 B. Leibe Image source: C.M. Bishop, 2006
24
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Extension to Multiple Classes K-class discriminant Combination of K linear functions Resulting decision hyperplanes: It can be shown that the decision regions of such a discriminant are always singly connected and convex. This makes linear discriminant models particularly suitable for problems for which the conditional densities p ( x | w i ) are unimodal. 24 B. Leibe Image source: C.M. Bishop, 2006
25
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD) Classification as dimensionality reduction Linear discriminant analysis Multiple discriminant analysis Applications 25 B. Leibe
26
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 26 B. Leibe
27
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 General Classification Problem Classification problem Let’s consider K classes described by linear models We can group those together using vector notation where The output will again be in 1-of-K notation. We can directly compare it to the target value. 27 B. Leibe
28
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 General Classification Problem Classification problem For the entire dataset, we can write and compare this to the target matrix T where Result of the comparison: 28 B. Leibe Goal: Choose such that this is minimal!
29
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Simplest approach Directly try to minimize the sum-of-squares error We could write this as But let’s stick with the matrix notation for now… 29 B. Leibe
30
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Simplest approach Directly try to minimize the sum-of-squares error Taking the derivative yields 30 B. Leibe chain rule:using:
31
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Least-Squares Classification Minimizing the sum-of-squares error We then obtain the discriminant function as Exact, closed-form solution for the discriminant function parameters. 31 B. Leibe “pseudo-inverse”
32
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Problems with Least Squares Least-squares is very sensitive to outliers! The error function penalizes predictions that are “too correct”. 32 B. Leibe Image source: C.M. Bishop, 2006
33
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Problems with Least-Squares Another example: 3 classes (red, green, blue) Linearly separable problem Least-squares solution: Most green points are misclassified! Deeper reason for the failure Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution. However, our binary target vectors have a distribution that is clearly non-Gaussian! Least-squares is the wrong probabilistic tool in this case! 33 B. Leibe Image source: C.M. Bishop, 2006
34
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD) Classification as dimensionality reduction Linear discriminant analysis Multiple discriminant analysis Applications 34 B. Leibe
35
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 35 B. Leibe
36
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Models 36 B. Leibe Linear model Generalized linear model g ( ¢ ) is called an activation function and may be nonlinear. The decision surfaces correspond to If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x.
37
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Models Consider 2 classes: 37 B. Leibe with Slide credit: Bernt Schiele
38
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Logistic Sigmoid Activation Function 38 B. Leibe Example: Normal distributions with identical covariance Slide credit: Bernt Schiele
39
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Normalized Exponential General case of K > 2 classes: This is known as the normalized exponential or softmax function Can be regarded as a multiclass generalization of the logistic sigmoid. 39 B. Leibe with Slide credit: Bernt Schiele
40
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Relationship to Neural Networks 2-Class case Neural network (“single-layer perceptron”) 40 B. Leibe withconstant Slide credit: Bernt Schiele
41
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Relationship to Neural Networks Multi-class case Multi-class perceptron 41 B. Leibe withconstant Slide credit: Bernt Schiele
42
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Logistic Discrimination If we use the logistic sigmoid activation function… … then we can interpret the y ( x ) as posterior probabilities! 42 B. Leibe Slide adapted from Bernt Schiele
43
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Other Motivation for Nonlinearity Recall least-squares classification One of the problems was that data points that are “too correct” have a strong influence on the decision surface under a squared-error criterion. Reason: the output of y ( x n ; w ) can grow arbitrarily large for some x n : By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences 43 B. Leibe
44
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Discussion: Generalized Linear Models Advantages The nonlinearity gives us more flexibility. Can be used to limit the effect of outliers. Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage Least-squares minimization in general no longer leads to a closed-form analytical solution. Need to apply iterative methods. Gradient descent. 44 B. Leibe
45
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Separability Up to now: restrictive assumption Only consider linear decision boundaries Classical counterexample: XOR 45 B. Leibe Slide credit: Bernt Schiele
46
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Linear Separability Even if the data is not linearly separable, a linear decision boundary may still be “optimal”. Generalization E.g. in the case of Normal distributed data (with equal covariance matrices) Choice of the right discriminant function is important and should be based on Prior knowledge (of the general functional form) Empirical comparison of alternative models Linear discriminants are often used as benchmark. 46 B. Leibe Slide credit: Bernt Schiele
47
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Discriminants Generalization Transform vector x with M nonlinear basis functions Á j ( x ) : Purpose of Á j ( x ) : basis functions Allow non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation 47 B. Leibe with Slide credit: Bernt Schiele
48
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Generalized Linear Discriminants Model K functions (outputs) y k ( x ; w ) Learning in Neural Networks Single-layer networks: Á j are fixed, only weights w are learned. Multi-layer networks: both the w and the Á j are learned. In the following, we will not go into details about neural networks in particular, but consider generalized linear discriminants in general… 48 B. Leibe Slide credit: Bernt Schiele
49
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Learning the weights w : N training data points: X = { x 1, …, x N } K outputs of decision functions: y k ( x n ; w ) Target vector for each data point: T = { t 1, …, t N } Error function (least-squares error) of linear model 49 B. Leibe Slide credit: Bernt Schiele
50
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Problem The error function can in general no longer be minimized in closed form. ldea (Gradient Descent) Iterative minimization Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. This simple scheme corresponds to a 1 st -order Taylor expansion (There are more complex procedures available). 50 B. Leibe ´ : Learning rate
51
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent – Basic Strategies “Batch learning” Compute the gradient based on all training data: 51 B. Leibe Slide credit: Bernt Schiele ´ : Learning rate
52
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent – Basic Strategies “Sequential updating” Compute the gradient based on a single data point at a time: 52 B. Leibe ´ : Learning rate Slide credit: Bernt Schiele
53
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Error function 53 B. Leibe Slide credit: Bernt Schiele
54
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Delta rule (=LMS rule) where Simply feed back the input data point, weighted by the classification error. 54 B. Leibe Slide credit: Bernt Schiele
55
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Gradient Descent Cases with differentiable, non-linear activation function Gradient descent 55 B. Leibe Slide credit: Bernt Schiele
56
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Generalized Linear Discriminants Properties General class of decision functions. Nonlinearity g ( ¢ ) and basis functions Á j allow to address linearly non-separable problems. Shown simple sequential learning approach for parameter estimation using gradient descent. Better 2 nd order gradient descent approaches available (e.g. Newton-Raphson). Limitations / Caveats Flexibility of model is limited by curse of dimensionality – g ( ¢ ) and Á j often introduce additional parameters. –Models are either limited to lower-dimensional input space or need to share parameters. Linearly separable case often leads to overfitting. –Several possible parameter choices minimize training error. 56
57
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent Fisher’s linear discriminant (FLD) Classification as dimensionality reduction Linear discriminant analysis Multiple discriminant analysis Applications 57 B. Leibe
58
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Classification as dimensionality reduction We can interpret the linear classification model as a projection onto a lower-dimensional space. E.g. take the D -dimensional input vector x and project it down to one dimension by applying the function If we now place a threshold at y ¸ – w 0, we obtain our standard two-class linear classifier. The classifier will have a lower error the better this projection separates the two classes. New interpretation of the learning problem Try to find the projection vector w that maximizes the class separation. 58 B. Leibe
59
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Two questions How to measure class separation? How to find the best projection (with maximal class separation)? 59 B. Leibe bad separationgood separation Image source: C.M. Bishop, 2006
60
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Classification as Dimensionality Reduction Measuring class separation We could simply measure the separation of the class means. Choose w so as to maximize Problems with this approach 1.This expression can be made arbitrarily large by increasing. Need to enforce additional constraint. 2.The criterion may result in bad separation if the clusters have elongated shapes. 60 B. Leibe Image source: C.M. Bishop, 2006
61
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisher’s Linear Discriminant Analysis (FLD) Better idea: Find a projection that maximizes the ratio of the between-class variance to the within-class variance: Usually, this is written as where 61 B. Leibe with between-class scatter matrix within-class scatter matrix
62
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisher’s Linear Discriminant Analysis (FLD) Maximize distance between classes Minimize distance within a class Criterion: S B … between-class scatter matrix S W … within-class scatter matrix The optimal solution for w can be obtained as: Classification function: 62 B. Leibe Class 1 Class 2 w x x Slide credit: Ales Leonardis
63
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Multiple Discriminant Analysis Generalization to K classes where 63 B. Leibe
64
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Maximizing J(W) Generalized eigenvalue problem The columns of the optimal W are the eigenvectors correspon- ding to the largest eigenvectors of Defining, we get which is a regular eigenvalue problem. Solve to get eigenvectors of v, then from that of w. For the K-class case we obtain (at most) K -1 projections. (i.e. eigenvectors corresponding to non-zero eigenvalues.) 64 B. Leibe
65
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 What Does It Mean? What does it mean to apply a linear classifier? Classifier interpretation The weight vector has the same dimensionality as x. Positive contributions where sign( x i ) = sign( w i ). The weight vector identifies which input dimensions are important for positive or negative classification (large | w i | ) and which ones are irrelevant (near-zero w i ). If the inputs x are normalized, we can interpret w as a “template” vector that the classifier tries to match. 65 B. Leibe Input vectorWeight vector
66
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Example Application: Fisherfaces Visual discrimination task Training data: Test: 66 B. Leibe C 2 : Subjects without glasses C 1 : Subjects with glasses glasses? = Take each image as a vector of pixel values and apply FLD… Image source: Yale Face Database
67
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Fisherfaces: Interpretability Resulting weight vector for “Glasses/NoGlasses“ 67 B. Leibe [Belhumeur et.al. 1997] Slide credit: Peter Belhumeur
68
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Summary: Fisher’s Linear Discriminant Properties Simple method for dimensionality reduction, preserves class discriminability. Can use parametric methods in reduced-dim. space that might not be feasible in original higher-dim. space. Widely used in practical applications. Restrictions / Caveats Not possible to get more than K -1 projections. FLD reduces the computation to class means and covariances. Implicit assumption that class distributions are unimodal and well- approximated by a Gaussian/hyperellipsoid. Assumption that N > D (more training examples than dims.) –This may not be given in some domains (e.g. vision) –Solution: apply PCA first to get problem with D = N -1 dimensions. 68 B. Leibe
69
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop’s book (in particular Chapter 4.1). B. Leibe 69 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.