Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Similar presentations


Presentation on theme: "Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:"— Presentation transcript:

1 Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

2 Logistic discrimination Basic assumption of the logistic model: logarithm of the ratio of class distribution is a linear function: This is exact when class distributions are normal (Gaussian) with equal covariance matrices, and for some discrete data distributions. Since these probabilities sum to 1, using the Bayesian formula P(  |X) = P(X|  ) P(  )/P(X), the model is equivalent to:

3 Logistic DA Classification rule is therefore: This time probabilities (observations) are non-linear functions of parameters W; usually iterative procedures based on maximization of likelihood of generation of the observed data are used, equivalent to: Using logistic functions for P(  |X) and calculating gradients in respect to W leads to a non-linear optimization problem. This is implemented in WEKA/YALE, giving usually better results than LDA at some increase computational costs.

4 WEKA Logistic voting Similar results to LDA Whole data: === Confusion Matrix === a b <= classified as 260 7 | a = democrat 5 163 | b = republican 10xCV results a b <= classified as 258 9 | a = democrat 9 159 | b = republican Decision trees give better results in this case, perhaps one hyperplane is not sufficient.

5 Maximization of margin 1 Among all discriminating hyperplanes there is one that is clearly better.

6 Maximization of margin 2 g(X)=W T X+W 0 is the discriminant function, g(X)/||W|| is the distance. The best discriminating hyperplane should maximize the distance between the g(X)=0 plane and the data samples that are near to it. 1/||W|| g(X)> +1 g(X)<  1 g(X)=+1 g(X)=  1 X

7 Maximization of margin 3 Maximize the distance g w (X)/||W|| between the plane W and data samples, or maximize the value of discriminant g w (X) for ||W||=1 Find vectors X (i) that are close to W hyperplane in d dimensions: For these vectors find W giving maximum distance Which vectors to choose as “support” for such calculation? Let the target values for classification be Y(  1 )  and Y(  2 )  and the margin b be the distance between W and these support vectors: This should be true for all vectors, in a separable case.

8 Formulation of the problem Setting b||W||=1 (particular choice of b ) separation conditions are: Largest margin is obtained from minimization of ||W|| with g W (X), fulfilling the separation conditions. This leads to a constrained minimization problem. These conditions define two canonical hyperplanes: Distance of H i from the H 0 separating plane g W (X)=0 is D(H 0,H i ) = 1/||W|| Support vectors are vectors that are the closest to the separating hyperplane, most difficult to separate and most informative. Minimize ||W|| with constraints

9 Scalar product form In the d-dimensional space if n > d the weight vector may be expresses as the combination of: It should be enough to take only d independent training vectors, so most  i =0. Therefore the discriminant function: The kernel matrix K ij will play an important role soon...

10 Lagrange form and SV Lagrange multiplier method is used to convert constraint minimization problems into a simpler optimization problem (here X includes X 0 =1): where  are Lagrangian multipliers - free positive parameters, and summation runs over the number of all training samples n. Minimization of the Lagrangian function over W increases margin. Suppose that X (i) is misclassification, then the second term g(X (i) )  in the Lagrangian is negative, and large  i will create a large contribution to L(W,  ); this will be decreased by changing W to remove the error. Therefore ||W|| should be minimized and  maximized, but only for vectors for which g(X (i) )  1=0, called Support Vectors (SV). This leads to the search for the saddle point, not minima; to simplify it W parameters are replaced by .

11 Scalar product discriminant Differentiating in respect to W and W 0 gives: for support vector Y (i) g(X (i) )=1, so Interesting! W is now a linear combination of input vectors! Makes sense, since a component W Z of W=W Z +W X that does not belong to the space spanned by X (i) vectors has no influence on the discrimination process, because W Z T X=0. Inserting W in the discriminant function:

12 Lagrangian in dual form Substituting W into the Lagrangian leads to a maximization of a dual form (X here may be d+1 dim or d-dim, it does not matter): In this form optimization criterion is expressed as inner products of support vectors, and is now maximized subject to constraints. Initially number of parameters is equal to the number of patterns n, usually much bigger than dimensionality d, but the final number of non- zero  may be small. This type of quadratic minimization problem has a unique solution! Popular approach: SMO, Sequential Minimal Optimization algorithm for Quadratic Programming, fast and accurate.


Download ppt "Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:"

Similar presentations


Ads by Google