Download presentation
Presentation is loading. Please wait.
1
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space Kernel Technique Based on Mercer ’ s Condition (1909)
2
A Simple Example of Kernel Polynomial Kernel of Degree 2: Let and the nonlinear map defined by. Then. There are many other nonlinear maps,, that satisfy the relation:
3
Power of the Kernel Technique Consider a nonlinear mapthat consists of distinct features of all the monomials of degree d. Then. For example: Is it necessary? We only need to know ! This can be achieved
4
More Examples of Kernel is an integer: Polynomial Kernel : ) (Linear Kernel : Gaussian (Radial Basis) Kernel : The -entry of represents the “similarity” of data pointsand
5
Nonlinear SVM Motivation Linear SVM: (Linear separator: ) min s. t. (QP) By QP “duality”,. Maximizing the margin in the “dual space” gives: min Dual SSVM with separator: min s. t.
6
Nonlinear Smooth SVM Replace by a nonlinear kernel : min Use Newton-Armijo algorithm to solve the problem Each iteration solves m+1 linear equations in m+1 variables Nonlinear classifier depends on entire dataset : Nonlinear Classifier:
7
Difficulties with Nonlinear SVM for Large Problems The nonlinear kernel is fully dense Computational complexity depends on # of example Separating surface depends on almost entire dataset Complexity of nonlinear SVM Runs out of memory while storing the kernel matrix Long CPU time to compute the dense kernel matrix Need to generate and store entries Need to store the entire dataset even after solving the problem
8
Solving the SVM with Massive Dataset Limit the SVM to dataset of a few thousand points Solution I: SMO (Sequential Minimal Optimization) Standard optimization techniques require that the the data are held in memory Solve the sub-optimization problem defined by the working set (size =2) Increase the objective function iteratively Solution II: RSVM (Reduced Support Vector Machine)
9
Reduced Support Vector Machine (ii) Solve the following problem by the Newton’s method min (iii) The nonlinear classifier is defined by the optimal solution in step (ii): Using gives lousy results! (i) Choose a random subset matrixof entire data matrix Nonlinear Classifier:
10
A Nonlinear Kernel Application Checkerboard Training Set: 1000 Points in Separate 486 Asterisks from 514 Dots
11
Conventional SVM Result on Checkerboard Using 50 Randomly Selected Points Out of 1000
12
RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000
13
RSVM on Moderate Sized Problems (Best Test Set Correctness %, CPU seconds) Cleveland Heart 297 x 13, 30 86.47 3.04 85.92 32.42 76.88 1.58 BUPA Liver 345 x 6, 35 74.86 2.68 73.62 32.61 68.95 2.04 Ionosphere 351 x 34, 35 95.19 5.02 94.35 59.88 88.70 2.13 Pima Indians 768 x 8, 50 78.64 5.72 76.59 328.3 57.32 4.64 Tic-Tac-Toe 958 x 9, 96 98.75 14.56 98.43 1033.5 88.24 8.87 Mushroom 8124 x 22, 215 89.04 466.20 N/A 83.90 221.50
14
RSVM on Large UCI Adult Dataset Standard Deviation over 50 Runs = 0.001 Average Correctness % & Standard Deviation, 50 Runs (6414, 26148) 84.470.00177.030.014210 3.2% (11221, 21341) 84.710.00175.960.016225 2.0% (16101, 16461) 84.900.00175.450.017242 1.5% (22697, 9865) 85.310.00176.730.018284 1.2% (32562, 16282) 85.070.00176.950.013326 1.0%
15
Reduced Set: Plays the Most Important Role in RSVM It is natural to raise two questions: Is there a way to choose the reduced set other than random selection so that RSVM will have a better performance? Is there a mechanism to determine the size of reduced set automatically or dynamically?
16
Reduced Set Selection According to the Data Scatter in Input Space Expected these points to be representative sample Choose reduced set randomly but only keep the points in the reduced set that are more than a certain minimal distance apart
17
1 2 3 5 4 6 7 8 9 11 10 12 Data Scatter in Input Space is NOT Good Enough An example is given as following: Training data analogous to XOR problem
18
Mapping to Feature Space Map the input data via nonlinear mapping : Equivalent to polynomial kernel with degree 2:
19
Data Points in the Feature Space 1 2 3 5 4 6 7 8 9 11 10 12 36 25 14 8 11 9 12 7 10
20
The Polynomial Kernel Matrix
21
1 2 3 5 4 6 7 8 9 11 10 12 Experiment Result
22
Express the Classifier as Linear Combination of Kernel Functions is a linear combination of a set of kernel functions In SSVM, the nonlinear separating surface is: In RSVM, the nonlinear separating surface is: is a linear combination of a set of kernel functions
23
Motivation of IRSVM The Strength of Weak Ties The strength of weak ties Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No. 6 1360-1380, May, 1973 If the kernel functions are very similar, the space spanned by these kernel functions will be very limited.
24
Incremental Reduced SVMs Start with a very small reduced set, then add a new data point only when the kernel vector is dissimilar to the current function set This point contributes the most extra information for generating the separating surface Repeat until several successive points cannot be added
25
How to measure the dissimilarity? the kernel vector to the column space of is greater than a threshold Add a point into the reduced set if the distance from
26
This distance can be determined by solving a least squares problem Solving Least Squares Problems The LSP has a unique solution if and
27
IRSVM Algorithm pseudo-code (sequential version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For each data point not in the reduced set 4 Computes its kernel vector 5 Computes the distance from the kernel vector 6 to the column space of the current reduced kernel matrix 7 If its distance exceed a certain threshold 8 Add this point into the reduced set and form the new reduced kernel matrix 9 Until several successive failures happened in line 7 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface
28
Speed up IRSVM The main cost depends on but not on Take this advantage this, we examine a batch data points at the same We have to solve the LSP many times and the complexity is
29
IRSVM Algorithm pseudo-code (Batch version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For a batch data point not in the reduced set 4 Computes their kernel vectors 5 Computes the corresponding distances from these kernel vector 6 to the column space of the current reduced kernel matrix 7 For those points’ distance exceed a certain threshold 8 Add those point into the reduced set and form the new reduced kernel matrix 9 Until no data points in a batch were added in line 7,8 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface
30
IRSVM on Four Public Datasets
31
IRSVM on UCI Adult datasets
32
Time comparison on Adult datasets
33
IRSVM 10 Runs Average on 6414 Points Adult Training Set
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.