“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015
Support Vector Machine (SVM) Used for – Classification – Regression Applied in – Network intrusion detection – Image processing – Text classification – …
libSVM Library for support vector machines Integrate different types of SVMs
Types of SVMs Supported by libSVM For support vector classification – C-SVC – Nu-SVC For support vector regression – Epsilon-SVR – Nu-SVR For distribution estimation – One-class SVM
C-SVC Goal: Find the separating hyperplane that maximizes the margin Support vectors: data points closest to the separating hyperplane
C-SVC Primal form Dual form (derived using Lagrange multipliers)
Speedup Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases Need efficient algorithms and implementation to apply to large scale data mining => Parallel SVM
Parallel SVM Methods Message Passing Interface (MPI) – Efficient for computation-intensive problems Ex. Simulation MapReduce – Can be used for data-intensive problems …
Other Speedup Techniques Chunking: optimize subsets of training data iteratively until the global optimum is reached – Ex. Sequential Minimal Optimization (SMO) Use a chunk size of 2 vectors Eliminate non-support vectors early
This Paper’s Approach 1.Partition & distribute data to nodes 2.Map class: Train each subSVM to find support vectors for subset of data 3.Reduce class: Combine support vectors of each 2 subSVMs 4.If more than 1 SVM Go to 2.
Twister Support iterative MapReduce More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce
Computation Complexity
Evaluations Number of nodes Training time Accuracy = # correctly predicted data / # total testing data * 100 %
Adult Data Analysis Binary classification Correlation between attribute variable X and class variable Y used to select attributes
Adult Data Analysis Computation cost concentrates on training Data transfer time cost minor Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much
Forest Cover Type Classification Multiclass classification – Use k(k - 1)/2 binary SVMs as k-class SVM – 1 binary SVM for each pair of classes – Use maximum voting to determine the class
Forest Cover Type Classification Correlation between attribute variable X and class variable Y used to select attributes Attribute variables are normalized to [0, 1]
Forest Cover Type Classification Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much
Heart Disease Classification Binary classification Data replicated different times to compare results for different sample sizes
Heart Disease Classification When sample size too big, can’t be processed with 1 node because of memory constraint Training time decreases little when number of nodes > 8
Conclusion Classical SVM impractical for large scale data Need parallel SVM This paper proposes a model based on iterative MapReduce Show the model efficient for data-intensive problems
References [1]Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las Vegas, NV, 2012, pp. [2]C. Lin et al., “Anomaly Detection Using LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp
Q & A