Download presentation
Presentation is loading. Please wait.
Published byLaureen Edwards Modified over 9 years ago
1
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015
2
Support Vector Machine (SVM) Used for – Classification – Regression Applied in – Network intrusion detection – Image processing – Text classification – …
3
libSVM Library for support vector machines Integrate different types of SVMs
4
Types of SVMs Supported by libSVM For support vector classification – C-SVC – Nu-SVC For support vector regression – Epsilon-SVR – Nu-SVR For distribution estimation – One-class SVM
5
C-SVC Goal: Find the separating hyperplane that maximizes the margin Support vectors: data points closest to the separating hyperplane
6
C-SVC Primal form Dual form (derived using Lagrange multipliers)
7
Speedup Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases Need efficient algorithms and implementation to apply to large scale data mining => Parallel SVM
8
Parallel SVM Methods Message Passing Interface (MPI) – Efficient for computation-intensive problems Ex. Simulation MapReduce – Can be used for data-intensive problems …
9
Other Speedup Techniques Chunking: optimize subsets of training data iteratively until the global optimum is reached – Ex. Sequential Minimal Optimization (SMO) Use a chunk size of 2 vectors Eliminate non-support vectors early
10
This Paper’s Approach 1.Partition & distribute data to nodes 2.Map class: Train each subSVM to find support vectors for subset of data 3.Reduce class: Combine support vectors of each 2 subSVMs 4.If more than 1 SVM Go to 2.
11
Twister Support iterative MapReduce More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce
12
Computation Complexity
13
Evaluations Number of nodes Training time Accuracy = # correctly predicted data / # total testing data * 100 %
14
Adult Data Analysis Binary classification Correlation between attribute variable X and class variable Y used to select attributes
15
Adult Data Analysis Computation cost concentrates on training Data transfer time cost minor Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much
16
Forest Cover Type Classification Multiclass classification – Use k(k - 1)/2 binary SVMs as k-class SVM – 1 binary SVM for each pair of classes – Use maximum voting to determine the class
17
Forest Cover Type Classification Correlation between attribute variable X and class variable Y used to select attributes Attribute variables are normalized to [0, 1]
18
Forest Cover Type Classification Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much
19
Heart Disease Classification Binary classification Data replicated different times to compare results for different sample sizes
20
Heart Disease Classification When sample size too big, can’t be processed with 1 node because of memory constraint Training time decreases little when number of nodes > 8
21
Conclusion Classical SVM impractical for large scale data Need parallel SVM This paper proposes a model based on iterative MapReduce Show the model efficient for data-intensive problems
22
References [1]Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las Vegas, NV, 2012, pp. [2]C. Lin et al., “Anomaly Detection Using LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.
23
Q & A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.