“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Support Vector Machine (SVM) Used for – Classification – Regression Applied in – Network intrusion detection – Image processing – Text classification – …

libSVM Library for support vector machines Integrate different types of SVMs

Types of SVMs Supported by libSVM For support vector classification – C-SVC – Nu-SVC For support vector regression – Epsilon-SVR – Nu-SVR For distribution estimation – One-class SVM

C-SVC Goal: Find the separating hyperplane that maximizes the margin Support vectors: data points closest to the separating hyperplane

C-SVC Primal form Dual form (derived using Lagrange multipliers)

Speedup Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases Need efficient algorithms and implementation to apply to large scale data mining => Parallel SVM

Parallel SVM Methods Message Passing Interface (MPI) – Efficient for computation-intensive problems Ex. Simulation MapReduce – Can be used for data-intensive problems …

Other Speedup Techniques Chunking: optimize subsets of training data iteratively until the global optimum is reached – Ex. Sequential Minimal Optimization (SMO) Use a chunk size of 2 vectors Eliminate non-support vectors early

This Paper’s Approach 1.Partition & distribute data to nodes 2.Map class: Train each subSVM to find support vectors for subset of data 3.Reduce class: Combine support vectors of each 2 subSVMs 4.If more than 1 SVM Go to 2.

Twister Support iterative MapReduce More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce

Computation Complexity

Evaluations Number of nodes Training time Accuracy = # correctly predicted data / # total testing data * 100 %

Adult Data Analysis Binary classification Correlation between attribute variable X and class variable Y used to select attributes

Adult Data Analysis Computation cost concentrates on training Data transfer time cost minor Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much

Forest Cover Type Classification Multiclass classification – Use k(k - 1)/2 binary SVMs as k-class SVM – 1 binary SVM for each pair of classes – Use maximum voting to determine the class

Forest Cover Type Classification Correlation between attribute variable X and class variable Y used to select attributes Attribute variables are normalized to [0, 1]

Forest Cover Type Classification Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much

Heart Disease Classification Binary classification Data replicated different times to compare results for different sample sizes

Heart Disease Classification When sample size too big, can’t be processed with 1 node because of memory constraint Training time decreases little when number of nodes > 8

Conclusion Classical SVM impractical for large scale data Need parallel SVM This paper proposes a model based on iterative MapReduce Show the model efficient for data-intensive problems

References [1]Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las Vegas, NV, 2012, pp. [2]C. Lin et al., “Anomaly Detection Using LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Similar presentations

Presentation on theme: "“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Similar presentations

Presentation on theme: "“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015."— Presentation transcript:

Similar presentations

About project

Feedback