Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.

Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang

Motivation Massive data challenges – More and more data to process (Google: 20,000 terabytes per day) – Data arrives faster and faster Solutions – Invent faster ML algorithms: online algorithms Stochastic gradient decent v.s. batch gradient decent – Parallelize learning processes: MapReduce, GPU, etc.

MapReduce A programming model invented by Google – Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), p.10-10, December 06-08, 2004, San Francisco, CA The objective – To support distributed computing on large data sets on clusters of computers Features – Automatic parallelization and distribution – Fault-tolerance – I/O scheduling – Status and monitoring

User Interface Users need to implement two functions – map (in_key, in_value) -> list(out_key, intermediate_value) – reduce (out_key, list(intermediate_value)) -> list(out_value) Example: Count word occurrences Map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

MapReduce Usage Statistics in Google

MapReduce for Machine Learning C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-reduce for machine learning on multicore," in NIPS 2006 Algorithms that can be expressed in summation form could be parallelized in the MapReduce framework – Locally weighted linear regression (LWLR) – Logistic regression(LR): Newton-Raphson method – Naive Bayes (NB)

– PCA – Linear SVM – EM: mixture of Gaussians (M-step)

Time complexity P: # of cores P’: speedup of matrix inversion and eigen-decomposition on multicore

Experimental Results Speedup from 1 to 16 cores over all datasets

Apache Hadoop http://hadoop.apache.org/ An open-source implementation of MapReduce An excellent tutorial – http://hadoop.apache.org/common/docs/current/mapred_tutorial.html (with the famous examples of WordCount) http://hadoop.apache.org/common/docs/current/mapred_tutorial.html – Very helpful if you need to quickly develop a simple Hadoop program A comprehensive book – Tom White. Hadoop: The Definitive Guide. O'Reilly Media. May 2009 http://oreilly.com/catalog/9780596521981 http://oreilly.com/catalog/9780596521981 – Topics: Hadoop distributed file system, Hadoop I/O, How to set up a Hadoop cluster, how to develop a Hadoop application, Administration, etc. – Helpful if you want to become a Hadoop expert

Key User Interfaces of Hadoop Class Mapper – Implement the map function to define your map routines Class Reducer – Implement the reduce function to define your reduce routines Class JobConf – the primary interface to configure job parameters, which include but not limited to: Input and output path (Hadoop Distributed File System) Number of Mappers and Reducers Job name …

Apache Mahout http://lucene.apache.org/mahout/ A library of parallelized machine learning algorithms implemented on top of Hadoop Applications – Clustering – Classification – Batch based collaborative filtering – Frequent itemset mining – …

Mahout in progress Algorithms already implemented – K-Means, Fuzzy K-Means, Naive Bayes, Canopy clustering, Mean Shift, Dirichlet process clustering, Latent Dirichlet Allocation, Random Forests, etc. Algorithms to be implemented – Stochastic gradient decent, SVM, NN, PCA, ICA, GDA, EM, etc.

GPU for Large-Scale ML Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms

NVIDIA GeForce 8800 GTX Specification Number of Streaming Multiprocessors16 Multiprocessor Width8 Local Store Size16 KB Total number of Stream Processors128 Peak SP Floating Point Rate346 Gflops Clock1.35 GHz Device Memory768 MB Peak Memory Bandwidth86.4 GB/s Connection to Host CPUPCI Express CPU -> GPU bandwidth2.2 GB/s* GPU -> CPU bandwidth1.7 GB/s*

Logical Organization to Programmers Each block can have up to 512 threads that synchronize Millions of blocks can be issued

Programming Environment: CUDA Compute Unified Device Architecture (CUDA) A parallel computing architecture developed by NVIDIA The computing engine in GPUs is accessible to software developers through industry standard programming language

SVM on GPUs Catanzaro, B., Sundaram, N., and Keutzer, K. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international Conference on Machine Learning (Helsinki, Finland, July 05 - 09, 2008). ICML '08, vol. 307.

SVM Training Quadratic Program The SMO algorithm

SVM Training on GPU Each thread computes the following variable for each point:

Result: SVM training on GPU (Speedup over LibSVM)

SVM Classification SVM classification task involves finding which side of the hyperplane a point lies on Each thread evaluates kernel function for a point

Result: SVM classification on GPU (Speedup over LibSVM)

GPUMiner: Parallel Data Mining on Graphics Processors Wenbin Fang, etc. Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07, Oct 2008 Three components – The CPU-based storage and buffer manager to handle I/O and data transfer between CPU and GPU – The GPU-CPU co-processing parallel mining module – The GPU-based mining visualization module Two mining algorithms implemented – K-Means clustering – Apriori (frequent pattern mining algorithm)

GPUMiner: System Architecture

The bitmap technique Use a bitmap to represent the association between data objects and clusters (for K-means), and the association between items and transactions (for Apriori) Supports efficient row-wise and column-wise operations exploiting the thread parallelism on the GPU Use a summary vector to store the number of ones to accelerate counting on the number of ones in a row/column

K-means Three functions executed on GPU in parallel – makeBitmap_kernel – computeCentriod_kernel – findCluster_kernel

Apriori To find those frequent itemsets among a large number of transactions The trie-based implementation – Uses a trie to store candidates and their supports – Uses a bitmap to store the Item-Transaction matrix – Obtain the item supports by counting 1s in the bitmap – The 1-bit counting and intersection operations are implemented as GPU programs

Experiments Settings – GPU: NVIDIA GTX280, 30*8 processors – CPU: Intel Core2 Quad Core

Result: K-means Baseline: Uvirginia – 35x faster than the four-threaded CPU-based couterpart

Result: Apriori Baselines – CPU-based Apriori – Best implementation of FIMI’03

Conclusion Both MapReduce and GPU are feasible strategies for large-scale parallelized machine learning MapReduce aims at parallelization over computer clusters The hardware architecture of GPUs makes them a natural choice for parallelized ML

Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.

Similar presentations

Presentation on theme: "Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.

Similar presentations

Presentation on theme: "Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang."— Presentation transcript:

Similar presentations

About project

Feedback