Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.

Slides:

Advertisements

Similar presentations

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Advertisements

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Frequent Itemset Mining on Graphics Processors Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He 1, Qiong Luo Hong Kong Univ. of Sci.

Concurrency for data-intensive applications

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

Distributed Computations

MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Distributed Computations MapReduce

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

MapReduce for Machine Learning on Multicore

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

MapReduce How to painlessly process terabytes of data.

Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Large-scale Deep Unsupervised Learning using Graphics Processors

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Image taken from: slideshare

Big Data is a Big Deal!.

Hadoop Clusters Tess Fulkerson.

Presentation transcript:

Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang

Motivation Massive data challenges – More and more data to process (Google: 20,000 terabytes per day) – Data arrives faster and faster Solutions – Invent faster ML algorithms: online algorithms Stochastic gradient decent v.s. batch gradient decent – Parallelize learning processes: MapReduce, GPU, etc.

MapReduce A programming model invented by Google – Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), p.10-10, December 06-08, 2004, San Francisco, CA The objective – To support distributed computing on large data sets on clusters of computers Features – Automatic parallelization and distribution – Fault-tolerance – I/O scheduling – Status and monitoring

User Interface Users need to implement two functions – map (in_key, in_value) -> list(out_key, intermediate_value) – reduce (out_key, list(intermediate_value)) -> list(out_value) Example: Count word occurrences Map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

MapReduce Usage Statistics in Google

MapReduce for Machine Learning C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-reduce for machine learning on multicore," in NIPS 2006 Algorithms that can be expressed in summation form could be parallelized in the MapReduce framework – Locally weighted linear regression (LWLR) – Logistic regression(LR): Newton-Raphson method – Naive Bayes (NB)

– PCA – Linear SVM – EM: mixture of Gaussians (M-step)

Time complexity P: # of cores P’: speedup of matrix inversion and eigen-decomposition on multicore

Experimental Results Speedup from 1 to 16 cores over all datasets

Apache Hadoop An open-source implementation of MapReduce An excellent tutorial – (with the famous examples of WordCount) – Very helpful if you need to quickly develop a simple Hadoop program A comprehensive book – Tom White. Hadoop: The Definitive Guide. O'Reilly Media. May – Topics: Hadoop distributed file system, Hadoop I/O, How to set up a Hadoop cluster, how to develop a Hadoop application, Administration, etc. – Helpful if you want to become a Hadoop expert

Key User Interfaces of Hadoop Class Mapper – Implement the map function to define your map routines Class Reducer – Implement the reduce function to define your reduce routines Class JobConf – the primary interface to configure job parameters, which include but not limited to: Input and output path (Hadoop Distributed File System) Number of Mappers and Reducers Job name …

Apache Mahout A library of parallelized machine learning algorithms implemented on top of Hadoop Applications – Clustering – Classification – Batch based collaborative filtering – Frequent itemset mining – …

Mahout in progress Algorithms already implemented – K-Means, Fuzzy K-Means, Naive Bayes, Canopy clustering, Mean Shift, Dirichlet process clustering, Latent Dirichlet Allocation, Random Forests, etc. Algorithms to be implemented – Stochastic gradient decent, SVM, NN, PCA, ICA, GDA, EM, etc.

GPU for Large-Scale ML Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms

NVIDIA GeForce 8800 GTX Specification Number of Streaming Multiprocessors16 Multiprocessor Width8 Local Store Size16 KB Total number of Stream Processors128 Peak SP Floating Point Rate346 Gflops Clock1.35 GHz Device Memory768 MB Peak Memory Bandwidth86.4 GB/s Connection to Host CPUPCI Express CPU -> GPU bandwidth2.2 GB/s* GPU -> CPU bandwidth1.7 GB/s*

Logical Organization to Programmers Each block can have up to 512 threads that synchronize Millions of blocks can be issued

Programming Environment: CUDA Compute Unified Device Architecture (CUDA) A parallel computing architecture developed by NVIDIA The computing engine in GPUs is accessible to software developers through industry standard programming language

SVM on GPUs Catanzaro, B., Sundaram, N., and Keutzer, K Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international Conference on Machine Learning (Helsinki, Finland, July , 2008). ICML '08, vol. 307.

SVM Training Quadratic Program The SMO algorithm

SVM Training on GPU Each thread computes the following variable for each point:

Result: SVM training on GPU (Speedup over LibSVM)

SVM Classification SVM classification task involves finding which side of the hyperplane a point lies on Each thread evaluates kernel function for a point

Result: SVM classification on GPU (Speedup over LibSVM)

GPUMiner: Parallel Data Mining on Graphics Processors Wenbin Fang, etc. Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07, Oct 2008 Three components – The CPU-based storage and buffer manager to handle I/O and data transfer between CPU and GPU – The GPU-CPU co-processing parallel mining module – The GPU-based mining visualization module Two mining algorithms implemented – K-Means clustering – Apriori (frequent pattern mining algorithm)

GPUMiner: System Architecture

The bitmap technique Use a bitmap to represent the association between data objects and clusters (for K-means), and the association between items and transactions (for Apriori) Supports efficient row-wise and column-wise operations exploiting the thread parallelism on the GPU Use a summary vector to store the number of ones to accelerate counting on the number of ones in a row/column

K-means Three functions executed on GPU in parallel – makeBitmap_kernel – computeCentriod_kernel – findCluster_kernel

Apriori To find those frequent itemsets among a large number of transactions The trie-based implementation – Uses a trie to store candidates and their supports – Uses a bitmap to store the Item-Transaction matrix – Obtain the item supports by counting 1s in the bitmap – The 1-bit counting and intersection operations are implemented as GPU programs

Experiments Settings – GPU: NVIDIA GTX280, 30*8 processors – CPU: Intel Core2 Quad Core

Result: K-means Baseline: Uvirginia – 35x faster than the four-threaded CPU-based couterpart

Result: Apriori Baselines – CPU-based Apriori – Best implementation of FIMI’03

Conclusion Both MapReduce and GPU are feasible strategies for large-scale parallelized machine learning MapReduce aims at parallelization over computer clusters The hardware architecture of GPUs makes them a natural choice for parallelized ML