Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Mar 16, Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
CUDA - 2.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
QCAdesigner – CUDA HPPS project
Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Prof. Zhang Gang School of Computer Sci. & Tech.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Communication and Memory Efficient Parallel Decision Tree Construction
Symmetric Multiprocessing (SMP)
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
6- General Purpose GPU Programming
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal

Outline of contents Background of GPU computing Background of GPU computing Parallel data mining Parallel data mining Challenges of data mining on GPU Challenges of data mining on GPU GPU implementation GPU implementation k-means k-means EM EM kNN kNN Apriori Apriori Experiment results Experiment results Results of kmeans and EM Results of kmeans and EM Features of applications that are suitable for GPU computing Features of applications that are suitable for GPU computing Related and future work Related and future work

Background of GPU computing Multi-core architectures are becoming more popular in high performance computing Multi-core architectures are becoming more popular in high performance computing GPU is inexpensive and fast GPU is inexpensive and fast CUDA is a high level language that supports programming on GPU CUDA is a high level language that supports programming on GPU

CUDA functions Host function Host function Called by host and executed on host Called by host and executed on host Global function Global function Called by host and executed on device Called by host and executed on device Device function Device function Called by device and executed on device Called by device and executed on device

Architecture of GeForce 8800 GPU (1 multiprocessor)

Parallel data mining Common structure of data mining applications (adopted from Freeride) Common structure of data mining applications (adopted from Freeride) { * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; }

Challenges of data mining on GPU SIMD shared memory programming 3 steps involved in the main loop 3 steps involved in the main loop Data read Computing update Computing update Writing update Writing update

Computing update copy common variables from device memory to shared memory nBlocks = blockSize/ thread number For i=1 to nBlocks { each thread process 1 data element } Global reduction

GPU Implementation k-means k-means Data are points (say, 3 dimension) Data are points (say, 3 dimension) Start with k clusters Start with k clusters Find the nearest cluster for each point Find the nearest cluster for each point determine the k centroids from the points assigned to the corresponding center Repeat until the assignments of points don’t change

GPU version of kmeans Device function: Shared_memory center nBlocks = blockSize / thread_number tid = thread_ID For i = 1 to nBlocks min = 0; For j = 1 to k dis = distance(data[tid], center[j]) if (dis < min) min = dis min index = i update[tid][min index] (data[tid],dis) Thread 0 combines all copies of update

Other applications EM EM E step and M step, different amount of computation E step and M step, different amount of computation Apriori Apriori Tree-structured reduction objects Tree-structured reduction objects Large amount of updates Large amount of updates kNN kNN

Experiment results k-means and EM has the best performance when using 512 threads/block and 16 or 32 thread blocks k-means and EM has the best performance when using 512 threads/block and 16 or 32 thread blocks kNN and apriori hardly get good speedup with GPU kNN and apriori hardly get good speedup with GPU

k-means (10MB points)

k-means (continued) (20MB points)

EM (continued) (512K points)

EM (continued) (1M points)

Features of applications that are suitable for GPU computing the time spent on processing the data must dominate the I/O cost the size of the reduction object needs to be small enough to have a replica for each thread in device memory using the shared memory to store frequently accessed data

the time spent on processing the data must dominate the I/O cost I/O computing

the size of the reduction object needs to be small enough to have a replica for each thread in device memory No locking mechanism on GPU The access to the reductionobjects are unpredictable

using the shared memory to store frequently accessed data Accessing device memory is very time consuming Accessing device memory is very time consuming Shared memory serves as a high speed cache Shared memory serves as a high speed cache For non-read-only data elements on shared memory, we also need replica for each thread For non-read-only data elements on shared memory, we also need replica for each thread

Related work Freeride Freeride Other GPU computing languages Other GPU computing languages The usage of GPU computation in scientific computing The usage of GPU computation in scientific computing

Future work Middleware for data mining on GPU Middleware for data mining on GPU Provide some compilation mechanism for data mining applications on MATLAB Provide some compilation mechanism for data mining applications on MATLAB Enable tuning of parameters that can optimize GPU computing Enable tuning of parameters that can optimize GPU computing

Thank you! Questions? Thank you! Questions?