Download presentation
Presentation is loading. Please wait.
Published byChristina Tate Modified over 9 years ago
1
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009
2
Euro-Par, 2006 Motivation and Overview Two Popular Trends –Data-intensive computing –GPU programming Seems like a good match Can we ease use of GPGPUs ? –Domain-specific Programming Tool –Can exploit common programming structure –Enable good speedups ICS 2009
3
Euro-Par, 2006 Context Many years of work on compiler and runtime support for data-intensive applications –Clusters, SMPs, Cluster of SMPs –FREERIDE and language front-ends Similar to map-reduce but … –Predates it and performs better !! –Recent work on (Cluster of) Multi-cores, Incorporate RSTM GPUs – C and Matlab front-end –Cluster of GPUs, Multi-core and GPUs ICS 2009
4
Euro-Par, 2006 ICS 2009 Outline Background GPU Computing Parallel Data Mining Challenges of Data Mining on GPU Architecture of the System –Sequential code analysis –Generation of CUDA programs –Optimization Techniques Experimental Results –k-means, EM, PCA Related and future work ICS 2009
5
Euro-Par, 2006 ICS 2009 Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming
6
Euro-Par, 2006 ICS 2009 CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy
7
Euro-Par, 2006 ICS 2009 Parallel Data mining Common structure of data mining applications (FREERIDE) /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }
8
Euro-Par, 2006 Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory
9
Euro-Par, 2006 ICS 2009 Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM) Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input
10
Euro-Par, 2006 User Input A sequential reduction function Optional functions (initialization function, combination function…) Values of each variable or size of array Variables to be used in the reduction function
11
Euro-Par, 2006 ICS 2009 Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory
12
Euro-Par, 2006 Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.
13
Euro-Par, 2006 ICS 2009 Extract information of variable access Variable analyzer IR from LLVM Extract variables to be written Argument list Extract read-only variables User input Extract temporary variables
14
Euro-Par, 2006 ICS 2009 Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number) Global function Kernel reduction function Global combination
15
Euro-Par, 2006 ICS 2009 Global Combination Assume all updates are summed or multiplied from each thread An automatically generated global combination function which is invoked by 1 thread
16
Euro-Par, 2006 ICS 2009 Kernel Reduction Function Generated out of the original sequential code Divide the main loop by block_number and thread_number Replace the access offsets with appropriate indices
17
Euro-Par, 2006 ICS 2009 Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once
18
Euro-Par, 2006 ICS 2009 Dealing with Shared memory Size = length * sizeof(type) * thread_info –length: size of the array –type: char, int, and float –thread_info: whether it’s copied to each thread Mark each array as shared until the size exceeds the limit of shared memory
19
Euro-Par, 2006 ICS 2009 Shared memory layout Strategies No-sorting Greedy sorting Write-first sorting
20
Euro-Par, 2006 ICS 2009 No sorting Shared Memory B A CD
21
Euro-Par, 2006 ICS 2009 Greedy sorting Shared Memory BACD BACD
22
Euro-Par, 2006 ICS 2009 Other Optimizations Reducing Memory allocation and copy overhead –Arrays shared by multiple iterations can be allocated and copied only once User defined combination function
23
Euro-Par, 2006 ICS 2009 Applications K-means clustering EM clustering PCA
24
Euro-Par, 2006 ICS 2009 Experiment Results Speedup of k-means
25
Euro-Par, 2006 ICS 2009 Speedup of k-means on GeForce 9800X2
26
Euro-Par, 2006 ICS 2009 Speedup of EM
27
Euro-Par, 2006 ICS 2009 Speedup of PCA
28
Euro-Par, 2006 Related Work OpenMP to CUDA (Purdue) Domain-specific operators to CUDA (NEC) CUDA-lite etc. (Illinois) Various application studies
29
Euro-Par, 2006 Conclusions Automatic CUDA Code Generation and Optimization is feasible Restricting to domain / communication style helps Interesting new compiler optimizations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.