Download presentation
Presentation is loading. Please wait.
Published byAugust Daniel Modified over 9 years ago
1
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University
2
Outline of Contents Background of GPU and GPU cluster computing System Design Implementation –System API –Code analysis –Generation of FREERIDE code –Generation of CUDA programs Applications on GPU clusters –K-means, PCA Future work
3
Background Multi-core and many-core architectures are becoming more popular GPUs are inexpensive and fast CUDA is a high level language that supports programming on GPU It is common for clusters to have powerful GPGPUs –Tianhe-1, #5 in Top 500 list Other systems with Heterogeneous Cores –RoadRunner, #2 in Top 500 list
4
Complication of Programming on GPUs User has to have thorough knowledge of the architecture of GPU and the programming model of CUDA Has to deal with the memory allocation and copy Need to know what data should be copied onto shared memory ……
5
Programming GPU Clusters Need to combine CUDA with a distributed memory programming model Hybrid Programming Models are not well developed today Many decisions for application development –Partitioning data and computation between nodes –CUDA `Memory Hierarchy’
6
Needs and Approach Need high-level approaches for programming heterogeneous clusters Very challenging compilation / programming model problem Domain-specific approaches can be feasible –Support restricted class of applications
7
Target Domain: Data-Intensive Applications A lot of interest in large-scale data analysis Large clusters widely used Many data-intensive applications can be scaled using GPUs Common structure can be exploited
8
Parallel Data Mining Common structure of parallel data mining implementations Based on our earlier work on FREERIDE Middleware { * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); }
9
A Data Intensive Computing Middleware - FREERIDE 9 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. overheads are eliminated with red. func/obj.
10
Architecture of Our Code Generation System
11
Architecture (Summary) User input Code analyzer –Analysis of variables (variable type and size) –Analysis of reduction functions (sequential code from the user) Code Generator ( generating FREERIDE and CUDA code)
12
Generalized Reductions on GPUs SIMD shared memory programming 3 steps involved in the main loop –Data read –Computing update –Writing update
13
Parallelization on a Cluster Use FREERIDE based processing Local reductions are further accelerated with GPUs { * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; Reduction() } Finalize(); Finalize() }
14
User Input Finalize function Optional functions (initialization function, combination function…) A sequential reduction function Variables to be used in the reduction function
15
Analysis of Sequential Code Obtaining variable access features Extracting the reduction objects and their combination operations Analysis for parallelization –Find out the parallel loop –Figure out the data to be replicated –Calculate the size of shared memory to use and which data to be copied to shared memory
16
Get the Operator for Combining Reduction Objects All the output variables are denoted as reduction objects Combination of reduction objects –Local combination at the end of CUDA function –Global combination by MPI done within ADR Also done by the code analyzer using LLVM Currently support + and *
17
Generating Freeride code Three functions –Initialization(): At the beginning of the execution –Reduction(): Invoked for every data block –Finalize(): At the end of every iteration
18
Generating Freeride code Initialization() –Allocate memory for variables Need to consider the copies for GPUs Reduction() –Void Reduction(void* data) { float* point = (float*) data; reduc(data, update, ……); // invoke CUDA function for(int i=0;i<k*5;k++) reduction(0, i, update[i]); // updating reduction objects } Finalize() –Update local variables with the combined value of reduction objects –Check for finishing condition
19
Generating CUDA Code Memory allocation and copy Global function Kernel reduction function local combination
20
Memory Allocation and Copy T0T1 T2T3T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… Copy the updates back to host memory after the kernel reduction function returns A. B. C. Need copy for each thread
21
Generated CUDA Functions Global function –Invoking kernel Kernel Reduction Function –Generated out of the original sequential code –Divide the main loop by block_number and thread_number –Replace the access offsets with appropriate indices –…… Local Combination –Assume all updates are summed or multiplied from each thread –An automatically generated global combination function which is invoked by 1 thread
22
Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated for single time …
23
Applications K-means clustering –16 blocks and 256 threads per block PCA –16 blocks and 128 threads per block
24
Experiment Results K-means (1.5GB data)
25
PCA (64M rows, 3 principal Components) Experiment Results
26
PCA (2M rows, 64 principal Components)
27
Summary High-level abstractions are feasible if we target specific processing structures Promising initial results –Many challenges remain in obtaining performance from GPU Clusters
28
Future work Analysis of sequential code to support more operations for combination of reduction objects Improve shared memory allocation strategy Support of automatic grid configuration Support of multi-core and GPU on the some computing node
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.