Download presentation
Presentation is loading. Please wait.
Published byMillicent Richards Modified over 8 years ago
1
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal
2
Outline of contents Background of GPU computing Background of GPU computing Parallel data mining Parallel data mining Challenges of data mining on GPU Challenges of data mining on GPU GPU implementation GPU implementation k-means k-means EM EM kNN kNN Apriori Apriori Experiment results Experiment results Results of kmeans and EM Results of kmeans and EM Features of applications that are suitable for GPU computing Features of applications that are suitable for GPU computing Related and future work Related and future work
3
Background of GPU computing Multi-core architectures are becoming more popular in high performance computing Multi-core architectures are becoming more popular in high performance computing GPU is inexpensive and fast GPU is inexpensive and fast CUDA is a high level language that supports programming on GPU CUDA is a high level language that supports programming on GPU
4
CUDA functions Host function Host function Called by host and executed on host Called by host and executed on host Global function Global function Called by host and executed on device Called by host and executed on device Device function Device function Called by device and executed on device Called by device and executed on device
5
Architecture of GeForce 8800 GPU (1 multiprocessor)
6
Parallel data mining Common structure of data mining applications (adopted from Freeride) Common structure of data mining applications (adopted from Freeride) { * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; }
7
Challenges of data mining on GPU SIMD shared memory programming 3 steps involved in the main loop 3 steps involved in the main loop Data read Computing update Computing update Writing update Writing update
8
Computing update copy common variables from device memory to shared memory nBlocks = blockSize/ thread number For i=1 to nBlocks { each thread process 1 data element } Global reduction
9
GPU Implementation k-means k-means Data are points (say, 3 dimension) Data are points (say, 3 dimension) Start with k clusters Start with k clusters Find the nearest cluster for each point Find the nearest cluster for each point determine the k centroids from the points assigned to the corresponding center Repeat until the assignments of points don’t change
10
GPU version of kmeans Device function: Shared_memory center nBlocks = blockSize / thread_number tid = thread_ID For i = 1 to nBlocks min = 0; For j = 1 to k dis = distance(data[tid], center[j]) if (dis < min) min = dis min index = i update[tid][min index] (data[tid],dis) Thread 0 combines all copies of update
11
Other applications EM EM E step and M step, different amount of computation E step and M step, different amount of computation Apriori Apriori Tree-structured reduction objects Tree-structured reduction objects Large amount of updates Large amount of updates kNN kNN
12
Experiment results k-means and EM has the best performance when using 512 threads/block and 16 or 32 thread blocks k-means and EM has the best performance when using 512 threads/block and 16 or 32 thread blocks kNN and apriori hardly get good speedup with GPU kNN and apriori hardly get good speedup with GPU
13
k-means (10MB points)
14
k-means (continued) (20MB points)
15
EM (continued) (512K points)
16
EM (continued) (1M points)
17
Features of applications that are suitable for GPU computing the time spent on processing the data must dominate the I/O cost the size of the reduction object needs to be small enough to have a replica for each thread in device memory using the shared memory to store frequently accessed data
18
the time spent on processing the data must dominate the I/O cost I/O computing
19
the size of the reduction object needs to be small enough to have a replica for each thread in device memory No locking mechanism on GPU The access to the reductionobjects are unpredictable
20
using the shared memory to store frequently accessed data Accessing device memory is very time consuming Accessing device memory is very time consuming Shared memory serves as a high speed cache Shared memory serves as a high speed cache For non-read-only data elements on shared memory, we also need replica for each thread For non-read-only data elements on shared memory, we also need replica for each thread
21
Related work Freeride Freeride Other GPU computing languages Other GPU computing languages The usage of GPU computation in scientific computing The usage of GPU computation in scientific computing
22
Future work Middleware for data mining on GPU Middleware for data mining on GPU Provide some compilation mechanism for data mining applications on MATLAB Provide some compilation mechanism for data mining applications on MATLAB Enable tuning of parameters that can optimize GPU computing Enable tuning of parameters that can optimize GPU computing
23
Thank you! Questions? Thank you! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.