Download presentation
Presentation is loading. Please wait.
Published byCarol Palmer Modified over 9 years ago
1
By Xinggao Xia and Jong Chul Lee
3
TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n : the number of elements in matrix Table 1: Computational Complexity of Various Solving Techniques Comments: Computational complexity increases tremendously as the dimension of matrix increases. Gaussian Elimination solver has obvious advantage in terms of complexity as the matrix size increases.
4
Iteration No.1Iteration No.2Iteration No.3 Normalization 10:00010:000 10:00010:000 10:0010:00 10:00010:000 10:0010:000 10:010:0 10:10: 1010 1 …… Iteration No.1Iteration No.2……Iteration No.N
5
Inter-iteration parallelism 10:00010:000 m1i m0i m2i m3i For Iteration i A[j][]=A[j][] –m[j][i]*matrix pivot row Multiplier array m must be determined before each iteration Perfectly fit CUDA architecture
6
Modified Gaussian Elimination is considered for CUDA linear equations solver More parallelism No back substitution Partial pivoting guarantees accuracy of solution
7
Initial stateIteration No.1Iteration No.2 Initial stateIteration No.1Iteration No.2Iteration No.3 Traditional Gaussian Elimination Modified Gaussian Elimination
8
0:00100:00:00100:0 Row i Column i For iteration ith Row j=Row j-mj*Row i Traditional Gaussian Elimination Added elimination in modified Gaussian Elimination
9
10:00010:000 10:00010:000 10:0010:00 10:00010:000 10:0010:000 10:010:0 10:10: 1010 1 …… Iteration No.1Iteration No.2……Iteration No.N-1 Back Substitution Traditional Gaussian Linear Solver Gaussian Elimination 10:00010:000 10:00010:000 010:00010:00 10:00010:000 010:00010:000 0010:00010:0 0:01:00:01:0 0::0100::010 0:::010:::01 …… Iteration No.1Iteration No.2……Iteration No.N Modified Gaussian Elimination Modified Gaussian Linear Solver 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1
10
For (i=0; i<N; i++) { Partial pivoting { Transfer the ith column back to host; Search the maximum of this column and return the index; (Host) Switches rows if necessary; (Device) } Determine the multiplier column; (Device) Modified Gaussian elimination; (Device) } Normalize the solution; (Device) Transfer solution back to host; Threads Architecture Matrix handling like modified Gaussian Elimination kernel, each thread handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two dimensional grid and block, total of N*N threads in the kernel Row or column handling like partial pivoting and others, each thread for one elements in the row or column, use one dimentsional grid and block, total of N threads in the kernel
11
0:00abc:x0:00abc:x h_tempd_temp Kernel1 c Host: search maximum is c 0:00abc:x0:00abc:x d_temp CudaMemcpy:Device to Host Kernel2 Kernel3Kernel4 Minimizing Device Host transportation: Switching rows by kernel
12
For ith iteration Each thread handles: A[j][i]=A[j][i]-mj*A[i][j] B(1,1)B(0,0) B(0,1) B(N-1,1) B(0,N-1) :::::: …… B(i,j) T(0,0) T(0,1) ………………T(0,M-1) T(0,0) T(0,1) : T(0,M-1) Iteration i data partitioning N BLOCK_SIZE N 0:00100:00:00100:0 Row i Column i
13
B(1,1)B(0,0) B(0,1) B(N-1,1) B(0,N-1) :::::: …… B(i,j) N N Multiplier Column m Row i Shared Memory For ith iteration: A[j][i]=A[j][i]-mj*A[i][j]
14
Platform Configuration: GPU: GeForce 8400 GS 1 SM, 8 cores, Clock rate 1.40GHz CPU: Intel Core2 Quad Q6600 Clock rate 2.39GHz 512102420484096 Serial Traditional Gaussian Linear Solver47403521446098 Serial Modified Gaussian Linear Solver71564841269949 Global Memory (1SM)171813488108916862580 Shared Memory (1SM)662480638923312787 Global Memory (scaled by 16)107843680753911 Shared Memory (scaled by 16)41300243319549 Comments: GPU implementation (Global or shared) is much slower than CPU implementation(1SM) Try to mimic Tesla (16SM) by scaling GPU time by 16
15
Comments: CPU prefers traditional GE solver than modified GE solver GPU shared implementation is always 2-3 times faster than global implementation GPU(16SM) shared implementation is around 2 times speedup compared to traditional GE Matrix size Time (ms) Matrix size
16
Method#CallsGPU(usec)%GPU time GlobalGE_kernel10241.3e+0799.11 SharedGE_kernel10244.8e+0697.6 gld uncoalescedgld coalesced%uncoalesced rate Global104857613107289 Shared614407372845 For 1024 case (1SM), global memory implementation time is 13488ms, shared implementation is 4806ms
17
Conclusion: Linear equations solver based on modified Gaussian Elimination is implemented on CUDA Shared memory is about 3 times faster than global memory implementation Shared memory is expected about 3 times faster than traditional Gaussian Elimination Solver serial code in 16SM GPU Partial pivoting guarantees stability and accuracy. (error less than 0.001 compared to serial code) Problem found: More uncoalesced global memory accessing offsets advantages gained from more parallelism in modified Gaussian Elimination.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.