Download presentation
Presentation is loading. Please wait.
Published bySheila Lauren Stone Modified over 9 years ago
1
Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002
2
The Problem Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.
3
What To Improve Current algorithms use excessive indirect addressing Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)
4
Sparse Matrix Representations Coordinate format Compressed Sparse Row (CSR) Compressed Sparse Column (CSC) Modified Sparse Row (MSR)
5
Compressed Sparse Row (CSR) 0 A 01 A 02 0 0 A 11 0 A 13 A 20 000 0245 02130 A 01 A 02 A 11 A 13 A 20 rS ndx val
6
CSR Code void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y) { int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; }
7
Goals Eliminate indirect addressing Remove the dependency on the distribution of the nonzero elements Further compress the matrix storage Most of all, to speed up the operation
8
Proposed Solution {0,0}{1,A 01 }{2,A 02 }{-1,0}{1,A 11 }{3,A 13 }{-2,A 20 } 0 A 01 A 02 0 0 A 11 0 A 13 A 20 000 A =
9
Data Structure typedef struct { int rCol; double val; } dSparS_t; {rCol,val}
10
Process 013p local_size hdr.size … residual < p local_size – hdr.size / p residual = hdr.size % p A
11
Scatter 012p local_size … A 012p … local_A
12
Multiplication Code if( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index]; else local_Y[0].val = local_A[0].val * X[0]; local_Y[0].rCol = -1; k=1; h=0; while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size)) local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) { local_Y[h++].rCol = -index-1; local_Y[h].val = local_A[k++].val * X[0]; } local_Y[h].rCol = local_Y[-1+h++].rCol+1; while(h < stride) local_Y[h++].rCol = -1;
13
Multiplication local_size local_A stride local_Y doamin Range X =*
14
Algorithm local_A X Y.val Y.rCol {r 0,v 0 } 0 X[0] =X[0]*v 00 - {c 1,v 1 } 0 X[c 01 ] +=X[c 01 ]*v 01 -.. {r 1,v 0 } 1.. X[0] =X[0]*v 00 - {c 2,v 2 } 0 X[c 02 ] +=X[c 02 ]*v 02 -r 1 -1 {c 1,v 1 } 1 X[c 11 ] +=X[c 11 ]*v 11 -
15
Gather … 012p … local_Y residual gatherBuffer split element stride range
16
Consolidation of Split Rows … residual Y nCols … += gatherBuffer
17
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P00.1039302.3802850.0960510.012123 P10.1075880.4571400.0120000.011504 P20.1076670.7060870.0120220.011642 P30.1031550.9518140.0119710.011560 P40.1076441.2063760.0122100.011536 P50.1092431.4525630.0120320.011506 P60.1084771.7025710.0120440.011506 P70.1094461.9484810.0120040.011658 P80.0558222.2089240.0120790.011540 P90.0590232.4599000.0120090.011438
18
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P00.0894782.2643160.1217410.014860 P10.0930830.5690911.7117890.014105 P20.0932170.8664601.4293520.014227 P30.0910121.1605911.1469540.014457 P40.0817191.4623350.8655200.014365 P50.0853751.7569410.5823530.014341 P60.0854182.0556510.2998470.014362 P70.0890872.3509980.0178130.014728 vavasis3.rua - Total non-zero values: 1,683,902 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P00.0000021.4127740.0330150.112132
19
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P00.0519803.0268460.2175740.028587 P10.0556051.7252721.0279280.028258 P20.0557032.3193430.4510210.028141 P30.0564223.2125180.0180730.027988 vavasis3.rua - Total non-zero values: 1,683,902 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P00.2332005.8108140.4260970.056334 P10.2368646.5213280.0321250.055866
20
Results (vavasis3) PComputationSpeedupE_pGatherC_p 10.112132--- 0.0330151.294430 20.0563341.9904850.9952430.4260978.563763 40.0285873.9224820.9806211.02792836.957883 80.0148607.5458950.9432371.711789116.194415 100.0121239.2495260.9249530.0960518.923039 vavasis3.rua - Calculated Results
21
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 10 Broadcast TimeScatter TimeGather Time Computation Time P00.0461360.0931430.0117330.000926 P10.0488240.0182070.0015670.000423 P20.0486270.0271460.0020540.000456 P30.0444160.0343860.0024400.000445 P40.0482140.0463650.0024570.000397 P50.0484810.0535110.0019780.000425 P60.0456660.0632040.0020150.000467 P70.0481730.0701670.0024400.000419 P80.0339470.0885320.0023230.000395 P90.0321100.0978660.0019590.000479
22
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 8 Broadcast TimeScatter TimeGather Time Computation Time P00.0401590.1034220.0118100.001020 P10.0427430.0233530.0017280.000549 P20.0427090.0356700.0017770.000607 P30.0393220.0471410.0017380.000599 P40.0415840.0640240.0017240.000702 P50.0392290.0755280.0017250.000568 P60.0372060.0897570.0017330.000565 P70.0399120.1012670.0021110.000541 bayer02.rua - Total non-zero values: 63,679 - p = 1 Broadcast TimeScatter TimeGather Time Computation Time P00.0000030.0638240.0109750.006090
23
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 4 Broadcast TimeScatter TimeGather Time Computation Time P00.0496800.0969300.0183080.001888 P10.0523790.0489240.0037650.001555 P20.0519440.0764050.0036090.001561 P30.0464130.1018710.0036360.001528 bayer02.rua - Total non-zero values: 63,679 - p = 2 Broadcast TimeScatter TimeGather Time Computation Time P00.0254940.5206110.0081920.003445 P10.0281570.5040810.0328480.003121
24
Results (bayer02) PComputationSpeedupE_pGatherC_p 10.006090--- 0.0109752.802135 20.0034451.7677790.8838900.03284810.534978 40.0018883.2256360.8064090.01830810.697034 80.0010205.9705880.7463240.01181012.578431 100.0009266.5766740.6576670.01173313.670626 bayer02.rua - Calculated Results
25
Conclusions The proposed representation speeds up the matrix calculation Data mismatch solution before gather should be improved There seems to be a communication penalty for using moving structured data
26
Bibliography “Optimizing the Performance of Sparse Matrix- Vector Multiplication” dissertation by Eun-Jin Im. “Iterative Methods for Sparse Linear Systems” by Yousef Saad “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.