Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia.

Similar presentations


Presentation on theme: "Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia."— Presentation transcript:

1 Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia Collaborators: Jed Brown, Dr. John Gunnels King Abdullah University of Science and Technology November 2011

2 Motivation PowerPC 450: a representation to exascale architectures –Increased parallelism: vectorization and multi-issue pipeline –Silicon and power savings: in-order execution Streaming numerical kernels: –At the heart of many scientific applications –Bottleneck in scientific codes 2 7-point stencil operator27-point stencil operator

3 Why is tuning computation on the BG/P PowerPC 450 difficult? Utilizes features to improve efficiency –SIMDized fused floating point units B012345..N A012345..N For (i=0; i<N; i++) A[i] = B[i] + B[i+1] Not Aligned 3

4 Why is tuning computation on the BG/P PowerPC 450 difficult? Utilizes features to improve efficiency –SIMDized fused floating point units –Superscalar processor with In-order execution at the core level 4 CycleLoad unitFP unit 1load Aadd B 2load C- 3load D- 4-add D 5-add E 6-add F 1 load A 2 add B 3 load C 4 load D 5 add D 6 add E 7 add F CycleLoad unitFP unit 1load Aadd B 2load Cadd E 3load Dadd F 4-add D 5 6 1 load A 2 add B 3 load C 6 add E 4 load D 7 add F 5 add D

5 Engineering tactics Divide and conquer: 3-point stencil –Optimize then replicate into larger stencils Design focus: computer architecture –Fully utilize SIMD capabilities –Reduce pipeline stalls: unroll-and-jam and instructions interleaving (reordering) Technique: assembly synthesis in Python –Accelerates prototyping –Simplifies source 5

6 3-point stencil SIMDization Utilizing the SIMD-like unit features: A ij 0123456 R ij 0 123456 k W A R r3 = a2*W0 + a3*W1 + a4*W2 Primary | Secondary 6 Regular SIMDCrossCopy-primary And more … W A R Primary | Secondary W A R

7 Mutate-mutate Vs. load-copy Kernel OperationsCyclesRegistersUtilization % Ld-stFPUld/stFPUInputOutputld/stFPU Mutate-mutate2-1 3631110050 Load-copy1-144421100 7 Mutate-mutate –Fully utilizes the FPU –Requires less registers Load-copy –Requires less load cycles

8 Unroll-and-jam reduce data hazards For (i=0; i<4; i++) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] For (i=0; i<4; i+=2) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] A[i+1] += q*B[i+1][j] + p*B[i+1][j+1] A 0 1 2 3 B j 012345 67891011 121314151617 181920212223 += 2 sources, 1 destinations 2 sources, 2 destinations A[0] += q*B[0] A[1] += q*B[6] A[0] += p*B[1] A[1] += p*B[7]. A[0] += q*B[0] stall A[0] += p*B[1] stall A[0] += q*B[2] stall A[0] += p*B[3]. A 0 1 2 3 B j 012345 67891011 121314151617 181920212223 += 8

9 Unroll-and-jam data reuse j 1234 1 2 R (i,j) R (i,j+1) i 3 R (i+1,j) R (i+1,j+1) 4 Jam 1 Jam 2 Jam 3 Jam 4 j 1234 1 w1w2w1w3w2w3 2 w4 w5w4w6w5 w6 i w1 w2w1w3w2 w3 3 w7 w8w7w9w8 w9 w4 w5w4w6w5 w6 4 w7 w8w7w9w8 w9 9

10 Pythonic code synthesis overview Instruction scheduler and simulator Documented C code template PowerPC 450 simulator C code generator GPR FPR Memory Simulation log and debugging information Python code Instructions (list of objects) Register allocation 10

11 Pythonic code synthesis instruction scheduling Goal: –Run load/store and FMA instructions each cycle –Reduce read-after-write (RAW) data dependency hazards Technique (Greedy) per cycle: –Create a list of instructions with no RAW hazards –Execute the instruction(s) that will require the minimal stall –Repeat until all instructions are executed 11

12 Unroll-and-jam effects 27-point stencil 12

13 Kernel and L2 effects 7-point stencil 13

14 Unroll-and-jam effects 3-point stencil 14

15 Instruction scheduling optimization formulation 15

16 Conclusion SIMDizing the computations of streaming numerical kernels is challenging Assembly programming is important for “peak” hardware utilization We introduced a code synthesis and simulation framework that facilitates: –A faster development-testing loop –Instruction reordering for improved efficiency –Cycle-accurate performance modeling 16


Download ppt "Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia."

Similar presentations


Ads by Google