Presentation is loading. Please wait.

Presentation is loading. Please wait.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Similar presentations


Presentation on theme: "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta."— Presentation transcript:

1 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008

2 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Code Overview  For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)  A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil3D Stencil

3 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Applications  Stencils are critical to many scientific applications:  Diffusion, Electromagnetics, Computational Fluid Dynamics  Both uniform and adaptive block-structured meshes  Many type of stencils  1D, 2D, 3D meshes  Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…)  Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)  Varying boundary conditions (constant vs. periodic)

4 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Stencil Code void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); }

5 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Our Stencil Code  Executes a 3D, 7-point, Jacobi iteration on a 256 3 grid  Performs 8 flops (6 adds, 2 mults) per point  Parallelization performed with pthreads  Thread affinity: multithreading, then multicore, then multisocket  Flop:Byte Ratio  0.33 (write allocate architectures)  0.5 (Ideal)

6 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel Clovertown Sun Victoria Falls AMD Barcelona

7 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Autotuning  Provides a portable and effective method for tuning  Limiting the search space:  Searching the entire space is intractable  Instead, we ordered the optimizations appropriately for a given platform  To find best parameters for a given optimization, performed exhaustive search  Each optimization was applied on top of all previous optimizations  In general, can also use heuristics/models to prune search space

8 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naive Code  Naïve code is a simple, threaded stencil kernel  Domain partitioning performed only in least contiguous dimension  No optimizations or tuning was performed x y z (unit-stride)

9 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Intel ClovertownAMD Barcelona Sun Victoria Falls Naive

10 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB NUMA-Aware Intel Clovertown Sun Victoria Falls AMD Barcelona  Exploited “first-touch” page mapping policy on NUMA architectures  Due to our affinity policy, benefit only seen when using both sockets

11 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB NUMA-Aware Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware

12 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Loop Unrolling/Reordering  Allows for better use of registers and functional units  Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF)  should eliminate any effects from memory subsystem  This optimization is independent of later memory optimizations

13 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Loop Unrolling/Reordering Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering

14 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Padding x y z (unit-stride)  Used to reduce conflict misses and DRAM bank conflicts  Drawback: Larger memory footprint  Performed search to determine best padding amount  Only padded in unit-stride dimension Padding Amount

15 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Padding Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding

16 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Thread/Cache Blocking x y z (unit-stride) Thread Blocks in x: 4 Thread Blocks in z: 2 Thread Blocks in y: 2 Cache Blocks in y: 2  Performed exhaustive search over all possible power-of-two parameter values  Every thread block is the same size and shape  Preserves load balancing  Did NOT cut in contiguous dimension on x86 machines  Avoids interrupting HW prefetchers  Only performed cache blocking in one dimension  Sufficient to fit three read planes and one write plane into cache

17 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Thread/Cache Blocking Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking

18 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Software Prefetching  Allows us to hide memory latency  Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)

19 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Software Prefetching Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching

20 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SIMDization  Requires complete code rewrite to utilize 128-bit SSE registers  Allows single instruction to add/multiply two doubles  Only possible on the x86 machines  Padding performed to achieve proper data alignment (not to avoid conflicts)  Searched over register block sizes and prefetch distances simultaneously

21 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SIMDization Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization

22 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache Bypass  Writes data directly to write-back buffer  No data load on write miss  Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2  Reduces memory data traffic by 33%  Still requires the SIMDized code from the previous optimization  Searched over register block sizes and prefetch distances simultaneously

23 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache Bypass Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass

24 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Collaborative Threading x y z (unit-stride) Thread Blocks in x: 4 Thread Blocks in z: 2 Thread Blocks in y: 2 Cache Blocks in y: 2 No CollaborationWith Collaboration y z (unit-stride) Large Coll. TBs in y: 4 Large Coll. TBs in z: 2 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 Small Coll. TBs in y: 2 Small Coll. TBs in z: 4  Requires another complete code rewrite  CT allows for better L1 cache utilization when switching threads  Only effective on VF due to:  very small L1 cache (8 KB) shared by 8 HW threads  lack of hardware prefetchers (allows us to cut in contiguous dimension)  Drawback: Parameter space becomes very large

25 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Collaborative Threading Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass +Collaborative Threading

26 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Autotuning Results Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass +Collaborative Threading 1.9x Better5.4x Better 10.4x Better

27 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Architecture Comparison Single PrecisionDouble Precision Performance Power Efficiency

28 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions  Compilers alone fail to fully utilize system resources  Programmers may not even know that system is being underutilized  Autotuning provides a portable and effective solution  Produces up to a 10.4x improvement over compiler alone  To make autotuning tractable:  Choose the order of optimizations appropriately for the platform  Prune the search space intelligently for large searches  Power efficiency has become a valuable metric  Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines

29 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Acknowledgements  Sam Williams for:  writing the Cell stencil code  guiding my work by autotuning SpMV and LBMHD  Vasily Volkov for writing the G80 CUDA code  Kathy Yelick and Jim Demmel for general advice and feedback


Download ppt "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta."

Similar presentations


Ads by Google