Download presentation
Presentation is loading. Please wait.
1
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008
2
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Code Overview For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil3D Stencil
3
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil Applications Stencils are critical to many scientific applications: Diffusion, Electromagnetics, Computational Fluid Dynamics Both uniform and adaptive block-structured meshes Many type of stencils 1D, 2D, 3D meshes Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes) Varying boundary conditions (constant vs. periodic)
4
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Stencil Code void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); }
5
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Our Stencil Code Executes a 3D, 7-point, Jacobi iteration on a 256 3 grid Performs 8 flops (6 adds, 2 mults) per point Parallelization performed with pthreads Thread affinity: multithreading, then multicore, then multisocket Flop:Byte Ratio 0.33 (write allocate architectures) 0.5 (Ideal)
6
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache-Based Architectures Intel Clovertown Sun Victoria Falls AMD Barcelona
7
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Autotuning Provides a portable and effective method for tuning Limiting the search space: Searching the entire space is intractable Instead, we ordered the optimizations appropriately for a given platform To find best parameters for a given optimization, performed exhaustive search Each optimization was applied on top of all previous optimizations In general, can also use heuristics/models to prune search space
8
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naive Code Naïve code is a simple, threaded stencil kernel Domain partitioning performed only in least contiguous dimension No optimizations or tuning was performed x y z (unit-stride)
9
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Naïve Intel ClovertownAMD Barcelona Sun Victoria Falls Naive
10
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB NUMA-Aware Intel Clovertown Sun Victoria Falls AMD Barcelona Exploited “first-touch” page mapping policy on NUMA architectures Due to our affinity policy, benefit only seen when using both sockets
11
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB NUMA-Aware Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware
12
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Loop Unrolling/Reordering Allows for better use of registers and functional units Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF) should eliminate any effects from memory subsystem This optimization is independent of later memory optimizations
13
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Loop Unrolling/Reordering Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering
14
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Padding x y z (unit-stride) Used to reduce conflict misses and DRAM bank conflicts Drawback: Larger memory footprint Performed search to determine best padding amount Only padded in unit-stride dimension Padding Amount
15
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Padding Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding
16
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Thread/Cache Blocking x y z (unit-stride) Thread Blocks in x: 4 Thread Blocks in z: 2 Thread Blocks in y: 2 Cache Blocks in y: 2 Performed exhaustive search over all possible power-of-two parameter values Every thread block is the same size and shape Preserves load balancing Did NOT cut in contiguous dimension on x86 machines Avoids interrupting HW prefetchers Only performed cache blocking in one dimension Sufficient to fit three read planes and one write plane into cache
17
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Thread/Cache Blocking Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking
18
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Software Prefetching Allows us to hide memory latency Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)
19
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Software Prefetching Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching
20
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SIMDization Requires complete code rewrite to utilize 128-bit SSE registers Allows single instruction to add/multiply two doubles Only possible on the x86 machines Padding performed to achieve proper data alignment (not to avoid conflicts) Searched over register block sizes and prefetch distances simultaneously
21
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SIMDization Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization
22
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache Bypass Writes data directly to write-back buffer No data load on write miss Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 Reduces memory data traffic by 33% Still requires the SIMDized code from the previous optimization Searched over register block sizes and prefetch distances simultaneously
23
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Cache Bypass Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass
24
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Collaborative Threading x y z (unit-stride) Thread Blocks in x: 4 Thread Blocks in z: 2 Thread Blocks in y: 2 Cache Blocks in y: 2 No CollaborationWith Collaboration y z (unit-stride) Large Coll. TBs in y: 4 Large Coll. TBs in z: 2 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 t0t1 t2t3 t4t5 t6t7 Small Coll. TBs in y: 2 Small Coll. TBs in z: 4 Requires another complete code rewrite CT allows for better L1 cache utilization when switching threads Only effective on VF due to: very small L1 cache (8 KB) shared by 8 HW threads lack of hardware prefetchers (allows us to cut in contiguous dimension) Drawback: Parameter space becomes very large
25
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Collaborative Threading Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass +Collaborative Threading
26
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Autotuning Results Intel ClovertownAMD Barcelona Sun Victoria Falls Naive +NUMA-Aware +Loop Unrolling/Reordering +Padding +Thread/Cache Blocking +Prefetching +SIMDization +Cache Bypass +Collaborative Threading 1.9x Better5.4x Better 10.4x Better
27
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Architecture Comparison Single PrecisionDouble Precision Performance Power Efficiency
28
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions Compilers alone fail to fully utilize system resources Programmers may not even know that system is being underutilized Autotuning provides a portable and effective solution Produces up to a 10.4x improvement over compiler alone To make autotuning tractable: Choose the order of optimizations appropriately for the platform Prune the search space intelligently for large searches Power efficiency has become a valuable metric Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines
29
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Acknowledgements Sam Williams for: writing the Cell stencil code guiding my work by autotuning SpMV and LBMHD Vasily Volkov for writing the G80 CUDA code Kathy Yelick and Jim Demmel for general advice and feedback
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.