Profiling and Optimization Outline

Profiling and Optimization Outline
Target Architecture Anatomy of Code Optimization Process Compiling Baseline Assessment Identify Optimization Potential Consider Scale of Modifications (from compiler pragmas to algorithmic changes) Experiment, Assess, and Iterate Potential Avenues for Improvement

Target Architecture Stampede2 hosts 4,200 KNL compute nodes with 68 cores per node. Each core has 4 hardware threads. The code was run on KNLs which were configured in Cache Quadrant mode, in which the entire MCDRAM is managed by the system as a level 3 cache and the tiles (2 cores) are arranged in quadrants where addresses are hashed to a directory in the same quadrant as the memory. Table 1. Stampede2 KNL Compute Node Specifications Model: Intel Xeon Phi 7250 ("Knights Landing") Total cores per KNL node: 68 cores on a single socket Hardware threads per core: 4 Hardware threads per node: 68 x 4 = 272 Clock rate: 1.4GHz RAM: 96GB DDR4 plus 16GB high-speed MCDRAM. Cache: 32KB L1 data cache per core; 1MB L2 per two-core tile. In default config, MCDRAM operates as 16GB direct-mapped L3. Local storage: All but 504 KNL nodes have a 132GB /tmp partition on a 200GB Solid State Drive (SSD). The 504 KNLs originally installed as the Stampede1 KNL sub-system each have a 58GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, flat-quadrant and flat-snc4 queues.

KNL Tile: 2 Cores, each with 2 VPU, 4 threads/core, 1M L2 Cache shared between two cores. Silvermont Microarchitecture but with changes for HPC 2 VPU: 2x AVX512 units. 32 SP or 16 DP per unit. X87, SSE, AVX1, AVX2 and EMU L2: 1MB 16-way. 1 line read and ½ line write per cycle. Coherent across tiles CHA: Caching/Home Agent. Distributed Tag Directory to keep L2’s coherent. MESIF protocol. 2D-Mesh connections for tiles.

Profiling for Optimization
Compiling In order to take full advantage of the KNL 512 bit vector instruction set Compile flags: COMPFLAGS = -03 -xMIC-AVX512 -fma -align array64byte -finline-functions -ip -ipo Profiling tools and procedures Hotspot identification Roofline plots Intra-node Scaling study Inter-node Scaling study Application Performance Snapshot Memory Access Analysis Vectorization Advisor Loop Analytics Vectorization Report

Hotspot Identification
Intel Vtune Amplifier: mpiexec -n 1 amplxe-cl -collect hotspots -r ./hotspots_results ./main.out Using Intel’s Vtune Amplifier we can collect a list of functions and loops arranged in order of percentage of total runtime. This hotspot analysis shows that the largest percentage of the runtime, within the time-stepping loop, is spent in the Legendre transform function tra_qst2rtp (19%). The next 2 most costly functions (~24% of time-stepping loop) both call the transform tra_qst2rtp, so any improvements in this function are spread across several functions within the time-stepping loop.

Loop Survey Intel Vector Advisor
mpiexec -n 1 advixe-cl -collect survey --project-dir vectorization_profile ./main.out Function/Loop Survey provides greater detail than hotspot analysis Loop Metrics Include: Contribution to Runtime Vectorization Efficiency Performance (GFLOPS) Arithmetic Intensity (FLOP/byte) Clues to Performance Issues

Roofline Analysis #collect survey mpiexec -n 1 advixe-cl --collect survey -project-dir ./roofline_analysis ./main.out #collect flops and tripcounts mpiexec -n 1 advixe-cl --collect tripcounts -flops-and-masks -project-dir ./roofline_analysis ./main.out By running an additional test to collect trip counts and flops, we can create a roofline plot which is a graphical representation of the data in the loop survey. Each loop represented by a dot, with size and color to indicate relative execution times and call count. The position of each dot is determined by its Performance (GFLOPS) on the y axis, and Arithmetic Intensity (FLOP/Byte) on the x axis, Both axes are log scaled. The rooflines shown on this plot indicate the level at which no further performance improvements can be made due to bandwidth bounds of either MCDRAM, L2 Cache, L1 Cache, or peak computational capability of the architecture, e.g. Scalar, SP, or DP Vector Add or FMA peaks.

Baseline Scaling Study
While our hotspot and vector analyses tells us a lot about how the code behaves on a single MPI rank, this code is intended to be distributed among many ranks, so a scaling test is needed. The following three plots illustrate the behavior of the code on Stampede2, keeping the problem size fixed while varying the number of nodes and threads per node. The problem size is a fixed spherical grid with discretization along the Radial (N=160), Theta (M=80), and Phi (L=80) dimensions This problem size was chosen as it is an exemplar of the target case the PI’s are interested in running.

Intra-node Scaling Weak scaling comparison (1 rank per thread) KNL
68, 1.4 GHz cores, 2, 512 bit vector registers Sandybridge Xeon 16, 2.6 GHz cores 2, 256 bit vector registers Scaling plateaus beyond number of cores

Inter-node and Intra-node Scaling Arranged by constant number of nodes utilized
Distribute ranks across nodes and within nodes Max performance: sec/iter 128 ranks: Nr=16 x Nth=8 4 nodes, 32 ranks per node. Scaling efficiency degrades beyond 8 nodes (at >=64 ranks).

Inter-node and Intra-node Scaling Arranged by constant number of processes per node
Plot lines are arranged in collections of constant processes per node (ppn) Optimal performance found at 128 total ranks 4 or 8 nodes 32 or 16 ppn Scaling profile shows core/thread capacity underutilized Possible Scaling Barriers: MPI communication overhead Core/Thread resource contention

Application Performance Snapshot
The APS is available in the Intel 2018 Parallel Studio suite of tools. mpiexec -n 128 aps --r ./aps_results_128 ./main.out 32 Ranks 64 Ranks 128 Ranks

Application Performance Snapshot Interpretation
MPI Bound? APS reports the application is MPI bound Percentage of time in MPI operations 17%, 20%, 16% at 32, 64, 128 ranks respectively. Does not increase with scaling MPI Imbalance? Imbalance responsible for between 2% and 4% of elapsed time. Back End Stalls Percentage of empty pipeline slots increase from: 38% ,48%, and 71% at 32, 64, 128 ranks respectively. If the code is indeed MPI bound, the degree to which it is, is dwarfed by memory pipeline bottlenecks.

Back End Stalls Superscalar processors can be conceptually divided into the `front-end`, where instructions are fetched and decoded into the operations that constitute them, and the `back-end`, where the required computation is performed. During each cycle, the front-end generates up to two of these operations, places them into pipeline slots and moves them through the back-end. The actual number of retired pipeline slots containing useful work rarely equals this maximum. This can be because the back-end was not prepared to accept more operations of a certain kind (`Back-end bound` execution). L2 Hit Bound L2 Hit bound is when a significant proportion of pipeline slots remain empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable of supporting. This opportunity cost results in slower execution. Long-latency operations like division and memory operations can cause this, as can too many operations being directed to a single execution port. L2 Miss Bound L2 Miss bound is the percentage of CPU cycles being spent waiting for L2 load misses to be serviced. Any memory requests missing here must be serviced by local or remote DRAM or MCDRAM, with significant latency.

Memory Access Analysis
To get an idea of where our back end stall bottleneck is occurring, we can run a memory access analysis using Intel Vtune Amplifier: mpiexec -n 128 amplxe-cl -collect memory-access -r ./vtune_memory_analysis ./main.out L2 cache miss count of 141 million. L2 Hit bounds 16% of the execution time for this function L2 Miss bounds are taking almost 20%. Increase prefetch aggressivenes Better predicting L2 cache loads should reduce cache misses that require loading from MCDRAM May decrease pipeline efficiency due to mispredictions

Reducing L2 miss bounds by increasing prefetching aggresiveness
COMPFLAGS = -qopt-prefetch=3 (default is 2) Increasing the compiler prefetch level to 3 and rerunning the memory access analysis we see a dramatic improvement in the performance of function tra_qst2rtp. L2 Miss Counts drop from 141 million to 3 million L2 Hit Bound drops from 16% to 0.7% L2 Miss Bound drops from 19% to 0.4%.

Loop Survey post prefetch=3
tra_qst2rtp line 325 performance improves from 3.8 GFLOPS to 5.83 GFLOPS All bottleneck loops in tra_qst2rtp have improved performance

Roofline post prefetch=3
All loops have moved closer to the L2 cache roofline Across the board performance increase. Loops still under L2 roof Don’t yet know if we have resolved all L2 bandwidth barriers. Try to increase loop performance to see if we can rise above the L2 bandwidth roofline.

Vector Efficiency Optimization
Vector efficiency for bottleneck loops is 60%-70%. Picking one loop in particular (line 325 in function tra_qst2rtp) we can highlight the stats using loop analytics via vector advisor. This line of Fortran is written using vector notation denoted by using (:) in place of an index. We can think of line 325 as a scalar loop over the leading index of the vectors involved. Vectorization is applied to this loop over the first dimension of the arrays a3c_Y(:,m,n,2) = a3c_Y(:,m,n,2) + c1 * leg_llP_(:,nh) + c2 * leg_llsin1mP(:,nh)

Vectorization Metrics
Gain is a measure of the reduction in cycles from a scalar loop to a vectorized loop. Gain= Scalar Loop Cost (cycles) / Vector Loop Cost (cycles) Because the expression in line 325 involves Float32 Complex values, our effective vector length is 512/(2x32)= 8, so the maximum theoretical vectorization gain possible for this computation is 8x. Vector efficiency is the ratio of actual gain over ideal gain Vector efficiency = Gain/Vector_Length(for data type) Vector efficiency can be hindered by overhead that is sometimes needed to convert scalar operations to vector operations.

Vectorization report To produce vectorization reports use the following compiler flag: COMPFLAGS = -qopt-report5 LOOP BEGIN at ./modules/transform.fftw3.f90(325,13) remark #15389: vectorization support: reference a3c_y(:,m,n,2) has unaligned access remark #15389: vectorization support: reference leg_llp_(:,nh) has unaligned access [ ./modules/transform.fftw3.f90(325,47) ] remark #15389: vectorization support: reference leg_llsin1mp(:,nh) has unaligned access [ ./modules/transform.fftw3.f90(326,43) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 8 remark #15309: vectorization support: normalized vectorization overhead 0.640 remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 3 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 19 remark #15477: vector cost: 3.120 remark #15478: estimated potential speedup: 5.830 remark #15488: --- end vector cost summary --- remark #25018: Total number of lines prefetched=4 remark #25019: Number of spatial prefetches=4, dist=2 remark #25135: Using outer-loop strategy for prefetching memory reference, outer_dist=1 [ ./modules/transform.fftw3.f90(325,47) ] remark #25135: Using outer-loop strategy for prefetching memory reference, outer_dist=1 [ ./modules/transform.fftw3.f90(326,43) ] remark #25015: Estimate of max trip count of loop=15 LOOP END

Aligning Variables We can inform the compiler that variables should be allocated such that the base pointer is aligned on specific byte boundaries. This allows vectorization to load memory into registers without gaps at the head of the register. This in turn eliminates some of the operations needed to perform the vector operation. We had previously used the compiler flag -align array64byte to notify the compiler that we wanted all of our memory allocations to align with 64 byte boundaries. COMPFLAGS = -03 -xMIC-AVX512 -fma -align array64byte -finline-functions -ip -ipo This compiler flag does not ensure alignment of static common blocks or fields within derived datatypes. The geodynamo code utilizes many derived datatypes and static allocations, so we have to include pragmas to enforce proper base pointer alignment. In transform.fftw3.f90: double complex, private :: a3c_Y(i_Th, 0:2*_pHc-1,i_pN,3) !DIR$ ATTRIBUTES ALIGN: 64:: a3c_Y in legendre.f90 double precision :: leg_llP_ (i_Th, 0:i_H1) !DIR$ ATTRIBUTES ALIGN: 64::leg_llP_ double precision :: leg_llsin1mP(i_Th, 0:i_H1) !DIR$ ATTRIBUTES ALIGN: 64::leg_llsin1mP

Vectorization Report post align
LOOP BEGIN at ./modules/transform.fftw3.f90(325,13) remark #15388: vectorization support: reference a3c_y(:,m,n,2) has aligned access remark #15388: vectorization support: reference leg_llp_(:,nh) has aligned access [ ./modules/transform.fftw3.f90(325,47) ] remark #15388: vectorization support: reference leg_llsin1mp(:,nh) has aligned access [ ./modules/transform.fftw3.f90(326,43) ] remark #15305: vectorization support: vector length 8 remark #15309: vectorization support: normalized vectorization overhead 0.200 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 3 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 19 remark #15477: vector cost: 2.500 remark #15478: estimated potential speedup: 7.490 remark #15488: --- end vector cost summary --- remark #25018: Total number of lines prefetched=4 remark #25019: Number of spatial prefetches=4, dist=2 remark #25135: Using outer-loop strategy for prefetching memory reference, outer_dist=1 [ ./modules/transform.fftw3.f90(325,47) ] remark #25135: Using outer-loop strategy for prefetching memory reference, outer_dist=1 [ ./modules/transform.fftw3.f90(326,43) ] remark #25015: Estimate of max trip count of loop=15 LOOP END Estimated speedup has improved from 5.83 to 7.49

Optimized Loop Analytics
Alignment of the selected variables improves vectorization across many bottleneck loops Applying explicit data alignment throughout code will result in further improvements across many loops and functions.

Optimized Loop Roofline Plot
Several loops now above the L2 Cache Bandwidth roofline. Not bound by L2 Cache Bandwidth, May benefit from parallelization among additional threads. Primary bottleneck loop remains under the roof Unclear whether the optimized function would benefit from thread parallelization of the outer loop. Selected loop parallelization is one possible avenue for increasing intra-node scalability.

Scaling Comparison of Optimized Code
Optimized scaling profiles are plotted with dotted lines and un-optimized with solid lines. Scaled relative to the seconds/iter time of the un-optimized code on a single KNL rank. We see that the optimized code curves match the trend of the original but exhibit better performance at each discretization level. Improved the computational performance Have not removed any of the barriers to scaling efficiency. Peak performance 128 ranks, 4 nodes with 32 ranks per node. Un-optimized: seconds/iter Optimized: seconds/iter 18% speedup.

Potential Avenues for Improvement
As we can see in the roofline plot for the optimized version of the code, We are still mostly under the L2 bandwidth roof, so there may some room for memory access improvement. In order to approach the Double Precision Vector Add peak (~20 GFLOPS), and ultimately the Double Precision FMA peak (~40 GFLOPS), we will need to increase the algorithmic intensity (ratio of floating point operations per memory access) Until this can done, any further improvements in vectorization or shared memory parallelism (multithreading) will be limited by cache bandwidth.

Batch Matrix-Matrix Multiplication
The parallelization scheme is well optimized to split up the matrix-matrix multiplication term of the Legendre transform and evenly distribute it as a loop over vector-matrix multiplications. AI= 2 𝑛 2 flops / 𝑛 2 memory accesses = 2 If the algorithm could be restructured to bundle these operations into a batch matrix-matrix operation it would dramatically improve the arithmetic intensity of the operation. AI= 2 𝑛 3 flops / 4 𝑛 2 memory accesses = 𝑛 2

Profiling and Optimization Outline

Similar presentations

Presentation on theme: "Profiling and Optimization Outline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Profiling and Optimization Outline

Similar presentations

Presentation on theme: "Profiling and Optimization Outline"— Presentation transcript:

Similar presentations

About project

Feedback