Gary M. Zoppetti Gagan Agrawal Rishi Kumar Compiler and Runtime Support for Parallelizing Irregular Reductions on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal Rishi Kumar In this talk I will describe my thesis work…
Motivation: Irregular Reductions Frequently arise in scientific computations Widely studied in the context of distributed memory machines, shared memory machines, distributed shared memory machines, uniprocessor cache Main difficulty: can’t apply traditional compile-time optimizations Runtime optimizations: trade-off between runtime costs and efficiency of execution
Motivation: Multithreaded Architectures: Multiprocessors based upon multithreading Support multiple threads of execution on each processor Support low-overhead context switching and thread initiation Low-cost point-to-point communication and synchronization
Problem Addressed Can we use multiprocessors based upon multithreading for irregular reductions ? What kind of runtime and compiler support is required ? What level of performance and scalability is achieved ?
Outline Irregular Reductions Execution Strategy Runtime Support Compiler Analysis Experimental Results Related Work Summary
Irregular Reductions: Example for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Unstructured mesh is used to model an irregular geometry (airplane wing, particle interactions which are inherently sparse) Time-step loop iterates until convergence Point out indirection arrays reduction arrays Associative, commutative operator
Irregular Reductions Irregular Reduction Loops Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators No loop-carried dependences other than those on elements of the reduction arrays One or more arrays are accessed using indirection arrays Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes) Irregular reductions well-studied for DM, DSM, and cache optimization Compute-intensive
Execution Strategy Overview Partition edges (interactions) among processors Challenge: updating reduction arrays Divide reduction arrays into NUM_PROCS portions – revolving ownership Execute NUM_PROCS phases on each processor --each processor eventually will own every reduction array portion
Execution Strategy To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P0 P1 P2 P3 Reduction Portion # 1 Phase 0 2 4 processors, k = 2, so 8 phases Ownership of reduction portions is offset by a factor of k K provides opportunity for overlap of computation with communication by way of intervening phases Phase 2 3 4 5 6 7
Execution Strategy (Example) for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for(i = loop1_pt[phase];i < loop1_pt[phase + 1];i++{ node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; --send & receive is asynchronous --usually 2 indirection arrays that represent an edge or interaction --iterate over the edges local to the current phase
Execution Strategy Make communication independent of data distribution and values of indirection arrays Exploit MTA’s ability to overlap communication & computation Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements) --2 goals of Mention inspector/executor approach --total communication volume = NUM_PROCS * REDUCTION_ARRAY_SIZE
Execution Strategy (Updating Reduction Arrays) // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; Suppose we assign to lesser of two phases Compiler creates a second loop
Runtime Processing Responsibilities Divide iterations on each processor into phases Manage buffer space for reduction arrays Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); } To make the execution strategy possible, the runtime processing is responsible for… Call it LightInspector b/c it is significantly lighter weight than traditional inspector – no inter-processor communication is required
2 4 6 8 Input: 7 4 1 Output: Phase # 1 2 3 9 1 Phase # 1 2 3 4 9 4 buffers 2 4 6 8 reduc1 remote area Input: 7 4 1 nodeptr1 nodeptr2 Output: Phase # 1 2 3 nodeptr1_out nodeptr1_out 9 nodeptr2_out 1 K=2, 2 processors, thus 4 phases Number of nodes (vertices) = 8 Phase # 1 2 3 copy1_out 4 copy2_out 9
Compiler Analysis Identify reduction array sections updated through an associative, commutative operator Identify indirection array (IA) sections Form reference groups of reduction array sections accessed through same IA sections Each reference group can use same LightInspector EARTH-C compiler infrastructure Now I’ll present the compiler analysis that utilizes the execution strategy and runtime processing previously described
Experimental Results Three scientific kernels Euler: 2k and 10k mesh Moldyn: 2k and 10k dataset sparse MVM: class W (7k), A (14k), & B (75k) matrices Distribution of edges (interactions) block cyclic block-cyclic (in thesis) Three values of k (1, 2, & 4) EARTH-MANNA (SEMi) MVM is kernel from NAS Conjugate Gradient benchmark – different classes of matrices W, A, B Recall that edges denote interactions and can be reordered – how do we partition them onto processors?
Experimental Results (Euler 10k) Do not report k=1,4 block b/c block dist. typically resulted in load imbalance and k=2 outperformed k=1,4 Best Abs 2b: 1.16 Relative 2c (32) 10.35 On 32 (out of 16) 1c: 7.62 2c: 10.35 4c: 9.93 2b: 6.94
Experimental Results (Moldyn 2k) Best Abs 1c: 1.31 Relative 2c (32) 9.68 On 32 1c: 7.50 2c: 9.68 4c: 8.65 2b: 6.47 K = 1: less phases, therefore better locality Also less threading overhead (Moldyn 10k in thesis)
Experimental Results (MVM Class A) Didn’t experiment with other distributions b/c block achieves near linear speedups Still irregular code – indirection array is used to access vector, not reduction array Best Abs: 1.95, 4.04, 8.51, 16.98, 30.65 On 32 K = 1: 28.41 K = 2: 30.65 K = 4: 30.21
Summary and Conclusions Execution strategy: frequency and volume of communication independent of contents of indirection arrays No mesh partitioning or communication optimizations required Initially incur overheads (locality), but high relative speedups