Gary M. Zoppetti Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal
Compiler and Runtime Support for Adaptive Irregular Applications on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal In this talk I will describe my thesis work…

Multithreaded Architectures (MTA’s)
MTA characteristics multiple threads of execution in hardware each processor maintains several loci of control fast thread switching efficient communication & synchronization mechanisms special hardware and/or software runtime system (RTS) support MTA capabilities masking communication and synchronization latencies dynamic load balancing Well-suited for irregular applications MTAs overcome conventional arch limitations. maintain several contexts First 2 achieved thru RTS support Unstructured communication typically results in high latency so the ability to mask latency is important Dynamic control flow typically results in load imbalances so the ability of the architecture to adapt to a changing workload is vital

Unstructured Mesh Processing
for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Unstructured mesh is used to model an irregular geometry (airplane wing, particle interactions which are inherently sparse) Time-step loop iterates until convergence Point out indirection arrays reduction arrays Associative, commutative operator

Irregular Reductions Irregular Reduction Loops
Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators No loop-carried dependences other than those on elements of the reduction arrays One or more arrays are accessed using indirection arrays Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes) Irregular reductions well-studied for DM, DSM, and cache optimization Compute-intensive

Execution Strategy Overview
Partition edges (interactions) among processors Challenge: updating reduction arrays Divide reduction arrays into NUM_PROCS portions – revolving ownership Execute NUM_PROCS phases on each processor --each processor eventually will own every reduction array portion

Execution Strategy To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P0 P1 P2 P3 Reduction Portion # 1 Phase 0 2 4 processors, k = 2, so 8 phases Ownership of reduction portions is offset by a factor of k K provides opportunity for overlap of computation with communication by way of intervening phases Phase 2 3 4 5 6 7

Execution Strategy (Example)
for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; --send & receive is asynchronous --usually 2 indirection arrays that represent an edge or interaction --iterate over the edges local to the current phase

Execution Strategy Make communication independent of data distribution and values of indirection arrays Exploit MTA’s ability to overlap communication & computation Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements) --2 goals of Mention inspector/executor approach --total communication volume = NUM_PROCS * REDUCTION_ARRAY_SIZE

Execution Strategy (Updating Reduction Arrays)
Edge (4, 0)  Phases 2, 0  Phase 0 buffer node 4’s value during Phase 0 and update from buffer during Phase 2 // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; Suppose we assign to lesser of two phases Compiler creates a second loop

Runtime Processing Responsibilities
Divide iterations on each processor into phases Manage buffer space for reduction arrays Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); } To make the execution strategy possible, the runtime processing is responsible for… Call it LightInspector b/c it is significantly lighter weight than traditional inspector – no inter-processor communication is required

Runtime Processing (Example)
4 buffers 2 4 6 8 reduc1 remote area Input: 7 4 1 nodeptr1 nodeptr2 Output: Phase # 1 2 3 nodeptr1_out nodeptr1_out 9 nodeptr2_out 1 K=2, 2 processors, thus 4 phases Number of nodes (vertices) = 8 Phase # 1 2 3 copy1_out 4 copy2_out 9

Compiler Analysis Identify reduction array sections
updated through an associative, commutative operator Identify indirection array (IA) sections Form reference groups of reduction array sections accessed through same IA sections Each reference group can use same LightInspector EARTH-C compiler infrastructure Now I’ll present the compiler analysis that utilizes the execution strategy and runtime processing previously described

Adaptive Codes Could just re-run inspector
for (tstep = 0; tstep < num_steps; tstep++) { if (tstep % update_freq == 0) update (nodeptr1, nodeptr2); // update IAs for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Could just re-run inspector Developed incremental inspector and a pre-incremental inspector

Runtime Processing (Adaptive)
2 4 6 8 reduc1 remote area Input: 7 Extra space for phase insertions nodeptr1 4 nodeptr2 2 Output: 9 9 2 Phase # 1 2 3 nodeptr1_out nodeptr2_out Simple example to convey idea Phase # 1 2 3 copy1_out 4 copy2_out 9

Runtime Processing (Adaptive)
Incremental inspector One iteration over edges, comparing new and old indirection array values Edges (iterations in 1st loop) move to new phases if necessary (28 cases) Update edges (iterations in 2nd loop) are modified if necessary Reuse buffer locations when possible Pre-incremental inspector Similar to non-adaptive inspector Extra space allocated for edge movement (mappings maintained for efficiency) Saves values for subsequent runs of the incremental inspector

Experimental Results Euler 10k p = 0.02 Euler 10k p = 0.10
We’ve introduced two parameters: ‘p’ is extent of adaptivity, probability an edge changes Iters is rate of adaptivity, # of iterations before the IA’s are modified Key kernel – whole benchmark would allow better amortization of overhead Iters = 5 is a little unrealistic: at minimum around 10 Iters = 5: Abs 0.92 Relative (32): 11.97 Iters = 20: Abs 1.14 Relative (32): 10.35 Iters = 5: Abs 0.49 Relative (32):15.03 Iters = 20: Abs 0.92 Relative (32): 9.63

Experimental Results Moldyn 2k p = 0.02 Moldyn 2k p = 0.10
Iters = 5: Abs 1.10 Relative 10.61 Iters = 20: Abs 1.22 Relative 10.01 Iters = 5: Abs 0.80 Relative 11.84 Iters = 20: Abs 1.07 Relative 9.84

Summary and Conclusions
Class II frequency and volume of communication independent of contents of indirection arrays no mesh partitioning or communication optimizations required initially incur overheads (locality), but high relative speedups near linear scaling of inspector times wrt number of processors and extent of adaptivity

Gary M. Zoppetti Gagan Agrawal

Similar presentations

Presentation on theme: "Gary M. Zoppetti Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gary M. Zoppetti Gagan Agrawal

Similar presentations

Presentation on theme: "Gary M. Zoppetti Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback