Gary M. Zoppetti Gagan Agrawal

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
High Performance Computing Seminar
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Auburn University
TensorFlow– A system for large-scale machine learning
SDN Network Updates Minimum updates within a single switch
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Computational Techniques for Efficient Carbon Nanotube Simulation
CS427 Multicore Architecture and Parallel Computing
Distributed Shared Memory
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
PREGEL Data Management in the Cloud
Parallel Algorithm Design
Chapter 25: Architecture and Product Lines
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Performance Evaluation of Adaptive MPI
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Supporting Fault-Tolerance in Streaming Grid Applications
Year 2 Updates.
Anne Pratoomtong ECE734, Spring2002
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Component Frameworks:
Predictive Performance
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Course Outline Introduction in algorithms and applications
Peng Jiang, Linchuan Chen, and Gagan Agrawal
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
The Vector-Thread Architecture
Computational Techniques for Efficient Carbon Nanotube Simulation
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
COMP60611 Fundamentals of Parallel and Distributed Systems
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Gary M. Zoppetti Gagan Agrawal Compiler and Runtime Support for Adaptive Irregular Applications on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal In this talk I will describe my thesis work…

Multithreaded Architectures (MTA’s) MTA characteristics multiple threads of execution in hardware each processor maintains several loci of control fast thread switching efficient communication & synchronization mechanisms special hardware and/or software runtime system (RTS) support MTA capabilities masking communication and synchronization latencies dynamic load balancing Well-suited for irregular applications MTAs overcome conventional arch limitations. maintain several contexts First 2 achieved thru RTS support Unstructured communication typically results in high latency so the ability to mask latency is important Dynamic control flow typically results in load imbalances so the ability of the architecture to adapt to a changing workload is vital

Unstructured Mesh Processing for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Unstructured mesh is used to model an irregular geometry (airplane wing, particle interactions which are inherently sparse) Time-step loop iterates until convergence Point out indirection arrays reduction arrays Associative, commutative operator

Irregular Reductions Irregular Reduction Loops Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators No loop-carried dependences other than those on elements of the reduction arrays One or more arrays are accessed using indirection arrays Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes) Irregular reductions well-studied for DM, DSM, and cache optimization Compute-intensive

Execution Strategy Overview Partition edges (interactions) among processors Challenge: updating reduction arrays Divide reduction arrays into NUM_PROCS portions – revolving ownership Execute NUM_PROCS phases on each processor --each processor eventually will own every reduction array portion

Execution Strategy To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P0 P1 P2 P3 Reduction Portion # 1 Phase 0 2 4 processors, k = 2, so 8 phases Ownership of reduction portions is offset by a factor of k K provides opportunity for overlap of computation with communication by way of intervening phases Phase 2 3 4 5 6 7

Execution Strategy (Example) for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; --send & receive is asynchronous --usually 2 indirection arrays that represent an edge or interaction --iterate over the edges local to the current phase

Execution Strategy Make communication independent of data distribution and values of indirection arrays Exploit MTA’s ability to overlap communication & computation Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements) --2 goals of Mention inspector/executor approach --total communication volume = NUM_PROCS * REDUCTION_ARRAY_SIZE

Execution Strategy (Updating Reduction Arrays) Edge (4, 0)  Phases 2, 0  Phase 0 buffer node 4’s value during Phase 0 and update from buffer during Phase 2 // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; Suppose we assign to lesser of two phases Compiler creates a second loop

Runtime Processing Responsibilities Divide iterations on each processor into phases Manage buffer space for reduction arrays Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); } To make the execution strategy possible, the runtime processing is responsible for… Call it LightInspector b/c it is significantly lighter weight than traditional inspector – no inter-processor communication is required

Runtime Processing (Example) 4 buffers 2 4 6 8 reduc1 remote area Input: 7 4 1 nodeptr1 nodeptr2 Output: Phase # 1 2 3 nodeptr1_out nodeptr1_out 9 nodeptr2_out 1 K=2, 2 processors, thus 4 phases Number of nodes (vertices) = 8 Phase # 1 2 3 copy1_out 4 copy2_out 9

Compiler Analysis Identify reduction array sections updated through an associative, commutative operator Identify indirection array (IA) sections Form reference groups of reduction array sections accessed through same IA sections Each reference group can use same LightInspector EARTH-C compiler infrastructure Now I’ll present the compiler analysis that utilizes the execution strategy and runtime processing previously described

Adaptive Codes Could just re-run inspector for (tstep = 0; tstep < num_steps; tstep++) { if (tstep % update_freq == 0) update (nodeptr1, nodeptr2); // update IAs for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Could just re-run inspector Developed incremental inspector and a pre-incremental inspector

Runtime Processing (Adaptive) 2 4 6 8 reduc1 remote area Input: 7 Extra space for phase insertions nodeptr1 4 nodeptr2 2 Output: 9 9 2 Phase # 1 2 3 nodeptr1_out nodeptr2_out Simple example to convey idea Phase # 1 2 3 copy1_out 4 copy2_out 9

Runtime Processing (Adaptive) Incremental inspector One iteration over edges, comparing new and old indirection array values Edges (iterations in 1st loop) move to new phases if necessary (28 cases) Update edges (iterations in 2nd loop) are modified if necessary Reuse buffer locations when possible Pre-incremental inspector Similar to non-adaptive inspector Extra space allocated for edge movement (mappings maintained for efficiency) Saves values for subsequent runs of the incremental inspector

Experimental Results Euler 10k p = 0.02 Euler 10k p = 0.10 We’ve introduced two parameters: ‘p’ is extent of adaptivity, probability an edge changes Iters is rate of adaptivity, # of iterations before the IA’s are modified Key kernel – whole benchmark would allow better amortization of overhead Iters = 5 is a little unrealistic: at minimum around 10 Iters = 5: Abs 0.92 Relative (32): 11.97 Iters = 20: Abs 1.14 Relative (32): 10.35 Iters = 5: Abs 0.49 Relative (32):15.03 Iters = 20: Abs 0.92 Relative (32): 9.63

Experimental Results Moldyn 2k p = 0.02 Moldyn 2k p = 0.10 Iters = 5: Abs 1.10 Relative 10.61 Iters = 20: Abs 1.22 Relative 10.01 Iters = 5: Abs 0.80 Relative 11.84 Iters = 20: Abs 1.07 Relative 9.84

Summary and Conclusions Class II frequency and volume of communication independent of contents of indirection arrays no mesh partitioning or communication optimizations required initially incur overheads (locality), but high relative speedups near linear scaling of inspector times wrt number of processors and extent of adaptivity