Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Slides:



Advertisements
Similar presentations
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Advertisements

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Makoto Kudoh*1, Hisayasu Kuroda*1,
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Triangular Mesh Decimation
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
Data Structures and Algorithms in Parallel Computing Lecture 7.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Parallel Computing Presented by Justin Reschke
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Auburn University
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Programming Models for SimMillennium
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Performance Evaluation of Adaptive MPI
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Supporting Fault-Tolerance in Streaming Grid Applications
Linchuan Chen, Xin Huo and Gagan Agrawal
Operating Systems (CS 340 D)
Year 2 Updates.
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Data-Intensive Computing: From Clouds to GPU Clusters
Course Outline Introduction in algorithms and applications
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Gary M. Zoppetti Gagan Agrawal
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Operating Systems (CS 340 D)
Chapter 4 Multiprocessors
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Maximizing Speedup through Self-Tuning of Processor Allocation
Department of Computer Science University of California, Santa Barbara
Operating System Overview
COMP60611 Fundamentals of Parallel and Distributed Systems
Threads -For CSIT.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Gary M. Zoppetti Gagan Agrawal Rishi Kumar Compiler and Runtime Support for Parallelizing Irregular Reductions on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal Rishi Kumar In this talk I will describe my thesis work…

Motivation: Irregular Reductions Frequently arise in scientific computations Widely studied in the context of distributed memory machines, shared memory machines, distributed shared memory machines, uniprocessor cache Main difficulty: can’t apply traditional compile-time optimizations Runtime optimizations: trade-off between runtime costs and efficiency of execution

Motivation: Multithreaded Architectures: Multiprocessors based upon multithreading Support multiple threads of execution on each processor Support low-overhead context switching and thread initiation Low-cost point-to-point communication and synchronization

Problem Addressed Can we use multiprocessors based upon multithreading for irregular reductions ? What kind of runtime and compiler support is required ? What level of performance and scalability is achieved ?

Outline Irregular Reductions Execution Strategy Runtime Support Compiler Analysis Experimental Results Related Work Summary

Irregular Reductions: Example for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } Unstructured mesh is used to model an irregular geometry (airplane wing, particle interactions which are inherently sparse) Time-step loop iterates until convergence Point out indirection arrays reduction arrays Associative, commutative operator

Irregular Reductions Irregular Reduction Loops Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators No loop-carried dependences other than those on elements of the reduction arrays One or more arrays are accessed using indirection arrays Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes) Irregular reductions well-studied for DM, DSM, and cache optimization Compute-intensive

Execution Strategy Overview Partition edges (interactions) among processors Challenge: updating reduction arrays Divide reduction arrays into NUM_PROCS portions – revolving ownership Execute NUM_PROCS phases on each processor --each processor eventually will own every reduction array portion

Execution Strategy To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P0 P1 P2 P3 Reduction Portion # 1 Phase 0 2 4 processors, k = 2, so 8 phases Ownership of reduction portions is offset by a factor of k K provides opportunity for overlap of computation with communication by way of intervening phases Phase 2 3 4 5 6 7

Execution Strategy (Example) for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for(i = loop1_pt[phase];i < loop1_pt[phase + 1];i++{ node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; --send & receive is asynchronous --usually 2 indirection arrays that represent an edge or interaction --iterate over the edges local to the current phase

Execution Strategy Make communication independent of data distribution and values of indirection arrays Exploit MTA’s ability to overlap communication & computation Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements) --2 goals of Mention inspector/executor approach --total communication volume = NUM_PROCS * REDUCTION_ARRAY_SIZE

Execution Strategy (Updating Reduction Arrays) // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; Suppose we assign to lesser of two phases Compiler creates a second loop

Runtime Processing Responsibilities Divide iterations on each processor into phases Manage buffer space for reduction arrays Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); } To make the execution strategy possible, the runtime processing is responsible for… Call it LightInspector b/c it is significantly lighter weight than traditional inspector – no inter-processor communication is required

2 4 6 8 Input: 7 4 1 Output: Phase # 1 2 3 9 1 Phase # 1 2 3 4 9 4 buffers 2 4 6 8 reduc1 remote area Input: 7 4 1 nodeptr1 nodeptr2 Output: Phase # 1 2 3 nodeptr1_out nodeptr1_out 9 nodeptr2_out 1 K=2, 2 processors, thus 4 phases Number of nodes (vertices) = 8 Phase # 1 2 3 copy1_out 4 copy2_out 9

Compiler Analysis Identify reduction array sections updated through an associative, commutative operator Identify indirection array (IA) sections Form reference groups of reduction array sections accessed through same IA sections Each reference group can use same LightInspector EARTH-C compiler infrastructure Now I’ll present the compiler analysis that utilizes the execution strategy and runtime processing previously described

Experimental Results Three scientific kernels Euler: 2k and 10k mesh Moldyn: 2k and 10k dataset sparse MVM: class W (7k), A (14k), & B (75k) matrices Distribution of edges (interactions) block cyclic block-cyclic (in thesis) Three values of k (1, 2, & 4) EARTH-MANNA (SEMi) MVM is kernel from NAS Conjugate Gradient benchmark – different classes of matrices W, A, B Recall that edges denote interactions and can be reordered – how do we partition them onto processors?

Experimental Results (Euler 10k) Do not report k=1,4 block b/c block dist. typically resulted in load imbalance and k=2 outperformed k=1,4 Best Abs 2b: 1.16 Relative 2c (32) 10.35 On 32 (out of 16) 1c: 7.62 2c: 10.35 4c: 9.93 2b: 6.94

Experimental Results (Moldyn 2k) Best Abs 1c: 1.31 Relative 2c (32) 9.68 On 32 1c: 7.50 2c: 9.68 4c: 8.65 2b: 6.47 K = 1: less phases, therefore better locality Also less threading overhead (Moldyn 10k in thesis)

Experimental Results (MVM Class A) Didn’t experiment with other distributions b/c block achieves near linear speedups Still irregular code – indirection array is used to access vector, not reduction array Best Abs: 1.95, 4.04, 8.51, 16.98, 30.65 On 32 K = 1: 28.41 K = 2: 30.65 K = 4: 30.21

Summary and Conclusions Execution strategy: frequency and volume of communication independent of contents of indirection arrays No mesh partitioning or communication optimizations required Initially incur overheads (locality), but high relative speedups