11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Compiler Challenges for High Performance Architectures

Parallel Sorting Algorithms Comparison Sorts if (A>B) { temp=A; A=B; B=temp; } Potential Speed-up –Optimal Comparison Sort: O(N lg N) –Optimal Parallel.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Reference: Message Passing Fundamentals.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.

Topic Overview One-to-All Broadcast and All-to-One Reduction

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

06/24/2004NUG20041 Hybrid OpenMP and MPI Programming and Tuning Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory.

03/20/2003Yun (Helen) He1 Hybrid MPI and OpenMP Programming on IBM SP Yun (Helen) He Lawrence Berkeley National Laboratory.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Parallel Programming in C with MPI and OpenMP

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

Hybrid MPI and OpenMP Parallel Programming

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Parallel Computing Presented by Justin Reschke

Background Computer System Architectures Computer System Software.

From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

10/10/2002Yun (Helen) He, GHC20021 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Shared Memory Parallelism - OpenMP

Hadoop MapReduce Framework

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Programming in C with MPI and OpenMP

NERSC, LBNL 08/07/2002 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures Chris.

Hybrid Programming with OpenMP and MPI

Multithreading Why & How.

Hybrid MPI and OpenMP Parallel Programming

Chapter 01: Introduction

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Programming Parallel Computers

Presentation transcript:

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array Transposition Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory

11/19/2002Yun (Helen) He, SC Outline Introduction 2-array transpose method In-place vacancy tracking method Performance on single CPU Parallelization of Vacancy Tracking Method Pure OpenMP Pure MPI Hybrid MPI/OpenMP Performance Scheduling for pure OpenMP Pure MPI and pure OpenMP within one node Pure MPI and Hybrid MPI/OpenMP across nodes Conclusions

11/19/2002Yun (Helen) He, SC Two-Array Transpose Method Reshuffle Phase B[k 1,k 3,k 2 ]  A[k 1,k 2,k 3 ] Use auxiliary array B Copy Back Phase A  B Combine Effect A’[k 1,k 3,k 2 ]  A[k 1,k 2,k 3 ]

11/19/2002Yun (Helen) He, SC Vacancy Tracking Method A(3,2)  A(2,3) A(3,2)  A(2,3) Tracking cycle: 1 – 3 – 4 – A(2,3,4)  A(3,4,2), tracking cycles: A(2,3,4)  A(3,4,2), tracking cycles: – – 5 Cycles are closed, non-overlapping. Cycles are closed, non-overlapping.

11/19/2002Yun (Helen) He, SC Algorithm to Generate Tracking Cycles ! For 2D array A, viewed as A(N1,N2) at input and as A(N2,N1) at output. ! Starting with (i1,i2), find vacancy tracking cycle ioffset_start = index_to_offset (N1,N2,i1,i2) ioffset_start = index_to_offset (N1,N2,i1,i2) ioffset_next = -1 ioffset_next = -1 tmp = A (ioffset_start) tmp = A (ioffset_start) ioffset = ioffset_start ioffset = ioffset_start do while ( ioffset_next.NOT_EQUAL. ioffset_start) (C.1) do while ( ioffset_next.NOT_EQUAL. ioffset_start) (C.1) call offset_to_index (ioffset,N2,N1,j1,j2) ! N1,N2 exchanged call offset_to_index (ioffset,N2,N1,j1,j2) ! N1,N2 exchanged ioffset_next = index_to_offset (N1,N2,j2,j1) ! j1,j2 exchanged ioffset_next = index_to_offset (N1,N2,j2,j1) ! j1,j2 exchanged if (ioffset.NOT_EQUAL. ioffset_next) then if (ioffset.NOT_EQUAL. ioffset_next) then A (ioffset) = A (ioffset_next) A (ioffset) = A (ioffset_next) ioffset = ioffset_next ioffset = ioffset_next end if end if end_do_while end_do_while A (ioffset_next) = tmp A (ioffset_next) = tmp

11/19/2002Yun (Helen) He, SC In-Place vs. Two-Array

11/19/2002Yun (Helen) He, SC Memory Access Volume and Pattern Eliminates auxiliary array and copy-back phase, reduces memory access in half. Has less memory access due to length-1 cycles not touched. Has more irregular memory access pattern than traditional method, but gap becomes smaller when size of move is larger than cache-line size. Same as 2-array method: inefficient memory access due to large stride.

11/19/2002Yun (Helen) He, SC Multi-Threaded Parallelism Key: Independence of tracking cycles. !$OMP PARALLEL DO DEFAULT (PRIVATE) !$OMP& SHARED (N_cycles, info_table, Array) (C.2) !$OMP& SCHEDULE (AFFINITY) do k = 1, N_cycles do k = 1, N_cycles an inner loop of memory exchange for each cycle using info_table an inner loop of memory exchange for each cycle using info_table enddo enddo !$OMP END PARALLEL DO

11/19/2002Yun (Helen) He, SC Pure MPI A(N 1,N 2,N 3 )  A(N 1,N 3,N 2 ) on P processors: (G1) Do a local transpose on the local array A(N 1,N 2,N 3 /P)  A(N 1,N 3 /P,N 2 ). A(N 1,N 2,N 3 /P)  A(N 1,N 3 /P,N 2 ). (G2) Do a global all-to-all exchange of data blocks, each of size N 1 (N 3 /P)(N 2 /P). each of size N 1 (N 3 /P)(N 2 /P). (G3) Do a local transpose on the local array A(N 1,N 3 /P,N 2 ), viewed as A(N 1 N 3 /P,N 2 /P,P) A(N 1,N 3 /P,N 2 ), viewed as A(N 1 N 3 /P,N 2 /P,P)  A(N 1 N 3 /P,P,N 2 /P), viewed as A(N 1,N 3,N 2 /P).  A(N 1 N 3 /P,P,N 2 /P), viewed as A(N 1,N 3,N 2 /P).

11/19/2002Yun (Helen) He, SC Global all-to-all Exchange ! All processors simultaneously do the following: ! All processors simultaneously do the following: do q = 1, P - 1 do q = 1, P - 1 send a message to destination processor destID send a message to destination processor destID receive a message from source processor srcID receive a message from source processor srcID end do end do ! where destID = srcID = (myID XOR q) ! where destID = srcID = (myID XOR q)

11/19/2002Yun (Helen) He, SC Total Transpose Time (Pure MPI) Use “latency+ message-size / bandwidth” model Use “latency+ message-size / bandwidth” model T P = 2MN 1 N 2 N 3 /P + 2L(P-1) + [2N 1 N 3 N 2 /BP][(P-1)/P] T P = 2MN 1 N 2 N 3 /P + 2L(P-1) + [2N 1 N 3 N 2 /BP][(P-1)/P] where P --- total number of CPUs where P --- total number of CPUs M --- average memory access time per element M --- average memory access time per element L --- communication latency L --- communication latency B --- communication bandwidth B --- communication bandwidth

11/19/2002Yun (Helen) He, SC Total Transpose Time (Hybrid MPI/OpenMP) Parallelize local transposes (G1) and (G3) with OpenMP N_CPU = N_MPI * N_threads N_CPU = N_MPI * N_threads T = 2MN 1 N 2 N 3 /N CPU + 2L(N MPI -1) T = 2MN 1 N 2 N 3 /N CPU + 2L(N MPI -1) + [2N 1 N 3 N 2 /BN MPI ][(N MPI -1)/N MPI ] + [2N 1 N 3 N 2 /BN MPI ][(N MPI -1)/N MPI ] where N CPU --- total number of CPUs where N CPU --- total number of CPUs N MPI --- number of MPI tasks N MPI --- number of MPI tasks

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP Static: Loops are divided into n_thrds partitions, each containing ceiling(n_iters/n_thrds) iterations. Affinity: Loops are divided into n_thrds partitions, each containing ceiling(n_iters/n_thrds) iterations. Then each partition is subdivided into chunks containing ceiling(n_left_iters_in_partion/2) iterations. Guided: Loops are divided into progressively smaller chunks until the chunk size is 1. The first chunk contains ceiling(n_iter/n_thrds) iterations. Subsequent chunk contains ceiling(n_left_iters /n_thrds) iterations. Dynamic, n: Loops are divided into chunks containing n iterations. We choose different chunk sizes.

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP within one Node 64x512x128: N_cycles = 4114, cycle_lengths = 16 16x1024x256: N_cycles = 29140, cycle_lengths= 9, 3

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP within one Node (cont’d) 8x1000x500: N_cycles = 132, cycle_lengths = 8890, 1778, 70, 14, 5 32x100x25: N_cycles = 42, cycle_lengths = 168, 24, 21, 8, 3.

11/19/2002Yun (Helen) He, SC Pure MPI and Pure OpenMP Within One Node OpenMP vs. MPI (16 CPUs) 64x512x128: 2.76 times faster 16x1024x256:1.99 times faster

11/19/2002Yun (Helen) He, SC Pure MPI and Hybrid MPI/ OpenMP Across Nodes With 128 CPUs, n_thrds=4 hybrid MPI/OpenMP performs faster than n_thrds=16 hybrid by a factor of 1.59, and faster than pure MPI by a factor of 4.44.

11/19/2002Yun (Helen) He, SC Conclusions In-place vacancy tracking method outperforms 2-array method. It could be explained by the elimination of copy back and memory access volume and pattern. Independency and non-overlapping of tracking cycles allow multi-threaded parallelization. SMP schedule affinity optimizes performances for larger number of cycles and small cycle lengths. Schedule dynamic for smaller number of cycles and larger or uneven cycle lengths. The algorithm could be parallelized using pure MPI with the combination of local vacancy tracking and global exchanging.

11/19/2002Yun (Helen) He, SC Conclusions (cont’d) Pure OpenMP performs more than twice faster than pure MPI within one node. It makes sense to develop a hybrid MPI/OpenMP algorithm. Hybrid approach parallelizes the local transposes with OpenMP, and MPI is still used for global exchange across nodes. Given the total number of CPUs, the number of MPI tasks and OpenMP threads need to be carefully chosen for optimal performance. In our test runs, a factor of 4 speedup is gained compared to pure MPI. This paper gives a positive experience of developing hybrid MPI/OpenMP parallel paradigms.