11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

A Polynomial-Time Algorithm for Global Value Numbering SAS 2004 Sumit Gulwani George C. Necula.
1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.
Chapter 6 Cost and Choice. Copyright © 2001 Addison Wesley LongmanSlide 6- 2 Figure 6.1 A Simplified Jam-Making Technology.
1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
MULTIPLICATION EQUATIONS 1. SOLVE FOR X 3. WHAT EVER YOU DO TO ONE SIDE YOU HAVE TO DO TO THE OTHER 2. DIVIDE BY THE NUMBER IN FRONT OF THE VARIABLE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
1 Processes and Threads Creation and Termination States Usage Implementations.
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
ZMQS ZMQS
Prasanna Pandit R. Govindarajan
Distributed and Parallel Processing Technology Chapter2. MapReduce
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Acceleration of Cooley-Tukey algorithm using Maxeler machine
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
Shredder GPU-Accelerated Incremental Storage and Computation
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
ABC Technology Project
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Squares and Square Root WALK. Solve each problem REVIEW:
Unit 1:Parallel Databases
Chapter 5 Test Review Sections 5-1 through 5-4.
SIMOCODE-DP Software.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Week 1.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
1 Unit 1 Kinematics Chapter 1 Day
TASK: Skill Development A proportional relationship is a set of equivalent ratios. Equivalent ratios have equal values using different numbers. Creating.
Choosing an Order for Joins
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
Chapter 30 Induction and Inductance In this chapter we will study the following topics: -Faraday’s law of induction -Lenz’s rule -Electric field induced.
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Exploring Energy-Latency Tradeoffs for Broadcasts in Energy-Saving Sensor Networks AUTHOR: MATTHEW J. MILLER CIGDEM SENGUL INDRANIL GUPTA PRESENTER: WENYU.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Introduction to Openmp & openACC
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
06/24/2004NUG20041 Hybrid OpenMP and MPI Programming and Tuning Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory.
03/20/2003Yun (Helen) He1 Hybrid MPI and OpenMP Programming on IBM SP Yun (Helen) He Lawrence Berkeley National Laboratory.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
10/10/2002Yun (Helen) He, GHC20021 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures.
Hybrid Programming with OpenMP and MPI
Presentation transcript:

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array Transposition Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory

11/19/2002Yun (Helen) He, SC Outline Introduction Background 2-array transpose method In-place vacancy tracking method Performance on single CPU Parallelization of Vacancy Tracking Method Pure OpenMP Pure MPI Hybrid MPI/OpenMP Performance Scheduling for pure OpenMP Pure MPI and pure OpenMP within one node Pure MPI and Hybrid MPI/OpenMP across nodes Conclusions

11/19/2002Yun (Helen) He, SC Background Mixed MPI/openMP is software trend for SMP architectures Elegant in concept and architecture Negative experiences: NAS, CG, PS, indicate pure MPI outperforms mixed MPI/openMP Array transpose on distributed memory architectures equals the remapping of problem subdomains Used in many scientific and engineering applications Climate model: longitude local height local

11/19/2002Yun (Helen) He, SC Two-Array Transpose Method Reshuffle Phase B[k 1,k 3,k 2 ]  A[k 1,k 2,k 3 ] Use auxiliary array B Copy Back Phase A  B Combine Effect A’[k 1,k 3,k 2 ]  A[k 1,k 2,k 3 ]

11/19/2002Yun (Helen) He, SC Vacancy Tracking Method A(3,2)  A(2,3) A(3,2)  A(2,3) Tracking cycle: 1 – 3 – 4 – A(2,3,4)  A(3,4,2), tracking cycles: A(2,3,4)  A(3,4,2), tracking cycles: – – 5 Cycles are closed, non-overlapping. Cycles are closed, non-overlapping.

11/19/2002Yun (Helen) He, SC Algorithm to Generate Tracking Cycles ! For 2D array A, viewed as A(N1,N2) at input and as A(N2,N1) at output. ! Starting with (i1,i2), find vacancy tracking cycle ioffset_start = index_to_offset (N1,N2,i1,i2) ioffset_start = index_to_offset (N1,N2,i1,i2) ioffset_next = -1 ioffset_next = -1 tmp = A (ioffset_start) tmp = A (ioffset_start) ioffset = ioffset_start ioffset = ioffset_start do while ( ioffset_next.NOT_EQUAL. ioffset_start) (C.1) do while ( ioffset_next.NOT_EQUAL. ioffset_start) (C.1) call offset_to_index (ioffset,N2,N1,j1,j2) ! N1,N2 exchanged call offset_to_index (ioffset,N2,N1,j1,j2) ! N1,N2 exchanged ioffset_next = index_to_offset (N1,N2,j2,j1) ! j1,j2 exchanged ioffset_next = index_to_offset (N1,N2,j2,j1) ! j1,j2 exchanged if (ioffset.NOT_EQUAL. ioffset_next) then if (ioffset.NOT_EQUAL. ioffset_next) then A (ioffset) = A (ioffset_next) A (ioffset) = A (ioffset_next) ioffset = ioffset_next ioffset = ioffset_next end if end if end_do_while end_do_while A (ioffset_next) = tmp A (ioffset_next) = tmp

11/19/2002Yun (Helen) He, SC In-Place vs. Two-Array

11/19/2002Yun (Helen) He, SC Memory Access Volume and Pattern Eliminates auxiliary array and copy-back phase, reduces memory access in half. Has less memory access due to length-1 cycles not touched. Has more irregular memory access pattern than traditional method, but gap becomes smaller when size of move is larger than cache-line size. Same as 2-array method: inefficient memory access due to large stride.

11/19/2002Yun (Helen) He, SC Outline Introduction Background 2-array transpose method In-place vacancy tracking method Performance on single CPU Parallelization of Vacancy Tracking Method Pure OpenMP Pure MPI Hybrid MPI/OpenMP Performance Scheduling for pure OpenMP Pure MPI and pure OpenMP within one node Pure MPI and Hybrid MPI/OpenMP across nodes Conclusions

11/19/2002Yun (Helen) He, SC Multi-Threaded Parallelism Key: Independence of tracking cycles. !$OMP PARALLEL DO DEFAULT (PRIVATE) !$OMP& SHARED (N_cycles, info_table, Array) (C.2) !$OMP& SCHEDULE (AFFINITY) do k = 1, N_cycles do k = 1, N_cycles an inner loop of memory exchange for each cycle using info_table an inner loop of memory exchange for each cycle using info_table enddo enddo !$OMP END PARALLEL DO

11/19/2002Yun (Helen) He, SC Pure MPI A(N 1,N 2,N 3 )  A(N 1,N 3,N 2 ) on P processors: (G1) Do a local transpose on the local array A(N 1,N 2,N 3 /P)  A(N 1,N 3 /P,N 2 ). A(N 1,N 2,N 3 /P)  A(N 1,N 3 /P,N 2 ). (G2) Do a global all-to-all exchange of data blocks, each of size N 1 (N 3 /P)(N 2 /P). each of size N 1 (N 3 /P)(N 2 /P). (G3) Do a local transpose on the local array A(N 1,N 3 /P,N 2 ), viewed as A(N 1 N 3 /P,N 2 /P,P) A(N 1,N 3 /P,N 2 ), viewed as A(N 1 N 3 /P,N 2 /P,P)  A(N 1 N 3 /P,P,N 2 /P), viewed as A(N 1,N 3,N 2 /P).  A(N 1 N 3 /P,P,N 2 /P), viewed as A(N 1,N 3,N 2 /P).

11/19/2002Yun (Helen) He, SC Global all-to-all Exchange ! All processors simultaneously do the following: ! All processors simultaneously do the following: do q = 1, P - 1 do q = 1, P - 1 send a message to destination processor destID send a message to destination processor destID receive a message from source processor srcID receive a message from source processor srcID end do end do ! where destID = srcID = (myID XOR q) ! where destID = srcID = (myID XOR q)

11/19/2002Yun (Helen) He, SC Total Transpose Time (Pure MPI) Use “latency+ message-size / bandwidth” model Use “latency+ message-size / bandwidth” model T P = 2MN 1 N 2 N 3 /P + 2L(P-1) + [2N 1 N 3 N 2 /BP][(P-1)/P] T P = 2MN 1 N 2 N 3 /P + 2L(P-1) + [2N 1 N 3 N 2 /BP][(P-1)/P] where P --- total number of CPUs where P --- total number of CPUs M --- average memory access time per element M --- average memory access time per element L --- communication latency L --- communication latency B --- communication bandwidth B --- communication bandwidth

11/19/2002Yun (Helen) He, SC Total Transpose Time (Hybrid MPI/OpenMP) Parallelize local transposes (G1) and (G3) with OpenMP N_CPU = N_MPI * N_threads N_CPU = N_MPI * N_threads T = 2MN 1 N 2 N 3 /N CPU + 2L(N MPI -1) T = 2MN 1 N 2 N 3 /N CPU + 2L(N MPI -1) + [2N 1 N 3 N 2 /BN MPI ][(N MPI -1)/N MPI ] + [2N 1 N 3 N 2 /BN MPI ][(N MPI -1)/N MPI ] where N CPU --- total number of CPUs where N CPU --- total number of CPUs N MPI --- number of MPI tasks N MPI --- number of MPI tasks

11/19/2002Yun (Helen) He, SC Outline Introduction Background 2-array transpose method In-place vacancy tracking method Performance on single CPU Parallelization of Vacancy Tracking Method Pure OpenMP Pure MPI Hybrid MPI/OpenMP Performance Scheduling for pure OpenMP Pure MPI and pure OpenMP within one node Pure MPI and Hybrid MPI/OpenMP across nodes Conclusions

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP Static: Loops are divided into n_thrds partitions, each containing ceiling(n_iters/n_thrds) iterations. Affinity: Loops are divided into n_thrds partitions, each containing ceiling(n_iters/n_thrds) iterations. Then each partition is subdivided into chunks containing ceiling(n_left_iters_in_partion/2) iterations. Guided: Loops are divided into progressively smaller chunks until the chunk size is 1. The first chunk contains ceiling(n_iter/n_thrds) iterations. Subsequent chunk contains ceiling(n_left_iters /n_thrds) iterations. Dynamic, n: Loops are divided into chunks containing n iterations. We choose different chunk sizes.

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP within one Node 64x512x128: N_cycles = 4114, cycle_lengths = 16 16x1024x256: N_cycles = 29140, cycle_lengths= 9, 3

11/19/2002Yun (Helen) He, SC Scheduling for OpenMP within one Node (cont’d) 8x1000x500: N_cycles = 132, cycle_lengths = 8890, 1778, 70, 14, 5 32x100x25: N_cycles = 42, cycle_lengths = 168, 24, 21, 8, 3.

11/19/2002Yun (Helen) He, SC Pure MPI and Pure OpenMP Within One Node OpenMP vs. MPI (16 CPUs) 64x512x128: 2.76 times faster 16x1024x256:1.99 times faster

11/19/2002Yun (Helen) He, SC Pure MPI and Hybrid MPI/ OpenMP Across Nodes With 128 CPUs, n_thrds=4 hybrid MPI/OpenMP performs faster than n_thrds=16 hybrid by a factor of 1.59, and faster than pure MPI by a factor of 4.44.

11/19/2002Yun (Helen) He, SC Conclusions In-place vacancy tracking method outperforms 2-array method. It could be explained by the elimination of copy back and memory access volume and pattern. Independency and non-overlapping of tracking cycles allow multi-threaded parallelization. SMP schedule affinity optimizes performances for larger number of cycles and small cycle lengths. Schedule dynamic for smaller number of cycles and larger or uneven cycle lengths. The algorithm could be parallelized using pure MPI with the combination of local vacancy tracking and global exchanging.

11/19/2002Yun (Helen) He, SC Conclusions (cont’d) Pure OpenMP performs more than twice faster than pure MPI within one node. It makes sense to develop a hybrid MPI/OpenMP algorithm. Hybrid approach parallelizes the local transposes with OpenMP, and MPI is still used for global exchange across nodes. Given the total number of CPUs, the number of MPI tasks and OpenMP threads need to be carefully chosen for optimal performance. In our test runs, a factor of 4 speedup is gained compared to pure MPI. This paper gives a positive experience of developing hybrid MPI/OpenMP parallel paradigms.