Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

Similar presentations


Presentation on theme: "1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous."— Presentation transcript:

1 1 中華大學資訊工程學系 http:// www.csie.chu.edu.tw Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous Cluster Grid

2 2 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Localization on Multi-Cluster Grid System Scheduling Contention Free Communications for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

3 3 Regular Parallel Data Distribution Data Parallel Programming Language, e.g. HPF (High Performance Fortran), Fortran D, BLOCK, CYCLIC, BLOCK-CYCLIC(c) Ex. 18 Elements Array, 3 Logical Processors Introduction

4 4 Two Dimension Matrices Introduction (cont.) Data Distribution

5 5 Data Redistribution Introduction (cont.)

6 6 Data Redistribution Introduction (cont.) REALDIM(18, 24) :: A !HPF$PROCESSORS P(2, 3) !HPF$DISTRIBUTE A(BLOCK, BLOCK) ONTO P : (computation) !HPF$REDISTRIBUTE A(CYCLIC, CYCLIC(2)) ONTO P : (computation)

7 7 Irregular Redistribution Introduction (cont.) PARAMETER (S = /7, 16, 11, 10, 7, 49/) !HPF$PROCESSORS P(6) REAL A(100), new (6) !HPF$ DISTRIBUTE A (GEN_BLOCK(S)) onto P !HPF$DYNAMIC new = /15, 16, 10, 16, 15, 28/ !HPF$ REDISTRIBUTE A (GEN_BLOCK(new))

8 8 Introduction (cont.) Irregular Data Distribution (GEN_BLOCK) Data distribution for algorithm P 7 16 11 10 7 49 Data distribution for algorithm Q 15 16 10 16 15 28 Application … … Algorithm P Algorithm Q Heterogeneous Processors

9 9 Problem Category Benefits of runtime redistribution Achieve Data Locality Reduce Communication cost at runtime Objectives Indexing sets generation Data Packing & Unpacking Techniques Communication Optimizations Multi-Stage Redistribution Method Processor Mapping Technique Communication Scheduling Introduction (cont.)

10 10 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Multi-Cluster Grid System Contention Free Communication Scheduling for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

11 11 The Original Processor Mapping Technique (Prof. Lionel. M. Ni) Mapping function is provided to generate a new sequence of logical processor id Increase data hits Minimize the amount of data exchange Processor Mapping Technique

12 12 An Optimal Processor Mapping Technique (Hsu’05) Example: BC 86 over 11 Traditional Method Size Oriented Greedy Matching Maximum Matching (Optimal) Processor Mapping Technique (cont.)

13 13 Localize communications Cluster Grid Interior Communication External Communication Processor Mapping Technique (cont.)

14 14 Motivating Example Processor Mapping Technique (cont.)

15 15 Communication Table Before Processor Mapping Processor Mapping Technique (cont.) |I|=9 |E|=18

16 16 Communication links Before Processor Mapping Processor Mapping Technique (cont.)

17 17 Communication table after Processor Mapping Processor Mapping Technique (cont.) |I|=27 |E|=0

18 18 Communication links after Processor Mapping Processor Mapping Technique (cont.)

19 19 Processor Reordering Flow Diagram Processor Mapping Technique (cont.) Mapping Function Partitioning Data Alignment/ Dispatch Sourc e Data Reordering Agent SCA(x) Generate new P id Reordering SD(P x’ ) DCA(x) Determine Target Cluster Designate Target Node SCA(x) SD(P x ) Master Node DCA(x) DD(P y ) F(X) = X’ = +(X mod C) * K

20 20 Identical Cluster Grid vs. Non-identical Cluster Grid Processor Mapping Technique (cont.)

21 21 Processor Replacement Algorithm for Non-identical Cluster Grid Processor Mapping Technique (cont.)

22 22 Theoretical Analysis Processor Mapping Technique (cont.) The number of interior communications when C=3.

23 23 Theoretical Analysis Processor Mapping Technique (cont.)

24 24 Theoretical Analysis Processor Mapping Technique (cont.)

25 25 Simulation Setting Processor Mapping Technique (cont.) Taiwan UniGrid 8 campus clusters SPMD Programs C+MPI codes.

26 26 Topology Processor Mapping Technique (cont.) Tainan Taichung Academia Sinica National Tsing Hua University1 Taipei Hsing Kuo University Chung Hua University National Center for High- performance Computing National Tsing Hua University2 National Dong Hwa University Hsinchu Hualien Providence University Tunghai University

27 27 Hardware Infrastructure Processor Mapping Technique (cont.) HKU Intel P3 1.0, 256M THU Dual AMD 1.6, 1G CHU Intel P4 2.8, 256M SINICA Dual Intel P3 1.0, 1G NCHC Dual AMD 2000+, 512M PU AMD 2400+, 1G NTHU Dual Xeon 2.8, 1G NDHU AMD Athlon, 256M Internet

28 28 System Monitoring Webpage Processor Mapping Technique (cont.)

29 29 Experimental Results Processor Mapping Technique (cont.)

30 30 Experimental Results Processor Mapping Technique (cont.)

31 31 Experimental Results Processor Mapping Technique (cont.)

32 32 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Multi-Cluster Grid System Scheduling Contention Free Communications for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

33 33 Example of GEN_BLOCK distributions Enhance load balancing on heterogeneous environment Scheduling Irregular Redistributions Data distribution for algorithm P 7 16 11 10 7 49 Data distribution for algorithm Q 15 16 10 16 15 28 Application … … Algorithm P Algorithm Q

34 34 Example of GEN_BLOCK redistribution Scheduling Irregular Redistributions (cont.) Observation Without cross communications

35 35 Convex Bipartite Graph Scheduling Irregular Redistributions (cont.) SP 2 TP 1 TP 2 TP 3 SP 3 SP 1 :Node Data communication :

36 36 Example of GEN_BLOCK redistribution Scheduling Irregular Redistributions (cont.) A simple result. Minimize communication step. Minimize the message size of total steps.

37 37 Related Implementations Coloring

38 38 Related Implementations LIST

39 39 Related Implementations DC1 & DC2 (b)DC2(a)DC1

40 40 The Two Phase Degree Reduction Method Scheduling Irregular Redistributions (cont.) The First Phase (for nodes with degree >2) Reduces degree of the maximum degree nodes by one in each reduction iteration. The Second Phase (for nodes with degree = 1 and 2) Schedules messages between nodes that with degree 1 and 2 using an adjustable coloring mechanism.

41 41 The first phase Scheduling Irregular Redistributions (cont.) The Two Phase Degree Reduction Method S 3 : m11(6) 、 m5(3) ----6

42 42 The second phase Scheduling Irregular Redistributions (cont.) The Two-Phase Degree Reduction Method S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18 S 2 : m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) ---12 S3: m11(6) 、 m5(3) --- 6

43 43 Scheduling Irregular Redistributions (cont.) Extend TPDR S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m9(10) 、 m13(18) ---18 S 2 : m4(4) 、 m7(3) 、 m10(8) 、 m12(12) ---12 m2(3) m8(4) S 3 : m11(6) 、 m5(3) 、 m2(3) 、 m8(4) ----6 S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18 S 2 : m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) ---12 S 3 : m11(6) 、 m5(3) --- 6 TPDR E-TPDR

44 44 Performance Evaluation Simulation of TPDR and E-TPDR algorithms on uneven cases.

45 45 Performance Evaluation (cont.) Simulation A is carried out to examine the performance of TPDR and E-TPDR algorithms on uneven cases.

46 46 Performance Evaluation (cont.) Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

47 47 Performance Evaluation (cont.) Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

48 48 Summary TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions Contention free Optimal Number of Communication Steps Outperforms the D&C algorithm TPDR (uneven) performs better than TPDR (even)

49 49 1000 test cases Performance Evaluation (cont.)

50 50 Average Performance Evaluation (cont.)

51 51 Conclusions Runtime Data Redistribution is usually used to enhance algorithm performance in data parallel applications Processor Mapping technique minimizes data transmission cost and achieves better communication localization on multi-cluster grid systems TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions Contention free Good performance Future Works Incorporate localization techniques on Data Grid with considering Heterogeneous external communication overheads Incorporate the ratio between local memory access & remote message passing (on different architecture) into E-TPDR scheduling policy …

52 52 Thank you

53 53 Our Implementation of Communication Scheduling E-TPDR After process 1 and 2, m 7 is scheduled in step 3 and Degree max becomes 2. Messages m 1 and m 11 is scheduled in step 3. After process 3, m 1 and m 11 are schedule in step 3. Edges are colored blue and red for step 1 and 2, respectively. color


Download ppt "1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous."

Similar presentations


Ads by Google