1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

Slides:



Advertisements
Similar presentations
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Advertisements

Array Operation Synthesis to Optimize Data Parallel Programs Department of Computer Science, National Tsing-Hua University Student:Gwan-Hwan Hwang Advisor:
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Power Grid Sizing via Convex Programming Peng Du, Shih-Hung Weng, Xiang Hu, Chung-Kuan Cheng University of California, San Diego 1.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Query Evaluation Techniques for Cluster Database Systems Andrey V. Lepikhov, Leonid B. Sokolinsky South Ural State University Russia 22 September 2010.
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
1 高等演算法 Homework One 暨南大學資訊工程學系 黃光璿 2004/11/11. 2 Problem 1.
Parallel Programming Models and Paradigms
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.
Grid Defense Against Malicious Cascading Failure Paulo Shakarian, Hansheng Lei Dept. Electrical Engineering and Computer Science, Network Science Center,
Static Process Scheduling
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
Author: Weirong Jiang and Viktor K. Prasanna Publisher: The 18th International Conference on Computer Communications and Networks (ICCCN 2009) Presenter:
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Auburn University
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Linchuan Chen, Peng Jiang and Gagan Agrawal
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous Cluster Grid

2 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Localization on Multi-Cluster Grid System Scheduling Contention Free Communications for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

3 Regular Parallel Data Distribution Data Parallel Programming Language, e.g. HPF (High Performance Fortran), Fortran D, BLOCK, CYCLIC, BLOCK-CYCLIC(c) Ex. 18 Elements Array, 3 Logical Processors Introduction

4 Two Dimension Matrices Introduction (cont.) Data Distribution

5 Data Redistribution Introduction (cont.)

6 Data Redistribution Introduction (cont.) REALDIM(18, 24) :: A !HPF$PROCESSORS P(2, 3) !HPF$DISTRIBUTE A(BLOCK, BLOCK) ONTO P : (computation) !HPF$REDISTRIBUTE A(CYCLIC, CYCLIC(2)) ONTO P : (computation)

7 Irregular Redistribution Introduction (cont.) PARAMETER (S = /7, 16, 11, 10, 7, 49/) !HPF$PROCESSORS P(6) REAL A(100), new (6) !HPF$ DISTRIBUTE A (GEN_BLOCK(S)) onto P !HPF$DYNAMIC new = /15, 16, 10, 16, 15, 28/ !HPF$ REDISTRIBUTE A (GEN_BLOCK(new))

8 Introduction (cont.) Irregular Data Distribution (GEN_BLOCK) Data distribution for algorithm P Data distribution for algorithm Q Application … … Algorithm P Algorithm Q Heterogeneous Processors

9 Problem Category Benefits of runtime redistribution Achieve Data Locality Reduce Communication cost at runtime Objectives Indexing sets generation Data Packing & Unpacking Techniques Communication Optimizations Multi-Stage Redistribution Method Processor Mapping Technique Communication Scheduling Introduction (cont.)

10 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Multi-Cluster Grid System Contention Free Communication Scheduling for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

11 The Original Processor Mapping Technique (Prof. Lionel. M. Ni) Mapping function is provided to generate a new sequence of logical processor id Increase data hits Minimize the amount of data exchange Processor Mapping Technique

12 An Optimal Processor Mapping Technique (Hsu’05) Example: BC 86 over 11 Traditional Method Size Oriented Greedy Matching Maximum Matching (Optimal) Processor Mapping Technique (cont.)

13 Localize communications Cluster Grid Interior Communication External Communication Processor Mapping Technique (cont.)

14 Motivating Example Processor Mapping Technique (cont.)

15 Communication Table Before Processor Mapping Processor Mapping Technique (cont.) |I|=9 |E|=18

16 Communication links Before Processor Mapping Processor Mapping Technique (cont.)

17 Communication table after Processor Mapping Processor Mapping Technique (cont.) |I|=27 |E|=0

18 Communication links after Processor Mapping Processor Mapping Technique (cont.)

19 Processor Reordering Flow Diagram Processor Mapping Technique (cont.) Mapping Function Partitioning Data Alignment/ Dispatch Sourc e Data Reordering Agent SCA(x) Generate new P id Reordering SD(P x’ ) DCA(x) Determine Target Cluster Designate Target Node SCA(x) SD(P x ) Master Node DCA(x) DD(P y ) F(X) = X’ = +(X mod C) * K

20 Identical Cluster Grid vs. Non-identical Cluster Grid Processor Mapping Technique (cont.)

21 Processor Replacement Algorithm for Non-identical Cluster Grid Processor Mapping Technique (cont.)

22 Theoretical Analysis Processor Mapping Technique (cont.) The number of interior communications when C=3.

23 Theoretical Analysis Processor Mapping Technique (cont.)

24 Theoretical Analysis Processor Mapping Technique (cont.)

25 Simulation Setting Processor Mapping Technique (cont.) Taiwan UniGrid 8 campus clusters SPMD Programs C+MPI codes.

26 Topology Processor Mapping Technique (cont.) Tainan Taichung Academia Sinica National Tsing Hua University1 Taipei Hsing Kuo University Chung Hua University National Center for High- performance Computing National Tsing Hua University2 National Dong Hwa University Hsinchu Hualien Providence University Tunghai University

27 Hardware Infrastructure Processor Mapping Technique (cont.) HKU Intel P3 1.0, 256M THU Dual AMD 1.6, 1G CHU Intel P4 2.8, 256M SINICA Dual Intel P3 1.0, 1G NCHC Dual AMD 2000+, 512M PU AMD 2400+, 1G NTHU Dual Xeon 2.8, 1G NDHU AMD Athlon, 256M Internet

28 System Monitoring Webpage Processor Mapping Technique (cont.)

29 Experimental Results Processor Mapping Technique (cont.)

30 Experimental Results Processor Mapping Technique (cont.)

31 Experimental Results Processor Mapping Technique (cont.)

32 Outline Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems Processor Mapping Technique for Communication Localization The Processor Mapping Technique Multi-Cluster Grid System Scheduling Contention Free Communications for Irregular Problems The Two-Phase Degree Reduction Method (TPDR) Extended TPDR (E-TPDR) Conclusions

33 Example of GEN_BLOCK distributions Enhance load balancing on heterogeneous environment Scheduling Irregular Redistributions Data distribution for algorithm P Data distribution for algorithm Q Application … … Algorithm P Algorithm Q

34 Example of GEN_BLOCK redistribution Scheduling Irregular Redistributions (cont.) Observation Without cross communications

35 Convex Bipartite Graph Scheduling Irregular Redistributions (cont.) SP 2 TP 1 TP 2 TP 3 SP 3 SP 1 :Node Data communication :

36 Example of GEN_BLOCK redistribution Scheduling Irregular Redistributions (cont.) A simple result. Minimize communication step. Minimize the message size of total steps.

37 Related Implementations Coloring

38 Related Implementations LIST

39 Related Implementations DC1 & DC2 (b)DC2(a)DC1

40 The Two Phase Degree Reduction Method Scheduling Irregular Redistributions (cont.) The First Phase (for nodes with degree >2) Reduces degree of the maximum degree nodes by one in each reduction iteration. The Second Phase (for nodes with degree = 1 and 2) Schedules messages between nodes that with degree 1 and 2 using an adjustable coloring mechanism.

41 The first phase Scheduling Irregular Redistributions (cont.) The Two Phase Degree Reduction Method S 3 : m11(6) 、 m5(3) ----6

42 The second phase Scheduling Irregular Redistributions (cont.) The Two-Phase Degree Reduction Method S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18 S 2 : m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) S3: m11(6) 、 m5(3) --- 6

43 Scheduling Irregular Redistributions (cont.) Extend TPDR S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m9(10) 、 m13(18) S 2 : m4(4) 、 m7(3) 、 m10(8) 、 m12(12) m2(3) m8(4) S 3 : m11(6) 、 m5(3) 、 m2(3) 、 m8(4) S 1 : m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18 S 2 : m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) S 3 : m11(6) 、 m5(3) TPDR E-TPDR

44 Performance Evaluation Simulation of TPDR and E-TPDR algorithms on uneven cases.

45 Performance Evaluation (cont.) Simulation A is carried out to examine the performance of TPDR and E-TPDR algorithms on uneven cases.

46 Performance Evaluation (cont.) Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

47 Performance Evaluation (cont.) Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

48 Summary TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions Contention free Optimal Number of Communication Steps Outperforms the D&C algorithm TPDR (uneven) performs better than TPDR (even)

test cases Performance Evaluation (cont.)

50 Average Performance Evaluation (cont.)

51 Conclusions Runtime Data Redistribution is usually used to enhance algorithm performance in data parallel applications Processor Mapping technique minimizes data transmission cost and achieves better communication localization on multi-cluster grid systems TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions Contention free Good performance Future Works Incorporate localization techniques on Data Grid with considering Heterogeneous external communication overheads Incorporate the ratio between local memory access & remote message passing (on different architecture) into E-TPDR scheduling policy …

52 Thank you

53 Our Implementation of Communication Scheduling E-TPDR After process 1 and 2, m 7 is scheduled in step 3 and Degree max becomes 2. Messages m 1 and m 11 is scheduled in step 3. After process 3, m 1 and m 11 are schedule in step 3. Edges are colored blue and red for step 1 and 2, respectively. color