Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.

Slides:

Advertisements

Similar presentations

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.

Advertisements

Cyclic Coordinate Method

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Programmability Issues

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

CSCI-455/552 Introduction to High Performance Computing Lecture 11.

Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Reference: Message Passing Fundamentals.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

CS 104 Introduction to Computer Science and Graphics Problems

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡

Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.

MULTIPROCESSOR SYSTEMS OUTLINE  Coordinated job Scheduling  Separate Systems  Homogeneous Processor Scheduling  Master/Slave Scheduling.

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Omar Darwish.  Load balancing is the process of improving the performance of a parallel and distributed system through a redistribution of load among.

Scheduling Master - Slave Multiprocessor Systems Professor: Dr. G S Young Speaker:Darvesh Singh.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Project Mentor – Prof. Alan Kaminsky

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

Artificial Neural Networks

Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.

Input and Output Computer Organization and Assembly Language: Module 9.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

A High Performance Channel Sorting Scheduling Algorithm Based On Largest Packet P.G.Sarigiannidis, G.I.Papadimitriou, and A.S.Pomportsis Department of.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Static Process Scheduling Section 5.2 CSc 8320 Alex De Ruiter

13-Nov-15 (1) CSC Computer Organization Lecture 7: Input/Output Organization.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Concurrency and Performance Based on slides by Henri Casanova.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.

A Closer Look at Instruction Set Architectures

Parallel Programming in C with MPI and OpenMP

Multithreaded Programming

Parallel Programming in C with MPI and OpenMP

Introduction to Optimization

Presentation transcript:

Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing Systems Laboratory

September 24, 2005HERCMA'052Outline IntroductionIntroduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future work

September 24, 2005HERCMA'053Introduction  Motivation: A lot of work has been done in parallelizing loops with dependencies, but very little work exists on explicitly minimizing the communication incurred by certain dependence vectors

September 24, 2005HERCMA'054Introduction  Contribution: Enhancing the data locality for loops with dependencies Reducing the communication cost by mapping iterations tied by certain dependence vectors to the same processor Applicability to both homogeneous & heterogeneous systems, regardless of their interconnection network

September 24, 2005HERCMA'055Outline Introduction Definitions and notationsDefinitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future work

September 24, 2005HERCMA'056 Definitions and notations  Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies

September 24, 2005HERCMA'057 Definitions and notations J – the index space of an n-dimensional loop ECT – earliest computation time of an iteration point R k – set of points (called region) of J with ECT k R 0 – contains the boundary (pre-computed) points Con(d 1,..., d q ) – a cone, the convex subspace formed by q dependence vectors of the m+1 dependence vectors of the problem Trivial cones – the cones defined by dependence vectors and at least one unitary axis vector Non-trivial cones – the cones defined exclusively by dependence vectors Cone vectors – are those dependence vectors d i (i≤q) that define the hyperplane in a cone Chain of computations – a sequence of iterations executed by the same processor

September 24, 2005HERCMA'058 Definitions and notations Index space of a loop with d 1 =(1,7), d 2 =(2,4), d 3 =(3,2), d 4 =(4,4) and d 5 =(6,1) The cone vectors are d 1, d 2, d 3 and d 5 The first three regions and few chains of computations are shown

September 24, 2005HERCMA'059 Definitions and notations d c – the communication vector (one of the cone vectors) j = p + λd c is the family of lines of J formed by d c C r = is a chain formed by d c |C r | is the number of iteration points of C r r – is a natural number indicating the relative offset between chain C (0,0) and chain C r C – is the set of C r chains and | C | is the number of C r chains |C M | – is the cardinality of the maximal chain D r in – the volume of “incoming” data for C r D r out – the volume of “outgoing” data for C r D r in + D r out is the total communication associated with C r #P – the number of available processors m – the number of dependence vectors, except d c

September 24, 2005HERCMA'0510 Definitions and notations Communication vector is d c = d 3 = (3,2) Chains are formed along d c

September 24, 2005HERCMA'0511Outline Introduction Definitions and notations Adaptive cyclic schedulingAdaptive cyclic scheduling ACS for homogeneous systemsACS for homogeneous systems ACS for heterogeneous systemsACS for heterogeneous systems Conclusions Future work

September 24, 2005HERCMA'0512 Adaptive cyclic scheduling (ACS)  Assumptions  All points of a chain C r are mapped to the same processor  Each chain is mapped to a different processor a)Homogeneous case: in round-robin fashion, load balanced b)Heterogeneous case: according to the available computation power of every processor  #P is arbitrarily chosen to be fixed  The Master-Slave model is used in both cases

September 24, 2005HERCMA'0513 Adaptive cyclic scheduling (ACS) The ACS Algorithm I NPUT I NPUT : An n-dimensional nested loop with terminal point U. Master: (1)Determine the cone vectors. (2)Compute cones. (3)Use QuickHull to find the optimal hyperplane. (4)Choose the d c. (5)Form and count the chains. (6)Compute the relative offsets between C (0,0) and the m dependence vectors. (7)Divide #P so as to cover most successfully the relative offsets below as well as above d c. If no dependence vector exists below (or above) d c, then choose the closest offset to #P above (or below) d c, and use the remaining number of processors below (or above) d c. (8)Assign chains to slave: a)(homogeneous sys) in cyclic fashion. b)(heterogeneous sys) according to their available computational power (i.e. longer/more chains are mapped to faster processors, whereas shorter/fewer chains to slower processors).

September 24, 2005HERCMA'0514 Adaptive cyclic scheduling (ACS) The ACS Algorithm (continued) Slave: (1)Send request for work to master (and communicate the available computational power if in heterogeneous system). (2)Wait for reply; store all chains and sort the points by the region they belong to. (3)Compute points region by region, and along the optimal hyperplane. Communicate only when needed points are not locally computed. O UTPUT O UTPUT : (Slave) When no more points in the memory, notify the master and terminate. (Master) If all slaves sent notification, collect results and terminate.

September 24, 2005HERCMA'0515 Adaptive cyclic scheduling (ACS) ACS for homogeneous systems r =2 r =3 d 1 =(1,3) d 2 =(2,2) d 3 =(4,1) d c = d 2 C (0,0) communicates with C (0,0) +d 1 =C (0,2) and C (0,0) +d 3 =C (3,0) relative offsets are 2 and 3 5 slaves, 1 master S 1,S 2,S 3 are cyclicly assigned chains below d c S 4,S 5 are cyclicly assigned chains above d c

September 24, 2005HERCMA'0516 ACS for heterogeneous systems  Assumptions  Every process running on a heterogeneous computer takes an equal share of its computing resources  Notations  ACP i – the available computational power of slave i  VCP i – the virtual computational power of slave i  Q i – the number of running processes in the queue of slave i  ACP – the total available computational power of the heterogeneous system ACP i = VCP i / Q i and ACP = Σ ACP i Adaptive cyclic scheduling (ACS)

September 24, 2005HERCMA'0517 Adaptive cyclic scheduling (ACS) ACS for heterogeneous systems r =2 r =3 d 1 =(1,3) d 2 =(2,2) d 3 =(4,1) d c = d 2 C (0,0) communicates with C (0,0) +d 1 =C (0,2) and C (0,0) +d 3 =C (3,0) 5 slaves, 1 master S 3 has the lowest ACP S 3 is assigned 4 chains S 1,S 2, S 4,S 5 are assigned 5 chains each oversimplified example

September 24, 2005HERCMA'0518 Adaptive cyclic scheduling (ACS)  Advantages  It zeroes the communication cost imposed by as many dependence vectors as possible  #P is divided into two groups of processors used in the area above d c, and below d c respectively, such that chains above d c are cyclically mapped to one group of processors, whereas chains below d c are cyclically mapped to the other  This way communication cost is additionally zeroed along one dependence vector in the area above d c, and along another dependence vector in the area below d c  Suitable for Homogeneous systems (an arbitrary chain is mapped to an arbitrary processor) Heterogeneous systems (longer/more chains are mapped to faster processors, whereas shorter/fewer chains to slower processors)

September 24, 2005HERCMA'0519Outline Introduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems ConclusionsConclusions Future work

September 24, 2005HERCMA'0520Conclusions The total communication cost can be significantly reduced if the communication incurred by certain dependence vectors is eliminated Preliminary simulations show that the adaptive cyclic mapping outperforms other mapping schemes (e.g. cyclic mapping) by enhancing the data locality

September 24, 2005HERCMA'0521Outline Introduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future workFuture work

September 24, 2005HERCMA'0522 Future work Simulate the algorithm on various architectures (such as shared memory systems, SMPs and MPP systems) and for real-life test cases

September 24, 2005HERCMA'0523 Thank you Questions?

September 24, 2005HERCMA'0524 Selected references [12] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris, “Optimal loop parallelization in n-dimensional index spaces”, in Proc. of the 2002 Int’l Conf. on Par. and Dist. Proc. Techn. and Appl. (PDPTA’02) [13] F.M. Ciorba, T. Andronikos, D. Kamenopoulos, P. Theodoropoulos, and G. Papakonstantinou, “Simple code generation for special UDLs”, in Proc. of the 1 st Balkan Conference in Informatics (BCI’03) [14] N. Manjikian and T. Abdelrahman, “Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors,” IEEE Trans. on Par. and Dist. Sys., vol. 12, no. 3, pp , 2001 [15] G. Papakonstantinou, T. Andronikos, I. Drositis, “On the parallelization of UET/UET-UCT loops”, Journal of Neural Parallel & Scientific Computations, 2001