Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Advertisements

Cyclic Coordinate Method
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor Paul Regnier † George Lima † Ernesto Massa † Greg Levin ‡ Scott Brandt ‡
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
Distributed Algorithms for Secure Multipath Routing
1 Minimum-energy broadcasting in multi-hop wireless networks using a single broadcast tree Department of Computer Science and Information Engineering National.
Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Multiscalar processors
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Chapter 3: The Efficiency of Algorithms
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
SoftCOM 2005: 13 th International Conference on Software, Telecommunications and Computer Networks September 15-17, 2005, Marina Frapa - Split, Croatia.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Method of Hooke and Jeeves
User-Centric Data Dissemination in Disruption Tolerant Networks Wei Gao and Guohong Cao Dept. of Computer Science and Engineering Pennsylvania State University.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
1 Blend Times in Stirred Tanks Reacting Flows - Lecture 9 Instructor: André Bakker © André Bakker (2006)
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Outline Why this subject? What is High Performance Computing?
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Energy-Efficient Randomized Switching for Maximizing Lifetime in Tree- Based Wireless Sensor Networks Sk Kajal Arefin Imon, Adnan Khan, Mario Di Francesco,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Video Coding Presented By: Dr. S. K. Singh Department of Computer Engineering, Indian Institute of Technology (B.H.U.) Varanasi
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
A Closer Look at Instruction Set Architectures
Course Description Algorithms are: Recipes for solving problems.
Parallel Programming in C with MPI and OpenMP
CPSC 531: System Modeling and Simulation
CSCI1600: Embedded and Real Time Software
Chapter 17 Parallel Processing
Chapter 3: The Efficiency of Algorithms
Parallel Programming in C with MPI and OpenMP
Course Description Algorithms are: Recipes for solving problems.
Mitali Rawat, Sonu Agarwal, Sudarshan Avish Maru
Introduction to Optimization
CSCI1600: Embedded and Real Time Software
Locality In Distributed Graph Algorithms
Presentation transcript:

Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis Tsanakas National Technical University of Athens Computing Systems Laboratory

July 29, 2005NCA'052Outline IntroductionIntroduction Definitions and notations Chain pattern scheduling unbounded #P – high communication fixed #P – moderate communication Performance results Conclusions Future work

July 29, 2005NCA'053Introduction  Motivation: A lot of work has been done in parallelizing loops with dependencies, but very little work exists on explicitly minimizing the communication incurred by certain dependence vectors

July 29, 2005NCA'054Introduction  Contribution: Enhancing the data locality for loops with dependencies Reducing the communication cost by mapping iterations tied by certain dependence vectors to the same processor Applicability to various multiprocessor architectures

July 29, 2005NCA'055Outline Introduction Definitions and notationsDefinitions and notations Chain pattern scheduling unbounded #P – high communication fixed #P – moderate communication Performance results Conclusions Future work

July 29, 2005NCA'056 Definitions and notations  Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies

July 29, 2005NCA'057 Definitions and notations J – the index space of an n-dimensional uniform dependence loop ECT – earliest computation time of an iteration (time-patterns) Pat k – set of points (called pattern) of J with ECT k Pat 0 – contains the boundary (pre-computed) points Pat 1 – initial pattern pat k – pattern outline (the upper boundary) of Pat k Pattern points – the points that define the polygon shape of a pattern Pattern vectors – are those dependence vectors d i whose end- points are the pattern points of Pat 1 Chain of computations – a sequence of iterations executed by the same processor (space-patterns)

July 29, 2005NCA'058 Definitions and notations Index space of a loop with d 1 =(1,3), d 2 =(2,2), d 3 =(4,1), d 4 =(4,3) The pattern vectors are d 1, d 2, d 3 Pat 1, Pat 2, Pat 3, pat 1, pat 2 and pat 3 are shown Few chains of computations are shown

July 29, 2005NCA'059 Definitions and notations d c – the communication vector (one of the pattern vectors) j = p + λd c is the family of lines of J formed by d c C r = is a chain formed by d c |C r | is the number of iteration points of C r r – is the starting point of a chain C – is the set of C r chains and |C| is the number of C r chains |C M | – is the cardinality of the maximal chain D r in – the volume of “incoming” data for C r D r out – the volume of “outgoing” data for C r D r in + D r out is the total communication associated with C r #P – the number of available processors m – the number of dependence vectors, except d c

July 29, 2005NCA'0510 Definitions and notations Communication vector is d c = d 2 = (2,2) C r=(0,0) communicates with C r=(0,2), C r=(1,0) and C r=(3,0)

July 29, 2005NCA'0511Outline Introduction Definitions and notations Chain pattern schedulingChain pattern scheduling unbounded #P – high communicationunbounded #P – high communication fixed #P – moderate communicationfixed #P – moderate communication Performance results Conclusions Future work

July 29, 2005NCA'0512 Chain pattern scheduling Scenario 1: unbounded #P – high communication All points of a chain C r are mapped to the same processor #P is assumed to be unbounded Each chain is mapped to a different processor  Disadvantages  Unrealistic because for large index spaces the number of chains formed, hence of processors needed, is prohibitive  Provides limited data locality (only for points tied by d c )  Total communication volume is V = (D r in + D r out ) |C|≈2m|C M ||C|

July 29, 2005NCA'0513 Chain pattern scheduling Scenario 1: unbounded #P – high communication Each chain is mapped to a different processor. 24 chains are formed.

July 29, 2005NCA'0514 Chain pattern scheduling Scenario 2: fixed #P – moderate communication All points of a chain C r are mapped to the same processor #P is arbitrarily chosen to be fixed cyclic mapping  Mapping I: cyclic mapping [8]  Each chain from the pool of unassigned chains is mapped to a processor in a cyclic fashion  Disadvantages:  Provides limited data locality  Total communication volume is a function of #P and r 1,…,r m  Due to not being able to predict for what dependence vector the communication is eliminated and in which case, the total communication volume is bounded above by V ≈2m|C M ||C|

July 29, 2005NCA'0515 Chain pattern scheduling Scenario 2: fixed #P – moderate communication Mapping I: cyclic mapping

July 29, 2005NCA'0516 Chain pattern scheduling Scenario 2: fixed #P – moderate communication chain pattern mapping  Mapping II: chain pattern mapping  It zeroes the communication cost imposed by as many dependence vectors as possible  #P is divided into a group of n a processors used in the area above d c, and another group of n b processors used in the area below d c  Chains above d c are cyclically mapped to the n a processors  Chains below d c are cyclically mapped to the n b processors  This way communication cost is additionally zeroed along one dependence vector in the area above d c, and along another dependence vector in the area below d c

July 29, 2005NCA'0517 Chain pattern scheduling Scenario 2: fixed #P – moderate communication Mapping II: chain pattern mapping n a =2 n b =3 n a =2 n b =3

July 29, 2005NCA'0518 Chain pattern scheduling Scenario 2: fixed #P – moderate communication chain pattern mapping  Mapping II: chain pattern mapping  Total communication volume in this case is bounded above by V ≈2(m- 2 ) |C M ||C|  Differences from cyclic mapping  Processors do not span the entire index space, but only a part of it  A different cycle size is chosen to map different areas of the index space

July 29, 2005NCA'0519 Chain pattern scheduling Scenario 2: fixed #P – moderate communication chain pattern mapping  Mapping II: chain pattern mapping  Advantages  Provides better data locality than the cyclic mapping  Uses a more realistic #P than the cyclic mapping  Suitable for Distributed memory systems (a chain is mapped to a single processor) Symmetric multiprocessor systems (a chain is mapped to a single node, that may contain more than one processors) Heterogeneous systems (longer chains are mapped to faster processors, whereas shorter chains to slower processors)

July 29, 2005NCA'0520Outline Introduction Definitions and notations Chain pattern scheduling unbounded #P – high communication fixed #P – moderate communication Performance resultsPerformance results Conclusions Future work

July 29, 2005NCA'0521 Performance results  Simulation setup Simulation program written in C++ The distributed memory system was emulated Index spaces range from 10×10 … 1000×1000 iterations Dependence vectors d 1 =(1,3), d c =d 2 =(2,2), d 3 =(4,1), d 4 =(4,3) #P ranges from 5 … 8 Comparison with the cyclic mapping Communication reduction achieved ranges from 15% - 35%

July 29, 2005NCA'0522 Performance results

July 29, 2005NCA'0523 Performance results

July 29, 2005NCA'0524Outline Introduction Definitions and notations Chain pattern scheduling unbounded #P – high communication fixed #P – moderate communication Performance results ConclusionsConclusions Future work

July 29, 2005NCA'0525Conclusions The total communication cost can be significantly reduced if the communication incurred by certain dependence vectors is eliminated The chain pattern mapping outperforms other mapping schemes (e.g. cyclic mapping) by enhancing the data locality

July 29, 2005NCA'0526Outline Introduction Definitions and notations Chain pattern scheduling unbounded #P – high communication fixed #P – moderate communication Performance results Conclusions Future workFuture work

July 29, 2005NCA'0527 Future work Simulate other architectures (such as shared memory systems, SMPs and heterogeneous systems) Experiment also with the centralized (i.e. master-slave) version of the chain pattern scheduling scheme

July 29, 2005NCA'0528 Thank you Questions?