Conception of parallel algorithms

Slides:

Advertisements

Similar presentations

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Advertisements

Programmability Issues

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Memory Management Chapter 7.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Reference: Message Passing Fundamentals.

Parallel Programming Models and Paradigms

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Multiscalar processors

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Processor Architecture

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Parallel Computing Presented by Justin Reschke

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University

Single Instruction Multiple Threads

These slides are based on the book:

Chapter 7 Memory Management

Memory Management Chapter 7.

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Ioannis E. Venetis Department of Computer Engineering and Informatics

Distributed Processors

Operating Systems (CS 340 D)

Optimizing Compilers Background

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

3- Parallel Programming Models

Computer Engg, IIT(BHU)

Task Scheduling for Multicore CPUs and NUMA Systems

课程名编译原理 Compiling Techniques

Parallel Algorithm Design

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Database Performance Tuning and Query Optimization

Operating Systems (CS 340 D)

Chapter 15 QUERY EXECUTION.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Lecture 2: Processes Part 1

CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.

Compiler Back End Panel

Compiler Back End Panel

The Vector-Thread Architecture

Chapter 11 Database Performance Tuning and Query Optimization

Lecture 2 The Art of Concurrency

Mattan Erez The University of Texas at Austin

Programming with Shared Memory Specifying parallelism

6- General Purpose GPU Programming

COMPUTER ORGANIZATION AND ARCHITECTURE

Mattan Erez The University of Texas at Austin

Presentation transcript:

Conception of parallel algorithms F. Desprez - UE Parallel alg. and prog. 2016-2017

Some references Parallel Programming – For Multicore and Cluster System, T. Rauber, G. Rünger Decomposing the Potentially Parallel A one day course, Elspeth Minty, Robert Davey, Alan Simpson, David Henty, Edinburgh Parallel Computing Centre The University of Edinburgh F. Desprez - UE Parallel alg. and prog. 2016-2017

Algorithm Parallelization The target architecture and the programing model have an influence over the way of designing (and writing) a parallel algorithm The parallel algorithm should be designed depending of the characteristics of the sequential algorithm Type of data manipulated (tables, graphs, data structures, …) Data accesses (regular, random, …) Granularity Dependences (control, data) … Goals To obtain the same results as the sequential program! Gain some time Reduce the storage consumption (memory, disk) F. Desprez - UE Parallel alg. and prog. 2016-2017

Parallelization Steps F. Desprez - UE Parallel alg. and prog. 2016-2017

Computation decomposition The computations of a sequential program are decomposed into tasks (with dependencies between them) Each task is the smallest unit of parallelism Various levels of execution Instruction parallelism Data parallelism Functional parallelism The decomposition can be static (before execution) or dynamic (at runtime) At any time of execution, the number of "executable" tasks gives a higher bound of the degree of parallelism of the program The decomposition into tasks consists in generating enough parallelism to have a maximum occupancy of the cores / processors / machines F. Desprez - UE Parallel alg. and prog. 2016-2017

Computation decomposition, contd. Pay attention to granularity ! The execution time of a task must be large in relation to the time it takes to schedule it, place it, start it, acquire its data, return the computed data Large coarse grain tasks with numerous computations Fine-grained tasks with few computations Compromise to find according to the performances (computation, access to the data, overhead of the system, communications, disk access, ...) F. Desprez - UE Parallel alg. and prog. 2016-2017

Assignment to processes or threads A process or thread represents a flow of control executed by a processor or a core Performs tasks one after the other Not necessarily the same number as the number of processors (or cores) Assign tasks to ensure proper balancing (i.e. all processors or cores must have the same volume of computation to run) Attention should be paid to memory access (for shared memories) and message exchanges (for distributed memories) Re-use of data if possible Also called scheduling If executed during the initialization phase of the program: static scheduling If executed during program execution: dynamic scheduling Main goal: equal use of processors while maintaining a smallest amount of communication between processors F. Desprez - UE Parallel alg. and prog. 2016-2017

Placing processes on processors Generally placing each process (or thread) on separate processors (or core) If more processes than processors then placing more processes on processors Performed by the system and / or program directives Goal: Balance and reduce communications Scheduling algorithm A method for efficiently executing a set of tasks for a determined duration on a set of specified execution units Satisfying dependencies between tasks (precedence constraints) Capacity constraints (because number of execution units limited) Sequential or Parallel Tasks Goal Set the start time and the target execution unit for each task Minimize the execution time (makespan) = time between start of the 1st task and the end of the last task F. Desprez - UE Parallel alg. and prog. 2016-2017

Parallelism levels Computations executed by a program provide opportunities for parallel execution at different levels Instruction Loop Functions Different levels of granularity (fine, medium, wide) Possibility of grouping the different elements to increase the granularity Different algorithms according to the different granularities F. Desprez - UE Parallel alg. and prog. 2016-2017

Parallelism at the instruction level The instructions of a program can run in parallel if they are independent There are several types of dependencies Flow dependencies (ou true dependencies): there is a flow dependency between I1 and I2 if I1 computes a result value in a register or a variable used by I2 as an operand Anti-dependency: there is an anti-dependency between I1 and I2 if I1 uses a register or an operand variable that is used by I2 to store a result Output dependency: there is an output dependency between I1 and I2 if I1 and I2 use the same register or the same variable to store a result of a calculation F. Desprez - UE Parallel alg. and prog. 2016-2017

Dependence Graph F. Desprez - UE Parallel alg. and prog. 2016-2017

Using instruction parallelism Used by superscalar processors Dynamic scheduling of hardware-level instructions with extraction of independent instructions automatically Then assigned to independent units On VLIW, static scheduling performed by the compiler with arrangement of instructions in long instructions with assignment to the units As an input a sequential program with no explicit parallelism F. Desprez - UE Parallel alg. and prog. 2016-2017

Data parallelism Same operation performed on different elements of a larger data structure (array for example) If these operations are independent then they can be carried out in parallel The elements of the data structure are distributed in a balanced manner on the processors Each processor executes the operation on the elements assigned to it Extension of programming languages (data-parallel languages) A single flow of control Special constructions to express data-parallel operations on arrays (SIMD model) C*, data-parallel C, PC++, DINO, High Performance Fortran, Vienna Fortran Owner computes rule F. Desprez - UE Parallel alg. and prog. 2016-2017

Loop parallelism Many algorithms traverse a large data structure (data array, matrix, image, ...) Expressed by a loop in the imperative languages If there are no dependencies between the different iterations of a loop, we will have a parallel loop Several kinds of parallel loops FORALL DOPAR Possibility of having Multiple instructions in a loop Nested loops Loop transformations of to find parallelism F. Desprez - UE Parallel alg. and prog. 2016-2017

FORALL loop With one or more assignments to array elements If only one assignment, then equivalent to an array assignment The computations given in the right part of the assignment are executed in any order Then the results are affected in the array elements (in any order) If more than one assignment They are executed one after the other as array assignments An assignment ends before the next one starts F. Desprez - UE Parallel alg. and prog. 2016-2017

DOPAR loop Can contain one or more assignments to tables other instructions or other loops Executed by multiple processors in parallel (in any order) The instructions for each iteration are executed sequentially in the order of the program, using the values of the variables in the initial state (before execution of the DOPAR loop) The updates of the variables of an iteration are not visible for the other iterations Once all iterations have been executed, the iterations updates are combined and a new global state is computed F. Desprez - UE Parallel alg. and prog. 2016-2017

Execution Comparison In the sequential for loop, the computation of b(i) uses the value of a(i-1) that has been computed in the preceding iteration and the value of a(i+1) valid before the loop. The two statements in the forall loop are treated as separate array assignments. Thus, the computation of b(i) uses for both a(i-1) and a(i+1) the new value computed by the first statement. In the dopar loop, updates in one iteration are not visible to the other iterations. Since the computation of b(i) does not use the value of a(i) that is computed in the same iteration, the old values are used for a(i-1) and a(i+1). F. Desprez - UE Parallel alg. and prog. 2016-2017

Functional parallelism Many parallel programs have independent parts that can be executed in parallel Simple Instructions, blocks, loops, or function calls One can see the parallelism of tasks by considering the different parts of a program as tasks Functional parallelism or task parallelism To use task parallelism, tasks and their dependencies are represented as a graph where the nodes are tasks and the vertexes dependencies between tasks Sequential tasks or parallel tasks (mixed parallelism) To schedule tasks, you must assign a start date to each task that satisfies dependencies (a task can not be started until all tasks on which it depends have been completed) Minimize overall execution time (makespan) Static and dynamic algorithms F. Desprez - UE Parallel alg. and prog. 2016-2017

Functional parallelism Limitations Start-up costs At the beginning (and at the end of the pipeline) few active tasks Load Balancing If tasks have higher costs, it is difficult to balance the work between processors (think of re-splitting if possible) Scalability Limitation on the number of processors that can be used F. Desprez - UE Parallel alg. and prog. 2016-2017

Functional parallelism Static Scheduling Algorithm Determines assignment of tasks deterministically at the start of the program or at compilation Based on an estimate of the execution time of tasks (measures or models) Dynamic Scheduling Algorithm Determines the assignment of tasks at runtime Allows to adapt placement according to conditions Using a task pool that contains tasks that are ready to run At each task execution, all tasks that depend on them (and whose dependencies are satisfied) are found in the pool Often used in shared memory machines (pool in main memory) Languages that allow functional parallelism TwoL, PCN, P3L, Kaapi, OpenMP Used in multi-threaded systems, workflows systems F. Desprez - UE Parallel alg. and prog. 2016-2017

Task farming Source Divides initial tasks among workers and ensures balancing Worker Receives the task from the source, performs the job and passes the result Sink Receives completed tasks from workers and collects partial results F. Desprez - UE Parallel alg. and prog. 2016-2017

Task farming, contd. Limitations of the task farming model Large number of communications Limited to certain types of applications Scheduling management if no knowledge of task execution times F. Desprez - UE Parallel alg. and prog. 2016-2017

Conclusion Not one single solution but several solutions Combination of models Task parallelism and data parallelism (mixed parallelism) Implicit or explicit parallelism Automatic or semi-automatic discovery of parallelism Scheduling and task load-balancing algorithms Importance of execution models! F. Desprez - UE Parallel alg. and prog. 2016-2017