Chapter 9. Concepts in Parallelisation An Introduction

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Distributed Systems CS

SE-292 High Performance Computing

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Experiencing Cluster Computing Class 1. Introduction to Parallelism.

Parallel System Performance CS 524 – High-Performance Computing.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

10/16/ Realizing Concurrency using the thread model B. Ramamurthy.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Hybrid MPI and OpenMP Parallel Programming

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Parallel Programming with MPI and OpenMP

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

MPI and OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Parallel Computing Presented by Justin Reschke

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

Potential for parallel computers/parallel programming

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

4- Performance Analysis of Parallel Programs

Distributed Shared Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

SHARED MEMORY PROGRAMMING WITH OpenMP

What Exactly is Parallel Processing?

Computer Engg, IIT(BHU)

Realizing Concurrency using Posix Threads (pthreads)

Realizing Concurrency using the thread model

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.

Distributed Systems CS

Realizing Concurrency using Posix Threads (pthreads)

Realizing Concurrency using the thread model

Hybrid MPI and OpenMP Parallel Programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Programming Parallel Computers

Presentation transcript:

Chapter 9. Concepts in Parallelisation An Introduction March 2000 Parallel Concepts An Introduction Origin Optimisation and Parallelisation Training

The Goal of Parallelization Chapter 9. Concepts in Parallelisation March 2000 Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead: total increase in cpu time communication synchronization additional work in algorithm non-parallel part of the program (one processor works, others spin idle) Overhead vs Elapsed time is better expressed as Speedup and Efficiency Elapsed time cpu time communication overhead 1 processor 2 procs 4 procs 8 procs Reduction in elapsed time Elapsed time 1 processor 4 processors start finish Origin Optimisation and Parallelisation Training

Speedup and Efficiency Chapter 9. Concepts in Parallelisation March 2000 Both measure the parallelization properties of a program Let T(p) be the elapsed time on p processors The Speedup S(p) and the Efficiency E(p) are defined as: For ideal parallel speedup we get: Scalable programs remain efficient for large number of processors S(p) = T(1)/T(p) E(p) = S(p)/p T(p) = T(1)/p S(p) = T(1)/T(p) = p E(p) = S(p)/p = 1 or 100% Speedup Number of processors ideal Super-linear Saturation Disaster Efficiency 1 Origin Optimisation and Parallelisation Training

Chapter 9. Concepts in Parallelisation Amdahl’s Law Chapter 9. Concepts in Parallelisation March 2000 This rule states the following for parallel programs: The non-parallel (serial) fraction s of the program includes the communication and synchronization overhead Thus, the maximum parallel Speedup S(p) for a program that has parallel fraction f: The non-parallel fraction of the code (I.e. overhead) imposes the upper limit on the scalability of the code (1) 1 = s + f ! program has serial and parallel fractions (2) T(1) = T(parallel) + T(serial) = T(1) *(f + s) = T(1) *(f + (1-f)) (3) T(p) = T(1) *(f/p + (1-f)) (4) S(p) = T(1)/T(p) = 1/(f/p + 1-f) < 1/(1-f) ! for p-> inf. (5) S(p) < 1/(1-f) Origin Optimisation and Parallelisation Training

Amdahl’s Law: Time to Solution Chapter 9. Concepts in Parallelisation Amdahl’s Law: Time to Solution March 2000 T(p) = T(1)/S(p) S(p) = 1/(f/p + (1-f)) Hypothetical program run time as function of #processors for several parallel fractions f. Note the log-log plot Origin Optimisation and Parallelisation Training

The Practicality of Parallel Computing Chapter 9. Concepts in Parallelisation March 2000 Speedup Percentage parallel code 0% 20% 40% 100% 60% 80% 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 1970s 1980s 1990s Best Hand-tuned codes ~99% range P=2 P=4 P=8 In practice, making programs parallel is not as difficult as it may seem from Amdahl’s law It is clear that a program has to spend significant portion (most) of run time in the parallel region David J. Kuck, Hugh Performance Computing, Oxford Univ.. Press 1996 Origin Optimisation and Parallelisation Training

Fine-Grained Vs Coarse-Grained Chapter 9. Concepts in Parallelisation Fine-Grained Vs Coarse-Grained March 2000 Fine-grain parallelism (typically loop level) can be done incrementally, one loop at a time does not require deep knowledge of the code a lot of loops have to be parallel for decent speedup potentially many synchronization points (at the end of each parallel loop) Coarse-grain parallelism make larger loops parallel at higher call-tree level potentially in-closing many small loops more code is parallel at once fewer synchronization points, reducing overhead requires deeper knowledge of the code MAIN A B C D F E G H I J K L M N O p q r s t Coarse-grained Fine-grained Origin Optimisation and Parallelisation Training

Other Impediments to Scalability Chapter 9. Concepts in Parallelisation Other Impediments to Scalability March 2000 Elapsed time p0 p1 p2 p3 start finish Load imbalance: the time to complete a parallel execution of a code segment is determined by the longest running thread unequal work load distribution leads to some processors being idle, while others work too much with coarse grain parallelization, more opportunities for load imbalance exist Too many synchronization points compiler will put synchronization points at the start and exit of each parallel region if too many small loops have been made parallel, synchronization overhead will compromise scalability. Origin Optimisation and Parallelisation Training

Parallel Programming Models Chapter 9. Concepts in Parallelisation March 2000 Classification of Programming models: Control flow - number of explicit threads of execution Address space - access to global data from multiple threads Communication - data transfer part of language or library Synchronization - mechanism to regulate access to data Data allocation - control of the data distribution to execution threads Origin Optimisation and Parallelisation Training

Chapter 9. Concepts in Parallelisation Computing p with DPL Chapter 9. Concepts in Parallelisation March 2000 Notes: essentially sequential form automatic detection of parallelism automatic work sharing all variables shared by default number of processors specified outside of the code compile with: f90 -apo -O3 -mips4 -mplist the mplist switch will show the intermediate representation p = = S 1 4 (1+x2) dx 0<i<N N(1+((i+0.5)/N)2) PROGRAM PIPROG INTEGER, PARAMETER:: N = 1000000 REAL (KIND=8):: LS,PI, W = 1.0/N PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) ) PRINT *, PI END Origin Optimisation and Parallelisation Training

Computing p with Shared Memory Chapter 9. Concepts in Parallelisation March 2000 Notes: essentially sequential form automatic work sharing all variables shared by default directives to request parallel work distribution number of processors specified outside of the code p = = S 1 4 (1+x2) dx 0<i<N N(1+((i+0.5)/N)2) #define n 1000000 main() { double pi, l, ls = 0.0, w = 1.0/n; int i; #pragma omp parallel private(i,l) reduction(+:ls) { #pragma omp for for(i=0; i<n; i++) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } #pragma omp master printf(“pi is %f\n”,ls*w); #pragma omp end master Origin Optimisation and Parallelisation Training

Computing p with Message Passing Chapter 9. Concepts in Parallelisation March 2000 Notes: thread identification first explicit work sharing all variables are private explicit data exchange (reduce) all code is parallel number of processors is specified outside of code #include <mpi.h> #define N 1000000 main() { double pi, l, ls = 0.0, w = 1.0/N; int i, mid, nth; MPI_init(&argc, &argv); MPI_comm_rank(MPI_COMM_WORLD,&mid); MPI_comm_size(MPI_COMM_WORLD,&nth); for(i=mid; i<N; i += nth) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if(mid == 0) printf(“pi is %f\n”,pi*w); MPI_finalize(); p = = S 1 4 (1+x2) dx 0<i<N N(1+((i+0.5)/N)2) Origin Optimisation and Parallelisation Training

Computing p with POSIX Threads Chapter 9. Concepts in Parallelisation March 2000 Origin Optimisation and Parallelisation Training

Comparing Parallel Paradigms Chapter 9. Concepts in Parallelisation March 2000 Automatic parallelization combined with explicit Shared Memory programming (compiler directives) used on machines with global memory Symmetric Multi-Processors, CC-NUMA, PVP These methods collectively known as Shared Memory Programming (SMP) SMP programming model works at loop level, and coarse level parallelism: the coarse level parallelism has to be specified explicitly loop level parallelism can be found by the compiler (implicitly) Explicit Message Passing Methods are necessary with machines that have no global memory addressability: clusters of all sort, NOW & COW Message Passing Methods require coarse level parallelism to be scalable Choosing programming model is largely a matter of the application, personal preference and the target machine. it has nothing to do with scalability. Scalability limitations: communication overhead process synchronization scalability is mainly a function of the hardware and (your) implementation of the parallelism Origin Optimisation and Parallelisation Training

Chapter 9. Concepts in Parallelisation Summary Chapter 9. Concepts in Parallelisation March 2000 The serial part or the communication overhead of the code limits the scalability of the code (Amdahl Law) Programs have to be >99% parallel to use large (>30 proc) machines Several Programming Models are in use today: Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models) Message Passing model Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability. Don’t confuse Algorithm and implementation Machines with a global address space can run applications based on both, SMP and Message Passing programming models Origin Optimisation and Parallelisation Training