Parallel Programming Styles and Hybrids

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel Processing with OpenMP
Distributed Systems CS
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Reference: Message Passing Fundamentals.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Parallel Programming Models and Paradigms
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Hybrid MPI and OpenMP Parallel Programming
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Background Computer System Architectures Computer System Software.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
These slides are based on the book:
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chapter 4: Threads.
Chapter 4: Threads.
Introduction to Parallel Processing
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Introduction to parallel computing concepts and technics
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
CS 147 – Parallel Processing
Introduction to MPI.
Computer Engg, IIT(BHU)
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Distributed Systems CS
Chapter 4: Threads.
Chapter 4: Threads.
Message Passing Models
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
What is Concurrent Programming?
Threads Chapter 4.
Distributed Systems CS
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
Hybrid Programming with OpenMP and MPI
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
Introduction to parallelism and the Message Passing Interface
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Chapter 4 Multiprocessors
Hybrid MPI and OpenMP Parallel Programming
Chapter 01: Introduction
Programming Parallel Computers
Presentation transcript:

Parallel Programming Styles and Hybrids Parallel Computing 2 Parallel Programming Styles and Hybrids

Objectives Discuss major classes of parallel programming models Hybrid programming Examples

Introduction Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architectures Two styles have become defacto standards MPI library for message passing and OpenMP compiler directives for multithreading Both are widely used and are virtually on every parallel system. With current parallel systems it makes more sense to mix multithreading and message passing to maximize performance

HPC Architectures Based on memory distribution Shared – all processors share equal access to one or more banks of memory CRAY-YMP, SGI challenge, dual and quad workstations Distributed – each processor has its own memory which may or may not be visible to other processors IBM SP2 and clusters of single uniprocessor machines

Distributed shared memory NUMA (non-uniform memory access) SGI origin 3000, HP superdome Cluster of SMP (shared memory system) IBM SP, Beowulf clusters

Parallel Programming Styles Explicit threading Not commonly used on distributed systems Uses locks, semaphores, and mutexes Synchronization and parallelization handled by programmer POSIX threads (pthreads library)

Message passing interface (MPI) Application consists of several processes Communicates by passing data to one another (send/receive) (broadcast/gather) Synchronization is still require of the programmer, however, locking is not since nothing is shared Common approach is domain decomposition where each task is assigned a subset and communicates it edge values to neighbouring subdomains

Compiler directives (OpenMP) Special comments are added to serial programs in parallelizable regions Requires a compiler that understands the special directives Locking and synchronization handled by the compiler unless overwritten by directives (implicit and explicit) Decomposition is done primarily by the programmer Scalability is limited than that of MPI applications due to lesser amount of control the programmer has over how the code is parallelized

Hybrid Mixture of MPI and OpenMP Used on distributed shared memory systems Applications are usually consists of computationally expensive loops punctuated by calls to MPI. These loops, in many cases, can be further parallelized by adding OpenMP directives Not a solution to all parallel programs but quite suitable for certain algorithms

Why Hybrid? Performance considerations Scalability. For a fixed problem size, hybrid code will scale to higher processor counts before being overwhelmed by communication overhead A good example is the Laplace equation May not be effective where performance is limited by the speed of interconnect rather than the processor

Computer architecture Some architectural limitations force the use of hybrid computing (i.E. MPI process limits per 8 or cluster blocks) Some algorithms – namely FFT’s run better on machines where the local bandwidth is much greater than that of the network due to the O(N²) behaviour of the bandwidth required. With a hybrid approach the number of MPI processes can be lowered while retaining the same number of processors used

Algorithms Some algorithms such as computational fluid dynamics benefit greatly from a hybrid approach. The solution space is separated into interconnecting zones. The interaction between zones is handled by MPI while the fine grained computations required inside a zone are handled by OpenMP

Considerations on MPI, OpenMP, and Hybrid Styles General considerations Amdahl’s law Amdahl’s law states that the speedup through parallelization is limited by the portion of the serial code that cannot be parallelized. But in a hybrid program, if the fraction of MPI processes parallelized by OpenMP is not high, then the overall speedup is limited

Communication patterns How do the programs communication needs match the underlining hardware? It might increase performance if a hybrid approach is used where MPI code leads to rapid growth in communication traffic

Machine balance How does memory, cpu, and interconnect affect the performance of a program? If the processors are fast then communications might be a problem. Or if the primary cache is accessed differently in clusters (older machines in a Beowulf cluster)

Memory access patterns Cache memory has to be effectively used in order to achieve better performance in clusters (i.e. primary, secondary, tertiary)

Advantages and Disadvantages of OpenMP Comparatively easy to implement. In particular, it is easy to refit an existing serial code for parallel execution Same source code can be used for both parallel and serial versions More natural for shared-memory architectures Dynamic scheduling (load balancing is easier than with MPI) Useful for both fine and course grained problems

Disadvantages Can only run on shared memory systems Limits the number of processors that can be used Data placement and locality may become serious issues Especially true for SGI NUMA architectures where the cost of remote memory access may be high Thread creation overhead can be significant unless enough work is performed in each parallel loop Implementing course-grained solutions in OpenMP is usually about as involved as constructing the analogous MPI application Explicit synchronization is required

General characteristics Most effective for problems with fine-grain parallelism (i.E. Loop-level) Can also be used for course-grained parallelism Overall intra-node memory bandwidth may limit the number of processors that can effectively be used Each thread sees the same global memory, but has its own private memory Implicit messaging High level of abstraction (higher than MPI)

Advantages and Disadvantages of MPI Any parallel algorithm can be expressed in terms of the MPI paradigm Runs on both distributed and shared-memory systems. Performance is generally good in either environment Allows explicit control over communication, leading to high efficiency due to overlapping communication and computation Allows for static task handling Data placement problems are rarely observed For suitable problems MPI scales well to very large numbers of processors MPI is portable Current implementations are efficient and optimized

Disadvantages Application development is difficult. Re-fitting existing serial code using MPI is often a major undertaking, requiring extensive restructuring of the serial code It is less useful with fine-grained problems where communication costs may dominate For all-to-all type operations, the effective number of point-to-point interactions increases as the square of the number of processors – resulting in rapidly increasing communication costs Dynamic load balancing is difficult to implement Variations exist in different manufacturers implementation of the entire MPI library. Some may not implement all the calls, while others offer extensions

General characteristics MPI is most effective for problems with “course-grained” parallelism, for which The problem decomposes into quasi-independent pieces and Communication needs are minimized

The Best of Both Worlds Use hybrid programming when The code exhibits limited scaling with MPI The code could make use of dynamic load balancing The code exhibits fine-grained or a combination of both fine-grained and course-grained parallelism The application makes use of replicated data

Problems When Mixing Modes Environment variables may not be passed correctly to the remote MPI processes. This has negative implications for hybrid jobs because each MPI process needs to access the MP_SET_NUMTHREADS environment variable in order to start up the proper number of OpenMP threads. It can be solved by always setting the number of OpenMP threads within the code

Calling MPI communication functions within OpenMP parallel regions Hybrid programming works with having OpenMP threads spawned from MPI processes. It does not work the other way. It will result in a runtime error

Laplace Example Outline Serial MPI OpenMP Hybrid

Outline The Laplace equation in two dimensions is also know as the potential equation and is usually one of the first PDE’s (partial differential equations) encountered c²(²u/x² + ²u/y²) = 0 Governing equation for electrostatics, heat diffusion, and fluid flow. By adding a function f(x,y) we get Poisson’s equation, a first derivative in time – we get the diffusion equation and by adding a second derivative in time we get the wave equation A numerical solution to this PDE can be computed by using a finite difference based approach

Using an iterative method to solve the equation we get the following: du[sup(n+1)sub(ij)] = (u[sup(n)sub(I-1j)] + u[sup(n)sub(I+1j] + u[sup(n)sub(ij-1)] + u[sup(n)sub(ij+1)]) / 4- u[sup(n)sub(ij)] u[sup(n+1)sub(ij)] = u[sup(n)sub(ij) + du[sup(n+1)sub(ij)] *note* - n represents iteration number , not an exponent

Serial – cache-friendly approach incrementally computes du values and compares it with the current maximum, then updates all u values. Usually can be done without any additional memory operations – good for clusters See code

MPI Currently the most utilized method for distributed memory systems Note processes are not the same as processors. An MPI process can be thought of as a thread and multiple threads can run on a single processor. The system is responsible for mapping the MPI processes to physical processors Each process is an exact copy of the program with the exception that each copy has its own unique id

Hello World PROGRAM hello INCLUDE ’mpi.f’ INTEGER ierror, rank, size CALL MPI_INIT(ierror) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) CALL MPI_COMM_SIZE(MPI_COM_WORLD, size, ierror) if (rank .EQ. 2) print *, ’P:’, rank, ’ Hello World’ print *, ’I have rank ‘ , rank, ’ out of’, size CALL MPI_FINALIZE(ierror) END #include <mpi.h> #include <stdio.h> void main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==2) printf ("P:%d Hello World\n",rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am %d out of %d.\n", rank, size); MPI_Finalize(); }

OpenMP OpenMP is a tool for writing multi-threaded applications in a shared memory environment. It consists of a set of compiler directives and library routines. The compiler generates multi-threaded code based on the specified directives. OpenMP is essentially a standardization of the last 18 years or so of SMP (Symmetric Multi-Processor) development and practice See code

Hybrid Remember you are running both MPI and OpenMP so “f90 –O3 –mp file.f90 –lmpi” See code