Parallel Programming pt.1

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Operating System 4 THREADS, SMP AND MICROKERNELS

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Department of Computer Science University of the West Indies.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Parallel Computing.

Operating System 4 THREADS, SMP AND MICROKERNELS.

Outline Why this subject? What is High Performance Computing?

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Parallel Computing Presented by Justin Reschke

Designing Parallel Programs  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Communications  Synchronization  Data Dependencies.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Introduction to Operating Systems Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Computer Organization and Architecture Lecture 1 : Introduction

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Processing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

Lecture 1 – Parallel Programming Primer

PARALLEL COMPUTING Submitted By : P. Nagalakshmi

PARALLEL COMPUTING.

Introduction to parallel programming

Distributed Shared Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Parallel Processing - introduction

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

CS 147 – Parallel Processing

The University of Adelaide, School of Computer Science

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Morgan Kaufmann Publishers

Multi-Processing in High Performance Computer Architecture:

Pipelining and Vector Processing

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 17 Parallel Processing

Overview Parallel Processing Pipelining

What is Concurrent Programming?

Operating System 4 THREADS, SMP AND MICROKERNELS

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Designing Parallel Programs

Distributed Systems CS

MPJ: A Java-based Parallel Computing System

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

Chapter 4 Multiprocessors

Introduction, background, jargon

ARM ORGANISATION.

Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.

COMPUTER ORGANIZATION AND ARCHITECTURE

Programming Parallel Computers

Presentation transcript:

Parallel Programming pt.1 DaMiri young Unt High Performance Computing October 5, 2017

Overview Goal: working knowledge of parallel programming concepts. Outline: Concepts Models Examples Challenges Q&A/Conclusion

What is COMPUTING? Computing or scientific computing is the use of advanced computing capabilities to understand and solve complex problems. At its core, it is the development of models and simulations to understand systems in a tractable way. Common computing paradigms: Serial Parallel

What is Serial COMPUTING? Most research software is written for serial computation: Problem is decomposed into a discrete set of instructions Instructions are then executed sequentially one after another on a single processor Only one instruction is in execution at a time

Serial COMPUTING… Most research software is written for serial computation: Problem is decomposed into a discrete set of instructions Instructions are then executed sequentially one after another on a single processor Only one instruction is in execution at a time

What is PARALLel COMPUTING? Parallel computing is concurrent use of multiple compute resources to solve a computational problem: Problem is decomposed into a discrete units that can be solved concurrently Units are then further reduced to a set of instructions Instructions are then executed simultaneously on multiple compute units/processors An overall control/coordination mechanism is employed Caveats: Problem must be decomposable into discrete parts that can be solved concurrently Must be able to execute multiple program instructions at a time Problem must be solvable in less time with multiple compute resources than with a single resource

PARALLel COMPUTING…

PARALLel COMPUTING…

PARALLel COMPUTING… Consider a robot moving blocks from location A to location B It takes x mins / block and there 25 blocks so the total time required is 25x How do we speed up moving 25 blocks from A -> B

PARALLel COMPUTING… Get more robots moving blocks from location A to location B I.e., 5 robots = 5X speedup, so total time should be 5x

Why Use PARALLEL COMPUTING? Multiple processors in parallel in general solve problems quicker than single processors in serial! A contrived example: Suppose it takes 30 minutes to multiply an n x n matrix on 1 processor. It is parallelized to run on 16 processors (cores). Naively we might expect it to take ~1.8 minutes.

Why Use PARALLEL COMPUTING? More efficient use of underlying hardware Provide concurrency Save money/time Solve larger more complex problems Utilization of remote resources

PARALLEL COMPUTING ARchiTecture… Closely follows Von Neumann Architecture a.k.a stored program computer: 4 main components: Memory Control Unit Arithmetic Unit Input/Output

PARALLEL COMPUTING ARchiTecture… Random Access Memory (RAM) – used to store both program instructions and data Program instructions are coded data that instruct computer Data is information that will be utilized by the program Control Unit Fetches instructions/data from memory Decodes instructions Sequentially coordinates operations Arithmetic Unit Performs basic arithmetic operations Input/Output – basic interface to the human operator

PARALLEL COMPUTING ARchiTecture… Flynn’s Taxonomy - distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction Stream and Data Stream. Each of these dimensions can have only one of two possible states: Single or Multiple. Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)

PARALLEL ProgRamming Models Parallel programming models are considered an abstracted layer above hardware and memory architecture. Common models: Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)

PARALLEL ProgRamming Models… Shared Memory (without threads) This is perhaps the simplest parallel programming model. In this programming model, processes/tasks share a common address space, which they read and write to asynchronously. Various mechanisms such as locks / semaphores are used to control access to the shared memory, resolve contentions and to prevent race conditions and deadlocks. An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality: Keeping data local to the process that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processes use the same data. Unfortunately, controlling data locality is hard to understand and may be beyond the control of the average user.

PARALLEL ProgRamming Models… Shared Memory (without threads) Implementation(s): On stand-alone shared memory machines, native operating systems, compilers and/or hardware provide support for shared memory programming. For example, the POSIX standard provides an API for using shared memory, and UNIX provides shared memory segments (shmget, shmat, shmctl, etc). On distributed memory machines, memory is physically distributed across a network of machines, but made global through specialized hardware and software. A variety of SHMEM implementations are available: http://en.wikipedia.org/wiki/SHMEM.

PARALLEL ProgRamming Models… Thread This programming model is a type of shared memory programming. In the threaded model of parallel programming, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. programmer is responsible for determining the parallelism (although compilers can sometimes help).

PARALLEL ProgRamming Models… Thread Implementation(s) POSIX Threads C Language only. Part of Unix/Linux operating systems Library based Commonly referred to as Pthreads. Very explicit parallelism; requires significant programmer attention to detail. Others: Microsoft threads Java, Python threads CUDA threads for GPUs OpenMP Compiler directive based Portable / multi-platform, including Unix and Windows platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code. More Info: POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads OpenMP tutorial: computing.llnl.gov/tutorials/openMP

PARALLEL ProgRamming Models… Distributed Memory / Message Passing Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. A set of tasks that use their own local memory during computation. Tasks exchange data through communications by sending and receiving messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.

PARALLEL ProgRamming Models… Distributed Memory / Message Passing Implementation(s): Message Passing Interface (MPI-1) was released in 1994. MPI-2 was released in 1996 MPI-3 in 2012. More Info: All MPI specifications are available on the web at http://www.mpi- forum.org/docs/. MPI tutorial: computing.llnl.gov/tutorials/mpi

PARALLEL PROGRAMMING MINDSET Understand BOTH the problem and the program – Analyze the problem you’re trying to solve including existing code. BE THE PROGRAM Investigate other algorithms if possible. This may be the single most important consideration when designing a parallel application. Determine whether or not the problem can be parallelized and at what cost. Ex/ Calculate the potential energy for each of several thousand independent conformations of a molecule. > Can be parallelized, each of the molecular conformations is independently determinable. Ex/ Fibonacci series: F(n) = F(n-1) + F(n-2) = (0,1,1,2,3,5,8,13,21,… etc) > Not easy to parallelized , F(n) value uses both F(n-1) and F(n-2) which must be computed. Data dependence issue

PARALLEL PROGRAMMING MINDSET Identify the program's hotspots (Be the program): Know where most of the real work is being done. The majority of scientific and technical programs usually accomplish most of their work in a few places. Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage. Profilers and performance analysis tools can help Identify bottlenecks in the program: Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For example, I/O is usually something that slows a program down. May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary slow areas Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above. Use optimized third party parallel software and highly optimized math libraries available from leading vendors

PARALLEL PROGRAMMING MINDSET Partitioning - breaking the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. Types: Domain Decomposition: In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data. Functional Decomposition: In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

PARAllEL Programming EXAMPLEs Ex1 – Serial/Parallel (Non-Shared Memory) Ex2 – Threaded (Shared Memory) Ex3 – Distributed

PARAllEL Programming Challenges Cost – Programmers Time, Coding, Design, Debugging, Tuning, etc. Efficiency – Performance!! Hidden Serialization – Intrinsic, Internal Serializations, I/O Hotspots, Parallel Overhead – Communication, Latency Load balancing – Even distribution of workload Synchronization Issues - deadlocks

PARAllEL Programming Challenges Speedup - defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processors. Let Nproc be the number of parallel processes Speedup(NProc) =Time used by best serial program/Time used by parallel program

PARAllEL Programming Challenges Efficiency – defined as the ratio of speedup to the number of processors. Measures the fraction of time for which a processor is usefully utilized. Efficiency(NProc) = Speedup(NProc)/NProc

PARAllEL Programming Challenges Amdahl’s Law - states that potential program speedup is defined by the fraction of code (P) that can be parallelized: If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, P = 1 and the speedup is infinite (in theory). If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast. Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by: Speedup = ( 1 / P + S ) /N Where p = parallel fraction, N=number of processors and S = serial fraction

Q & A ?

… “You can spend a lifetime getting 95% of your code to be parallel, and never achieve better than 20x speedup no matter how many processors you throw at it! “ – Unknown