Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Distributed Systems CS
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Processes April 5, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Computer System Architectures Computer System Software
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Performance Evaluation of Parallel Processing. Why Performance?
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Implementing Processes and Process Management Brian Bershad.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.
Operating System Principles And Multitasking
Parallel Programming with MPI and OpenMP
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Processes and threads.
4- Performance Analysis of Parallel Programs
PARALLEL COMPUTING Submitted By : P. Nagalakshmi
Operating Systems (CS 340 D)
Sujata Ray Dey Maheshtala College Computer Science Department
Lecture Topics: 11/1 Processes Process Management
Process Management Presented By Aditya Gupta Assistant Professor
Parallel Algorithm Design
Introduction to Parallelism.
Operating Systems (CS 340 D)
EE 193: Parallel Computing
Chapter 3: Principles of Scalable Performance
Chapter 4 Multithreading programming
Lecture 2: Processes Part 1
Processes Hank Levy 1.
Sujata Ray Dey Maheshtala College Computer Science Department
Parallelismo.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4: Threads & Concurrency
Lecture 2 The Art of Concurrency
Processes Hank Levy 1.
FUNDAMENTAL CONCEPTS OF PARALLEL PROGRAMMING
Chapter 3: Processes Process Concept Process Scheduling
Presentation transcript:

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3

RZ: Christian TerbovenFolie 2  Basic Concepts  Parallelization Strategies  Possible Issues  Efficient Parallelization Agenda

RZ: Christian TerbovenFolie 3 Basic Concepts

RZ: Christian TerbovenFolie 4  Process: Instance of a computer program  Thread: A sequence of instructions (own Program Counter)  Parallel / Concurrent execution:  Multiple Processes  Multiple Threads  Per process: Address Space, Files, Environment, …  Per thread: Program Counter, Stack, … Processes and Threads process thread program counter computer

RZ: Christian TerbovenFolie 5  Even on a multi-socket multi-core system you should not make any assumption which process / thread is executed when and where!  Two threads on one core:  Two threads on two cores: Process and Thread Scheduling by the OS Thread1 Thread 2 System thread thread migration “pinned” threads

RZ: Christian TerbovenFolie 6  Memory can be accessed by several threads running on different cores in a multi-socket multi-core system: Shared Memory Parallelization a=4 CPU 1 CPU 2 a c=3+a

RZ: Christian TerbovenFolie 7 Parallelization Strategies

RZ: Christian TerbovenFolie 8  Chances for concurrent execution:  Look for tasks that can be executed simultaneously (task parallelism)  Decompose data into distinct chunks to be processed independently (data parallelism) Finding Concurrency Organize by task Task parallelism Divide and conquer Organize by data decomposition Geometric decomposition Recursive data Organize by flow of data Pipeline Event-based coordination

RZ: Christian TerbovenFolie 9 Divide and conquer Problem Subproblem split Subproblem split Subsolution solve Subsolution merge Solution merge

RZ: Christian TerbovenFolie 10 Geometric decomposition Example: ventricular assist device (VAD)

RZ: Christian TerbovenFolie 11  Analogy assembly line  Assign different stages to different PEs Pipeline C1C2C3C4C5 C1C2C3C4C5 C1C2C3C4C5 C1C2C3C4C5 Stage 1 Stage 2 Stage 3 Stage 4 Time

RZ: Christian TerbovenFolie 12  Recursive data pattern  Parallelize operations on recursive data structures  See example later on: Fibonacci  Event-based coordination pattern  Decomposition into semi-independent tasks interacting in an irregular fashion  See example later on: While-loop with tasks Further parallel design patterns

RZ: Christian TerbovenFolie 13 Possible Issues

RZ: Christian TerbovenFolie 14  Data Race: Concurrent access of the same memory location by multiple threads without proper synchronization  Let variable x have the initial value 1  Depending on which thread is faster, you will see either 1 or 5  Result is nondeterministic (i.e. depends on OS scheduling)  Data Races (and how to detect and avoid them) will be covered in more detail later! Data Races / Race Conditions x=5; printf(x);

RZ: Christian TerbovenFolie 15  Serialization: When threads / processes wait „too much“  Limited scalability, if at all  Simple (and stupid) example: Serialization Send Recv Data Transfer Send Recv Data Transfer Send Recv Data Transfer Calc Wait

RZ: Christian TerbovenFolie 16 Efficient Parallelization

RZ: Christian TerbovenFolie 17  Overhead introduced by the parallelization:  Time to start / end / manage threads  Time to send / exchange data  Time spent in synchronization of threads / processes  With parallelization:  The total CPU time increases,  The Wall time decreases,  The System time stays the same.  Efficient parallelization is about minimizing the overhead introduced by the parallelization itself! Parallelization Overhead

RZ: Christian TerbovenFolie 18 Load Balancing perfect load balancing time load imbalance time – All threads / processes finish at the same time – Some threads / processes take longer than others – But: All threads / processes have to wait for the slowest thread / process, which is thus limiting the scalability

RZ: Christian TerbovenFolie 19  Time using 1 CPU:T(1)  Time using p CPUs:T(p)  Speedup S:S(p)=T(1)/T(p)  Measures how much faster the parallel computation is!  Efficiency E: E(p)=S(p)/p  Example:  T(1)=6s, T(2)=4s  S(2)=1.5  E(2)=0.75  Ideal case: T(p)=T(1)/p  S(p)=p E(p)=1.0 Speedup and Efficiency

RZ: Christian TerbovenFolie 20  Describes the influence of the serial part onto scalability (without taking any overhead into account).  S(p)=T(1)/T(p)=T(1)/(f * T(1) + (1-f) * T(1)/p)=1/(f+(1-f)/p)  f:serial part (0  f  1)  T(1) :time using 1 CPU  T(p):time using p CPUs  S(p):speedup; S(p)=T(1)/T(p)  E(p):efficiency; E(p)=S(p)/p  It is rather easy to scale to a small number of cores, but any parallelization is limited by the serial part of the program! Amdahl‘s Law

RZ: Christian TerbovenFolie 21  If 80% (measured in program runtime) of your work can be parallelized and „just“ 20% are still running sequential, then your speedup will be: Amdahl‘s Law illustrated 1 processor: time: 100% speedup: 1 2 processors: time: 60% speedup: processors: time: 40% speedup: 2.5  processors: time: 20% speedup: 5

RZ: Christian TerbovenFolie 22  After the initial parallelization of a program, you will typically see speedup curves like this: Speedup in Practice speedup Ideal speedup S(p)=p Realistic speedup p Speedup according to Amdahl’s law