Experiencing Cluster Computing Class 1. Introduction to Parallelism.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Parallel Processing with OpenMP
Introduction to Openmp & openACC
Distributed Systems CS
SE-292 High Performance Computing
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Chapter 9. Concepts in Parallelisation An Introduction
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Computer Architecture Parallel Processing
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Performance Evaluation of Parallel Processing. Why Performance?
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Distributed Systems CS /640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andVinay.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Classification of parallel computers Limitations of parallel processing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
These slides are based on the book:
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Distributed Processors
Parallel Processing - introduction
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Multi-Processing in High Performance Computer Architecture:
Symmetric Multiprocessing (SMP)
CSE8380 Parallel and Distributed Processing Presentation
Distributed Systems CS
Multithreaded Programming
Multithreading Why & How.
Chapter 4: Threads & Concurrency
Chapter 4 Multiprocessors
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
Operating System Overview
Presentation transcript:

Experiencing Cluster Computing Class 1

Introduction to Parallelism

Outline Why Parallelism Types of Parallelism Drawbacks Concepts Starting Parallelization Simple Example

Why Parallelism

Why Parallelism – Passively Suppose you are using the most efficient algorithm with an optimal implementation and the program still takes too long or does not even fit onto your machine? Parallelization is the last chance.

Why Parallelism – Initiatively Faster –Finish the work earlier Same work in shorter time –Do more work More work in the same time Most importantly, you want to predict the result before the event occurs

Examples Many of the scientific and engineering problems require enormous computational power. Following are the few fields to mention: –Quantum chemistry, statistical mechanics, and relativistic physics –Cosmology and astrophysics –Computational fluid dynamics and turbulence –Material design and superconductivity –Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling –Medicine, and modeling of human organs and bones –Global weather and environmental modeling –Machine Vision

Parallelism The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time. The upper bound for the computing power available can be dramatically increased by integrating a set of processors together. Synchronization and exchange of partial results among processors are therefore unavoidable.

Computer Architecture 4 categories: SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data MISD: Multiple Instruction Single Data MIMD: Multiple Instruction Multiple Data

Computer Architecture SISD SIMDMISDMIMD Uniprocessor (Single processor computer) Vector processor Array processor Shared Memory (Microprocessors) Distributed Memory (Microcomputers) Cluster SMP NUMA Processor Organizations

Multiprocessing Clustering IS CU PU Shared Memory 1 n-1 n n IS DS LM CPU Interconnecting Network 1n-1n2 2 1 n DS Distributed Memory – Cluster Shared Memory – Symmetric multiprocessors (SMP) Parallel Computer Architecture

Types of Parallelism

Parallel Programming Paradigm Multithreading –OpenMP Message Passing –MPI (Message Passing Interface) –PVM (Parallel Virtual Machine) Shared memory, Distributed memory Shared memory only

Threads In computer programming, a thread is placeholder information associated with a single use of a program that can handle multiple concurrent users. From the program's point-of-view, a thread is the information needed to serve one individual user or a particular service request. If multiple users are using the program or concurrent requests from other programs occur, a thread is created and maintained for each of them. The thread allows a program to know which user is being served as the program alternately gets re-entered on behalf of different users.

Threads Programmers view: –Single CPU –Single block of memory –Several threads of action Parallelization –Done by the compiler Parallel Region FORK JOIN Master Thread Team of parallel threads Thread Fork-Join Model

Shared Memory Programmers view: –Several CPUs –Single block of memory –Several threads of action Parallelization –Done by the compiler Example –OpenMP time P1 P2P3 Single threaded Multi-threaded Process P2 P3 Threads Data exchange via shared memory

Parallel Region 1 !$OMP PARALLEL !$OMP END PARALLEL !$OMP PARALLEL !$OMP END PARALLEL Parallel Region 2 Master Thread Team of parallel threads Multithreaded Parallelization

Distributed Memory Programmers view: –Several CPUs –Several block of memory –Several threads of action Parallelization –Done by hand Example –MPI time P1 P2P3 P2 P3 Process 0 Process 1 Process 2Serial Data exchange via interconnection Process MessagePassing

Drawbacks

Drawbacks of Parallelism Traps –Deadlocks –Process Synchronization Programming Effort –Few tools support for automated parallelization and debugging Task Distribution (Load balancing)

Deadlock The earliest computer operating systems ran only one program at a time. All of the resources of the system were available to this one program. Later, operating systems ran multiple programs at once, interleaving them. Programs were required to specify in advance what resources they needed so that they could avoid conflicts with other programs running at the same time. Eventually some operating systems offered dynamic allocation of resources. Programs could request further allocations of resources after they had begun running. This led to the problem of the deadlock.

Deadlock Parallel tasks require resources to accomplish their work. If the resources are not available, the work cannot be finished. Each resource can only be locked (controlled) by exactly one task at any given point in time. Consider the situation: –Two tasks need both the same two resources. –Each task manages to gain control over just one resource, but not the other. –Neither task releases the resource that it already holds. It is called deadlock and the program will not terminate.

Deadlock Process Resource

Dining Philosophers Each philosopher either thinks or eats. In order to eat, he requires two forks. Each philosopher tries to pick up the right fork first. If success, he waits for the left fork to become available.  Deadlock

Dining Philosophers Demo Problem – mp/deadlock/Diners.htmhttp:// mp/deadlock/Diners.htm Solution – mp/deadlock/FixedDiners.htmhttp:// mp/deadlock/FixedDiners.htm

Concepts

Speedup Given a fixed problem size. T S : sequential wall clock execution time (in seconds) T N : parallel wall clock execution time using N processors (in seconds) Ideally, speedup = N  Linear speed up

Speedup Absolute Speedup Sequential time on 1 processor/parallel time on N processors Relative Speedup Parallel time on 1 processor/parallel time on N processors Different because parallel code on 1 processor has unnecessary MPI overhead –It may be slower than sequential code on 1 processor

Parallel Efficiency Effciency is a measure of process utilization in a parallel program, relative to the serial program. Parallel Efficiency E: Speedup per processor Ideally, E N = 1. or

Amdahl’s Law It states that potential program speedup is defined by the fraction of code (f) which can be parallelized If none of the code can be parallelized, f = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, f = 1 and the speedup is infinite (in theory).

Amdahl’s Law Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by the equation where: P: parallel fraction S: serial fraction N: number of processors

Amdahl’s Law When N  ∞, Speedup = 1/S Interpretation: No matter how many processors are used, the upper bound for the speed up is determined by the sequential section.

Amdahl’s Law – Example If the sequential section of a program amounts 5% of the run time, then S = 0.05 and hence:

Behind Amdahl’s Law 1.How much faster can a given problem be solved? 2.Which problem size can be solved on a parallel machine in the same time as on a sequential one? (Scalability)

Starting Parallelization

Parallelization – Option 1 Starting from an existing, sequential program –Easy on shared memory architectures (OpenMP) –Potentially adequate for small number of processes (moderate speed-up) –Does not scale to large number of processes –Restricted to trivially parallel problems on distributed memory machines

Parallelization – Option 2 Starting from scratch –Not popular, but often inevitable –Needs new program design –Increase complexity (data distribution) –Widely applicable –Often the best choice for large scale problems

Goals for Parallelization Avoid or reduce –synchronization –communication Try to maximize computational intensive sections.

Simple Example

Summation Given an N-dimensional vector of type integer. // Initialization // for (int i = 0; i<len; i++) vec[i] = i*i ; // Sum Calculation // for (int i = 0; i<len; i++) sum += vec[i];

Parallel Algorithm 1.Divide the vector in certain parts 2.In each CPU, initialize their own parts 3.Use global reduction to calculate the sum of the vector

OpenMP Compiler directives (#pragma omp) are inserted to tell the compiler to perform parallelization. The compiler would be responsible for automatically parallelizing certain types of loops. #pragma omp parallel for for (int i=1; i<len; i++) vec[i] = i*i; #pragma omp parallel for reduction(+: sum) for (int i=0; i<len; i++) sum += vec[i];

MPI // in each process, do the initialization for(int i=rank; i<len; i+=np) vec[i] = i*i; // calculate the local sum for(int i=rank; i<len; i+=np) localsum += vec[i]; // perform global reduction MPI_Reduce(&localsum, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); no. of processors, np = 3 rank sum localsum MPI_Reduce sum vec

END