Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Virtues of Good (Parallel) Software
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Outline Why this subject? What is High Performance Computing?
Martin Kruliš by Martin Kruliš (v1.1)1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Concurrency and Performance Based on slides by Henri Casanova.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
PARALLEL COMPUTING Submitted By : P. Nagalakshmi
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
CMSC 611: Advanced Computer Architecture
Lecture 3 : Performance of Parallel Programs
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Distributed Systems CS
Mattan Erez The University of Texas at Austin
Virtual Memory: Working Sets
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note

Creating a Parallel Program  Decomposition  Assignment  Orchestration/Mapping

Decomposition Break up computation into tasks to be divided among processes identify concurrency and decide level at which to exploit it

Assignment Assign tasks to threads Balance workload, reduce communication and management cost Together with decomposition, also called partitioning Can be performed statically, or dynamically Goal Balanced workload Reduced communication costs

Orchestration Structuring communication and synchronization Organizing data structures in memory and scheduling tasks temporally Goals Reduce cost of communication and synchronization as seen by processors Reserve locality of data reference (including data structure organization)

Mapping Mapping threads to execution units (CPU cores) Parallel application tries to use the entire machine Usually a job for OS Mapping decision Place related threads (cooperating threads) on the same processor maximize locality, data sharing, minimize costs of comm/sync

Performance of Parallel Programs What factors affect the performance ? Decomposition Coverage of parallelism in algorithm Assignment Granularity of partitioning among processors Orchestration/Mapping Locality of computation and communication

Coverage (Amdahl’s Law) Potential program speedup is defined by the fraction of code that can be parallelized

Amdahl’s Law Speedup = old running time / new running time = 100 sec / 60 sec = 1.67 (parallel version is 1.67 times faster)

Amdahl’s Law p = fraction of work that can be parallelized n = the number of processor

Implications of Amdahl’s Law Speedup tends to 1/(1-p) as number of processors tends to infinity Parallel programming is worthwhile when programs have a lot of work that is parallel in nature

Performance Scalability Scalability : the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added

Granularity Granularity is a qualitative measure of the ratio of computation to communication Computation stages are typically separated from periods of communication by synchronization events

Granularity From wikipedia Granularity the extent to which a system is broken down into small parts Coarse-grained systems consist of fewer, larger components than fine-grained systems regards large subcomponents Fine-grained systems regards smaller components of which the larger ones are composed.

Fine vs. Coarse Granularity Fine-grain Parallelism Low computation to communication ratio Small amounts of computational work between communication stages Less opportunity for performance enhancement High communication overhead Coarse-grain Parallelism High computation to communication ratio Large amounts of computational work between communication events More opportunity for performance increase Harder to load balance efficiently

General Load Balancing Problem The whole work should be completed as fast as possible. As workers are very expensive, they should be kept busy. The work should be distributed fairly. About the same amount of work should be assigned to every worker. There are precedence constraints between different tasks (we can start building the roof only after finishing the walls). Thus we also have to find a clever processing order of the different jobs.

Load Balancing Problem Processors that finish early have to wait for the processor with the largest amount of work to complete Leads to idle time, lowers utilization

Static load balancing Programmer make decisions and assigns a fixed amount of work to each processing core a priori Low run time overhead Works well for homogeneous multicores All core are the same Each core has an equal amount of work Not so well for heterogeneous multicores Some cores may be faster than others Work distribution is uneven

Dynamic Load Balancing When one core finishes its allocated work, it takes work from a work queue or a core with the heaviest workload Adapt partitioning at run time to balance load High runtime overhead Ideal for codes where work is uneven, unpredictable, and in heterogeneous multicore

Granularity and Performance Tradeoffs  Load balancing How well is work distributed among cores?  Synchronization/Communication Communication Overhead?

Communication With message passing, programmer has to understand the computation and orchestrate the communication accordingly Point to Point Broadcast (one to all) and Reduce (all to one) All to All (each processor sends its data to all others) Scatter (one to several) and Gather (several to one)

MPI : Message Passing Library MPI : portable specification Not a language or compiler specification Not a specific implementation or product SPMD model (same program, multiple data) For parallel computers, clusters, and heterogeneous networks, multicores Multiple communication modes allow precise buffer management Extensive collective operations for scalable global communication

Point-to-Point Basic method of communication between two processors Originating processor "sends" message to destination processor Destination processor then "receives" the message The message commonly includes Data or other information Length of the message Destination address and possibly a tag

Synchronous vs. Asynchronous Messages

Blocking vs. Non-Blocking Messages

Broadcast

Reduction Example: every processor starts with a value and needs to know the sum of values stored on all processors A reduction combines data from all processors and returns it to a single process MPI_REDUCE Can apply any associative operation on gathered data ADD, OR, AND, MAX, MIN, etc. No processor can finish reduction before each processor has contributed a value BCAST/REDUCE can reduce programming complexity and may be more efficient in some programs

Example : Parallel Numerical Integration

Computing the Integration (MPI)

Locality Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data Need to exploit spatial and temporal locality Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

Locality of memory access (shared memory)

Memory Access Latency in Shared Memory Architectures Uniform Memory Access (UMA) Centrally located memory All processors are equidistant (access times) Non-Uniform Access (NUMA) Physically partitioned but accessible by all Processors have the same address space Placement of data affects performance CC-NUMA (Cache-Coherent NUMA)

Shared Memory Architecture all processors to access all memory as global address space. (UMA, NUMA) Advantage Global address space provides a user-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantage Primary disadvantage is the lack of scalability between memory and CPUs Programmer responsibility for synchronization Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

Example of Parallel Program

Ray Tracing Shoot a ray into scene through every pixel in image plane Follow their paths they bounce around as they strike objects they generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays