An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Distributed Systems CS

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Computer Abstractions and Technology

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:

Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.

Reference: Message Passing Fundamentals.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Computer System Architectures Computer System Software

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Multi-Core Architectures

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Lecture 13: Multiprocessors Kai Bu

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Programming By J. H. Wang May 2, 2017.

EE 193: Parallel Computing

Multi-Processing in High Performance Computer Architecture:

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Lecture 3 : Performance of Parallel Programs

Background and Motivation

Distributed Systems CS

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Multithreading Why & How.

Chapter 4: Threads & Concurrency

Chapter 4 Multiprocessors

Mattan Erez The University of Texas at Austin

Presentation transcript:

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

The Multicore Revolution is Here!  More instruction-level parallelism hard to find  Very complex designs needed for small gain  Thread-level parallelism appears live and well  Clock frequency scaling is slowing drastically  Too much power and heat when pushing envelope  Cannot communicate across chip fast enough  Better to design small local units with short paths  Effective use of billions of transistors  Easier to reuse a basic unit many times  Potential for very easy scaling  Just keep adding processors/cores for higher (peak) performance

Vocabulary in the Multi Era  AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor  SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor

Vocabulary in the Multi Era  Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design.  Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.

Future Embedded Systems

The First Software Crisis  60’s and 70’s:  PROBLEM: Assembly Language Programming  Need to get abstraction and portability without losing performance  SOLUTION: High-level Languages (Fortran and C)  Provided “common machine language” for uniprocessors

The Second Software Crisis  80’s and 90’s:  PROBLEM: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundred programmers  Need to composability, malleability and maintainability  SOLUTION: Object-Oriented Programming (C++ and Java)  Better tools and software engineering methodology (design patterns, specification, testing)

The Third Software Crisis  Today:  PROBLEM: Solid boundary between hardware and software  High-level languages abstract away the hardware  Sequential performance is left behind by Moore’s Law  SOLUTION: What’s under the hood?  Language features for architectural awareness

The Software becomes the Problem, AGAIN  Parallelism required to gain performance  Parallel hardware is “easy” to design  Parallel software is (very) hard to write  Fundamentally hard to grasp true concurrency  Especially in complex software environments  Existing software assumes single-processor  Might break in new and interesting ways  Multitasking no guarantee to run on multiprocessor

Parallel Programming Principles  Coverage (Amdahl’s Law)  Communication/Synchronization  Granularity  Load Balance  Locality

Coverage  More, less powerful (and power-hungry) cores to achieve same performance?

Coverage  Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.  Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67

Amdahl’s Law  p = fraction of work that can be parallelized  n = the number of processors

Implications of Amdahl’s Law  Speedup tends to 1/(1-p) as number of processors tends to infinity  Parallel programming is worthwhile when programs have a lot of work that is parallel in nature Overhead

Overhead of Parallelism  Given enough parallel work, this is the biggest barrier to getting desired speedup  Parallelism overheads include:  cost of starting a thread or process  cost of communicating shared data  cost of synchronizing  extra (redundant) computation  Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Parallel Programming Principles  Coverage (Amdahl’s Law)  Communication/Synchronization  Granularity  Load Balance  Locality

Communication/Synchronization  Only few programs are “embarassingly” parallel  Programs have sequential parts and parallel parts  Need to orchestrate parallel execution among processors  Synchronize threads to make sure dependencies in the program are preserved  Communicate results among threads to ensure a consistent view of data being processed

Communication/Synchronization  Shared Memory  Communication is implicit. One copy of data shared among many threads  Atomicity, locking and synchronization essential for correctness  Synchronization is typically in the form of a global barrier  Distributed memory  Communication is explicit through messages  Cores access local memory  Data distribution and communication orchestration is essential for performance  Synchronization is implicit in messages Overhead

Parallel Programming Principles  Coverage (Amdahl’s Law)  Communication/Synchronization  Granularity  Load Balance  Locality

Granularity  Granularity is a qualitative measure of the ratio of computation to communication  Computation stages are typically separated from periods of communication by synchronization events

Granularity  Fine-grain Parallelism  Low computation to communication ratio  Small amounts of computational work between communication stages  Less opportunity for performance enhancement  High communication overhead  Coarse-grain Parallelism  High computation to communication ratio  Large amounts of computational work between communication events  More opportunity for performance increase  Harder to load balance efficiently

Parallel Programming Principles  Coverage (Amdahl’s Law)  Communication/Synchronization  Granularity  Load Balance  Locality

The Load Balancing Problem  Processors that finish early have to wait for the processor with the largest amount of work to complete  Leads to idle time, lowers utilization  Particularly urgent with barrier synchronization BALANCED workloads UNBALANCED workloads Slowest core dictates overall execution time

Static Load Balancing  Programmer make decisions and assigns a fixed amount of work to each processing core a priori  Works well for homogeneous multicores  All core are the same  Each core has an equal amount of work  Not so well for heterogeneous multicore  Some cores may be faster than others  Work distribution is uneven

Dynamic Load Balancing  Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue  When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing.  Ideal for codes where work is uneven, and in heterogeneous multicore

Parallel Programming Principles  Coverage (Amdahl’s Law)  Communication/Synchronization  Granularity  Load Balance  Locality

Memory Access Latency  Uniform Memory Access (UMA) – Shared Memory  Centrally located shared memory  All processors are equidistant (access times)  Non-Uniform Access (NUMA)  Shared memory – Processors have the same address space  data is directly accessible by all, but cost depends on the distance Placement of data affects performance  Distributed memory – Processors have private address spaces  Data access is local, but cost of messages depends on the distance Communication must be efficiently architected

Locality of Memory Accesses (UMA Shared Memory)  Parallel computation is serialized due to memory contention and lack of bandwidth

Locality of Memory Accesses (UMA Shared Memory)  Distribute data to relieve contention and increase effective bandwidth

Locality of Memory Accesses (NUMA Shared Memory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } INTERCONNECT SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 SHARED MEMORY Once parallel tasks have been assigned to different processors..

int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 AB..phisical placement of data can have a great impact on performance!

int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2

int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2

Locality in Communication (Message Passing)