Chapter 01: Introduction

Slides:

Advertisements

Similar presentations

Distributed Systems CS

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Reference: Message Passing Fundamentals.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Parallel Computer Architecture and Interconnect 1b.1.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Outline Why this subject? What is High Performance Computing?

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Overview Parallel Processing Pipelining

Introduction to Parallel Processing

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Operating Systems (CS 340 D)

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

3- Parallel Programming Models

CS 147 – Parallel Processing

The University of Adelaide, School of Computer Science

Constructing a system with multiple computers or processors

Parallel Algorithm Design

Multi-Processing in High Performance Computer Architecture:

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Operating Systems (CS 340 D)

Parallel Programming in C with MPI and OpenMP

L21: Putting it together: Tree Search (Ch. 6)

EE 193: Parallel Computing

Chapter 4: Threads.

Guoliang Chen Parallel Computing Guoliang Chen

Chapter 4: Threads.

Chapter 17: Database System Architectures

Multiple Processor Systems

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Distributed Systems CS

Hybrid Programming with OpenMP and MPI

Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.

Multithreaded Programming

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4: Threads & Concurrency

Chapter 4 Multiprocessors

Database System Architectures

Parallel Programming in C with MPI and OpenMP

Types of Parallel Computers

Presentation transcript:

Chapter 01: Introduction

What we‘ll be doing Modern CPUs contain many cores An even higher degree of parallelism is available multiple CPUs, clusters, and supercomputers. Thus, the ability to program these types of systems efficiently and effectively is an essential skill You will learn how to write programs that are explicitly parallel Using the C++ language for Message-Passing Interface (MPI) Multithreading OpenMP Vectorization

Learning Outcomes Today Understand and analyze a parallel algorithm for adding numbers Learn about the terms Speedup, Efficiency, Scalability, and Compute-to-Computation Ratio Know the difference between shared-memory and distributed-memory architectures and some corresponding parallel programming languages Understand important considerations for parallel program design: Partitioning, Communication, Synchronization, Load Balancing Learn about important HPC Trends and Rankings Top 500, Green 500, Graph 500

Motivational Example and its Analysis Input: Array A of n numbers Output: 𝑖=0 𝑛−1 𝐴[𝑖] Task: Parallelize this problem efficiently using an array of processing elements (PEs) Assumptions: Computation: Each PE can add two numbers stored in its local memory in 1 sec Communication: A PE can send data from its local memory to the local memory of any other PE in 3 sec (independent of the size of the data) Input and Output: At the beginning of the program the whole input array A is stored in PE #0. At the end the result should be gathered in PE #0 Synchronization: All PEs operate in lock-step manner; i.e. they can either compute, communicate or be idle.

Motivational Example and its Analysis 𝑇(𝑛,𝑝) Problem size #Processors Establish sequential runtime as a baseline (p = 1): For our example: T(1,n) = n1 sec What are we interested in? Speedup: How much faster can we get with p > 1 processors? Efficiency: Is our parallel program efficient? Scalability: How does our parallel program behave for varying number of processors (for fixed/varying problem size)?

Motivational Example and its Analysis PE #0 PE #1 𝑖=0 1023 𝐴[𝑖] 𝑖=0 511 𝐴[𝑖] 𝑖=512 1023 𝐴[𝑖] 𝑖=0 511 𝐴[𝑖] A[0..511] A[0..1023] 𝑖=512 1023 𝐴[𝑖] A[512..1023] Establish runtime for 2 PEs (p = 2) and 1024 numbers (n = 1024): T(2,1024) = 3 + 511 + 3 + 1 = 518 sec Speedup: T(1,1024)/T(2,1024) = 1023/518 = 1.975 Efficiency: 1.975/2 = 98.75%

Motivational Example and its Analysis PE #0 PE #1 𝑖=0 255 𝐴[𝑖] + 𝑖=256 511 𝐴[𝑖] 𝑖=0 1023 𝐴[𝑖] 𝑖=0 255 𝐴[𝑖] A[0..255] A[0..1023] A[0..511] 𝑖=512 767 𝐴[𝑖] + 𝑖=768 1023 𝐴[𝑖] A[512..767] A[512..1023] 𝑖=512 767 𝐴[𝑖] PE #2 PE #3 𝑖=256 511 𝐴[𝑖] A[256..511] A[768..1023] 𝑖=768 1023 𝐴[𝑖] T(4,1024) = 32 + 255 + 32 + 2 = 269 seconds Speedup: T(1,1024)/T(4,1024) = 1023/269 = 3.803 Efficiency: 3.803/4 = 95.07%

Motivational Example and its Analysis PE #0 PE #1 PE #4 PE #5 𝑖=0 511 𝐴[𝑖] 𝑖=0 255 𝐴[𝑖] 𝑖=0 127 𝐴[𝑖] 𝑖=0 1023 𝐴[𝑖] A[0..127] A[0..255] A[0..1023] A[0..511] 𝑖=512 767 𝐴[𝑖] 𝑖=512 639 𝐴[𝑖] A[512..767] 𝑖=512 1023 𝐴[𝑖] A[512..639] A[512..1023] 𝑖=128 255 𝐴[𝑖] A[128..255] 𝑖=640 767 𝐴[𝑖] A[640..767] PE #2 PE #3 PE #6 PE #7 𝑖=256 511 𝐴[𝑖] A[256..511] 𝑖=256 383 𝐴[𝑖] A[256..383] 𝑖=768 1023 𝐴[𝑖] 𝑖=768 895 𝐴[𝑖] A[768..1023] A[768..895] 𝑖=384 511 𝐴[𝑖] A[384..511] 𝑖=896 1023 𝐴[𝑖] A[896..1023] T(8,1024) = 33 + 127 + 33 + 3 = 148 seconds Speedup: T(1,1024)/T(8,1024) = 1023/148 = 6.91 Efficiency: 6.91/8 = 86%

Motivational Example and its Analysis Timing analysis using p = 2q PEs and n = 2k input numbers: Data distribution: 3q Computing local sums: n/p  1 = 2kq  1 Collection partial results: 3q Adding partial results: q T(p,n) = T(2q,2k) = 3q + 2k-q  1 + 3q + q = 2k-q  1 + 7q

Strong Scalability Analysis for n = 1024

Weak Scalability Analysis for n = 1024p

The General Case and the Compute-to-Communication Ratio Time needed to perform a single addition 𝛾= 𝛼 𝛽 Time to communciate a stack of numbers T,(2q,2k) = q + (2k-q  1) + q + q = 2q + (2k-q 1+q) 𝑆  ( 2 𝑞 , 2 𝑘 )= 𝛾( 2 𝑘 −1) 2𝑞+𝛾( 2 𝑘−𝑞 −1+𝑞) Diagram shows: F(,q) := S(2q,210)

Distributed Memory Systems Each node has its own, private memory. Processors communicate explicitly by sending messages across a network Most popular language: MPI (e.g. MPI_Send, MPI_Recv, MPI_Bcast, MPI_Reduce) Alternative: PGAS languages (e.g. UPC++) Example: Compute Clusters Collection of commodity systems such as CPUs connected by an interconnection network (e.g . Infiniband)

Shared Memory Systems All cores can access a common memory space through a shared bus or crossbar switch e.g. multi-core CPU-based workstations in which all cores share main memory In addition to the shared main memory each core can also contains a smaller local memory (e.g. L1-cache) in order to reduce expensive accesses to main memory (von-Neumann bottleneck) Modern multi-core CPU systems support cache coherence ccNUMA: cache coherent non-uniform access architectures Popular languages: C++11 multithreading, OpenMP, CUDA

Shared Memory Systems A[]: 2.1 3.4 0.1 2.2 ? -1.4 2.0 -1.9 0.4 Thread 0 Thread 1 a[4] = 3.1 a[4] = 2.9 Parallelism created by starting threads running concurrently on the system Exchange of data implemented by threads reading from and writing to shared memory locations. Race condition: occurs when two threads access a shared variable simultaneously Corresponding programming techniques: mutexes, condition variables, atomics Thread creation more lightweight and faster compared to process creation: CreateProcess(): 12.76 ms CreateThread(): 0.038 ms

Parallel Program Design Partitioning: Given problem needs to be decomposed into pieces; e.g. data parallelism, task parallelism, model parallelism Communication: Chosen partitioning scheme determines the amount and types of required communication Synchronization: In order to cooperate in an appropriate way, threads or processes may need to be synchronized Load Balancing: Work needs to be equally divided among threads or processes in order to balance the load and minimize idle times  Foster‘s Design methodology (see Chapter 2)

Discovering Parallelism (Prefix Sum) for (i=1; i<n; i++) A[i]=A[i]+A[i-1]; Step 1: local summation within each processor Step 2: Prefix sum computation using only the rightmost value of each local array Step 3: Addition of the value computed in Step2 from the left neighbor to each local array element

Partitioning Strategies Task parallelism Different binary classifier assigned to a different process (P0, P1, P2) Every process classifies each image using the assigned classifier. Binary classification results for each image are send to P0 and merged Limited parallelism and possible load imbalance Data parallelism Input images could be divided into a number of batches. Once a process has completed the classification of its assigned batch, a scheduler dynamically assigns a new batch

Model Parallelism for Deep Learning GPU 0 GPU 1 GPU 2 GPU 3

TOP500 Trends Supercomputers ranked according to their maximally achieved LINPACK performance (top500.org) Measures the performance for solving a dense system of linear equations (A⋅x=b) in terms of Flop/s Top-ranked system in 2017: Sunway Taihu Light contains over 10 million cores (hybrid system) and achieves 93 PFlop/s Green500: Flop/s-per-Watt of achieved LINPACK performance

Heterogeneous HPC system and associated programming languages Node 0 Node 1 Node 2 Node k1 Cluster Interconnection Network MPI, UPC OpenMP, Pthreads, CUDA CPU CPU Node CPU CPU Core Core Socket Core Core Core Core Accelerators Accelerator board Accelerator board

Class Exercise 1 Another well-known HPC ranking is the Graph500 project. Study the website www.graph500.org and answer the following questions: What is the utilized benchmark and how is performance measured? What is the measured performance and the specification of the top-ranked system? What are some of the advantages and disadvantages compared to the TOP500?

Review Questions How does our parallel algorithm for adding numbers work? Can you define the terms speedup, cost, and efficiency? State the difference between weak and strong scalability. What is the difference between a shared memory and a distributed memory parallel computer? Name and describe a few parallel programming languages Describe typical considerantion when designing a parallel program