Concurrent and Distributed Programming

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Distributed Systems CS

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Performance Evaluation of Parallel Processing. Why Performance?

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Classic Model of Parallel Processing

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Pipelining and Parallelism Mark Staveley

EKT303/4 Superscalar vs Super-pipelined.

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

PipeliningPipelining Computer Architecture (Fall 2006)

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Lecture 38: Compiling for Modern Architectures 03 May 02

These slides are based on the book:

CS203 – Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

COSC3330 Computer Architecture

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

William Stallings Computer Organization and Architecture 8th Edition

Design-Space Exploration

Improving Memory Access 1/3 The Cache and Virtual Memory

Parallel Processing - introduction

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

CS 147 – Parallel Processing

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers

/ Computer Architecture and Design

CSCI206 - Computer Organization & Programming

The University of Texas at Austin

Introduction, Focus, Overview

Chapter 3: Principles of Scalable Performance

CS 101 – Sept. 25 Continue Chapter 5

Merry Christmas Good afternoon, class,

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Hardware Multithreading

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Distributed Systems CS

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

CSC3050 – Computer Architecture

Chapter 4 Multiprocessors

Memory System Performance Chapter 3

Introduction, Focus, Overview

Lecture 1 An Overview of High-Performance Computer Architecture

Instruction Level Parallelism

Guest Lecturer: Justin Hsia

Chapter 2 Parallel Programming background

Presentation transcript:

Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, 236370, Winter 2010 Chapters 2,5 in “Intro to Parallel Computing”, Grama Lecture 2: CS149 Stanford by Aiken&Oluktun 1

What you will learn today Different levels of parallelism Instruction level parallelism SIMD Multicores Cache coherence problem Parallel performance criteria Assaf Schuster, 236370, Winter 12 2

Simple CPU CPU Instructions Memory Cache DRAM CPU is a digital circuit – time discretized in cycles. 1 cycle = 1/F(requency). Assume: 1 instruction is executed in 1 cycle Does it mean Te = #executed instructions/ F ? Assaf Schuster, 236370, Winter 12 3

NO Instruction should be understood by CPU Takes ~5 cycles Memory is slow – results in stalls 100-500 cycles Assaf Schuster, 236370, Winter 12 4

Pipelining Instruction decoding is split into stages One instruction is being executed while another is being decoded IF ID OF E WB Fetch Inst Decode Inst Operand Inst Execute Write back Assaf Schuster, 236370, Winter 12 5

Outcome: throughput = 4 inst/4cycles Source - Wikipedia Outcome: throughput = 4 inst/4cycles (if pipeline is full!) Assaf Schuster, 236370, Winter 12 6

Can we improve with more parallelism? Yes: Instruction level parallelism (ILP) Consider the following code Can it be parallelized? Assaf Schuster, 236370, Winter 12 7

Independent instructions 1 independent of 2 3 independent of 4 Processor can detect that! We need: Hardware to detect that Hardware to issue multiple instructions Hardware to execute them (maybe out-of-order) We get superscalar processor Assaf Schuster, 236370, Winter 12 8

How does it work Food for thought – how to cope with loops? Two-way super scalar Food for thought – how to cope with loops? Hint – guess.. Assaf Schuster, 236370, Winter 12 9

Bad news: ILP is limited Assaf Schuster, 236370, Winter 12 10

Vectorization and VLIW Why not let compiler do the job Vectorization: processing multiple data with the same instruction (SIMD) for(int i=0;i<4;i++){ c[i]=a[i]+b[i]; } C_vec=A_vec+B_vec Very Large Instruction Word (MIMD) Multiple instructions are grouped into one large one Assaf Schuster, 236370, Winter 12 11

Memory Performance characteristics Example: which drop will reach the other side first? Assaf Schuster, 236370, Winter 12 12

Memory Performance characteristics CPU Memory DATA BUS Latency – time between data request and receipt Bandwidth – rate at which data is moved Assaf Schuster, 236370, Winter 12 13

Why latency is a problem 1GHz processor Memory latency=100ns. Memory bandwidth = Infinity Assume 1 floating point ADD = 1 cycle (1ns) RAW ALU performance: 4GFLOPs/s Observed performance for adding two vectors? Assaf Schuster, 236370, Winter 12 14

Why latency is a problem 1GHz processor Memory latency=100ns. Memory bandwidth = Infinity Assume 1 floating point ADD = 1 cycle (1ns) RAW ALU performance: 4GFLOPs/s Observed performance for adding two vectors? We need 3 memory accesses = 300ns per 1 addition => 300 times slower Assaf Schuster, 236370, Winter 12 15

Why bandwidth is a problem - 2 1GHz processor Memory latency = 0 ns Memory bandwidth = 1GB/s Assume 1 floating point ADD = 1 cycle (1ns) RAW ALU performance: 4GFLOPs/s Observed performance for adding two vectors? Assaf Schuster, 236370, Winter 12 16

Why bandwidth is a problem - 2 1GHz processor Memory latency = 0 ns Memory bandwidth = 1GB/s Assume 1 floating point ADD = 1 cycle (1ns) RAW ALU performance: 4GFLOPs/s Observed performance for adding two vectors? We need 12 bytes per one cycle: 12 GB/s. But we can do only 1GB/s => 12 times slower 4GB/s => 48 Assaf Schuster, 236370, Winter 12 17

Coping with memory limitations Cache ($) Observation 1: most applications reuse data Spatial locality: most accesses are close in space Temporal locality: most accesses to the same data are close in time Observation 2: small is fast Cache can be placed ON chip Cache has low latency Cache has high bandwidth CPU $ DRAM Assaf Schuster, 236370, Winter 11 18

Cache performance If data in cache (cache hit): latency=1 cycle, bandwidth = 100s GB/s Otherwise – miss penalty Hit rate= <#hits>/<#total accesses> Miss rate? Example: hit rate=90%, 20% instructions access memory, miss penalty 100 cycles. Performance? What if hit rate=99%? Cache hierarchy follows same principles Assaf Schuster, 236370, Winter 12 19

Question: Does cache help for adding two vectors? Assaf Schuster, 236370, Winter 12 20

NO! There is no data reuse: Cache is populated upon first access – compulsory miss - slow If the data is not reused – no benefit from cache Almost true programmed prefetching helps Memory bandwidth better utilized Assaf Schuster, 236370, Winter 12 21

Question: Does cache help for matrix product? Assaf Schuster, 236370, Winter 11 22

Yes! O(n3) computations O(n2) memory We read each datum O(n) times Read more Chapter 2, Grama textbook Assaf Schuster, 236370, Winter 12 23

Checkpoint Pipelines improve CPU performance by hiding instruction handling delay Super-scalar out-of-order processors can “automatically” improve the performance by exploiting Instruction Level Parallelism Caches reduce latency by assuming spatial and temporal locality Caches are VERY important – lots of research done on cache-efficient algorithms Assaf Schuster, 236370, Winter 12 24

Adding more processors Shared Memory Processors (SMPs) Single physical memory space CPU CPU $ $ Memory Assaf Schuster, 236370, Winter 12 25

SMPs vs. Multicores SMP – different chips Multicores (CMP) – same chip, multiple units Assaf Schuster, 236370, Winter 12 26

Cache coherence problem Thread 1 Thread 2 A=0; A=1; print(A); CPU CPU $ $ Memory Which value will be printed? Assaf Schuster, 236370, Winter 12 27

Cache coherence protocol Must ensure serializibility There must be an equivalent global order of reads and writes that respects (is consistent with) the local order of each of the threads T1: W a,#1; R b,#2; W b,#3 T2: W b,#2; R b,#3; R a,#1 T1: W a,#1; R b,#2; W b,#3 T2: W b,#2; R b,#3; R a,#1 Assaf Schuster, 236370, Winter 12 28

Cache coherence Snoopy cache coherence All caches listen for other caches' reads and writes Write: invalidates values in others Reads: get the latest version from other cache Assaf Schuster, 236370, Winter 12 29

Cache coherence-2 Assaf Schuster, 236370, Winter 12 30

Fixed by Amit Markel Assaf Schuster, 236370, Winter 11 31

False sharing Cache coherence is managed at cache line granularity (e.g. 4 words) Variables on the same cache line will be invalidated May lead to unexpected coherence traffic Assaf Schuster, 236370, Winter 12 32

Checkpoint Snoopy cache coherence requires broadcasts There are other approaches False sharing: can generate unexpected load Delays memory access, quickly grows with #processors – not scalable Assaf Schuster, 236370, Winter 12 33

Performance of parallel programs More details in Chapter 5 in Graha book A parallel system is a combination of a parallel algorithm and an underlying platform Intuitive measures Wall clock time Speedup = (Serial time)/(Parallel time) GFLOPs/s = how well the hardware is exploited Need more: Scalability: Speedup as a function of #CPUs Efficiency: Speedup/#CPUs 34 Assaf Schuster, 236370, Winter 12

Question If I use 2 processors, will the program run twice as fast? Assaf Schuster, 236370, Winter 12 35

Sources of inefficiency Total parallel overhead: (SumpTi) - Tserial Parallelization overhead Parallel program may need more work to do Communication overhead Idling Load imbalance Serial part (insufficient parallelism) Assaf Schuster, 236370, Winter 12 36

Can we estimate an upper bound on speedup? Amdahl's law Sequential component limits the speedup Split program serial time Tserial=1 into Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads Cannot be parallelized: 1-A Parallel time Tparallel= A/#CPUs+(1-A) Scalability=Speedup(#CPUs) =1/(A/#CPUs+(1-A)) 37 Assaf Schuster, 236370, Winter 12

Bad news So why do we need machines with 1000x CPUs? Source: wikipedia 38 Assaf Schuster, 236370, Winter 12

Living with Amdahl's law: Gustafson's law Scaled speedup: Tparallel = Tparallel *(A+ (1-A)) Tbestserial <= #CPUs *Tparallel *A + Tparallel*(1-A) [by simulation, a bound on the best serial program] Speedup <= #CPUs *A + (1-A) = #CPUs – (1-A)*(#CPUs-1) Assaf Schuster, 236370, Winter 12 39

Amdahl vs. Gustafson Both are right Amdahl's law: strong scaling When problem size is fixed granularity diminishes When granularity diminishes, simulation departs from "bestserial", and Gustafson's upperbound departs from the actual speedup value Amdahl's law: strong scaling Solve same problems faster when problem size is fixed and #CPUs grow If A is fixed, granularity diminishes Gustafson's law: weak scaling Problem size grows: solve larger problems Keep the amount of work per CPU when adding more CPUs so as to keep granularity fixed 40

Question Can we achieve speedups higher than #CPUs? Assaf Schuster, 236370, Winter 12 41

Fake superlinear speedup: Serial algorithm does more work Parallel algorithm is BFS! DFS is inefficient for this input – BFS can be simulated with a serial algorithm 42 Assaf Schuster, 236370, Winter 12

Always use best serial algorithm as a baseline Another example: we want to sort an array Efficient parallel bubble sort takes 40s, serial 150s. Speedup = 150/40? NO. Serial quicksort runs in 30s. Speedup 0.75! Assaf Schuster, 236370, Winter 12 43

True superlinear speedup There are more to parallel machine than just CPUs Example: cache Parallelization results in smaller problem size/CPU =>if fits the cache => non-linear performance boost! Cannot be efficiently simulated on a serial machine (see more examples in the Graha book) Assaf Schuster, 236370, Winter 12 44

Summary Parallel performance is a subtle matter Need to use good serial program Need to understand the hardware Amdahl's law: understand the assumptions!!! Too pessimistic Fraction parallel independent of #CPUs Assumes fixed problem size Too optimistic Assumes perfect load balancing, no idling, equal CPUs speed See more material: refs/ on website and book 45