1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, 236370,

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 6: Multicore Systems
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Instruction-Level Parallelism (ILP)
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Prince Sultan College For Woman
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Performance Evaluation of Parallel Processing. Why Performance?
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Lecture 13: Multiprocessors Kai Bu
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Classic Model of Parallel Processing
Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.
Pipelining and Parallelism Mark Staveley
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
EKT303/4 Superscalar vs Super-pipelined.
1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
PipeliningPipelining Computer Architecture (Fall 2006)
Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.
CS203 – Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
4- Performance Analysis of Parallel Programs
Parallel Processing - introduction
Concurrent and Distributed Programming
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 3: Principles of Scalable Performance
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Hardware Multithreading
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4 Multiprocessors
Memory System Performance Chapter 3
Instruction Level Parallelism
Presentation transcript:

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, , Winter 2010 Chapters 2,5 in “Intro to Parallel Computing”, Grama Lecture 2: CS149 Stanford by Aiken&Oluktun

2 What you will learn today  Different levels of parallelism  Instruction level parallelism  SIMD  Multicores  Cache coherence

3 Simple CPU CPU Cache DRAM Memory Instructions  CPU is a digital circuit – time discretized in cycles. 1 cycle = 1/F (requency).  Assume: 1 instruction is executed in 1 cycle  Does it mean T e = #executed instructions / F ?

4 NO ● Instruction should be understood by CPU ● Takes ~5 cycles ● Memory is slow – results in stalls ● cycles

5 Pipelining  Instruction decoding is split into stages  One instruction is being executed while another is being decoded IF ID OFE WB Fetch Inst Decode Inst Operand Inst Execute Write back

6 Source - Wikipedia ● Outcome: throughput = 4 inst/4cycles ● (if pipeline is full!)

7 Can we improve with more parallelism?  Yes: Instruction level parallelism (ILP)  Consider the following code  Can it be parallelized?

8 Independent instructions  1 independent of 2  3 independent of 4  Processor can detect that!  We need:  Hardware to detect that  Hardware to issue multiple instructions  Hardware to execute them (maybe out-of-order)  We get superscalar processor

9 How does it work  Food for thought – how to cope with loops?  Hint – guess.. Two-way super scalar

10 Bad news: ILP is limited

11 Vectorization and VLIW  Why not let compiler do the job  Vectorization: processing multiple data with the same instruction (SIMD) for(int i=0;i<4;i++){ c[i]=a[i]+b[i]; } C_vec=A_vec+B_vec  Very Large Instruction Word (MIMD)  Multiple instructions are grouped into one large one

12 Memory Performance characteristics Example: which drop will reach the other side first?

13 Memory  Performance characteristics  Latency – time between data request and receipt  Bandwidth – rate at which data is moved CPU Memory DATA BUS

14 Why latency is a problem  1GHz processor  Memory latency=100ns.  Memory bandwidth = Infinity  Assume 1 floating point ADD = 1 cycle (1ns)  RAW ALU performance: 1GFLOPs/s  Observed performance for adding two vectors?

15 Why latency is a problem  1GHz processor  Memory latency=100ns.  Memory bandwidth = Infinity  Assume 1 floating point ADD = 1 cycle (1ns)  RAW ALU performance: 1GFLOPs/s  Observed performance for adding two vectors?  We need 3 memory accesses = 300ns per 1 addition => 300 times slower

16 Why bandwidth is a problem - 2  1GHz processor  Memory latency = 0 ns  Memory bandwidth = 1GB/s  Assume 1 floating point ADD = 1 cycle (1ns)  RAW ALU performance: 1GFLOPs/s  Observed performance for adding two vectors?

17 Why bandwidth is a problem - 2  1GHz processor  Memory latency = 0 ns  Memory bandwidth = 1GB/s  Assume 1 floating point ADD = 1 cycle (1ns)  RAW ALU performance: 1GFLOPs/s  Observed performance for adding two vectors? We need 12 bytes per one cycle: 12 GB/s. But we can do only 1GB/s => 12 times slower

18 Coping with memory limitations Cache ($) CPU $ DRAM  Observation 1: most applications reuse data  Spatial locality: most accesses are close in space  Temporal locality: most accesses to the same data are close in time  Observation 2: small is fast  Cache can be placed ON chip  Cache has low latency  Cache has high bandwidth

19 Cache performance  If data in cache (cache hit): latency=1 cycle, bandwidth = 100s GB/s  Otherwise – miss penalty  Hit rate= /  Miss rate?  Example: hit rate=90%, 20% instructions access memory, miss penalty 100 cycles.  Performance? What if hit rate=99%?  Cache hierarchy follows same principles

20 Question: Does cache help for adding two vectors?

21 NO!  There is no data reuse:  Cache is populated upon first access – compulsory miss - slow  If the data is not reused – no benefit from cache  Almost true – programmed prefetching helps – Memory bandwidth better utilized

22 Question: Does cache help for matrix product?

23 Yes!  O(n 3 ) computations  O(n 2 ) memory  We read each datum O(n) times  Read more Chapter 2, Grama textbook

24 Checkpoint  Pipelines improve CPU performance by hiding instruction handling delay  Super-scalar out-of-order processors can “automatically” improve the performance by exploiting Instruction Level Parallelism  Caches reduce latency by assuming spatial and temporal locality  Caches are VERY important – lots of research done on cache-efficient algorithms

25 Adding more processors ● Shared Memory Processors (SMPs) ● Single physical memory space CPU $ Memory CPU $

Assaf Schuster, , Winter SMPs vs. Multicores  SMP – different chips  Multicores (CMP) – same chip, multiple units

27 Cache coherence problem CPU $ Memory CPU $ A=0; A=1; print(A); Which value will be printed? Thread 1Thread 2

28 Cache coherence protocol  Must ensure serializibility  There must be an equivalent global order of reads and writes that respects (is consistent with) the local order of each of the threads T1: R a,#2; W a,#3 T2: W a,#2; R a,#3 T1: R a,#2; W a,#3 T2: W a,#2; R a,#3

29 Cache coherence  Snoopy cache coherence  All caches listen for other caches' reads and writes  Write: invalidates values in others  Reads: get the latest version from other cache

Assaf Schuster, , Winter Cache coherence-2

Fixed by Amit Markel 31

32 False sharing  Cache coherence is managed at cache line granularity (e.g. 4 words)  Variables on the same cache line will be invalidated  May lead to unexpected coherence traffic

33 Summary  Snoopy cache coherence requires broadcasts  There are other approaches  False sharing: can generate unexpected load  Delays memory access, quickly grows with #processors – not scalable

34 Performance of parallel programs More details in Chapter 5 in Graha book  A parallel system is a combination of a parallel algorithm and an underlying platform  Intuitive measures  Wall clock time  Speedup = (Serial time)/(Parallel time)  GFLOPs/s = how well the hardware is exploited  Need more:  Scalability: Speedup as a function of #CPUs  Efficiency: Speedup/#CPUs

35 Can we estimate an upper bound on speedup? Amdahl's law  Sequential component limits the speedup  Split program serial time T serial =1 into  Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads  Cannot be parallelized: 1-A  Parallel time T parallel = A/#CPUs+(1-A)  Speedup(#CPUs)=T serial /T parallel = 1 / ( A/#CPUs+(1-A) )

36 Bad news Source: wikipedia So - why do we need machines with 1000x CPUs?

37 Living with Amdahl's law: Gustafson's law T parallel = T parallel *(A+ (1-A)) T bestserial <= #CPUs *T parallel *A + T parallel *(1-A) [by simulation, a bound on the best serial program] Speedup = T bestserial / T parallel <= #CPUs *A + (1-A) = #CPUs – (1-A)*(#CPUs-1) It all depends how good the simulation is Simulation usually improves with problem size

38 Amdahl vs. Gustafson – both right It is all about granularity  When problem size is fixed granularity diminishes with #CPU  When granularity diminishes, simulation departs from "bestserial", and Gustafson's upper-bound departs from the actual speedup Amdahl's law: strong scaling  Solve same problems faster for fixed problem size and #CPU grows  If A is fixed, granularity diminishes Gustafson's law: weak scaling  Problem size grows: solve larger problems  Keep the amount of work per CPU when adding more CPUs to keep the granularity fixed

39 Question  Can we achieve speedups higher than #CPUs?  “Superlinear”

40 Fake superlinear speedup: Serial algorithm does more work  Parallel algorithm is BFS  DFS is inefficient for this input  BFS can be simulated with a serial algorithm

41 Always use best serial algorithm as a baseline Example: sorting an array  Efficient parallel bubble sort takes 40s, serial 150s. Speedup = 150/40?  NO. Serial quicksort runs in 30s. Speedup 0.75!

42 True superlinear speedup Example: cache Parallelization results in smaller problem size/CPU =>if fits the cach => non-linear performance boost!  Cannot be efficiently simulated on a serial machine  See more examples in the Graha book

43 Summary  Parallel performance is a subtle matter  Need to use a good serial program  Need to understand the hardware Amdahl's law: understand the assumptions!!! Too pessimistic  Fraction parallel independent of #CPUs  Assumes fixed problem size Too optimistic  Assumes perfect load balancing, no idling, equal CPUs speed  See more material: refs/ on website and book