Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Slides:



Advertisements
Similar presentations
PIPELINE AND VECTOR PROCESSING
Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Parallel Processing Problems Cache Coherence False Sharing Synchronization.
SE-292 High Performance Computing
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Review: Multiprocessor Systems (MIMD)
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
CS 470/570:Introduction to Parallel and Distributed Computing.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Multi-core architectures. Single-core computer Single-core CPU chip.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Multiprocessors Speed of execution is a paramount concern, always so … If feasible … the more simultaneous execution that can be done on multiple computers.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
ARM 2007 Chapter 15 The Future of the Architecture by John Rayfield Optimization Technique in Embedded System (ARM)
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
Computer Architecture And Organization UNIT-II Flynn’s Classification Of Computer Architectures.
Lecture 3: Computer Architectures
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
Classification of parallel computers Limitations of parallel processing.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Processor Level Parallelism 1
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Parallel Processing - introduction
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Lecture 18: Coherence and Synchronization
Flynn’s Classification Of Computer Architectures
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
Symmetric Multiprocessing (SMP)
Multiprocessors - Flynn’s taxonomy (1966)
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Parallel Processing Chapter 9

Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine

Motivations

Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult

Motivations Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult –More complexity/power required each time –Would require change in cooling technology

Challenges Parallelizing code is not easy Communication can be costly Requires HW support

Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly Requires HW support

Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly –Performance analysis ignores caches - these costs are much higher Requires HW support

Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly –Performance analysis ignores caches - these costs are much higher Requires HW support –Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.

Performance - Speedup _____________________ 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?

Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?

Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / ( / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? 8

Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / ( / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? –1 / ( /100) = 1 /.307 =

Taxonomy SISD – single instruction, single data SIMD – single instruction, multiple data MISD – multiple instruction, single data MIMD – multiple instruction, multiple data

Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data MISD – multiple instruction, single data MIMD – multiple instruction, multiple data

Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data MIMD – multiple instruction, multiple data

P Controller SIMD D PD PD PD PD PD PD PD Controller fetches instructions All processors execute the same instruction Conditional instructions only way for variation

Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data MIMD – multiple instruction, multiple data

Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data –Never built – pipeline architectures?!? MIMD – multiple instruction, multiple data

Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data –Streaming apps? MIMD – multiple instruction, multiple data –Most multiprocessors –Cheap, flexible

Example Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];

Parallel version Shared Memory

int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; }

Why Synchronization? Why can’t you figure out when proc x will finish work?

Why Synchronization? Why can’t you figure out when proc x will finish work? –Cache misses –Different control flow –Context switches

Supporting Parallel Programs Synchronization Cache Coherence False Sharing

Synchronization Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?

Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need hardware primitives that allow us to read and write without interruption

Solution Software Solution –“lock” – function that allows one processor to leave, all others to loop –“unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) Hardware –Provide primitives that read & write in order to implement lock and unlock

Software Using lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)

Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop If lock has 0, it is free If lock has 1, it is held

Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held

Outline Synchronization Cache Coherence False Sharing

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! 4

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 4

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load? 4

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load?3 4

Whatever are we to do? Write-Invalidate –Invalidate that value in all others’ caches –Set the valid bit to 0 Write-Update –Update the value in all others’ caches

Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches 4

Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3* P1: Rd a DRAM P1,P2 are write-back caches 4

Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3* P1: Rd a3 3 3 DRAM 1 3,5 2 P1,P2 are write-back caches 4

Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, P1: Rd a3 3 3 DRAM 1 3, 4 2 P1,P2 are write-back caches 4

Outline Synchronization Cache Coherence False Sharing

Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 A[0-3] * DRAM P1,P2 cacheline size: 4 words

False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses

False Sharing Different processors access different items in same cache block Leads to___________ misses

False Sharing Different processors access different items in same cache block Leads to coherence cache misses

Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i); Vs For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Which is better? Both access the same number of elements No processors access the same elements as each other

Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality

Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality Less False Sharing