Download presentation
Presentation is loading. Please wait.
Published byIsabella Cole Modified over 9 years ago
1
Parallel Processing Chapter 9
2
Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
3
Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine
4
Motivations
5
Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult
6
Motivations Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult –More complexity/power required each time –Would require change in cooling technology
7
Challenges Parallelizing code is not easy Communication can be costly Requires HW support
8
Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly Requires HW support
9
Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly –Performance analysis ignores caches - these costs are much higher Requires HW support
10
Challenges Parallelizing code is not easy –Languages, software engineering, software verification issue – beyond scope of class Communication can be costly –Performance analysis ignores caches - these costs are much higher Requires HW support –Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.
11
Performance - Speedup _____________________ 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?
12
Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?
13
Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / (.30 +.70 / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? 8
14
Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / (.30 +.70 / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? –1 / (.30 +.70/100) = 1 /.307 = 3.26 8
15
Taxonomy SISD – single instruction, single data SIMD – single instruction, multiple data MISD – multiple instruction, single data MIMD – multiple instruction, multiple data
16
Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data MISD – multiple instruction, single data MIMD – multiple instruction, multiple data
17
Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data MIMD – multiple instruction, multiple data
18
P Controller SIMD D PD PD PD PD PD PD PD Controller fetches instructions All processors execute the same instruction Conditional instructions only way for variation
19
Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data MIMD – multiple instruction, multiple data
20
Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data –Never built – pipeline architectures?!? MIMD – multiple instruction, multiple data
21
Taxonomy SISD – single instruction, single data –uniprocessor SIMD – single instruction, multiple data –vector, MMX extensions, graphics cards MISD – multiple instruction, single data –Streaming apps? MIMD – multiple instruction, multiple data –Most multiprocessors –Cheap, flexible
22
Example Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];
23
Parallel version Shared Memory
24
int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; }
25
Why Synchronization? Why can’t you figure out when proc x will finish work?
26
Why Synchronization? Why can’t you figure out when proc x will finish work? –Cache misses –Different control flow –Context switches
27
Supporting Parallel Programs Synchronization Cache Coherence False Sharing
28
Synchronization Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?
29
Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)
30
Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0) 5 38 15 5 38
31
Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0) 5 38 15 5 38
32
Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need hardware primitives that allow us to read and write without interruption
33
Solution Software Solution –“lock” – function that allows one processor to leave, all others to loop –“unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) Hardware –Provide primitives that read & write in order to implement lock and unlock
34
Software Using lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)
35
Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]
36
Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop If lock has 0, it is free If lock has 1, it is held
37
Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held
38
Outline Synchronization Cache Coherence False Sharing
39
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches
40
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches
41
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches
42
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches
43
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches
44
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches
45
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches
46
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! 4
47
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 4
48
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load? 4
49
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load?3 4
50
Whatever are we to do? Write-Invalidate –Invalidate that value in all others’ caches –Set the valid bit to 0 Write-Update –Update the value in all others’ caches
51
Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches 4
52
Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3* 3 5 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches 4
53
Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3* 3 5 5. P1: Rd a3 3 3 DRAM 1 3,5 2 P1,P2 are write-back caches 4
54
Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4
55
Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 33 3 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4
56
Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 33 3 3 5. P1: Rd a3 3 3 DRAM 1 3, 4 2 P1,P2 are write-back caches 4
57
Outline Synchronization Cache Coherence False Sharing
58
Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words
59
Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
60
Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words
61
Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words
62
Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words
63
Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 A[0-3] * DRAM P1,P2 cacheline size: 4 words
64
False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses
65
False Sharing Different processors access different items in same cache block Leads to___________ misses
66
False Sharing Different processors access different items in same cache block Leads to coherence cache misses
67
Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i); Vs For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);
68
Which is better? Both access the same number of elements No processors access the same elements as each other
69
Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality
70
Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality Less False Sharing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.