Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Similar presentations


Presentation on theme: "Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:"— Presentation transcript:

1 Parallel Processing Chapter 9

2 Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

3 Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: –Divide program into parts –Run each part on separate CPUs of larger machine

4 Motivations Desktops are incredibly cheap Squeezing out more ILP is difficult More software is multi-threaded

5 Motivations Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult More software is multi-threaded

6 Motivations Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult –More complexity/power required each time More software is multi-threaded

7 Motivations Desktops are incredibly cheap –Custom high-performance uniprocessor –Hook up 100 desktops Squeezing out more ILP is difficult –More complexity/power required each time More software is multi-threaded –Chicken & egg problem!

8 Challenge Parallelizing code is not easy Communication can be costly

9 Speedup _____________________ 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?

10 Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? What is the speedup with 100 processors?

11 Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / (.30 +.70 / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? 8

12 Speedup Amdahl’s Law!!!!!! 70% of the program is parallelizable What is the highest speedup possible? –1 / (.30 +.70 / ) = 1 /.30 = 3.33 What is the speedup with 100 processors? –1 / (.30 +.70/100) = 1 /.307 = 3.26 8

13 Example Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];

14 Parallel version Shared Memory

15 int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; }

16 Why Synchronization? Why can’t you figure out when proc x will finish work?

17 Why Synchronization? Why can’t you figure out when proc x will finish work? –Cache misses –Different control flow –Context switches

18 Programming Model Shared Memory Message-Passing

19 Programming Model Shared Memory –Communicate through shared variables Message-Passing

20 Programming Model Shared Memory –Communicate through shared variables –Synchronize with locks/barriers Message-Passing

21 Programming Model Shared Memory –Communicate through shared variables –Synchronize with locks/barriers Message-Passing –Communicate through send/receive

22 Programming Model Shared Memory –Communicate through shared variables –Synchronize with locks/barriers Message-Passing –Communicate through send/rcv –Synchronize through send/rcv or barriers

23 Parallel Problems Chapter 9.3-9.4

24 Outline Cache Coherence False Sharing Synchronization

25 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches

26 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches

27 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 P1,P2 are write-back caches

28 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

29 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

30 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 2 P1,P2 are write-back caches

31 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches

32 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! 4

33 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 4

34 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load? 4

35 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 35 3 5 5. P1: Rd a DRAM 1 3 2 AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?5 What should P1 receive from its load?3 4

36 Whatever are we to do? Write-Invalidate –Invalidate that value in all others’ caches –Set the valid bit to 0 Write-Update –Update the value in all others’ caches

37 Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches 4

38 Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3* 3 5 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches 4

39 Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3* 3 5 5. P1: Rd a3 3 3 DRAM 1 3,5 2 P1,P2 are write-back caches 4

40 Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

41 Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 33 3 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

42 Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 5 4. P2: Wr a, 33 3 3 5. P1: Rd a3 3 3 DRAM 1 3, 4 2 P1,P2 are write-back caches 4

43 Outline Cache Coherence False Sharing Synchronization

44 Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

45 Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

46 Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

47 Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

48 Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

49 Cache Coherence False Sharing w/ Invalidate $$$ P1P2 Current contents in:P1$ P2$* 1.P2: Rd A[0] *A[0-3] 2.P1: Rd A[1]A[0-3]A[0-3] 3. P2: Wr A[0], 5 *A[0-3] 4. P1: Wr A[1], 3 A[0-3] * DRAM P1,P2 cacheline size: 4 words

50 False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses

51 False Sharing Different processors access different items in same cache block Leads to___________ misses

52 False Sharing Different processors access different items in same cache block Leads to coherence cache misses

53 Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i); Vs For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

54 Which is better? Both access the same number of elements No processors access the same elements as each other

55 Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality

56 Why is the second better? Both access the same number of elements No processors access the same elements as each other Better Spatial Locality Less False Sharing

57 Outline Cache Coherence False Sharing Synchronization

58 Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?

59 Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

60 Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0) 5 38 15 5 38

61 Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0) 5 38 15 5 38

62 Does Cache Coherence solve it? Did load bring in an old value? –No – this was not a coherence problem sum += A[i] is non-atomic –Atomic – operation occurs in one unit, and nothing may interrupt it.

63 Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need hardware primitives that allow us to read and write without interruption

64 Solution Software Solution –“lock” – function that allows one processor to leave, all others to loop –“unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) Hardware –Provide primitives that read & write in order to implement lock and unlock

65 Software Using lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)

66 Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]

67 Hardware: Implementing lock & unlock with swap Lock: addi $t0, $0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop If lock has 0, it is free If lock has 1, it is held

68 Hardware: Implementing lock & unlock with swap Lock: addi $t0, $0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held

69 Hardware: Efficient lock & unlock with swap Lock: addi $t0, $0, 1 Loop:swap $t0, 0($a0) beq$t0, $0, exit busywait:lw $t0, 0($a0) beq $t0, $0, loop j busywait Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held

70 Summary Cache coherence must be implemented for shared memory to work False sharing causes bad cache performance Hardware primitives are necessary for synchronizing shared data


Download ppt "Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:"

Similar presentations


Ads by Google