Parallel Processing Problems Cache Coherence False Sharing Synchronization
Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches
Whatever are we to do? Write-Invalidate Write-Update
Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches 4
Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4
Performance Considerations Invalidate Update Writing makes data exclusive Receiving changed data slower Once shared, always shared Once shared, writes always on bus Get changed data very quickly
Cache Coherence False Sharing $$$ P1P2 Current contents in:P1$ P2$ * 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words
Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses
Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i); Vs For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);
Which is worse? Both access the same number of elements No processors access the same elements as each other
Synchronization Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?
Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0
Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)
Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)
Does Cache Coherence solve it? Did load bring in an old value? Sum += A[i] is ___________ –Atomic – operation occurs in one unit, and nothing may interrupt it.
Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need __________________ that allow us to read and write without interruption
Solution Software Solution –“lock” – –“unlock” – Hardware –Provide primitives that read & write in order to implement lock and unlock
Software Using lock and unlock Sum += A[i]
Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]
Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held
Summary Cache coherence must be implemented for shared memory to work False sharing causes bad cache performance Hardware primitives necessary for synchronizing shared data