Parallel Processing Problems Cache Coherence False Sharing Synchronization.

Slides:



Advertisements
Similar presentations
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Advertisements

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper ( )
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snoopy Coherence Protocols Small-scale multiprocessors.
Cache Organization of Pentium
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Multi-Core Architectures
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Lecture 13: Multiprocessors Kai Bu
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Influence Of The Cache Size On The Bus Traffic Mohd Azlan bin Hj. Abd Rahman M
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Understanding Parallel Computers Parallel Processing EE 613.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Additional Material CEG 4131 Computer Architecture III
1 Lecture: Coherence Protocols Topics: snooping-based protocols.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
תרגול מס' 5: MESI Protocol
6/16/2010 Parallel Performance Parallel Performance.
Lecture 18: Coherence and Synchronization
Assignment 4 – (a) Consider a symmetric MP with two processors and a cache invalidate write-back cache. Each block corresponds to two words in memory.
Directory-based Protocol
CS5102 High Performance Computer Systems Distributed Shared Memory
Cache Coherence Protocols 15th April, 2006
James Archibald and Jean-Loup Baer CS258 (Prof. John Kubiatowicz)
CMSC 611: Advanced Computer Architecture
Interconnect with Cache Coherency Manager
Lecture 4: Update Protocol
An Extensible Simulator for Bus- and Directory-Based Coherence
Multiprocessor Highlights
Lecture 25: Multiprocessors
Lecture 10: Consistency Models
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 25: Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 18: Cache Coherence
Lecture 19: Coherence and Synchronization
CSL718 : Multiprocessors 13th April, 2006 Introduction
Lecture 18: Coherence and Synchronization
Lecture 11: Consistency Models
Presentation transcript:

Parallel Processing Problems Cache Coherence False Sharing Synchronization

Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches

Whatever are we to do? Write-Invalidate Write-Update

Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches 4

Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * P2: Wr a, 5* P1: Rd a P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

Performance Considerations Invalidate Update Writing makes data exclusive Receiving changed data slower Once shared, always shared Once shared, writes always on bus Get changed data very quickly

Cache Coherence False Sharing $$$ P1P2 Current contents in:P1$ P2$ * 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses

Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i<N;i++) A[NumProcs*i+Pn] = f(i); Vs For(i=(Pn*NElem);i<(Pn+1)*NElem;i++) A[i] = f(i);

Which is worse? Both access the same number of elements No processors access the same elements as each other

Synchronization Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?

Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0

Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

Does Cache Coherence solve it? Did load bring in an old value? Sum += A[i] is ___________ –Atomic – operation occurs in one unit, and nothing may interrupt it.

Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need __________________ that allow us to read and write without interruption

Solution Software Solution –“lock” – –“unlock” – Hardware –Provide primitives that read & write in order to implement lock and unlock

Software Using lock and unlock Sum += A[i]

Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held

Summary Cache coherence must be implemented for shared memory to work False sharing causes bad cache performance Hardware primitives necessary for synchronizing shared data