1 Lecture 9: Directory Protocol, TM Topics: corner cases in directory protocols, coherence vs. message-passing, TM intro.

Slides:



Advertisements
Similar presentations
1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.
1 Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of “lazy” TM.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.
1 Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM.
1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)
1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols.
Multiprocessor Cache Coherency
1 Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian School of Computing, University of Utah
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 10: Transactional Memory Topics: lazy and eager TM implementations, TM pathologies.
The University of Adelaide, School of Computer Science
Lecture 8: Snooping and Directory Protocols
Lecture 20: Consistency Models, TM
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
Lecture 11: Transactional Memory
Lecture: Consistency Models, TM
Lecture 6: Transactions
Lecture 17: Transactional Memories I
Lecture 21: Transactional Memory
Yiannis Nikolakopoulos
Lecture 22: Consistency Models, TM
Lecture: Consistency Models, TM
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Lecture 26: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 10: Coherence, Transactional Memory
Lecture 19: Coherence Protocols
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture 19: Coherence and Synchronization
Lecture: Transactional Memory
Lecture 18: Coherence and Synchronization
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

1 Lecture 9: Directory Protocol, TM Topics: corner cases in directory protocols, coherence vs. message-passing, TM intro

2 Handling Write Requests The home node must invalidate all sharers and all invalidations must be acked (to the requestor), the requestor is informed of the number of invalidates to expect Actions taken for each state:  shared: invalidates are sent, state is changed to excl, data and num-sharers are sent to requestor, the requestor cannot continue until it receives all acks (Note: the directory does not maintain busy state, subsequent requests will be fwded to new owner and they must be buffered until the previous write has completed)

3 Handling Writes II Actions taken for each state:  unowned: if the request was an upgrade and not a read-exclusive, is there a problem?  exclusive: is there a problem if the request was an upgrade? In case of a read-exclusive: directory is set to busy, speculative reply is sent to requestor, invalidate is sent to owner, owner sends data to requestor (if dirty), and a “transfer of ownership” message (no data) to home to change out of busy  busy: the request is NACKed and the requestor must try again

4 Handling Write-Back When a dirty block is replaced, a writeback is generated and the home sends back an ack Can the directory state be shared when a writeback is received by the directory? Actions taken for each directory state:  exclusive: change directory state to unowned and send an ack  busy: a request and the writeback have crossed paths: the writeback changes directory state to shared or excl (depending on the busy state), memory is updated, and home sends data to requestor, the intervention request is dropped

5 Writeback Cases P1P2 D3 E: P1 Wback This is the “normal” case D3 sends back an Ack Ack

6 Writeback Cases P1P2 D3 E: P1  busy Wback If someone else has the block in exclusive, D3 moves to busy If Wback is received, D3 serves the requester If we didn’t use busy state when transitioning from E:P1 to E:P2, D3 may not have known who to service (since ownership may have been passed on to P3 and P4…) (although, this problem can be solved by NACKing the Wback and having P1 buffer its “strange” intervention requests… this could lead to other corner cases… ) Fwd Rd or Wr

7 Writeback Cases P1P2 D3 E: P1  busy Transfer ownership If Wback is from new requester, D3 sends back a NACK Floating unresolved messages are a problem Alternatively, can accept the Wback and put D3 in some new busy state Conclusion: could have got rid of busy state between E:P1  E:P2, but with Wback ACK/NACK and other buffering could have kept the busy state between E:P1  E:P2, could have got rid of ACK/NACK, but need one new busy state Fwd Wback Data

8 Future Scalable Designs Intel’s Single Cloud Computer (SCC): an example prototype No support for hardware cache coherence Programmer can write shared-memory apps by marking pages as uncacheable or L1-cacheable, but forcing memory flushes to propagate results Primarily intended for message-passing apps Each core runs a version of Linux

9 Scalable Cache Coherence Will future many-core chips forego hardware cache coherence in favor of message-passing or sw-managed cache coherence? It’s the classic programmer-effort vs. hw-effort trade-off … traditionally, hardware has won (e.g. ILP extraction) Two questions worth answering: will motivated programmers prefer message-passing?, is scalable hw cache coherence do-able?

10 Message Passing Message passing can be faster and more energy-efficient Only required data is communicated: good for energy and reduces network contention Data can be sent before it is required (push semantics; cache coherence is pull semantics and frequently requires indirection to get data) Downsides: more software stack layers and more memory hierarchy layers must be traversed, and.. more programming effort

11 Scalable Directory Coherence Note that the protocol itself need not be changed If an application randomly accesses data with zero locality:  long latencies for data communication  also true for message-passing apps If there is locality and page coloring is employed, the directory and data-sharers will often be in close proximity Does hardware overhead increase? See examples in last class… the overhead is ~2-10% and sharing can be tracked at coarse granularity… hierarchy can also be employed, with snooping-based coherence among a group of nodes

12 Transactions Access to shared variables is encapsulated within transactions – the system gives the illusion that the transaction executes atomically – hence, the programmer need not reason about other threads that may be running in parallel with the transaction Conventional model: TM model: … lock(L1); trans_begin(); access shared vars access shared vars unlock(L1); trans_end(); …

13 Transactions Transactional semantics:  when a transaction executes, it is as if the rest of the system is suspended and the transaction is in isolation  the reads and writes of a transaction happen as if they are all a single atomic operation  if the above conditions are not met, the transaction fails to commit (abort) and tries again transaction begin read shared variables arithmetic write shared variables transaction end

14 Why are Transactions Better? High performance with little programming effort  Transactions proceed in parallel most of the time if the probability of conflict is low (programmers need not precisely identify such conflicts and find work-arounds with say fine-grained locks)  No resources being acquired on transaction start; lesser fear of deadlocks in code  Composability

15 Example Producer-consumer relationships – producers place tasks at the tail of a work-queue and consumers pull tasks out of the head Enqueue Dequeue transaction begin transaction begin if (tail == NULL) if (head->next == NULL) update head and tail update head and tail else else update tail update head transaction end transaction end With locks, neither thread can proceed in parallel since head/tail may be updated – with transactions, enqueue and dequeue can proceed in parallel – transactions will be aborted only if the queue is nearly empty

16 Example Is it possible to have a transactional program that deadlocks, but the program does not deadlock when using locks? flagA = flagB = false; thr-1 thr-2 lock(L1) lock(L2) while (!flagA) {}; flagA = true; flagB = true; while (!flagB) {}; * * unlock(L1) unlock(L2) Somewhat contrived The code implements a barrier before getting to * Note that we are using different lock variables

17 Atomicity Blindly replacing locks-unlocks with tr-begin-end may occasionally result in unexpected behavior The primary difference is that:  transactions provide atomicity with every other transaction  locks provide atomicity with every other code segment that locks the same variable Hence, transactions provide a “stronger” notion of atomicity – not necessarily worse for performance or correctness, but certainly better for programming ease

18 Other Constructs Retry: abandon transaction and start again OrElse: Execute the other transaction if one aborts Weak isolation: transactional semantics enforced only between transactions Strong isolation: transactional semantics enforced beween transactions and non-transactional code

19 Useful Rules of Thumb Transactions are often short – more than 95% of them will fit in cache Transactions often commit successfully – less than 10% are aborted 99.9% of transactions don’t perform I/O Transaction nesting is not common Amdahl’s Law again: optimize the common case!  fast commits, can have slightly slow aborts, can have slightly slow overflow mechanisms

20 Design Space Data Versioning  Eager: based on an undo log  Lazy: based on a write buffer Typically, versioning is done in cache; The above two are variants that handle overflow Conflict Detection  Optimistic detection: check for conflicts at commit time (proceed optimistically thru transaction)  Pessimistic detection: every read/write checks for conflicts

21 Title Bullet