Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof John D. Kubiatowicz

Similar presentations


Presentation on theme: "Prof John D. Kubiatowicz"— Presentation transcript:

1 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252
CS252 Graduate Computer Architecture Lecture 21 April 13th, Distributed Shared Memory (con’t) Synchronization Prof John D. Kubiatowicz CS258 S99

2 Recall: Sequential Consistency of Directory Protocols
How to get exclusion zone for directory protocol? Clearly need to make sure that invalidations really invalidate copies Keep state to handle reordering at client (previous slide’s problem) While acknowledgements outstanding, cannot handle read requests NAK read requests Queue read requests Example for invalidation- based scheme: block owner (home node) provides appearance of atomicity by waiting for all invalidations to be ack’d before allowing access to new value As a result, write commit point becomes point at which WData leaves home node (after last ack received) Much harder in update schemes! Req Reader NACK Inv Ack Req Reader REQ HOME WData Reader 4/13/2011 cs252-S11, Lecture 21

3 Recall: Deadlock Issues with Protocols
1: req 2:reply 3:interv ention 4a:re vise 4b:response 1 2 3 4a 4b 2 Networks Sufficient to Avoid Deadlock L H R 1: req 2:interv ention 3:response 4:reply 1 2 3 4 Need 4 Networks to Avoid Deadlock L H R 1: req 2:interv ention 3b:response 3a:re vise 1 2 Need 3 Networks to Avoid Deadlock 3a 3b Consider Dual graph of message dependencies Nodes: Networks, Arcs: Protocol actions Number of networks = length of longest dependency Must always make sure response (end) can be absorbed! 4/13/2011 cs252-S11, Lecture 21

4 Recall: Mechanisms for reducing depth
1: req 2:interv ention 3b:response 3a:re vise 1 2 3 Original: Need 3 Networks to Avoid Deadlock L H R 1: req 2:interv ention 3b:response 3a:re vise 1 2 3 Optional NACK When blocked Need 2 Networks to X NACK 2:intervention 1: req 1 2’ Transform to Request/Resp: Need 2 Networks to 3a:re vise L H R 2’:SendInt To R 2 3a 3b:response 3a 4/13/2011 cs252-S11, Lecture 21

5 A Popular Middle Ground
Two-level “hierarchy” Individual nodes are multiprocessors, connected non-hiearchically e.g. mesh of SMPs Coherence across nodes is directory-based directory keeps track of nodes, not individual processors Coherence within nodes is snooping or directory orthogonal, but needs a good interface of functionality Examples: Convex Exemplar: directory-directory Sequent, Data General, HAL: directory-snoopy SMP on a chip? 4/13/2011 cs252-S11, Lecture 21

6 Example Two-level Hierarchies
4/13/2011 cs252-S11, Lecture 21

7 Advantages of Multiprocessor Nodes
Potential for cost and performance advantages amortization of node fixed costs over multiple processors applies even if processors simply packaged together but not coherent can use commodity SMPs less nodes for directory to keep track of much communication may be contained within node (cheaper) nodes prefetch data for each other (fewer “remote” misses) combining of requests (like hierarchical, only two-level) can even share caches (overlapping of working sets) benefits depend on sharing pattern (and mapping) good for widely read-shared: e.g. tree data in Barnes-Hut good for nearest-neighbor, if properly mapped not so good for all-to-all communication 4/13/2011 cs252-S11, Lecture 21

8 Disadvantages of Coherent MP Nodes
Bandwidth shared among nodes all-to-all example applies to coherent or not Bus increases latency to local memory With coherence, typically wait for local snoop results before sending remote requests Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth May hurt performance if sharing patterns don’t comply 4/13/2011 cs252-S11, Lecture 21

9 Insight into Directory Requirements
If most misses involve O(P) transactions, might as well broadcast!  Study Inherent program characteristics: frequency of write misses? how many sharers on a write miss how these scale Also provides insight into how to organize and store directory information 4/13/2011 cs252-S11, Lecture 21

10 Cache Invalidation Patterns
4/13/2011 cs252-S11, Lecture 21

11 Cache Invalidation Patterns
4/13/2011 cs252-S11, Lecture 21

12 Sharing Patterns Summary
Generally, few sharers at a write, scales slowly with P Code and read-only objects (e.g, scene data in Raytrace) no problems as rarely written Migratory objects (e.g., cost array cells in LocusRoute) even as # of PEs scale, only 1-2 invalidations Mostly-read objects (e.g., root of tree in Barnes) invalidations are large but infrequent, so little impact on performance Frequently read/written objects (e.g., task queues) invalidations usually remain small, though frequent Synchronization objects low-contention locks result in small invalidations high-contention locks need special support (SW trees, queueing locks) Implies directories very useful in containing traffic if organized properly, traffic and latency shouldn’t scale too badly Suggests techniques to reduce storage overhead 4/13/2011 cs252-S11, Lecture 21

13 Organizing Directories
Directory Schemes Centralized Distributed How to find source of directory information Flat Hierarchical How to locate copies Memory-based Cache-based 4/13/2011 cs252-S11, Lecture 21

14 How to Find Directory Information
centralized memory and directory - easy: go to it but not scalable distributed memory and directory flat schemes directory distributed with memory: at the home location based on address (hashing): network xaction sent directly to home hierarchical schemes ?? 4/13/2011 cs252-S11, Lecture 21

15 How Hierarchical Directories Work
Directory is a hierarchical data structure leaves are processing nodes, internal nodes just directory logical hierarchy, not necessarily phyiscal (can be embedded in general network) 4/13/2011 cs252-S11, Lecture 21

16 Find Directory Info (cont)
distributed memory and directory flat schemes hash hierarchical schemes node’s directory entry for a block says whether each subtree caches the block to find directory info, send “search” message up to parent routes itself through directory lookups like hiearchical snooping, but point-to-point messages between children and parents 4/13/2011 cs252-S11, Lecture 21

17 How Is Location of Copies Stored?
Hierarchical Schemes through the hierarchy each directory has presence bits child subtrees and dirty bit Flat Schemes vary a lot different storage overheads and performance characteristics Memory-based schemes info about copies stored all at the home with the memory block Dash, Alewife , SGI Origin, Flash Cache-based schemes info about copies distributed among copies themselves each copy points to next Scalable Coherent Interface (SCI: IEEE standard) 4/13/2011 cs252-S11, Lecture 21

18 Flat, Memory-based Schemes
info about copies co-located with block at the home just like centralized scheme, except distributed Performance Scaling traffic on a write: proportional to number of sharers latency on write: can issue invalidations to sharers in parallel Storage overhead simplest representation: full bit vector, (called “Full-Mapped Directory”), i.e. one presence bit per node storage overhead doesn’t scale well with P; 64-byte line implies 64 nodes: 12.7% ovhd. 256 nodes: 50% ovhd.; 1024 nodes: 200% ovhd. for M memory blocks in memory, storage overhead is proportional to P*M: Assuming each node has memory Mlocal= M/P,  P2Mlocal This is why people talk about full-mapped directories as scaling with the square of the number of processors P M 4/13/2011 cs252-S11, Lecture 21

19 Reducing Storage Overhead
Optimizations for full bit vector schemes increase cache block size (reduces storage overhead proportionally) use multiprocessor nodes (bit per mp node, not per processor) still scales as P*M, but reasonable for all but very large machines 256-procs, 4 per cluster, 128B line: 6.25% ovhd. Reducing “width” addressing the P term? Reducing “height” addressing the M term? P M = MlocalP 4/13/2011 cs252-S11, Lecture 21

20 Storage Reductions Width observation: Height observation:
most blocks cached by only few nodes don’t have a bit per node, but entry contains a few pointers to sharing nodes Called “Limited Directory Protocols” P=1024 => 10 bit ptrs, can use 100 pointers and still save space sharing patterns indicate a few pointers should suffice (five or so) need an overflow strategy when there are more sharers Height observation: number of memory blocks >> number of cache blocks most directory entries are useless at any given time Could allocate directory from pot of directory entries If memory line doesn’t have a directory, no-one has copy What to do if overflow? Invalidate directory with invaliations organize directory as a cache, rather than having one entry per memory block 4/13/2011 cs252-S11, Lecture 21

21 Case Study: Alewife Architecture
Cost Effective Mesh Network Pro: Scales in terms of hardware Pro: Exploits Locality Directory Distributed along with main memory Bandwidth scales with number of processors Con: Non-Uniform Latencies of Communication Have to manage the mapping of processes/threads onto processors due Alewife employs techniques for latency minimization and latency tolerance so programmer does not have to manage Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency Cache Controller holds tags and implements the coherence protocol Single Chip Cache Controller holds cache tags and implements cache coherence protocol by synthesizing messages to other nodes. 4/13/2011 cs252-S11, Lecture 21 CS258 S99

22 LimitLESS Protocol (Alewife)
Limited Directory that is Locally Extended through Software Support Handle the common case (small worker set) in hardware and the exceptional case (overflow) in software Processor with rapid trap handling (executes trap code within 4 cycles of initiation) State Shared Processor needs complete access to coherence related controller state in the hardware directories Directory Controller can invoke processor trap handlers Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets 4/13/2011 cs252-S11, Lecture 21

23 The Protocol Alewife: p=5-entry limited directory with software extension (LimitLESS) Read-only directory transaction: Incoming RREQ with n  p  Hardware memory controller responds If n > p: send RREQ to processor for handling 4/13/2011 cs252-S11, Lecture 21

24 Transition to Software
Trap routine can either discard packet or store it to memory Store-back capability permits message-passing and block transfers Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill Solution: Synchronous Trap (stored in local memory) to empty input queue 4/13/2011 cs252-S11, Lecture 21

25 Transition to Software (Con’t)
Overflow Trap Scenario First Instance: Full-Map bit-vector allocated in local memory and hardware pointers transferred into this and vector entered into hash table Otherwise: Transfer hardware pointers into bit vector Meta-State Set to “Trap-On-Write” While emptying hardware pointers, Meta-State: “Trans-In-Progress” Incoming Write Request Scenario Empty hardware pointers to memory Set AckCtr to number of bits that are set in bit-vector Send invalidations to all caches except possibly requesting one Free vector in memory Upon invalidate acknowledgement (AckCtr == 0), send Write-Permission and set Memory State to “Read-Write” 4/13/2011 cs252-S11, Lecture 21

26 Flat, Cache-based Schemes
How they work: home only holds pointer to rest of directory info distributed linked list of copies, weaves through caches cache tag has pointer, points to next cache with a copy on read, add yourself to head of the list (comm. needed) on write, propagate chain of invals down the list Scalable Coherent Interface (SCI) IEEE Standard doubly linked list 4/13/2011 cs252-S11, Lecture 21

27 Scaling Properties (Cache-based)
Traffic on write: proportional to number of sharers Latency on write: proportional to number of sharers! don’t know identity of next sharer until reach current one also assist processing at each node along the way (even reads involve more than one other assist: home and first sharer on list) Storage overhead: quite good scaling along both axes Only one head ptr per memory block rest is all prop to cache size Very complex!!! 4/13/2011 cs252-S11, Lecture 21

28 Summary of Directory Organizations
Flat Schemes: Issue (a): finding source of directory data go to home, based on address Issue (b): finding out where the copies are memory-based: all info is in directory at home cache-based: home has pointer to first element of distributed linked list Issue (c): communicating with those copies memory-based: point-to-point messages (perhaps coarser on overflow) can be multicast or overlapped cache-based: part of point-to-point linked list traversal to find them serialized Hierarchical Schemes: all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children 4/13/2011 cs252-S11, Lecture 21

29 Summary of Directory Approaches
Directories offer scalable coherence on general networks no need for broadcast media Many possibilities for organizing directory and managing protocols Hierarchical directories not used much high latency, many network transactions, and bandwidth bottleneck at root Both memory-based and cache-based flat schemes are alive for memory-based, full bit vector suffices for moderate scale measured in nodes visible to directory protocol, not processors will examine case studies of each 4/13/2011 cs252-S11, Lecture 21

30 Role of Synchronization
Types of Synchronization Mutual Exclusion Event synchronization point-to-point group global (barriers) How much hardware support? high-level operations? atomic instructions? specialized interconnect? 4/13/2011 cs252-S11, Lecture 21

31 Components of a Synchronization Event
Acquire method Acquire right to the synch enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn’t busy-waiting, blocking, or hybrid Release method Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization makes no sense to put in hardware 4/13/2011 cs252-S11, Lecture 21

32 Strawman Lock Why doesn’t the acquire method work? Release method?
Busy-Wait lock: ld register, location /* copy location to register */ cmp location, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ Why doesn’t the acquire method work? Release method? 4/13/2011 cs252-S11, Lecture 21

33 What to do if only load and store?
Here is a possible two-thread solution: Thread A Thread B Set A=1; Set B=1; while (B) {//X if (!A) {//Y do nothing; Critical Section; } } Critical Section; Set B=0; Set A=0; Does this work? Yes. Both can guarantee that: Only one will enter critical section at a time. At X: if B=0, safe for A to perform critical section, otherwise wait to find out what will happen At Y: if A=0, safe for B to perform critical section. Otherwise, A is in critical section or waiting for B to quit But: Really messy Generalization gets worse 4/13/2011 cs252-S11, Lecture 21 CS258 S99

34 Atomic Instructions Specifies a location, register, & atomic operation
Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0 How to implement test&set in distributed cache coherent machine? Wait until have write privileges, then perform operation without allowing any intervening operations (either locally or remotely) 4/13/2011 cs252-S11, Lecture 21

35 T&S Lock Microbenchmark: SGI Chal.
lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ s l n u uuuuuuuuuuuuuuu Number of processors T ime ( m s) 11 13 15 2 4 6 8 10 12 14 16 18 20 est&set, c = 0 est&set, exponential backof f, = 3.64 Ideal 9 7 5 3 lock; delay(c); unlock; 4/13/2011 cs252-S11, Lecture 21

36 Zoo of hardware primitives
test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; } swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp; } compare&swap (&address, reg1, reg2) { /* */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; } } load-linked&store conditional(&address) { /* R4000, alpha */ loop: ll r1, M[address]; movi r2, 1; /* Can do arbitrary comp */ sc r2, M[address]; beqz r2, loop; } 4/13/2011 cs252-S11, Lecture 21 CS258 S99

37 Mini-Instruction Set debate
atomic read-modify-write instructions IBM 370: included atomic compare&swap for multiprogramming x86: any instruction can be prefixed with a lock modifier High-level language advocates want hardware locks/barriers but it’s goes against the “RISC” flow,and has other problems SPARC: atomic register-memory ops (swap, compare&swap) MIPS, IBM Power: no atomic operations but pair of instructions load-locked, store-conditional later used by PowerPC and DEC Alpha too 68000: CCS: Compare and compare and swap No-one does this any more Rich set of tradeoffs 4/13/2011 cs252-S11, Lecture 21

38 Other forms of hardware support
Separate lock lines on the bus Lock locations in memory Lock registers (Cray Xmp,Intel Single-Chip CC) Hardware full/empty bits (Tera, Alewife) QOLB (machines supporting SCI protocol) Bus support for interrupt dispatch 4/13/2011 cs252-S11, Lecture 21

39 Enhancements to Simple Lock
Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don’t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: ith time = k*ci Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load cached lock variable will be invalidated when release occurs When value changes (to 0), try to obtain lock with test&set only one attemptor will succeed; others will fail and start testing again 4/13/2011 cs252-S11, Lecture 21

40 Busy-wait vs Blocking Busy-wait: I.e. spin lock Blocking: Hybrid:
Keep trying to acquire lock until read Very low latency/processor overhead! Very high system overhead! Causing stress on network while spinning Processor is not doing anything else useful Blocking: If can’t acquire lock, deschedule process (I.e. unload state) Higher latency/processor overhead (1000s of cycles?) Takes time to unload/restart task Notification mechanism needed Low system overheadd No stress on network Processor does something useful Hybrid: Spin for a while, then block 2-competitive: spin until have waited blocking time 4/13/2011 cs252-S11, Lecture 21

41 Improved Hardware Primitives: LL-SC
Goals: Test with reads Failed read-modify-write attempts don’t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional LL reads variable into register Follow with arbitrary instructions to manipulate its value SC tries to store back to location succeed if and only if no other write to the variable since this processor’s LL indicated by condition codes; If SC succeeds, all three steps happened atomically If fails, doesn’t write or generate invalidations must retry aquire 4/13/2011 cs252-S11, Lecture 21

42 Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ Can do more fancy atomic ops by changing what’s between LL & SC But keep it small so SC likely to succeed Don’t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: Detects intervening write even before trying to get bus Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables 4/13/2011 cs252-S11, Lecture 21

43 Ticket Lock Only one r-m-w per acquire
Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket atomic op when arrive at lock, not when it’s free (so less contention) Release: increment now-serving Performance low latency for low-contention - if fetch&inc cacheable O(p) read misses at release, since all spin on same variable FIFO order like simple LL-SC lock, but no inval when SC succeeds, and fair Backoff? Wouldn’t it be nice to poll different locations ... Can be difficult to find a good amount to delay on backoff exponential backoff not a good idea due to FIFO order backoff proportional to now-serving - next-ticket may work well 4/13/2011 cs252-S11, Lecture 21 CS258 S99

44 Array-based Queuing Locks
Waiting processes poll on different locations in an array of size p Acquire fetch&inc to obtain address on which to spin (next array element) ensure that these addresses are in different cache lines or memories Release set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock, but, O(p) space per lock Not so great for non-cache-coherent machines with distributed memory array location I spin on not necessarily in my local memory Example: MCS lock (Mellor-Crummey and Scott) 4/13/2011 cs252-S11, Lecture 21

45 Lock Performance on SGI Challenge
Loop: lock; delay(c); unlock; delay(d); l A r a y - b s e d 6 L S C n , x p o t i u T c k 1 3 5 7 9 2 4 (a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 s, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s) Time (s) Number of processors 4/13/2011 cs252-S11, Lecture 21

46 Point to Point Event Synchronization
Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is “full” with newly produced data (i.e. when written) Unset when word is “empty” due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexibility multiple consumers, or multiple writes before consumer reads? needs language support to specify when to use composite data structures? 4/13/2011 cs252-S11, Lecture 21

47 Barriers Software algorithms implemented using locks, flags, counters
Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors even harder with multiple processes per processor Difficult to dynamically change number and identity of participants e.g. latter due to process migration Not common today on bus-based machines 4/13/2011 cs252-S11, Lecture 21

48 A Simple Centralized Barrier
Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs Problem? struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ 4/13/2011 cs252-S11, Lecture 21

49 A Working Centralized Barrier
Consecutively entering the same barrier doesn’t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } } 4/13/2011 cs252-S11, Lecture 21

50 Centralized Barrier Performance
Latency Centralized has critical path length at least proportional to p Traffic About 3p bus transactions Storage Cost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node 4/13/2011 cs252-S11, Lecture 21

51 Improved Barrier Algorithms for a Bus
Software combining tree Only k processors access the same location, where k is degree of tree Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths On bus, all traffic goes on same bus, and no less total traffic Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage on bus is use of ordinary reads/writes instead of locks 4/13/2011 cs252-S11, Lecture 21

52 Barrier Performance on SGI Challenge
Centralized does quite well Will discuss fancier barrier algorithms for distributed machines Helpful hardware support: piggybacking of reads misses on bus Also for spinning on highly contended locks 4/13/2011 cs252-S11, Lecture 21

53 Lock-Free Synchronization
What happens if process grabs lock, then goes to sleep??? Page fault Processor scheduling Etc Lock-free synchronization: Operations do not require mutual exclusion of multiple insts Nonblocking: Some process will complete in a finite amount of time even if other processors halt Wait-Free (Herlihy): Every (nonfaulting) process will complete in a finite amount of time Systems based on LL&SC can implement these 4/13/2011 cs252-S11, Lecture 21

54 Using of Compare&Swap for queues
compare&swap (&address, reg1, reg2) { /* */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; } } Here is an atomic add to linked-list function: addToQueue(&object) { do { // repeat until no conflict ld r1, M[root] // Get ptr to current head st r1, M[object] // Save link in new object } until (compare&swap(&root,r1,object)); } root next next New Object 4/13/2011 cs252-S11, Lecture 21 CS258 S99

55 Transactional Memory Transaction-based model of memory
Interface: start transaction(); read/write data commit transaction(): If conflicts detected, commit will abort and must be retried What is a conflict? If values you read are written by others before commit Hardware support for transactions Typically uses cache coherence protocol to help process 4/13/2011 cs252-S11, Lecture 21

56 Brief discussion of Transactional Memory
LogTM: Log-based Transactional Memory Kevin Moore, Jayaram Bobba, Michelle Moravan, Mark Hill & David Wood Use of Cache Coherence protocol to detect transaction conflicts Transactional Interface: begin_transaction(): Request that subsequent statements for a transaction commit_transaction(): Ends successful transaction begun by matching begin_transaction(). Discards any transaction state saved for potential abort abort_transaction(): Transfers control to a previously register conflict handler which should undo and discard work since last begin_transaction() 4/13/2011 cs252-S11, Lecture 21

57 Specific Logging Mechanism
4/13/2011 cs252-S11, Lecture 21

58 Summary Performance Enhancements Deadlock Issues with Protocols
Reduce number of hops Reduce occupancy of transactions in memory controller Deadlock Issues with Protocols Many protocols are not simply request-response Consider Dual graph of message dependencies Nodes: Networks, Arcs: Protocol actions Consider maximum depth of graph to discover number of networks Distributed Directory Structure Flat: Each address has a “home node” Hierarchical: directory spread along tree Mechanism for locating copies of data Memory-based schemes info about copies stored all at the home with the memory block Cache-based schemes info about copies distributed among copies themselves 4/13/2011 cs252-S11, Lecture 21

59 Synchronization Summary
Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together primitives determine which algorithms perform well Evaluation methodology is challenging Use of delays, microbenchmarks Should use both microbenchmarks and real workloads Simple software algorithms with common hardware primitives do well on bus Will see more sophisticated techniques for distributed machines Hardware support still subject of debate Theoretical research argues for swap or compare&swap, not fetch&op Algorithms that ensure constant-time access, but complex 4/13/2011 cs252-S11, Lecture 21


Download ppt "Prof John D. Kubiatowicz"

Similar presentations


Ads by Google