1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

Slides:



Advertisements
Similar presentations
EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization without Contention
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Computer Architecture II 1 Computer architecture II Lecture 9.
Synchronization (Barriers) Parallel Processing (CS453)
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.
Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.
1 Previous Lecture Overview  semaphores provide the first high-level synchronization abstraction that is possible to implement efficiently in OS. This.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Multiprocessors – Locks
Lecture 21 Synchronization
Synchronization Feb 2017 Topics Locks Barriers
Prof John D. Kubiatowicz
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Designing Parallel Algorithms (Synchronization)
Shared Memory Systems Miodrag Bolic.
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Hardware-Software Trade-offs in Synchronization and Data Layout
Lecture 25: Multiprocessors
Lecture 4: Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006

2 Synchronization Operations Locks Point-to-point event synchronization Barriers Global event notification Dynamic work distribution

3 Locks - Desirable Characteristics and Potential Tradeoffs Low latency to acquire lock High bandwidth Minimal traffic at all stages Low storage cost Fairness - FIFO lock granting Perform well with distributed memory

4 Test&set Lock Acquire method: test&set returns 0, sets to 1 Waiting algorithm: spin on test&set until it returns 0 Release algorithm: set to 0

5 Disadvantages Excessive traffic Unfair Separate primitives needed for different operations Exponential backoff only helps somewhat

6 Test-and-test&set Spin waiting protocol Spins on the read only Generates less bus traffic, but still O(p 2 ) Failed attempts generate invalidations

7 Contended test&set spin locks P1 releases the lock, P2 and P3 read miss P1 holds the lock, P2 and P3 spin on the same variable

8 Contended test&set spin locks P2 and P3 try to reread lock - lock is temporarily unlocked P2 and P3 attempt to test&set the lock to gain exclusive access

9 Contended test&set spin locks Causes additional invalidations and cache interference Return to a), but now P2 has the lock

10 Load-linked, Store-conditional LL - Loads variable to register SC - Writes register to memory only if no intervening writes to that location occurred Together, they implement an atomic r-m-w Goals: –Test with reads only –No invalidations on failure –Single primitive for variety of r-m-w operations

11 LL, SC Lock Implementation SC fails if 1) Detects another write before bus request or 2) Loses bus arbitration lock: llr1, location ; read the value bnzr1, lock ; loop if not free sclocation, #1 ; try to store beqzlockit ; start over if ; unsuccessful unlock: st location, #0 ; release the lock

12 Load-linked, Store-conditional Advantages –No bus traffic while spinning –Generates no invalidations on store failure –Primitive for various operations (test&set, fetch&op, compare&swap) –Improved traffic for lock acquisition - O(p)

13 Load-linked, Store-conditional Disadvantages –Heavy traffic when lock is released. –Invalidates caches for all waiting processors –O(p) traffic per lock acquisition (could do better) –Not fair

14 Contended Locks Problem: Release all waiting processors, but only one will get the lock! Solution: Notify only one processor

15 Ticket Lock Two counters: next-ticket and now-serving Algorithm –Acquire method: atomic fetch&increment on next-ticket provides unique my-ticket –Waiting algorithm: check now-serving until it equals my-ticket –Release method: increment now-serving

16 Ticket Lock Advantages: –Decreased traffic on lock release –Constant, small storage –Fair –Low latency with cacheable fetch&increment Drawbacks: –Traffic still not O(1) on release

17 Array-Based Lock Acquire method: atomic fetch&increment provides unique location (address) Waiting algorithm: check location for ready, if not ready, check until a read miss occurs Release method: write to the next location in the array

18 Array-Based Lock Advantages: –Only one invalidate on a release –Fair –Similar uncontended latency –No backoff needed Disadvantages: –O(p) rather than O(1) space –Complications with distributed memory

19 Synchronization with Distributed Memory Interconnect not centralized –Disjoint processors coordinate in parallel –Complicates synchronization primitives Physically distributed memory –Synchronization variable allocation important –Varies with cache implementation

20 Synchronization with Distributed Memory Memory bandwidth –Limits scalability –Hot spot references are most severe cause Memory latency –Limits performance –Requires good cache and memory locality

21 Array-Based Locks and Distributed Memory Problems –O(p) storage –Impossible to always spin on local memory Spinning on remote locations undesirable –Increases traffic –Increases contention

22 Software Queuing Lock Goals: –Reduce space requirements –Always spin on locally allocated variables Distributed linked list of waiters Each node points to following node Tail pointer to last waiter

23 Software Queuing Lock

24 Software Queuing Lock Atomic changes to tail pointer Atomic fetch&store –Returns current value of 1st operand –Sets it to second operand –Returns only on success –Determines FIFO ordering for acquisition

25 Software Queuing Lock Atomic check for last processor Atomic compare&swap –Compares first two operands –If equal, set first to third, return true –If not equal, return false –Difficult to implement (3 operands) - use LL,SC

26 Software Queuing Lock Advantages –Space proportional to waiting processes –FIFO granting order –Processes spin on local variables Preferred lock for shared address space, distributed memory with no coherent caching

27 Queue on Lock Bit (QOLB) Hardware primitive Incorporated in IEEE SCI protocol List of waiters in cache tags of spinning processors DASH - directory pointers approximate QOLB waiting list

28 Atomic Counter Increment Performance Time per increment (usec)

29 Atomic Counter Increment Network Usage Est. Network Messages per Increment

30 Point-to-Point Event Synchronization Producer-consumer synchronization Software algorithms use flags - P 1 tells P 2 that a value is ready for P 2 to use: P1 P2P1 P2 a = f(x); // set a while (flag is 0) flag = 1; do nothing; b = g(a); // use a

31 Full-Empty Bits Word-level, producer-consumer synchronization: –A full-empty bit is associated with each word in memory –Producer writes only if the full-empty bit is empty, and leaves it set to full –Consumer reads only if the full-empty bit is set to full, and leaves it set to empty

32 Full-Empty Bits Advantages: –Full-empty bit preserves atomicity –Hardware support for fine-grained producer-consumer synchronization Disadvantages: –Inflexible –Imposes synchronization on all accesses –Hardware cost J-machine? M-machine?

33 Global (Barrier) Event Synchronization No processes can go beyond the barrier until all processes have reached the barrier Arrival Wait for release Release

34 Centralized Barrier Single, shared counter, and flag Counter: Number of arrived processes, increment on arrival to get my-number p = Total number of processes If my-number = p, set release flag Otherwise, busy-wait on release flag

35 Centralized Barrier Inefficient –Counter: incremented atomically by each arriving processor –Flag: all arrived processors busy-wait on the same flag Correctness Problem: Consecutively entering the same barrier (use sense reversal)

36 Centralized Barrier Latency: critical path proportional to p Traffic: about 3p bus transactions Storage: low cost (1 counter, 1 flag) Fairness: same processor may always be last to exit the barrier (unfair) Key problems: latency and traffic, especially with distributed memory!

37 Barriers and Distributed Memory Why do we need better barrier algorithms for distributed memory? –Traffic, contention –Even bigger problem without cache coherence –Parallelization of communication now possible –Fine-grained parallelism often means frequent communication and synchronization

38 Barriers and Distributed Memory Is special hardware support needed? –CM-5, special “control” network for barriers, reductions, broadcasts –CRAY T3D, M-machine –Potentially significant overhead in a large system Are sophisticated software algorithms enough?

39 Software Combining Trees Contention Little Contention Flat Tree-structured

40 Software Combining Trees Same process for release Critical path length is O(log k p) –O(p) for centralized barrier –O(p) for any barrier on a centralized bus Disadvantages: –Remote spinning problem –Heavy network traffic while spinning

41 Tree Barriers with Local Spinning “Tournament barrier” –Predetermine which processor moves up –The other processor spins on a local variable P-node tree –A leaf writes to its parent’s arrival array –A parent waits for all arrivals, then writes to its parent’s arrival array –Separate arrival and release tree ok

42 Tree Barriers with Local Spinning Separate arrival, release branching factor –Larger branching factor => more contention –Smaller branching factor => more network transactions Suited to scalable machines without coherent caching

43 Global Event Notification Example uses: –Producer-consumer synchronization –Communicate global data to consumers (new global min/max, for example) Invalidation-based coherence - sufficient for low- frequency writes Update protocol - reduces communication latency, prevents remote read misses for consuming processors

44 Update-writes Consumer doesn’t fetch data from producers cache Used for: –Small data items (coherence messages per word, not per cache line) –Items the consumer already has cached –Well-suited to implementing barrier release

45 Barrier Synchronization with Update Write and Fetch&Op

46 Barrier Synchronization Without Fetch&Op

47 Dynamic Work Distribution Allocate work to load-balance system, often using task queues Mutual exclusion => multiple remote memory accesses per update Instead, support Fetch&Op Fetch&Op operations can often be parallelized (combining tree)

48 Parallel Prefix Synchronize by combining information Distribute a result based on that combination Carry-lookahead operator is an example Can calculate any associative function (sum, maximum, concatenate) in O(log n) time

49 Parallel Prefix - Upward Sweep Each node saves the value from its rightmost child… … and passes a combined result to its parent

50 Parallel Prefix - Downward Sweep Combine values, send to left child Pass data directly to right child

51 Synchronization and Fine- Grained Parallelism How do these techniques apply to transactional memory? How do they differ for message-passing vs. shared memory? What mechanisms are worth implementing in hardware to support fine-grained parallelism?

52 Questions?