Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
Concurrency: Mutual Exclusion, Synchronization, Deadlock, and Starvation in Representative Operating Systems.
Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Synchronization (Barriers) Parallel Processing (CS453)
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Multiprocessors – Locks
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
CS703 – Advanced Operating Systems
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Multiprocessor Cache Coherency
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Lecture 4: Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization

Caltech CS184 Spring DeHon 2 Today Requirement Implementation –Uniprocessor –Bus-Based –Distributed Granularity

Caltech CS184 Spring DeHon 3 Synchronization

Caltech CS184 Spring DeHon 4 Problem If correctness requires an ordering between threads, –have to enforce it Was not a problem we had in the single- thread case –does occur in the multiple threads on single processor case

Caltech CS184 Spring DeHon 5 Desired Guarantees Precedence –barrier synchronization Everything before barrier completes before anything after begins –producer-consumer Consumer reads value produced by producer Atomic Operation Set Mutual exclusion

Caltech CS184 Spring DeHon 6 Read/Write Locks? Try implement lock with r/w: if (~A.lock) A.lock=true do stuff A.lock=false

Caltech CS184 Spring DeHon 7 Problem with R/W locks? Consider context switch between test (~A.lock=true?) and assignment (A.lock=true) if (~A.lock) A.lock=true do stuff A.lock=false

Caltech CS184 Spring DeHon 8 Primitive Need Need Indivisible primitive to enabled atomic operations

Caltech CS184 Spring DeHon 9 Original Examples Test-and-set –combine test of A.lock and set into single atomic operation –once have lock can guarantee mutual exclusion at higher level Read-Modify-Write –atomic read…write sequence Exchange

Caltech CS184 Spring DeHon 10 Examples (cont.) Exchange –Exchange true with A.lock –if value retrieved was false this process got the lock –if value retrieved was true already locked (didn’t change value) keep trying –key is, only single exchanger get the false value

Caltech CS184 Spring DeHon 11 Implementation

Caltech CS184 Spring DeHon 12 Implementing... What required to implement? –Uniprocessor –Bus-based

Caltech CS184 Spring DeHon 13 Implement: Uniprocessor Prevent Interrupt/context switch Primitives use single address –so page fault at beginning –then ok to prevent interrupts (defer faults…) SMT?

Caltech CS184 Spring DeHon 14 Implement: Snoop Bus Need to reserve for Write –write-through hold the bus between read and write Guarantee no operation can intervene –write-back need exclusive read and way to defer other writes until written

Caltech CS184 Spring DeHon 15 Performance Concerns? Locking resources reduce parallelism Bus (network) traffic Processor utilization Latency of operation

Caltech CS184 Spring DeHon 16 Basic Synch. Components Acquisition Waiting Release

Caltech CS184 Spring DeHon 17 Spin Wait While (Exchange (A.lock,True)==True); Access A; Exchange(A.lock,False);

Caltech CS184 Spring DeHon 18 Possible Problems Spin wait generates considerable memory traffic Release traffic Bottleneck on resources Invalidation –can’t cache locally… Fairness

Caltech CS184 Spring DeHon 19 Test-and-Set Try: t&s R1, A.lock bnz R1, Try return Simple algorithm generate considerable traffic p contenders –p try first, 1 wins –for o(1) time p-1 spin –…then p-2… –c*(p+p-1+p-2,,,) –O(p 2 )

Caltech CS184 Spring DeHon 20 Backoff Instead of immediately retrying –wait some time before retry –reduces contention –may increase latency (what if I’m only contender and is about to be released?)

Caltech CS184 Spring DeHon 21 Primitive Bus Performance [Culler/Singh/Gupta 5.29]

Caltech CS184 Spring DeHon 22 Bad Effects Performance Decreases with users –From growing traffic already noted

Caltech CS184 Spring DeHon 23 Test-test-and-Set Try: ld R1, A.lock bnz R1, Try t&s R1, A.lock bnz R1, Try return Read can be to local cache Not generate bus traffic Generates less contention traffic

Caltech CS184 Spring DeHon 24 Detecting atomicity sufficient Fine to detect if operation will appear atomic Pair of instructions –ll -- load locked load value and mark in cache as locked –sc -- store conditional stores value iff no intervening write to address e.g. cache-line never invalidated by write

Caltech CS184 Spring DeHon 25 LL/SC operation Try: LL R1 A.lock BNZ R1, Try SC R2, A.lock BEQZ Try return from lock

Caltech CS184 Spring DeHon 26 LL/SC Pair doesn’t really lock value Just detects if result would appear that way Ok to have arbitrary interleaving between LL and SC Ok to have capacity eviction between LL and SC –will just fail and retry

Caltech CS184 Spring DeHon 27 LL/SC and MP Traffic Address can be cached Spin on LL not generate global traffic (everyone have their own copy) After write (e.g. unlock) –everyone miss -- O(p) message traffic –O(p 2 ) total traffic for p contenders No need to lock down bus during operation

Caltech CS184 Spring DeHon 28 Synchronization in DSM

Caltech CS184 Spring DeHon 29 Implement: Distributed Can’t lock down bus Exchange at memory controller? –Invalidate copies (force writeback) –after settles, return value and write new –don’t service writes until complete

Caltech CS184 Spring DeHon 30 Ticket Synchronization Separate counters for place in line and current owner Use ll/sc to implement fetch-and-increment on position in line Simply read current owner until own number comes up Increment current owner when done Provides FIFO service (fairness) O(p) reads on change like ll/sc Chance to backoff based on expected wait time

Caltech CS184 Spring DeHon 31 Array Based Assign numbers like Ticket But use numbers as address into an array of synchronization bits Each queued user spins on different location Set next location when done Now only O(1) traffic per invalidation

Caltech CS184 Spring DeHon 32 Performance Bus [Culler/Singh/Gupta 5.30]

Caltech CS184 Spring DeHon 33 Queuing Like Array, but use queue Atomicly splice own synchronization variable at end of queue Can allocate local to process Spin until predecessor done –and splices self out

Caltech CS184 Spring DeHon 34 Performance Distributed [Culler/Singh/Gupta 8.34]

Caltech CS184 Spring DeHon 35 Barrier Synchronization Guarantee all processes rendezvous at point before continuing Separate phases of computation

Caltech CS184 Spring DeHon 36 Simple Barrier Fetch-and-Decrement value Spin until reaches zero If reuse same synchronization variable –will have to take care in reset –one option: invert sense each barrier

Caltech CS184 Spring DeHon 37 Simple Barrier Performance Bottleneck on synchronization variable O(p 2 ) traffic spinning Each decrement invalidates cached version

Caltech CS184 Spring DeHon 38 Combining Trees Avoid bottleneck by building tree –fan-in and fan-out Small (constant) number of nodes rendezvous on a location Last arriver synchronizes up to next level Exploit disjoint paths in scalable network Spin on local variables Predetermine who passes up –“Tournament”

Caltech CS184 Spring DeHon 39 Simple Bus Barrier [Culler/Singh/Gupta 5.31]

Caltech CS184 Spring DeHon 40 Synchronization Grain Size

Caltech CS184 Spring DeHon 41 Full/Empty Bits One bit associated with data word Bit set if data is present (location full) Bit not set if data is not present (empty) Read to full data –Completes like normal read Read to empty data –Stalls for data to be produced Like a register scoreboard in the memory system

Caltech CS184 Spring DeHon 42 F/E Uses Like a thunk in Scheme Allows you to allocate a datastructure (addresses) and pass around before data is computed Non-strict operations need not block on data being created –Copying the address, returning it … are non-strict E.g. cons Only strict operations block –Add (needs value)

Caltech CS184 Spring DeHon 43 F/E Use: Example Consider a relaxation calculation on an entire grid (or a cellular automata) Want each element to read values from appropriate epoch Could barrier synch. With F/E bits –can allow some processes –to start on next iteration –…will block on data from previous iteration not, yet produced…

Caltech CS184 Spring DeHon 44 Coarse vs. Fine-Grained... Barriers are coarse-grained synchronization –all processes rendezvous at point Full-empty bits are fine-grained –synchronize each consumer with producer –expose more parallelism –less false waiting –many more synchronization events and variables

Caltech CS184 Spring DeHon 45 Alewife / Full Empty Experiment to see impact of synchronization granularity Conjugate Gradient computation Barriers vs. full-empty bits [Yeung and Agarwal PPoPP’93]

Caltech CS184 Spring DeHon 46 Overall Impact

Caltech CS184 Spring DeHon 47 Alewife provides Ability to express fine-grained synchronization with J-structures Efficient data storage (in directory) Hardware handling of data presence –like memory op in common case that data is available

Caltech CS184 Spring DeHon 48 Breakdown benefit How much of benefit from each? –Expressiveness –Memory efficiency –Hardware support

Caltech CS184 Spring DeHon 49 Impact of Compact Memory

Caltech CS184 Spring DeHon 50 Overall Contribution II expression only III + memory full-empty IV + fast bit ops

Caltech CS184 Spring DeHon 51 Synch. Granularity Big benefit from expression Hardware can make better –but not the dominant effect

Caltech CS184 Spring DeHon 52 Big Ideas Simple primitives –Must have primitives to support atomic operations –don’t have to implement atomicly just detect non-atomicity Make fast case common –optimize for locality –minimize contention

Caltech CS184 Spring DeHon 53 Big Ideas Expose parallelism –fine-grain expressibility exposes most –cost can be manageable