Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
The University of Adelaide, School of Computer Science
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
The University of Adelaide, School of Computer Science
Multiprocessors – Locks
Lecture 21 Synchronization
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CSE 586 Computer Architecture Lecture 9
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Cache Coherence Protocols 15th April, 2006
Lecture 5: Snooping Protocol Design Issues
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
High Performance Computing
Lecture 4: Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on a shared-bus with a shared L2 –Directory cache coherence on a cluster basis –Clusters (up to 16) connected through 2 2D-meshes (one for sending messages, one for acks) Alewife (MIT) –Dynamic pointer allocation directory (5 pointers) –On “overflow” of a directory entry, software takes over –Multithreaded. (Fast) Context-switch on a cache miss to a remote node FLASH (Stanford) –Use of a programmable protocol processor. Can implement different protocols (including message passing) depending on the application

Synchron. CSE 471 Aut 012 Some Recent Medium-scale NUMA Multiprocessors (commercial machines) SGI Origin (follow-up on DASH) –2 processors/cluster –Full directory –Hypercube topology up to 32 processors (16 nodes) –Then “fat hypercube” with a metarouter (up to 256 processors) vertices of hypercubes connected to switches in metarouter Sequent NUMA-Q –SPM clusters of 4 processors + shared “remote” cache (caches only data not homed in cluster) –Clusters connected in a ring –SCI cache coherence via remote caches

Synchron. CSE 471 Aut 013 Extending the range of SMP’s – Sun’s Starfire Use snooping buses (4 of them) for transmitting requests and addresses –One bus per each quarter of the physical address bus Up to 16 clusters of 4 processor/memory modules each Data is transmitted via a 16 x 16 cross-bar between clusters “Analysis” shows that up to 12 clusters, performance is limited by the data network; after that it’s by the snooping buses

Synchron. CSE 471 Aut 014 Multiprogramming and Multiprocessing Imply Synchronization Locking –Critical sections –Mutual exclusion –Used for exclusive access to shared resource or shared data for some period of time –Efficient update of a shared (work) queue Barriers –Process synchronization -- All processes must reach the barrier before any one can proceed (e.g., end of a parallel loop).

Synchron. CSE 471 Aut 015 Locking Typical use of a lock: while (!acquire (lock)) /*spin*/ ; /* some computation on shared data*/ release (lock) Acquire based on primitive: Read-Modify-Write –Basic principle: “Atomic exchange” –Test-and-set –Fetch-and-add

Synchron. CSE 471 Aut 016 Test-and-set Lock is stored in a memory location that contains 0 or 1 Test-and-set (attempt to acquire) writes a 1 and returns the value in memory If the value is 0, the process gets the lock; if the value is 1 another process has the lock. To release, just clear (set to 0) the memory location.

Synchron. CSE 471 Aut 017 Atomic Exchanges Test-and-set is one form of atomic exchange Atomic-swap is a generalization of Test-and-set that allows values besides 0 and 1 Compare-and-swap is a further generalization: the value in memory is not changed unless it is equal to the test value supplied

Synchron. CSE 471 Aut 018 Fetch-and-Θ Generic name for fetch-and-add, fetch-and-store etc. Can be used as test-and-set (since atomic exchange) but more general. Will be used for barriers (see later) Introduced by the designers of the NYU Ultra where the interconnection network allowed combining. –If two fetch-and-add have the same destination, they can be combined. However, they have to be forked on the return path

Synchron. CSE 471 Aut 019 Full/Empty Bits Based on producer-consumer paradigm Each memory location has a synchronization bit associated with it –Bit = 0 indicates the value has not been produced (empty) –Bit = 1 indicates the value has been produced (full) A write stalls until the bit is empty (0). After the write the bit is set to full (1). A read stalls until the bit is full and then empty it. Not all load/store instructions need to test the bit. Only those needed for synchronization (special opcode) First implemented in HEP and now in Tera.

Synchron. CSE 471 Aut 0110 Faking Atomicity Instead of atomic exchange, have an instruction pair that can be deduced to have operated in an atomic fashion Load locked (ll) + Store conditional (sc) (Alpha) –sc detects if the value of the memory location loaded by ll has been modified. If so returns 0 (locking fails) otherwise 1 (locking succeeds) –Similar to atomic exchange but does nor require read-modify-write Implementation –Use a special register (link register) to store the address of the memory location addressed by ll. On context-switch, interrupt or invalidation of block corresponding to that address (by another sc), the register is cleared. If on sc, the addresses match, the sc succeeds

Synchron. CSE 471 Aut 0111 Spin Locks Repeatedly: try to acquire the lock Test-and-Set in a cache coherent environment (invalidation-based): –Bus utilized during the whole read-modify-write cycle –Since test-and-set writes a location in memory, need to send an invalidate (even if the lock is not acquired) –In general loop to test the lock is short, so lots of bus contention –Possibility of “exponential back-off” (like in Ethernet protocol to avoid too many collisions)

Synchron. CSE 471 Aut 0112 Test and Test-and-Set Replace “test-and-set” with “test and test-and-set”. –Keep the test (read) local to the cache. –First test in the cache (non atomic). If lock cannot be acquired, repeatedly test in the cache (no bus transaction) –On lock release (write 0 in memory location) all other cached copies of the lock are invalidated. –Still racing condition for acquiring a lock that has just been released. (O(n**2) bus transactions for n contending processes). Can use ll+sc but still racing condition when the lock is released

Synchron. CSE 471 Aut 0113 Queuing Locks Basic idea: a queue of waiting processors is maintained in shared-memory for each lock (best for bus-based machines) –Each processor performs an atomic operation to obtain a memory location (element of an array) on which to spin –Upon a release, the lock can be directly handed off to the next waiting processor

Synchron. CSE 471 Aut 0114 Software Implementation lock struct {int Queue[P]; int Queuelast;} /*for P processors*/ ACQUIREmyplace := fetch-and-add (lock->Queuelast); while (lock->Queue[myplace modP] = = 1; /* spin*/ lock->Queue[myplace modP] := 1; RELEASE lock->Queue[myplace + 1 modP] := 0; –The Release should invalidate the cached value in the next processor that can then fetch the new value stored in the array.

Synchron. CSE 471 Aut 0115 Queuing Locks (hardware implementation) Can be done several ways via directory controllers Associate a syncbit (aka, full/empty bit) with each block in memory ( a single lock will be in that block) –Test-and-set the syncbit for acquiring the lock –Unset to release –Special operation (QOLB) non-blocking operation that enqueues the processor for that lock if not already in the queue. Can be done in advance, like a prefetch operation. Have to be careful if process is context-switched (possibility of deadlocks)

Synchron. CSE 471 Aut 0116 Barriers All processes have to wait at a synchronization point –End of parallel do loops Processes don’t progress until they all reach the barrier Low-performance implementation: use a counter initialized with the number of processes –When a process reaches the barrier, it decrements the counter (atomically -- fetch-and-add (-1)) and busy waits –When the counter is zero, all processes are allowed to progress (broadcast) Lots of possible optimizations (tree, butterfly etc. ) –Is it important? Barriers do not occur that often (Amdahl’s law….)