Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Chapter 6: Process Synchronization

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Local-Spin Algorithms

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

More on Locks: Case Studies

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Synchronization (Barriers) Parallel Processing (CS453)

Synchronization and Communication in the T3E Multiprocessor.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Kernel Locking Techniques by Robert Love presented by Scott Price.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.

Multiprocessors – Locks

Introduction to threads

Background on the need for Synchronization

Synchronization Feb 2017 Topics Locks Barriers

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Task Scheduling for Multicore CPUs and NUMA Systems

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Barrier Synchronization

Module 7a: Classic Synchronization

Lecture 21: Synchronization and Consistency

Yiannis Nikolakopoulos

Lecture: Coherence and Synchronization

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Multithreaded Programming

CS533 Concepts of Operating Systems

CS510 - Portland State University

Lecture: Coherence, Synchronization

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Presentation transcript:

Jeremy Denham April 7, 2008

 Motivation  Background / Previous work  Experimentation  Results  Questions

 Modern processor design trends are primarily concerned with the multi-core design paradigm.  Still figuring out what to do with them  Different way of thinking about “shared-memory multiprocessors”  Distributed apps?  Synchronization will be important.

 Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor- Crummey & Scott  Scalable, busy-wait synchronization algorithms  No memory or interconnect contention  O(1) remote references per mechanism utilization  Spin locks and barriers

 “Spin” on lock by busy-waiting until available.  Typically involves “fetch-and-Φ” operations  Must be atomic!

 “Test-and-set”  Needs processor support to make it atomic  “fetch-and-store”  xchg works in x86  Loop until lock is possessed  Expensive!  Frequently accessed, too  Networking issues

 Can reduce fetch-and-Φ ops to one per lock acquisition  FIFO service guarantee  Two counters  Requests  Releases  fetch_and_increment request counter  Wait until release counter reflects turn  Still problematic…

 T.E. Anderson  Incoming processes put themselves in the queue  Lock holder hands off the lock to next in queue  Faster than ticket, but more space

 FIFO Guarantee  Local spinning!  Small constant amount of space  Cache coherence a non-issue

 Each processor allocates a record  next link  boolean flag  Adds to queue  Spins locally  Owner passes lock to next user in queue as necessary

 Mechanism for “phase separation”  Block processes from proceeding until all others have reached a checkpoint  Designed for repetitive use

 “Local” and “global” sense  As processor arrives  Reverse local sense  Signal its arrival  If last, reverse global sense  Else spin  Lots of spinning…

 Barrier information is “disseminated” algorithmically  At each synchronization stage k, processor i signals processor (i + 2 k ) mod P, where P is the number of processors  Similarly, processor i continues when it is signaled by processor (i - 2 k ) mod P  log(P) operations on critical path, P log(P) remote operations

 Tree-based approach  Outcome statically determined  “Roles” for each round  “loser” notifies “winner,” then drops out  “winner” waits to be notified, participates in next round  “champion” sets global flag when over  log(P) rounds  Heavy interconnect traffic…

 Also tree-based  Local spinning  O(P) space for P processors  (2P – 2) network transactions  O(log P) network transactions on critical path

 Use two P-node trees  “child-not-ready” flag for each child present in parent  When all children have signaled arrival, parent signals its parent  When root detects all children have arrived, signals to the group that it can proceed to next barrier.

 Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines  BBN  Supports up to 256 processor nodes  8 MHz MC68000  Sequent  Supports up to 30 processor nodes  16 MHz Intel  Most concerned with Sequent

 Want to extend to multi-core machines  Scalability of limited usefulness (not that many cores)  Shared resources  Core load

 Intel Centrino Duo T5200 Processor  Two cores  1.60 GHz per core  2MB L2 Cache  Windows Vista  2GB DDR2 Memory

 Evaluate basic and MCS approaches  Simple and complex evaluations  Core pinning  Load ramping

 Code porting  Lots of Linux-specific code  Win32 Thread API  Esoteric…  How to pin a thread to a core?  Timing  Win32 μsec-granularity measurement  Surprisingly archaic C code

 Spin lock base code ported  Barriers nearly done  Simple experiments for spin locks done  More complex on the way

 Simple spin lock tests  Simple lock outperforms MCS on: ▪ Empty Critical Section ▪ Simple FP Critical Section ▪ Single core ▪ Dual core  More procedural overhead for MCS on small scale  Next steps: ▪ More threads! ▪ More critical section complexity