Synchronization (Barriers) Parallel Processing (CS453)

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization without Contention
Operating Systems Part III: Process Management (Process Synchronization)
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
6: Process Synchronization 1 1 PROCESS SYNCHRONIZATION I This is about getting processes to coordinate with each other. How do processes work with resources.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
CS533 - Concepts of Operating Systems 1 Class Discussion.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Synchronous Algorithms I Barrier Synchronizations and Computing LBTS.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.
Concurrency, Mutual Exclusion and Synchronization.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
CS 3204 Operating Systems Godmar Back Lecture 7. 12/12/2015CS 3204 Fall Announcements Project 1 due on Sep 29, 11:59pm Reading: –Read carefully.
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Background on the need for Synchronization
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
Ivy Eva Wu.
Chapter 5: Process Synchronization
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Concurrency: Mutual Exclusion and Process Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Synchronizing Computations
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Synchronization (Barriers) Parallel Processing (CS453)

Need for Barriers In iterative computations, often inner loops (e.g. over data) can be parallelized, whereas an outer loop (e.g. over time steps) cannot. – True dependencies between computation steps or stages Need for barriers: – The next computation stage needs results computed by several processes on the previous stage(s) The next stage can start only after all processes complete the previous stage. – Barrier synchronization among processes is required at the end of computation stages.

A barrier is a point that all processes must reach before any proceed Barriers – Centralized: Counter barrier, Coordinator barrier – Decentralized: Combining tree barrier, Symmetric barriers Barriers can be implemented in software using: – Locks, flags, counters – busy-waiting at a barrier – Locks and condition variables – blocking at a barrier Barrier Synchronization

Counter Barrier Uses a shared counter to count processes arrived at the barrier Lock or atomic Fetch&Increment can be used for atomic increment of the counter This is fine as a single-use barrier, but things get more complex if (as is more likely) we need the barrier to be reusable.(Why?) We want to be sure that all processes have moved from the barrier before reusing it.

Reusable Counter Barrier with Sense Reversal Wait for a binary flag to take different value consecutive times Toggle this value only when all processes reach the instance of the barrier

Evaluation of the Counter Barrier Storage cost is low: one counter (with a lock) and one flag The counter barrier is efficient if the machine has – An atomic fetch&op instruction – Coherent caches (in a multiprocessor) – to spin on the go flag The main problem: memory contention – All processes increment counter – In the worst case, n-1 processes spin on a cached copy of the go flag, but – All n-1 read flag when it’s written by the very last process arrived at the barrier

Point-to-Point Flag Synchronization Flag synchronization delays a proc until a binary flag is set (or cleared) by another proc. – A point-to-point signaling mechanism used as condition synchronization (to wait for a condition, to signal the condition). To reuse a synchronization flag: – The process that waits for a flag to be set is the one that should clear that flag. – The flag should not be set until it is known that it is cleared. Typical usage: – Proc 1 sets a flag (“sends a signal”): while (flag) continue; flag = 1; – Proc 2 clears the flag (“receives a signal”): while (!flag) continue; flag = 0;

Use of Flags for Barriers Idea: – To signal arrival, use an array of arrival flags (one flag per process) instead of the shared counter; Each proc set its arrival flag when it arrives at the barrier. – To release, use an array of release flags (one flag per process) instead of one release flag. Each proc waits for its release flag to be set to exit the barrier. – To avoid false sharing, store flags in different memory blocks. – A coordinator proc can be used to control the barrier. This approach, allows to avoid contention for the centralized counter and the release flag.

Coordinator Barrier The Coordinator process waits for all arrival flags to be set, then sets continue flags to release processes at the barrier

Shortcomings of the Coordinator Barrier Extra process (on its own processor), 2n flags Pure scalability: waiting and releasing time proportional to the number of processes in the barrier O(n) Enhancement: do not use Coordinator, distribute its work among processes (Workers) – Combining tree barrier – Symmetric barriers

Combining Tree Barrier Combining tree is a tree of processes where each proc combines results of its children (if any), and passes to its parent (if any). Combining tree barrier – Arrival signals (flags) are sent up the tree (i.e. from leaves up to the root), continue signals are sent down the tree (i.e. from the root to all the nodes) A worker node at the barrier – Waits for signals from children (if any), – Signals parent (if any) that it too has arrived, – Waits for a continue signal from parent (if any), – Signals its children (if any) to continue – Exits the barrier.

Code of Combining (Binary) Tree Barrier

Evaluation of the Combining Tree Barrier 2n flags – the same number of flags as in the Coordinator barrier Latency is O(logn) Shortcomings: – Too many signals – not good for bus-based SMP – Asymmetric and harder to program Design options: – Use one continue flag for all waiters. Set by the Worker at the root of the tree Alternate sense of the flag for reuse – Use two continue flags and alternate them for reuse

Symmetric Barriers Idea: – Construct an n-process barrier using 2-process barriers passed in several rounds. – Change pairs in 2-proc barriers in each round so that after log n rounds all processes are synchronized, i.e. that are passed the barrier. – At the n-process barrier, each proc should synchronize in total with log n others. Symmetric barriers differ in their schemes of pairing – Butterfly barrier – Dissemination barrier – others

A Two-Process Barrier A proc signals its arrival by setting a flag, and then waits for another flag to be set (another proc to arrive). The first await is to be sure that the first flag have been seen and cleared in previous round. The last line clears the second flag for the next round

Butterfly Barrier Suitable if the number of processes n is power of 2 log 2 n rounds of pair-wise barriers: – In round k, a process i synchronizes with proc j at distance 2k-1 away – In total, each proc "barriers" with log n others

Dissemination Barrier Suitable for any number of processes n log2n rounds – In round k, a proc i signals to at distance (i+2k-1) mod n away – A process disseminates notice of its arrival: it sets the flag of a proc on its right and waits for its own flag to be set, then clears it – In total, each proc signals log n others:

Race Conditions in Symmetric Barriers The race conditions result from using multiple instances of the basic two-process barrier, and may cause deadlock. For example, consider the Butterfly barrier: – Assume, proc 3 has synchronized with proc 4 in round 1, and proceeds to synchronize with proc 1 in round 2; – Proc 1 is in round 1: it sets ar[1] (for proc 2 which late) and waits for ar[2] to be set. – Proc 3 clears ar[1] which was set by proc 1 for proc 2 in the 1 st round, and proceeds to the next round with proc 7.

Use different set of flags in each round of the n-process barrier: logn by n flags Better yet, use counters instead of binary flags: each counter records the number of the current round executed – Use increment (decrement) instead of set (clear) – Wait for the value of ar[j] to be at least as large as ar[i] Avoiding Race Conditions