Shared Memory Systems Miodrag Bolic.

Slides:

Advertisements

Similar presentations

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Tutorial for QUIZ 1: Interconnects, shared memory, and synchronization

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Lab 2 Parallel processing using NIOS II processors

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

Multiprocessors – Locks

Symmetric Multiprocessors: Synchronization and Sequential Consistency

COSC6385 Advanced Computer Architecture

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:

Prof John D. Kubiatowicz

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Lecture 18: Coherence and Synchronization

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 2: Snooping-Based Coherence

CS510 Concurrent Systems Jonathan Walpole.

Cache Coherence Protocols 15th April, 2006

Designing Parallel Algorithms (Synchronization)

Message Passing Models

Lecture 5: Snooping Protocol Design Issues

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Concurrency: Mutual Exclusion and Process Synchronization

Hardware-Software Trade-offs in Synchronization and Data Layout

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

Lecture 26: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Chapter 6: Synchronization Tools

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

Shared Memory Systems Miodrag Bolic

Overview Next 4 to 5 lectures

Outline Characteristics of shared memory systems Programming shared-memory multiprocessors Hardware implementation Architectures Memory access Caches Synchronization

Characteristics of shared memory systems [3] Any processor can directly reference any memory location. Communication occurs implicitly as result of loads and stores. Location of data in memory is transparent to the programmer. Inherently provided on wide range of platforms (standard processors today have specific extra hardware for share memory systems) Memory may be physically distributed among processors.

Requirements [3] Support for memory coherency The machine must make sure that all of the processing nodes have an accurate picture of the most up-to-date memory. Support for atomic operations on data The machine must allow for only one processor to change data at a time. Nonatomic operation: One processor requests data and before the request is answered, another processor changes that data.

Shared Memory Program [3] Sum all the elements of an array of size n. INITIALIZE; //assign proc_nums and num_procs read_array(array_to_sum, size); //read the array and array size from file if (proc_num == 0) //initialize the sum { LOCK(global_sum); global_sum = 0; UNLOCK(global_sum); } local_sum = 0; size_to_sum = size/num_procs; lower_ind = size_to_sum * proc_num; upper_ind = size_to_sum * (proc_num + 1); for (i = lower_ind; i < upper_ind; i++) local_sum += array_to_sum[i]; //if size =100, num_proc=4, processor 0 sums 0 to 24, proc 1 sums 25 to 49, etc LOCK(global_sum); //locks the sum variable so only this process can change it global_sum += local_sum; UNLOCK(global_sum); //gives the sum back so other procs can add to it BARRIER(num_procs); //waits for num_procs to get to this point in the program if (proc_num == 0) printf("sum is %d", global_sum); END;

Multiprocessor Software Functions – Example [3] INITIALIZE – assigns a number (proc_num) to each processor in the system; assigns the total number of processors (num_procs). LOCK(data) Allows a processor to “check out” a certain piece of shared data. While one processor has the data locked, no other processors can obtain the lock. The lock is blocking, so once a LOCK is encountered, execution of the program cannot proceed until the LOCK is obtained. UNLOCK(data) – releases a lock so that other processors can obtain it. BARRIER(n_procs) – When a BARRIER is encountered, a processor waits at that BARRIER until n_procs processors reach the BARRIER, then execution can proceed.

Architecture [3] An example of shared-bus architecture with 4 processors Both static and dynamic networks can be used to connect processors and shared memory

Natural Extensions of Memory System [4]

Memory Organization [1]

Caches and Cache Coherence [4] Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared interconnect But private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others They’ll keep accessing stale value in their caches Cache coherence problem Need to take actions to ensure visibility

Inconsistency in Data Sharing Suppose two processors each use (read) a data item X from a shared memory. Then each processor’s cache will have a copy of X that is consistent with the shared memory copy. Now suppose one processor modifies X (to X’). Now that processor’s cache is inconsistent with the other processor’s cache and the shared memory. With a write-through cache, the shared memory copy will be made consistent, but the other processor still has an inconsistent value (X). With a write-back cache, the shared memory copy will be updated eventually, when the block containing X (actually X’) is replaced or invalidated.

Cache coherence

Mutual Exclusion [4] Provided by LOCK-UNLOCK around critical section Set of operations we want to execute atomically Implementation of LOCK/UNLOCK must guarantee mutual excl. Can lead to significant serialization if contended Especially since expect non-local accesses in critical section Mutex stands for “mutual exclusion”

Simple Software Lock [4] lock: ld register, location /* copy location to register*/ cmp register, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 / * store 1 to mark it locked */ ret /* return control to caller */ and unlock: st location, #0 /* write 0 to location */ ret /* return control to caller*/ Problem: lock needs atomicity in its own implementation Read (test) and write (set) of lock variable by a process not atomic Solution: atomic read-modify-write or exchange instructions atomically test value of location and set it to another value, return success or failure somehow

Atomic Exchange Instruction [4] Specifies a location and register. In atomic operation: Value in location read into a register Another value (function of value read or not) stored into location Many variants Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0 Can be used to build locks

Simple Test&Set Lock [4] lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ Other read-modify-write primitives can be used too Swap

Mutual Exclusion: Altera [1] A mutex allows cooperating processors to agree that one of them should be allowed mutually exclusive access to a hardware resource in the system. Component that is added in SOPC builder Shared memory can be accessed without using mutexes

Altera Mutex [2]

Example: Opening and locking a mutex [2] #include <altera_avalon_mutex.h> /* get the mutex device handle */ alt_mutex_dev* mutex = altera_avalon_mutex_open( “/dev/mutex” ); /* acquire the mutex, setting the value to one */ altera_avalon_mutex_lock( mutex, 1 ); /* * Access a shared resource here. */ /* release the lock */ altera_avalon_mutex_unlock( mutex );

Performance Criteria (T&S Lock) [4] Uncontended Latency Very low if repeatedly accessed by same processor; indept. of p Traffic Lots if many processors compete; poor scaling with p Each t&s generates invalidations, and all rush out again to t&s Storage Very small (single variable); independent of p Fairness Poor, can cause starvation Test&set with backoff similar, but less traffic Luckily, better hardware primitives as well as algorithms exist

Improved Hardware Primitives: LL-SC [4] Goals: Test with reads Failed read-modify-write attempts don’t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional LL reads variable into register Follow with arbitrary instructions to manipulate its value SC tries to store back to location if and only if no one else has written to the variable since this processor’s LL If SC succeeds, means all three steps happened atomically If fails, doesn’t write or generate invalidations (need to retry LL) Success indicated by condition codes;

Simple Lock with LL-SC [4] lock: ll reg1, location /* LL location to reg1 */ bnz reg1, lock /* if location is locked, try again*/ sc location, reg2 /* SC reg2 into location*/ beqz lock /* if failed, start again */ ret /* return controll to the caller of lock */ *-------------------------*/ unlock: st location, #0 /* write 0 to location */ ret SC can fail (without putting transaction on bus) if: Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables

A Simple Centralized Barrier [2] Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs struct bar_type { int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */

References Altera Corp., Creating Multiprocessor NIOS II System Tutorial, 2005. Altera Corp. ,Altera Embedded Peripherals Handbook, 2005. J. Kowalczyk, “Multiprocessor Systems,” Xilinx, 2003. D. Culler, J. P. Singh, Parallel Computer Architectures, A Hardware/Software Approach, Morgan Kaufman, 1999.