CS 252 Graduate Computer Architecture Lecture 10: Multiprocessors

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
SE-292 High Performance Computing
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
4/4/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
April 8, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CSE 502 Graduate Computer Architecture Lec – Symmetric MultiProcessing
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Processing - introduction
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Chapter 5: Process Synchronization
CPSC 614 Computer Architecture Lec 9 – Multiprocessor
The University of Adelaide, School of Computer Science
Lecture 13 Multi-core Chips
CMSC 611: Advanced Computer Architecture
CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative.
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Krste Asanovic Electrical Engineering and Computer Sciences
Designing Parallel Algorithms (Synchronization)
Chapter 5 Multiprocessor and Thread-Level Parallelism
Chapter 17 Parallel Processing
Krste Asanovic Electrical Engineering and Computer Sciences
Department of Computer Sciences The George Washington University
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Multiprocessors - Flynn’s taxonomy (1966)
Introduction to Multiprocessors
Lecture 15 Multi-core Chips
Dr. George Michelogiannakis EECS, University of California at Berkeley
Concurrency: Mutual Exclusion and Process Synchronization
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches
Lecture 23: Virtual Memory, Multiprocessors
Chapter 6: Synchronization Tools
Multiprocessor and Thread-Level Parallelism Chapter 4
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 22 Synchronization Krste Asanovic Electrical Engineering and.
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS 252 Graduate Computer Architecture Lecture 10: Multiprocessors Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

Recap: Virtual Memory Virtual memory support standard on general-purpose processors Gives each user (or program) illusion of a separate large protected memory Programs can be written independent of machine memory configuration Hierarchical page tables exploit sparseness of virtual address usage to reduce size of mapping information TLB caches translation/protection information to make VM practical Would not be acceptable to have multiple additional memory references for each instruction Interaction between TLB lookup and cache tag lookup Want to avoid inconsistencies from virtual address aliases 10/9/2007

Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present 9/22/2018

Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’07 Intel/’07 IBM/’07 Sun/’07 Processors/chip 4 2 8 Threads/Processor 1 Threads/chip 64 9/22/2018

Other Factors  Multiprocessors Growth in data-intensive applications Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design 9/22/2018

Flynn’s Taxonomy Flynn classified by data and control streams in 1966 M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966. Flynn classified by data and control streams in 1966 SIMD  Data-Level Parallelism MIMD  Thread-Level Parallelism MIMD popular because Flexible: N programs or 1 multithreaded program Cost-effective: same MPU in desktop & MIMD machine Single Instruction, Single Data (SISD) (Uniprocessor) Single Instruction, Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction, Single Data (MISD) (????) Multiple Instruction, Multiple Data MIMD (Clusters, SMP servers) 9/22/2018

Back to Basics “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 9/22/2018

Two Models for Communication and Memory Architecture Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors (aka multicomputers) Modern cluster systems contain multiple stand-alone computers communicating via messages Communication occurs through a shared address space (via loads and stores): shared-memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non-Uniform Memory Access time multiprocessor) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space 9/22/2018

Centralized vs. Distributed Memory Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory 9/22/2018

Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches  single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases 9/22/2018

Distributed Memory Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Software must be aware of data placement to take advantage of increased memory BW 9/22/2018

Challenges of Parallel Processing Big challenge is % of program that is inherently sequential What does it mean to be inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? 10% 5% 1% <1% 9/22/2018

Communication and Synchronization Parallel processes must co-operate to complete a single task faster Requires distributed communication and synchronization Communication is for data values, or “what” Synchronization is for control, or “when” Communication and synchronization are often inter-related i.e., “what” depends on “when” Message-passing bundles data and control Message arrival encodes “what” and “when” In shared-memory machines, communication usually via coherent caches & synchronization via atomic memory operations Due to advent of single-chip multiprocessors, it is likely cache-coherent shared memory systems will be the dominant form of multiprocessor Today’s lecture focuses on the synchronization problem 9/22/2018

CS252 Administrivia Haven’t received many project website URLs, please forward to both Rose and me We will use this for 2nd project meetings, week of October 22 Midterm #1, Thursday, in class, 9:40AM-11:00AM Closed book, no calculators/computers/iPhones… Based on material and assigned readings from lectures 1-9 Practice problems and solutions on website Meet in La Vals for pizza/drinks at 7pm after midterm on Thursday Show of hands for RSVP Slight change in course calendar All final project presentations on Thursday December 6th (no class on December 4th) Gives all groups same amount of time before final presentation Reminder: final report due Monday December 10th, no extensions 10/9/2007

Symmetric Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 10/9/2007

Synchronization The need for synchronization arises whenever fork join P1 P2 The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Forks and Joins: In parallel programming, a parallel process may want to wait until several events have occurred Producer-Consumer: A consumer process must wait until the producer process has produced data Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time producer consumer 10/9/2007

A Producer-Consumer Example tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 10/9/2007

A Producer-Consumer Example continued Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) 1 3 2 4 Can the tail pointer get updated before the item x is stored? Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are: 2, 3, 4, 1 4, 1, 2, 3 10/9/2007

Sequential Consistency A Memory Model P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 10/9/2007

Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0 10/9/2007

Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ? more on this later 10/9/2007

Multiple Consumer Example tail head Producer Rtail Consumer 1 R Rhead 2 Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail Critical section: Needs to be executed atomically by one consumer  locks What is wrong with this code? 10/9/2007

Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 10/9/2007

Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R  M[m]; M[m] R + RV; Swap (m), R: Rt  M[m]; M[m] R; R  Rt; 10/9/2007

Multiple Consumers Example using the Test&Set Instruction P: Test&Set (mutex),Rtemp if (Rtemp!=0) goto P Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead V: Store (mutex),0 process(R) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section? 10/9/2007

Nonblocking Synchronization Compare&Swap(m), Rt, Rs: if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status success; else status fail; status is an implicit argument try: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rnewhead = Rhead+1 Compare&Swap(head), Rhead, Rnewhead if (status==fail) goto try process(R) 10/9/2007

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr>  <1, m>; R  M[m]; Store-conditional (m), R: if <flag, adr> == <1, m> then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead = Rhead + 1 Store-conditional (head), Rhead if (status==fail) goto try process(R) 10/9/2007

Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ... 10/9/2007

Issues in Implementing Sequential Consistency Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a  b Store(a); Load(b) yes if a  b Store(a); Store(b) yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors 10/9/2007

Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 10/9/2007

Using Memory Fences tail head Consumer: Load Rhead, (head) Producer Consumer tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored 10/9/2007

Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1 ... Acquire(mutex); < critical section> Release(mutex); Process 2 ... Acquire(mutex); < critical section> Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free. 10/9/2007

Fences in Data-Race Free Programs Process 1 ... Acquire(mutex); membar; < critical section> Release(mutex); Process 2 ... Acquire(mutex); membar; < critical section> Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences 10/9/2007

Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2 c2=1; L: if c1=1 then go to L c2=0; What is wrong? Deadlock! 10/9/2007

Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 ... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2 L: c2=1; if c1=1 then { c2=0; go to L} c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section  starvation 10/9/2007

A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2 ... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 10/9/2007

Analysis of Dekker’s Algorithm ... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; ... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 c2=0; Scenario 1 ... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; ... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 c2=0; Scenario 2 10/9/2007

N-process Mutual Exclusion Lamport’s Bakery Algorithm Process i choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } num[i] = 0; Initially num[j] = 0, for all j Entry Code Wait if the process is currently choosing Wait if the process has a number and comes ahead of us. Exit Code 10/9/2007