Lecture 15 Multi-core Chips Peng Liu liupeng@zju.edu.cn
Introduction to Multi-core Processors
Motivation: Single Processor Performance Scaling
Multi-core Chips (aka亦称 Chip Multi-Processors or CMPs)
Sample of Multi-core Options
Sample of Multi-core Options
And There is Much More…
Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 8
CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words 128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05 9
Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 10
BlueGene/Q Compute chip System-on-a-Chip design : integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) ~ 1.47 B transistors 16 user + 1 service processors plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS@55W Central shared L2 cache: 32 MB eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) Chip-to-chip networking Router logic integrated into BQC chip. External IO PCIe Gen2 interface
Blue Gene/Q packaging hierarchy 4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute Card One single chip module, 16 GB DDR3 Memory 2. Module Single Chip 1. Chip 16 cores 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards 5-D Topology: 16x16x16x12x2 A Q32 card is 2x2x2x2x2 and a midplane is 4x4x4x4x2. Ref: SC2010
GPU
Graphics Processing Units (GPUs) Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added (2001-2005) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation
General-Purpose GPUs (GP-GPUs) In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA “Compute Unified Device Architecture” Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing
“Single Instruction, Multiple Thread” GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp
Nvidia Fermi GF100 GPU [Nvidia, 2010]
GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway
Synchronization
Symmetric Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 22
Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time producer consumer Shared Resource P1 P2 23
A Producer-Consumer Example tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 24
Sequential Consistency A Memory Model P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 26
Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1(Y’= Y) Load R2, (X) Store (X’), R2(X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’) {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0 27
Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1(Y’= Y) Load R2, (X) Store (X’), R2(X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentiallyconsistent view of the memory ? more on this later 28
Multiple Consumer Example tail head Producer Rtail Consumer 1 R Rhead 2 Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail Critical section: Needs to be executed atomically by one consumer locks What is wrong with this code? 29
Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 30
Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R M[m]; M[m] R + RV; Swap (m), R: Rt M[m]; M[m] R; R Rt; 31
Multiple Consumers Example using the Test&Set Instruction P: Test&Set (mutex),Rtemp if (Rtemp!=0) goto P Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead V: Store (mutex),0 process(R) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section? 32
Nonblocking Synchronization Compare&Swap(m), Rt, Rs: if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status success; else status fail; statusis an implicit argument try: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rnewhead = Rhead+1 Compare&Swap(head), Rhead, Rnewhead if (status==fail) goto try process(R) 33
Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr><1, m>; R M[m]; Store-conditional (m), R: if<flag, adr> == <1, m> then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtailgoto spin Load R, (Rhead) Rhead = Rhead + 1 Store-conditional (head), Rhead if (status==fail) goto try process(R) 34
Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ... 35
Issues in Implementing Sequential Consistency Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b Caches Caches can prevent the effect of a store from being seen by other processors No common commercial architecture has a sequentially consistent memory model! 36
Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 37
Using Memory Fences tail head Consumer: Load Rhead, (head) Producer Consumer tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored 38
Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2 c2=1; L: if c1=1 then go to L c2=0; What is wrong? Deadlock! 39
Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 ... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2 L: c2=1; if c1=1 then { c2=0; go to L} c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section starvation 40
A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2 ... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 41
N-process Mutual Exclusion Lamport’s Bakery Algorithm Process i choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } num[i] = 0; Initiallynum[j] = 0, for all j Entry Code Wait if the process is currently choosing Wait if the process has a number and comes ahead of us. Exit Code 43
Relaxed Memory Model needs Fences Producer Consumer tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored 44
Memory Coherence in SMPs cache-1 A 100 CPU-Memory bus CPU-1 CPU-2 cache-2 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 45
Write-back Caches & SC T1 is executed cache-1 writes backY T2 executed prog T1 ST X, 1 ST Y,11 cache-2 cache-1 memory prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 X = 0 Y =10 X’= Y’= X= 1 Y=11 Y = X = T1 is executed cache-1 writes backY X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = X = X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X’= 0 T2 executed X = 1 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X = 0 X’= 0 cache-1 writes backX inconsistent X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11 X = 0 cache-2 writes back X’&Y’ 46
Write-through Caches & SC prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 cache-2 Y = Y’= X = 0 X’= memory Y =10 cache-1 X= 0 Y=10 Y = Y’= X = 0 X’= X = 1 Y =11 X= 1 Y=11 T1 executed Y = 11 Y’= 11 X = 0 X’= 0 X = 1 Y =11 Y’=11 X= 1 Y=11 T2 executed Write-through caches don’t preserve sequential consistency either 47
Cache Coherence vs. Memory Consistency A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address i.e., updates are not lost A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses Equivalently, what values can be seen by a load A cache coherence protocol is not enough to ensure sequential consistency But if sequentially consistent, then caches must be coherent Combination of cache coherence protocol plus processor memory reorder buffer implements a given machine’s memory consistency model
Maintaining Cache Coherence Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write cache coherence protocols 49
Warmup: Parallel I/O Physical Memory Proc. Cache Bus Address (A) Proc. Cache Data (D) R/W Page transfers occur while the Processor is running A Either Cache or DMA can be the Bus Master and effect transfers D DMA DISK R/W (DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU) 50
Problems with Parallel I/O DISK DMA Physical Memory Proc. Cache Bus Cached portions of page DMA transfers Memory Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes 51
Snoopy CacheGoodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” Snoopy cache tags are dual-ported Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master 52
Shared Memory Multiprocessor Bus M1 Snoopy Cache Physical Memory M2 Snoopy Cache Snoopy Cache DMA M3 DISKS Use snoopy mechanism to keep all processors’ view of memory coherent 53
Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a write-back is performed before the memory is read Update protocols, or write broadcast. Latency between writing a word in one processor and reading it in another is usually smaller in a write update scheme. But since bandwidth is more precious, most multiprocessors use a write invalidate scheme. 54
Cache State Transition Diagram The MSI protocol M: Modified S: Shared I: Invalid Each cache line has state bits Address tag state bits Write miss (P1 gets line from memory) P1 reads or writes M Other processor reads (P1writes back) P1 intent to write Other processor intent to write (P1 writes back) Read miss (P1 gets line from memory) S I Read by any processor Other processor intent to write Cache state in processor P1 55
Two Processor Example (Reading and writing the same cache line) P1 reads or writes P1 reads P2 reads, P1 writes back M P1 writes P2 reads Write miss P2 writes P1 intent to write P2 intent to write P1 reads P1 writes Read miss P2 writes S I P2 intent to write P1 writes P2 P2 reads or writes P1 reads, P2 writes back M Write miss P2 intent to write P1 intent to write Read miss S I P1 intent to write 56
Observation M S I Write miss Other processor intent to write Read miss P1 intent to write Read by any processor P1 reads or writes Other processor reads P1 writes back If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 57
MESI: An Enhanced MSI protocol increased performance for private data M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss P1 read P1 write P1 write or read M E Read miss, not shared Other processor reads Other processor intent to write, P1 writes back P1 intent to write Other processor reads P1writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1 58
Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2 invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? Interlocks are required when both CPU-L1 and L2-Bus interactions involve the same address. 59
Intervention When a read-miss for A occurs in cache-2, CPU-1 CPU-2 A 200 cache-1 cache-2 CPU-Memory bus A 100 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 60
False Sharing A cache block contains more than one word state blk addr data0 data1 ... dataN A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M1 writes wordiand M2 writes wordk and both words have the same block address. What can happen? The block may be invalidated many times unnecessarily because the addresses share a common block. 61 61
Synchronization and Caches: Performance Issues Processor 1 R 1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex] 0; Processor 2 R 1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex] 0; Processor 3 R 1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex] 0; mutex=1 cache cache cache CPU-Memory Bus Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 62
Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr><1, a>; R M[a]; Store-conditional (a), R: if<flag, adr> == <1, a> then cancel other procs’ reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve ‘a’ simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 63
Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 64
Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared miss would not occur if block size were 1 word 65
Home Work Readings: Read Chapter 7.1-7.14;
Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B