Lecture 15 Multi-core Chips

Lecture 15 Multi-core Chips
Peng Liu

Introduction to Multi-core Processors

Motivation: Single Processor Performance Scaling

Multi-core Chips (aka亦称 Chip Multi-Processors or CMPs)

Sample of Multi-core Options

And There is Much More…

Supercomputers Definition of a supercomputer:
Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 8

CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words
128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05 9

Supercomputer Applications
Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer  Vector Machine 10

BlueGene/Q Compute chip
System-on-a-Chip design : integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) ~ 1.47 B transistors 16 user + 1 service processors plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance Central shared L2 cache: 32 MB eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) Chip-to-chip networking Router logic integrated into BQC chip. External IO PCIe Gen2 interface

Blue Gene/Q packaging hierarchy
4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute Card One single chip module, 16 GB DDR3 Memory 2. Module Single Chip 1. Chip 16 cores 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards 5-D Topology: 16x16x16x12x2 A Q32 card is 2x2x2x2x2 and a midplane is 4x4x4x4x2. Ref: SC2010

Graphics Processing Units (GPUs)
Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added ( ) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation

General-Purpose GPUs (GP-GPUs)
In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA “Compute Unified Device Architecture” Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing

“Single Instruction, Multiple Thread”
GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp

Nvidia Fermi GF100 GPU [Nvidia, 2010]

GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway

Synchronization

Symmetric Multiprocessors
Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 22

Synchronization The need for synchronization arises whenever
there are concurrent processes in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time producer consumer Shared Resource P1 P2 23

A Producer-Consumer Example
tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 24

Sequential Consistency A Memory Model
P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 26

Sequential Consistency
Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1(Y’= Y) Load R2, (X) Store (X’), R2(X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0 27

Sequential Consistency
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1(Y’= Y) Load R2, (X) Store (X’), R2(X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentiallyconsistent view of the memory ? more on this later 28

Multiple Consumer Example
tail head Producer Rtail Consumer 1 R Rhead 2 Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail Critical section: Needs to be executed atomically by one consumer  locks What is wrong with this code? 29

Locks or Semaphores E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 30

Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R  M[m]; M[m] R + RV; Swap (m), R: Rt M[m]; M[m] R; R  Rt; 31

Multiple Consumers Example using the Test&Set Instruction
P: Test&Set (mutex),Rtemp if (Rtemp!=0) goto P Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead V: Store (mutex),0 process(R) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section? 32

Nonblocking Synchronization
Compare&Swap(m), Rt, Rs: if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status success; else status fail; statusis an implicit argument try: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rnewhead = Rhead+1 Compare&Swap(head), Rhead, Rnewhead if (status==fail) goto try process(R) 33

Load-reserve & Store-conditional
Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr><1, m>; R  M[m]; Store-conditional (m), R: if<flag, adr> == <1, m> then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtailgoto spin Load R, (Rhead) Rhead = Rhead + 1 Store-conditional (head), Rhead if (status==fail) goto try process(R) 34

Performance of Locks Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ... 35

Issues in Implementing Sequential Consistency
Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b Caches Caches can prevent the effect of a store from being seen by other processors No common commercial architecture has a sequentially consistent memory model! 36

Memory Fences Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 37

Using Memory Fences tail head Consumer: Load Rhead, (head)
Producer Consumer tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored 38

Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2 c2=1; L: if c1=1 then go to L c2=0; What is wrong? Deadlock! 39

Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 ... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2 L: c2=1; if c1=1 then { c2=0; go to L} c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section  starvation 40

A Protocol for Mutual Exclusion T. Dekker, 1966
A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2 ... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 41

N-process Mutual Exclusion Lamport’s Bakery Algorithm
Process i choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } num[i] = 0; Initiallynum[j] = 0, for all j Entry Code Wait if the process is currently choosing Wait if the process has a number and comes ahead of us. Exit Code 43

Relaxed Memory Model needs Fences
Producer Consumer tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored 44

Memory Coherence in SMPs
cache-1 A 100 CPU-Memory bus CPU-1 CPU-2 cache-2 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 45

Write-back Caches & SC T1 is executed cache-1 writes backY T2 executed
prog T1 ST X, 1 ST Y,11 cache-2 cache-1 memory prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 X = 0 Y =10 X’= Y’= X= 1 Y=11 Y = X = T1 is executed cache-1 writes backY X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = X = X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X’= 0 T2 executed X = 1 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X = 0 X’= 0 cache-1 writes backX inconsistent X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11 X = 0 cache-2 writes back X’&Y’ 46

Write-through Caches & SC
prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 cache-2 Y = Y’= X = 0 X’= memory Y =10 cache-1 X= 0 Y=10 Y = Y’= X = 0 X’= X = 1 Y =11 X= 1 Y=11 T1 executed Y = 11 Y’= 11 X = 0 X’= 0 X = 1 Y =11 Y’=11 X= 1 Y=11 T2 executed Write-through caches don’t preserve sequential consistency either 47

Cache Coherence vs. Memory Consistency
A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address i.e., updates are not lost A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses Equivalently, what values can be seen by a load A cache coherence protocol is not enough to ensure sequential consistency But if sequentially consistent, then caches must be coherent Combination of cache coherence protocol plus processor memory reorder buffer implements a given machine’s memory consistency model

Maintaining Cache Coherence
Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write cache coherence protocols 49

Warmup: Parallel I/O Physical Memory Proc. Cache
Bus Address (A) Proc. Cache Data (D) R/W Page transfers occur while the Processor is running A Either Cache or DMA can be the Bus Master and effect transfers D DMA DISK R/W (DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU) 50

Problems with Parallel I/O
DISK DMA Physical Memory Proc. Cache Bus Cached portions of page DMA transfers Memory Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes 51

Snoopy CacheGoodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” Snoopy cache tags are dual-ported Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master 52

Shared Memory Multiprocessor
Bus M1 Snoopy Cache Physical Memory M2 Snoopy Cache Snoopy Cache DMA M3 DISKS Use snoopy mechanism to keep all processors’ view of memory coherent 53

Snoopy Cache Coherence Protocols
write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a write-back is performed before the memory is read Update protocols, or write broadcast. Latency between writing a word in one processor and reading it in another is usually smaller in a write update scheme. But since bandwidth is more precious, most multiprocessors use a write invalidate scheme. 54

Cache State Transition Diagram The MSI protocol
M: Modified S: Shared I: Invalid Each cache line has state bits Address tag state bits Write miss (P1 gets line from memory) P1 reads or writes M Other processor reads (P1writes back) P1 intent to write Other processor intent to write (P1 writes back) Read miss (P1 gets line from memory) S I Read by any processor Other processor intent to write Cache state in processor P1 55

Two Processor Example (Reading and writing the same cache line)
P1 reads or writes P1 reads P2 reads, P1 writes back M P1 writes P2 reads Write miss P2 writes P1 intent to write P2 intent to write P1 reads P1 writes Read miss P2 writes S I P2 intent to write P1 writes P2 P2 reads or writes P1 reads, P2 writes back M Write miss P2 intent to write P1 intent to write Read miss S I P1 intent to write 56

Observation M S I Write miss Other processor intent to write Read miss P1 intent to write Read by any processor P1 reads or writes Other processor reads P1 writes back If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 57

MESI: An Enhanced MSI protocol increased performance for private data
M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss P1 read P1 write P1 write or read M E Read miss, not shared Other processor reads Other processor intent to write, P1 writes back P1 intent to write Other processor reads P1writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1 58

Optimized Snoop with Level-2 Caches
CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2  invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? Interlocks are required when both CPU-L1 and L2-Bus interactions involve the same address. 59

Intervention When a read-miss for A occurs in cache-2,
CPU-1 CPU-2 A 200 cache-1 cache-2 CPU-Memory bus A 100 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 60

False Sharing A cache block contains more than one word
state blk addr data0 data dataN A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M1 writes wordiand M2 writes wordk and both words have the same block address. What can happen? The block may be invalidated many times unnecessarily because the addresses share a common block. 61 61

Synchronization and Caches: Performance Issues
Processor 1 R  1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex]  0; Processor 2 R  1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex]  0; Processor 3 R  1 L: swap (mutex), R; if<R>then goto L; <critical section> M[mutex]  0; mutex=1 cache cache cache CPU-Memory Bus Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 62

Load-reserve & Store-conditional
Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr><1, a>; R  M[a]; Store-conditional (a), R: if<flag, adr> == <1, a> then cancel other procs’ reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve ‘a’ simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 63

Performance of Symmetric Shared-Memory Multiprocessors
Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 64

Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word 65

Home Work Readings: Read Chapter ;

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Lecture 15 Multi-core Chips

Similar presentations

Presentation on theme: "Lecture 15 Multi-core Chips"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 15 Multi-core Chips

Similar presentations

Presentation on theme: "Lecture 15 Multi-core Chips"— Presentation transcript:

Similar presentations

About project

Feedback