CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative.

CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative Approach (Paperback) John L. Hennessy, David A. Patterson Morgan Kaufmann; 4th edition 2006 Many lecture slides are courtesy of or based on the work of Drs. Asanovic, Patterson, Culler and Amaral CS252 S05

Uniprocessor Performance (SPECint)
3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present cmpe382/ece510 ch 4 CS252 S05

Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’07 Intel/’07 IBM/’07 Sun/’07 Processors/chip 4 2 8 Threads/Processor 1 Threads/chip 64 cmpe382/ece510 ch 4 CS252 S05

Other Factors  Multiprocessors
Growth in data-intensive applications Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design cmpe382/ece510 ch 4 CS252 S05

Flynn’s Taxonomy Flynn classified by data and control streams in 1966
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, , Dec Flynn classified by data and control streams in 1966 SIMD  Data-Level Parallelism MIMD  Thread-Level Parallelism MIMD popular because Flexible: N programs or 1 multithreaded program Cost-effective: same MPU in desktop & MIMD machine Single Instruction, Single Data (SISD) (Uniprocessor) Single Instruction, Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction, Single Data (MISD) (????) Multiple Instruction, Multiple Data MIMD (Clusters, SMP servers) cmpe382/ece510 ch 4 CS252 S05

Back to Basics “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture cmpe382/ece510 ch 4 CS252 S05

Two Models for Communication and Memory Architecture
Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors (aka multicomputers) Modern cluster systems contain multiple stand-alone computers communicating via messages Communication occurs through a shared address space (via loads and stores): shared-memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non-Uniform Memory Access time multiprocessor) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space cmpe382/ece510 ch 4 CS252 S05

Centralized vs. Distributed Memory
Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory cmpe382/ece510 ch 4 CS252 S05

Centralized Memory Multiprocessor
Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches  single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases cmpe382/ece510 ch 4 CS252 S05

Distributed Memory Multiprocessor
Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Software must be aware of data placement to take advantage of increased memory BW cmpe382/ece510 ch 4 CS252 S05

Challenges of Parallel Processing
Big challenge is % of program that is inherently sequential What does it mean to be inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? 10% 5% 1% <1% cmpe382/ece510 ch 4 CS252 S05

Synchronization The need for synchronization arises whenever
fork join P1 P2 The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Forks and Joins: In parallel programming, a parallel process may want to wait until several events have occurred Producer-Consumer: A consumer process must wait until the producer process has produced data Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time producer consumer cmpe382/ece510 ch 4 CS252 S05

Sequential Consistency A Memory Model
P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs cmpe382/ece510 ch 4 CS252 S05

Locks or Semaphores E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section cmpe382/ece510 ch 4 CS252 S05

Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R  M[m]; M[m] R + RV; Swap (m), R: Rt  M[m]; M[m] R; R  Rt; cmpe382/ece510 ch 4 CS252 S05

Performance of Locks Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ... cmpe382/ece510 ch 4 CS252 S05

Issues in Implementing Sequential Consistency
Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a  b Store(a); Load(b) yes if a  b Store(a); Store(b) yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors cmpe382/ece510 ch 4 CS252 S05

Memory Fences Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar (memory barrier) Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required cmpe382/ece510 ch 4 CS252 S05

Data-Race Free Programs a.k.a. Properly Synchronized Programs
Process 1 ... Acquire(mutex); < critical section> Release(mutex); Process 2 ... Acquire(mutex); < critical section> Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free. cmpe382/ece510 ch 4 CS252 S05

Fences in Data-Race Free Programs
Process 1 ... Acquire(mutex); membar; < critical section> Release(mutex); Process 2 ... Acquire(mutex); membar; < critical section> Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences cmpe382/ece510 ch 4 CS252 S05

Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; L: if c2==1 then go to L < critical section> c1=0; Process 2 c2=1; L: if c1==1 then go to L c2=0; What is wrong? Deadlock! cmpe382/ece510 ch 4 CS252 S05

Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 ... L: c1=1; if c2==1 then { c1=0; go to L} < critical section> c1=0 Process 2 L: c2=1; if c1==1 then { c2=0; go to L} c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section  starvation cmpe382/ece510 ch 4 CS252 S05

Memory Consistency in SMPs
cache-1 A 100 CPU-Memory bus CPU-1 CPU-2 cache-2 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? cmpe382/ece510 ch 4 CS252 S05

Maintaining Sequential Consistency
SC is sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker) Multiple copies of a location in various caches can cause SC to break down. Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write  cache coherence protocols cmpe382/ece510 ch 4 CS252 S05

Cache Coherence Protocols for SC
write request: the address is invalidated (updated) in all other caches before (after) the write is performed read request: if a dirty copy is found in some cache, a write-back is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols Update protocols, or write broadcast. Latency between writing a word in one processor and reading it in another is usually smaller in a write update scheme. But since bandwidth is more precious, most multiprocessors use a write invalidate scheme. cmpe382/ece510 ch 4 CS252 S05

Problems with Parallel I/O
DISK DMA Physical Memory Proc. Cache Bus Cached portions of page DMA transfers Memory Disk: Physical memory may be stale if Cache copy is dirty Disk Memory: Cache may hold state data and not see memory writes cmpe382/ece510 ch 4 CS252 S05

Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” Snoopy cache tags are dual-ported Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master cmpe382/ece510 ch 4 CS252 S05

Snoopy Cache Actions for DMA
Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified DMA Write Cached, unmodified Disk Memory Cached, modified No action No action Cache intervenes No action Cache purges its copy ??? cmpe382/ece510 ch 4 CS252 S05

Shared Memory Multiprocessor
Bus M1 Snoopy Cache Physical Memory M2 Snoopy Cache Snoopy Cache DMA M3 DISKS Use snoopy mechanism to keep all processors’ view of memory coherent cmpe382/ece510 ch 4 CS252 S05

Cache State Transition Diagram The MSI protocol
M: Modified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 reads or writes M Other processor reads P1 writes back P1 intent to write Write miss Other processor intent to write Read miss S I Read by any processor Other processor intent to write Cache state in processor P1 cmpe382/ece510 ch 4 CS252 S05

Two Processor Example (Reading and writing the same cache line)
P1 reads or writes P1 reads P2 reads, P1 writes back M P1 writes P2 reads Write miss P2 writes P1 intent to write P2 intent to write P1 reads P1 writes Read miss P2 writes S I P2 intent to write P1 writes P2 P2 reads or writes P1 reads, P2 writes back M Write miss P2 intent to write P1 intent to write Read miss S I P1 intent to write cmpe382/ece510 ch 4 CS252 S05

Observation M S I Write miss Other processor intent to write Read miss P1 intent to write Read by any processor P1 reads or writes Other processor reads P1 writes back If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist cmpe382/ece510 ch 4 CS252 S05

MESI: An Enhanced MSI protocol increased performance for private data
M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 read P1 write P1 write or read M E Read miss, not shared P1 intent to write Write miss Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1 cmpe382/ece510 ch 4 CS252 S05

Optimized Snoop with Level-2 Caches
CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2  invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? Interlocks are required when both CPU-L1 and L2-Bus interactions involve the same address. cmpe382/ece510 ch 4 CS252 S05

Performance of Symmetric Shared-Memory Multiprocessors
Cache performance is combination of Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses 4th C: coherence miss Joins Compulsory, Capacity, Conflict cmpe382/ece510 ch 4 CS252 S05

Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word cmpe382/ece510 ch 4 CS252 S05

Multithreading cmpe382/ece510 ch 4

Pipeline Hazards F D X M W LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 Each instruction may depend on the next What can be done to cope with this? cmpe382/ece510 ch 4 CS252 S05

Multithreading How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline F D X M W t0 t1 t2 t3 t4 t5 t6 t7 t8 T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t9 Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe Prior instruction in a thread always completes write-back before next instruction in same thread reads register file cmpe382/ece510 ch 4 CS252 S05

CDC 6600 Peripheral Processors (Cray, 1964)
First multithreaded hardware 10 “virtual” I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator-based instruction set to reduce processor state Was objective was to cope with long I/O latencies? cmpe382/ece510 ch 4 CS252 S05

Tera MTA (1990-97) Up to 256 processors
Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat, shared main memory No data cache Sustains one main memory access per cycle per processor GaAs logic in prototype, 260MHz CMOS version, MTA-2, 50W/processor cmpe382/ece510 ch 4 CS252 S05

IBM PowerPC RS64-IV (2000) Commercial coarse-grain multithreading CPU
Based on PowerPC with quad-issue in-order five-stage pipeline Each physical CPU supports two virtual CPUs On L2 cache miss, pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency flush pipeline to simplify exception handling cmpe382/ece510 ch 4 CS252 S05

Changes in Power 5 to support SMT
Increased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support cmpe382/ece510 ch 4 CS252 S05

Pentium-4 Hyperthreading (2002)
First commercial SMT design (2-way SMT) Hyperthreading == SMT Logical processors share nearly all resources of the physical processor Caches, execution units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress No logical processor can use all entries in queues when two threads are active Processor running only one active software thread runs at approximately same speed with or without hyperthreading Load-store buffer in L1 cache doesn’t behave like that, and hence 15% slowdown. cmpe382/ece510 ch 4 CS252 S05

Pentium-4 Hyperthreading Front End
Resource divided between logical CPUs Resource shared between logical CPUs [ Intel Technology Journal, Q ] cmpe382/ece510 ch 4 CS252 S05

Pentium-4 Hyperthreading Execution Pipeline
[ Intel Technology Journal, Q ] cmpe382/ece510 ch 4 CS252 S05

Summary: Multithreaded Categories
Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot cmpe382/ece510 ch 4 CS252 S05

CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative.

Similar presentations

Presentation on theme: "CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative.

Similar presentations

Presentation on theme: "CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative."— Presentation transcript:

Similar presentations

About project

Feedback