The University of Adelaide, School of Computer Science Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 16 April 2017 Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Introduction Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model Targeted for tightly-coupled shared-memory multiprocessors For n processors, need at least n threads Amount of computation assigned to each thread = grain size Threads can be used for data-level parallelism, but the overhead may outweigh the benefit Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Types Introduction Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory access/latency (UMA) Distributed shared memory (DSM) Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Cache Coherence Processors may see different values through their caches: Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Cache Coherence Coherence All reads by any processor must return the most recently written value Writes to the same location by any two processors are seen in the same order by all processors Protocols Distributed Each core tracks sharing status of each block Centralized Sharing status of each block kept in one location Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols Where is the most up-to-date version of the data? Depends on cache organization: Write buffers? May be in a write buffer Write-through or write-back caches? Write-through: in memory or in a write buffer Write-back: could be in any cache or write buffer How to efficiently find this version of the data? Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Write invalidate On write, invalidate all other copies Use bus itself to serialize Write cannot complete until bus access is obtained Write update On write, update all copies Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Caches “snoop” for updates Can use the valid bit already associated with each cache block Can also use dirty bit, assuming write-back caches MSI protocol (write invalidate) Every cache block is in one of three states Modified – dirty (and exclusive) – read/write access Shared – valid and clean – read-only access Invalid – not valid – empty Write hits to modified blocks do not require invalidations Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block, initial cache state is invalid Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event. Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Coherence Protocols: Extensions The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols: Extensions Complications for the basic MSI protocol: Operations are not atomic E.g. detect miss, acquire bus, send invalidation Creates possibility of deadlock and races Protocols need to enforce atomic transactions Extensions: Add exclusive state to indicate clean block in only one cache (MESI protocol) Avoid needing to send invalidation when writing to an exclusive block Add owned state to indicate the newest copy of data (MOESI protocol) Like a modified block, but can coexist with shared blocks Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Coherence Protocols: Extensions The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols: Extensions Shared memory bus and snooping bandwidth is bottleneck for scaling symmetric multiprocessors Duplicating tags Place directory in outermost cache Use crossbars or point-to-point networks with banked memory Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Performance A new type of cache miss: coherence (4th C) Two types True sharing misses Write to a word of a shared block Another processor reads the word from the invalidated block (read miss) False sharing misses Another processor reads an unmodified word in an invalidated block Can’t exist if block size is one word Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Copyright © 2012, Elsevier Inc. All rights reserved. Directory Protocols Can reduce snoopy-protocol traffic by adding a directory to L3 Add a bit string of length n for n cores to each block Keeps track of which core has the entry in a private cache Allows point-to-point communication between relevant cores Can not scale this approach to multiple processors because each processor has a distinct L3 cache Must instead add a directory to main memory Copyright © 2012, Elsevier Inc. All rights reserved.
The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols Directory is implemented in main memory Tracks all blocks in main memory Which nodes (processors) have each block Sharing state of each block Implement in a distributed fashion: Statically divide shared memory between processors to afford easy block location Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For each block, possible states: Uncached No nodes have the block cached Shared One or more nodes have the block cached, value in memory is up-to-date Set of node IDs Modified Exactly one node has a copy of the cache block, value in memory is out-of-date Owner node ID Caches also still maintain state as in snoopy protocol Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Copyright © 2012, Elsevier Inc. All rights reserved. Directory Protocols Directory maintains block states and sends state messages Each message is either sent or received by the Local node – node which originated the request Either from or to one of the following nodes Home node – node which controls the requested address Remote node – any node which needs to know about a change in state Copyright © 2012, Elsevier Inc. All rights reserved.
The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For uncached block: Read miss Requesting node is sent the requested data and is made the only sharing node, block is now shared Write miss The requesting node is sent the requested data and becomes the sharing node, block is now exclusive For shared block: The requesting node is sent the requested data from memory, node is added to sharing set The requesting node is sent the value, all nodes in the sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For exclusive block: Read miss The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor Write miss Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Messages Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
State Transitions (cache) The University of Adelaide, School of Computer Science 16 April 2017 State Transitions (cache) Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
State Transitions (directory) The University of Adelaide, School of Computer Science 16 April 2017 State Transitions (directory) Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Example Configuration Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99
Copyright © 2012, Elsevier Inc. All rights reserved. AMD Opteron 8439 Transistors: 904 M Power consumption: 137 W Max cores/chip: 6 Max threads/core: 1 Max nodes: 8 Instruction issue/clock: 3 Clock rate: 2.8 GHz Multicore coherence protocol: MOESI (snooping) Multichip coherence protocol: Snoopy or directory Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. IBM Power 7 Transistors: 1200 M Power consumption: 140 W Max cores/chip: 8 Max threads/core: 6 Max nodes: 32 Instruction issue/clock: 6 Clock rate: 4.1 GHz Multicore coherence protocol: MESI (snooping/directory) Multichip coherence protocol: Directory Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Intel Xeon 7560 Transistors: 2300 M Power consumption: 130 W Max cores/chip: 8 Max threads/core: 2 Max nodes: 8 Instruction issue/clock: 4 Clock rate: 2.7 GHz Multicore coherence protocol: MESIF (snooping/directory) Multichip coherence protocol: Directory Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Sun UltraSPARC T2 Transistors: 500 M Power consumption: 95 W Max cores/chip: 8 Max threads/core: 8 Max nodes: 4 Instruction issue/clock: 2 Clock rate: 1.6 GHz Multicore coherence protocol: MOESI (snooping/directory) Multichip coherence protocol: Snooping Copyright © 2012, Elsevier Inc. All rights reserved.
Performance Comparison Copyright © 2012, Elsevier Inc. All rights reserved.
The University of Adelaide, School of Computer Science 16 April 2017 Synchronization Synchronization Need to control access to resources between threads Basic building blocks: Need to be able to read from and write to memory in an uninterruptable instruction Atomic exchange Swaps register with memory location Test-and-set Sets under condition Fetch-and-increment Reads original value from memory and increments it in memory load linked/store conditional If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Implementing Locks Synchronization Spin lock If no coherence: DADDUI R2,R0,#1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? If coherence: lockit: LD R2,0(R1) ;load of lock BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1) ;swap BNEZ R2,lockit ;branch if lock wasn’t 0 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 16 April 2017 Implementing Locks Synchronization Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Copyright © 2012, Elsevier Inc. All rights reserved. Fallacies Amdahl’s law doesn’t apply to parallel computers Linear speedups are needed to make multiprocessors cost effective Copyright © 2012, Elsevier Inc. All rights reserved.
Copyright © 2012, Elsevier Inc. All rights reserved. Pitfalls Measuring performance of multiprocessors by linear speedup versus execution time Not developing the software to take advantage of, or optimize for, a multiprocessor architecture Copyright © 2012, Elsevier Inc. All rights reserved.