Multiprocessors - Flynn’s taxonomy (1966)

Multiprocessors - Flynn’s taxonomy (1966)
Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Single Instruction stream, Multiple Data stream (SIMD) Popular for some applications like image processing One can construe vector processors to be of the SIMD type. MMX extensions to ISA reflect the SIMD philosophy Multiple Instruction stream, Single Data stream (MISD) Don’t know of any Multiple Instruction stream, Multiple Data stream (MIMD) The most general 12/2/2018 CSE 471 Multiprocessors

MIMD: Shared-memory vs. message-passing
Shared-Memory = Single shared-address space (extension of uniprocessor) UMA: uniform memory access (centralized; shared-bus) NUMA: non-uniform memory access (decentralized “distributed shared-memory”; interconnection network) NUMA-CC: cache coherent (directory-based protocols) NUMA without cache coherence (enforced by software) Message passing = Processors communicate by messages (send, receive) Multicomputers Synchronous (e.g. RPC) vs. asynchronous (sender does not wait for reply) 12/2/2018 CSE 471 Multiprocessors

Shared-memory vs. message-passing
An old debate that is not that much important any longer Many systems are built to support a mixture of both paradigms Shared virtual memory (e.g., network of workstations; coherence at the page level, also called “distributed shared memory”) Shared-memory pros Ease of programming; good for communication of small items; less overhead of O.S.; hardware-based cache coherence Message-passing pros Simpler hardware (more scalable); explicit communication (both good and bad), easier for long messages Use of message passing libraries 12/2/2018 CSE 471 Multiprocessors

Caveat about parallel processing
Multiprocessors are used to: Speedup computations Solve larger problems Recall Amdahl’s law If x% of your program is sequential, speedup is bounded by 1/x At best linear speedup (if no sequential section) Superlinear speedup occurs because of larger amount of caches and memories Speedup is limited by the communication/computation ratio and synchronization 12/2/2018 CSE 471 Multiprocessors

SMP (Symmetric MultiProcessors) single shared-bus systems
Caches Shared-bus I/O adapter Interleaved Memory 12/2/2018 CSE 471 Multiprocessors

Cache coherence (controllers snoop on bus transactions)
P2 reads A; P3 reads A P1 P2 P3 P4 Mem. 12/2/2018 CSE 471 Multiprocessors

Cache coherence (cont’d)
Now P4 writes A Two choices: Write-through the new value of A snooped by caches of P2 and P3: Write-update (or write broadcast) protocol Invalidate the copy in caches of P2 and P3. Write-invalidate protocols 12/2/2018 CSE 471 Multiprocessors

Write-update P1 P2 P3 P4 Mem. 12/2/2018 CSE 471 Multiprocessors

Write-invalidate P1 P2 P3 P4 Mem. Invalid line 12/2/2018
CSE 471 Multiprocessors

Snoopy cache coherence protocols
Associate states with each cache block (minimal 3) Invalid Clean (one or more copies are up to date) Dirty (modified; exists in only one cache) Fourth state (and sometimes more) for performance purposes 12/2/2018 CSE 471 Multiprocessors

State transitions for a given cache block
Those incurred as answers to processor associated with the cache Read miss, write miss, write on clean block Those incurred by snooping on the bus as result of other processor actions, e.g., Read miss by Q might make P’s block transit from dirty to clean Write miss by Q might make P’s block transit from dirty/clean to invalid (write invalidate protocol) 12/2/2018 CSE 471 Multiprocessors

An example of write-invalidate protocol: the Illinois protocol
States: Invalid Valid-Exclusive (clean, only copy) Shared (clean, possibly other copies) Dirty (modified, only copy) 12/2/2018 CSE 471 Multiprocessors

Illinois protocol: design decisions
The Valid-Exclusive state is there to enhance performance On a write to a block in V-E state, no need to send an invalidation message (occurs often for private variables). On a read miss with no cache having the block in dirty state Who sends the data: memory or cache (if any)? Answer: cache for that particular protocol; other protocols might use the memory If more than one cache, which one? Answer: the first to grab the bus (tri-state devices) 12/2/2018 CSE 471 Multiprocessors

Illinois protocol: state diagram
Proc. induced Read miss from mem. Read hit bus write miss Bus induced Inv. V.E. bus write miss bus write miss Bus read miss Write hit Hit Bus read miss Read hit and bus read miss Dirty Sh. Write hit Write miss Read miss from cache 12/2/2018 CSE 471 Multiprocessors

An example of sophisticated write-update protocol
Dragon protocol - 4 states + “shared” bus control line No invalid state: either memory or a cache block is in correct state Valid-exclusive (only copy in cache but not modified) Shared-dirty (write-back required at replacement) Shared-clean (several clean copies in caches) Dirty (single modified copy in caches) On a write, the data is sent to caches that have a valid copy of the block. They must raise the “shared” control line to indicate they are alive. If none, go to state dirty. On a miss, shared line is raised by cache that has a copy 12/2/2018 CSE 471 Multiprocessors

Dragon protocol (writes are updates)
Write hit Dirty V-E Shared low Shared low Write miss shared low Bus read miss Bus read miss Write miss Read miss Bus write miss Shared high Shared high Write miss, shared high S-D S-C Write hit, shared high 12/2/2018 CSE 471 Multiprocessors

Performance of snoopy protocols
Performance depends on the length of a write run Write run: sequence of write references by 1 processor to a shared address (or shared block) uninterrupted by either access by another processor or replacement long write runs better to have write invalidate short write runs better to have write update There have been proposals to make the choice between protocols at run time Competitive algorithms 12/2/2018 CSE 471 Multiprocessors

What about cache hierarchies?
Implement snoopy protocol at L2 (board-level) cache Impose multilevel inclusion property Encode in L2 whether the block (or part of it) is in L1 Disrupt L1 on bus transactions from other processors only if data is there Total inclusion might be expensive. use partial inclusion (i.e., possibility of slightly over invalidating L1) 12/2/2018 CSE 471 Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

Similar presentations

Presentation on theme: "Multiprocessors - Flynn’s taxonomy (1966)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessors - Flynn’s taxonomy (1966)

Similar presentations

Presentation on theme: "Multiprocessors - Flynn’s taxonomy (1966)"— Presentation transcript:

Similar presentations

About project

Feedback