CSE 586 Computer Architecture Lecture 8

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Cache Optimization Summary

Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

Cache Coherence. CSE 4711 Cache Coherence Recall the memory wall –In multiprocessors the wall might even be higher! –Contention on shared-bus –Time to.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

Introduction to Parallel Processing Ch. 12, Pg

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

The University of Adelaide, School of Computer Science

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

Overview Parallel Processing Pipelining

Parallel Architecture

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Course Outline Introduction in algorithms and applications

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

CS 704 Advanced Computer Architecture

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Cache Coherence (controllers snoop on bus transactions)

Chip-Multiprocessor.

Chapter 5 Multiprocessor and Thread-Level Parallelism

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

Multiple Processor Systems

Parallel Processing Architectures

Lecture 24: Memory, VM, Multiproc

11 – Snooping Cache and Directory Based Multiprocessors

Networks Networking has become ubiquitous (cf. WWW)

Advanced Computer and Parallel Processing

/ Computer Architecture and Design

Lecture 25: Multiprocessors

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Advanced Computer and Parallel Processing

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSL718 : Multiprocessors 13th April, 2006 Introduction

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Synonyms v.p. x, process A v.p # index Map to same physical page

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Multi-computers

Presentation transcript:

CSE 586 Computer Architecture Lecture 8 Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00

Highlights from last week Hardware-software interactions for paging systems TLB’s Page fault: detection and termination Choice of a (or several) page size(s) Virtually addressed caches - Synonyms Protection I/O and caches (software and hardware solutions) CSE 586 Spring 00

Highlights from last week (c’ed) I/O I/O architecture (CPU-memory and I/O buses) Disks (access time components) Buses (arbitration, transactions, split-transactions) I/O hardware-software interface DMA Disk arrays CSE 586 Spring 00

Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The data is not “streaming” Single Instruction stream, Multiple Data stream (SIMD) Popular for some applications like image processing One can construe vector processors to be of the SIMD type. MMX extensions to ISA reflect the SIMD philosophy Also apparent in “multimedia” processors (Equator Map-1000) “Data Parallel” Programming paradigm CSE 586 Spring 00

Flynn’s Taxonomy (c’ed) Multiple Instruction stream, Single Data stream (MISD) Don’t know of any Multiple Instruction stream, Multiple Data stream (MIMD) The most general Covers Shared-memory multiprocessors Message passing multicomputers (including networks of workstations cooperating on the same problem) Fine-grained multithreaded processors (several PC’s)? CSE 586 Spring 00

Shared-memory Multiprocessors Shared-Memory = Single shared-address space (extension of uniprocessor; communication via Load/Store) Uniform Memory Access: UMA Today, almost uniquely shared-bus systems The basis for SMP’s (Symmetric MultiProcessing) Cache coherence enforced by “snoopy” protocols Form the basis for clusters (but in clusters access to memory of other clusters is not UMA) CSE 586 Spring 00

Shared-memory Multiprocessors (c’ed) Non-uniform memory access: NUMA NUMA-CC: cache coherent (directory-based protocols or SCI) NUMA without cache coherence (enforced by software) COMA: Cache Memory Only Architecture Clusters Distributed Shared Memory: DSM Most often network of workstations. The shared address space is the “virtual address space” O.S. enforces coherence on a page per page basis CSE 586 Spring 00

Message-passing Systems Processors communicate by messages Primitives are of the form “send”, “receive” The user (programmer) has to insert the messages Message passing libraries (MPI, OpenMP etc.) Communication can be: Synchronous: The sender must wait for an ack from the receiver (e.g, in RPC) Asynchronous: The sender does not wait for a reply to continue CSE 586 Spring 00

Shared-memory vs. Message-passing An old debate that is not that much important any longer Many systems are built to support a mixture of both paradigms “send, receive” can be supported by O.S. in shared-memory systems “load/store” in virtual address space can be used in a message-passing system (the message passing library can use “small” messages to that effect, e.g. passing a pointer to a memory area in another computer) Does a network of workstations with a page being the unit of coherence follow the shared-memory paradigm or the message passing paradigm? CSE 586 Spring 00

The Pros and Cons Shared-memory pros Message-passing pros Ease of programming (SPMD: Single Program Multiple Data paradigm) Good for communication of small items Less overhead of O.S. Hardware-based cache coherence Message-passing pros Simpler hardware (more scalable) Explicit communication (both good and bad; some programming languages have primitives for that), easier for long messages Use of message passing libraries CSE 586 Spring 00

Caveat about Parallel Processing Multiprocessors are used to: Speedup computations Solve larger problems Speedup Time to execute on 1 processor / Time to execute on N processors Speedup is limited by the communication/computation ratio and synchronization Efficiency Speedup / Number of processors CSE 586 Spring 00

Amdahl’s Law for Parallel Processing Recall Amdahl’s law If x% of your program is sequential, speedup is bounded by 1/x At best linear speedup (if no sequential section) What about superlinear speedup? Theoretically impossible “Occurs” because adding a processor might mean adding more overall memory and caching (e.g., fewer page faults!) Have to be careful about the x% of sequentiality. Might become lower if the data set increases. Speedup and Efficiency should have the number of processors and the size of the input set as parameters CSE 586 Spring 00

SMP (Symmetric MultiProcessors aka Multis) Single shared-bus Systems Caches Shared-bus I/O adapter Interleaved Memory CSE 586 Spring 00

Cache Coherence (controllers snoop on bus transactions) Initial state: P2 reads A; P3 reads A P1 P2 P3 P4 A A A Mem. CSE 586 Spring 00

Cache coherence (cont’d) Now P2 wants to write A Two choices: Broadcast the new value of A on the bus; value of A snooped by cache of P3: Write-update (or write broadcast) protocol (resembles write-through) Broadcast an invalidation message with the address of A; the address snooped by cache of P3 which invalidates its copy of A: Write-invalidate protocols. Note that the copy in memory is not up-to-date any longer (resembles write-back) If instead of P2 wanting to write A, we had a write miss in P4 for A, the same two choices of protocol apply. CSE 586 Spring 00

Write-update P1 P2 P3 P4 A’ A’ A’ A’ Mem. CSE 586 Spring 00

Write-invalidate P1 P2 P3 P4 Mem. Invalid lines A A A’ A CSE 586 Spring 00

Coherence and Consistency A memory system is coherent if P writes A (becomes AP), no other writes, P reads A -> P gets AP Q writes A (becomes Aq), no other writes, P reads A (with some delay) -> P gets Aq P writes A (Ap) then (or at the same time) Q writes A (Aq) . Then if R reads A as Aq it should never see A as Ap from thereon (write serialization) or in another formulation, if R, S, T read A after Q writes A, they should all see the same value of A (either all see Ap or all see Ap ) A memory system can have different models of memory consistency, i.e., of the time when a value written by some P is seen by Q Sequential consistency is the model seen by programmer (as if the programs running on the multiprocessor were running – interleaved – on a single one) Many variations of “relaxed” consistency (think, e.g., about write buffers) CSE 586 Spring 00

Snoopy Cache Coherence Protocols Associate states with each cache block; for example: Invalid Clean (one or more copies are up to date) Dirty (modified; exists in only one cache) Fourth state (and sometimes more) for performance purposes CSE 586 Spring 00

State Transitions for a Given Cache Block Those incurred as answers to processor associated with the cache Read miss, write miss, write on clean block Those incurred by snooping on the bus as result of other processor actions, e.g., Read miss by Q might make P’s block transit from dirty to clean Write miss by Q might make P’s block transit from dirty/clean to invalid (write invalidate protocol) CSE 586 Spring 00

Basic Write-invalidate Protocol (write-through write-no-allocate caches) Needs only two states: Valid Invalid When a processor writes in a valid state, it sends an invalidation message to all other processors Not interesting because most L2 are write-back, write-allocate to alleviate bus contention CSE 586 Spring 00

Basic Write-invalidate Protocol (write-back write-allocate caches) Needs 3 states associated with each cache block Invalid Clean (read only – can be shared) – also called Shared Dirty (only valid copy in the system) – also called Modified Need to decompose state transitions into those: Induced by the processor attached to the cache Induced by snooping on the bus CSE 586 Spring 00

Basic 3 State Protocol: Processor Actions Read miss (data might come from mem. or from another cache) Transitions from Invalid state won’t be shown in forthcoming figures Read miss Clean Inv. Read hit Write miss Write hit (will also send a transaction on bus) Read/write hit Dirty Read miss and Write miss will send corresponding transactions on the bus Write miss (data might come from mem. or from another cache) CSE 586 Spring 00

Basic 3 State Protocol: Transitions from Bus Snooping Bus write Clean Inv. Bus write Bus read Dirty CSE 586 Spring 00

An Example of Write-invalidate Protocol: the Illinois Protocol States: Invalid (aka Invalid) Valid-Exclusive (clean, only copy, aka Exclusive) Shared (clean, possibly other copies, aka Shared) Dirty (modified, only copy, aka Modified) In the MOESI notation, a MESI protocol O stands for ownership CSE 586 Spring 00

Illinois Protocol: Design Decisions The Valid-Exclusive state is there to enhance performance On a write to a block in V-E state, no need to send an invalidation message (occurs often for private variables). On a read miss with no cache having the block in dirty state Who sends the data: memory or cache (if any)? Answer: cache for that particular protocol; other protocols might use the memory If more than one cache, which one? Answer: the first to grab the bus (tri-state devices) CSE 586 Spring 00

Illinois Protocol: State Diagram Proc. induced Read miss from mem. Read hit bus write miss Bus induced Inv. V.E. bus write miss bus write miss Bus read miss Write hit Hit Bus read miss Read hit and bus read miss Dirty Sh. Write hit Write miss Read miss from cache CSE 586 Spring 00

Example: P2 reads A (A only in memory) Proc. induced Read miss from mem. Read hit bus write miss Bus induced Inv. V.E. bus write miss bus write miss Bus read miss Write hit Hit Bus read miss Read hit and bus read miss Dirty Sh. Write hit Write miss Read miss from cache CSE 586 Spring 00

Example: P3 reads A (A comes from P2) Proc. induced Read miss from mem. Read hit bus write miss Bus induced Inv. V.E. Both P2 and P3 will have A in state Sh bus write miss bus write miss Bus read miss Write hit Hit Bus read miss Read hit and bus read miss Dirty Sh. Write hit Write miss Read miss from cache CSE 586 Spring 00

Example: P4 writes A (A comes from P2) Proc. induced Read miss from mem. Read hit bus write miss Bus induced Inv. V.E. P2 and P3 will have A in state Inv; P4 will be in state Dirty bus write miss bus write miss Bus read miss Write hit Hit Bus read miss Read hit and bus read miss Dirty Sh. Write hit Write miss Read miss from cache CSE 586 Spring 00

A Sophisticated Write-update Protocol Dragon protocol - 4 states + “shared” bus control line No invalid state: either memory or a cache block is in correct state Valid-Exclusive (only copy in cache but not modified) Shared-Dirty (write-back required at replacement; single block in that state) Shared-Clean (several copies in caches, coherent but might have been modified wrt memory) Dirty (single modified copy in caches) On a write, the data is sent to caches that have a valid copy of the block. They must raise the “shared” control line to indicate they are alive. If none alive, go to state Dirty. On a miss, shared line is raised by cache that has a copy CSE 586 Spring 00

Dragon Protocol (writes are updates) Write hit Dirty V-E Shared low Bus read miss Write hit shared low Shared low Bus read miss Read miss Bus write miss Write miss Shared high Shared high Write miss, shared high S-D S-C Write hit, shared high CSE 586 Spring 00

Example: P2 reads A (A only in memory) Write hit Dirty V-E Shared low Bus read miss Write hit shared low Shared low Bus read miss Read miss Bus write miss Write miss Shared high Shared high Write miss, shared high S-D S-C Write hit, shared high CSE 586 Spring 00

Example: P3 reads A (A comes from P2) Write hit Dirty V-E Shared low Bus read miss Write hit shared low Shared low Bus read miss Read miss Bus write miss Write miss Shared high Shared high Write miss, shared high S-D S-C Write hit, shared high CSE 586 Spring 00

Example: P4 writes A (A comes from P2) Write hit Dirty V-E Shared low Bus read miss Write hit shared low Shared low Bus read miss Read miss Bus write miss Write miss Shared high Shared high Write miss, shared high S-D S-C P4 has the responsibility to replace A Write hit, shared high P2 will transmit the data, raise share-line and stay in S-C; P3 will also stay in S-C CSE 586 Spring 00

Cache Parameters for Multiprocessors In addition to the 3 C’s types of misses, add a 4th C: coherence misses As cache sizes increase, the misses due to the 3 C’s decrease but coherence misses increase Shared data has been shown to have less spatial locality than private data; hence large block sizes could be detrimental Large block sizes induce more false sharing P1 writes the first part of line A; P2 writes the second part. From the coherence protocol viewpoint, both look like “write A” CSE 586 Spring 00

Performance of Snoopy Protocols Protocol performance depends on the length of a write run Write run: sequence of write references by 1 processor to a shared address (or shared block) uninterrupted by either access by another processor or replacement Long write runs better to have write invalidate Short write runs better to have write update There have been proposals to make the choice between protocols at run time Competitive algorithms CSE 586 Spring 00

What About Cache Hierarchies? Implement snoopy protocol at L2 (board-level) cache Impose multilevel inclusion property Encode in L2 whether the block (or part of it if blocks in L2 are longer than blocks in L1) is in L1 (1 bit/block or subblock) Disrupt L1 on bus transactions from other processors only if data is there, I.e., L2 shields L1 from unnecessary checks Total inclusion might be expensive (need for large associativity) if several L1’s share a common L2 (like in clusters). Instead use partial inclusion (i.e., possibility of slightly over invalidating L1) CSE 586 Spring 00

Interconnection Networks for Multiprocessors Buses have limitations for scalability: Physical (number of devices that can be attached) Performance (contention on a shared resource: the bus) Instead use interconnection networks to form: Tightly coupled systems. Most likely the nodes (processor and memory elements) will be homogeneous, and operated as a whole under the same operating system and will be physically close to each other (a few meters) Local Area Networks (LAN) : building size; network of workstations (in fact the interconnect could be a bus – Ethernet0 Wide Area Networks (WAN - long haul networks): connect computers and LANs distributed around the world CSE 586 Spring 00

Switches in the Interconnection Network Centralized (multistage) switch All nodes connected to the central switch Or, all nodes share the same medium (bus) There is a single path from one node to another (although some redundant paths could be added for fault-tolerance) Distributed switch One switch associated with each node And of course, hierarchical combinations CSE 586 Spring 00

Multiprocessor with Centralized Switch Mem Mem Mem Interconnection network Mem Mem Mem Proc Proc Proc CSE 586 Spring 00

Multiprocessor with Decentralized Switches Mem Mem Mem Switch Mem Mem Mem Proc Proc Proc CSE 586 Spring 00

Multistage Switch Topology (centralized) Shared-bus (simple, one stage, but not scalable) Hierarchy of buses (often proposed, never commercially implemented) Crossbar (full connection) Gives the most parallelism; Cost (number of switches) grows as the square of number of processors Multistage interconnection networks Based on the perfect shuffle. Cost grows as O(nlogn) Fat tree CSE 586 Spring 00

Crossbar PE: processing element = Proc + cache + memory Switch …... …... 1 n2 switches complete concurrency 2 3 4 5 6 7 CSE 586 Spring 00

Perfect Shuffle and Omega Network Perfect shuffle: one stage of the interconnection network With a power of 2 number of processors (i.e., an n bit id) Shuffle(p0, p1, p2, …, p2k-2, p2k-1) = (p0, p2, p4, …, p2k-3, p2k-1) like shuffling a deck of cards Put a switch that can either go straight-through or exchange between a pair of adjacent nodes Can reach any node from any node after log2n trips through the shuffle Omega network (and butterfly networks) for n nodes uses logn stages of n/2 2*2 switches Setting of switches done by looking at destination address Not all permutations can be done in one pass through the network (was important for SIMD, less important for MIMD) CSE 586 Spring 00

Omega Network for n = 8 (k = 3) To go from node i to node j, follow the binary representation of j; at stage k, check kth bit of j. Go up if current bit = 0 and go down if bit =1 Example path: Node 3 to node 6 (110) 1 2 3 1 4 5 6 1 7 CSE 586 Spring 00

Butterfly Network for n = 8 (k = 3) FFT pattern 1 2 3 1 1 4 5 6 7 CSE 586 Spring 00

Multistage Interconnection Networks Omega networks (and equivalent) Possibility of blocking (two paths want to go through same switch) Possibility of combining (two messages pass by the same switch, for the same destination, at the same time) Buffering in the switches (cf. Routing slide later on) Possibility of adding extra stages for fault-tolerance Can make the switches bigger, e.g., 4*4, 4*8 etc. CSE 586 Spring 00

Fat Tree (used in CM-5 and IBM SP-2) Increase bandwidth when closer to root To construct a fat tree, take a butterfly network, connect it to itself back to back and fold it along the highest dimension. Links are now bidirectional Allow more than one path (e.g., each switch has 4 connections backwards and 2 upwards cf. H&P p 585) Cf. PMP class on parallel processing? CSE 586 Spring 00

Decentralized Switch Rings (and hierarchy of) Used in the KSR Bus + ring (Sequent CC-NUMA) 2D and 3D-meshes and tori Intel Paragon 2D (message co-processor) Cray T3D and T3E 3D torus (shared-memory w/o cache coherence) Tera 3D torus (shared-memory, no cache) Hypercubes CM-2 (12 cube; each node had 16 1-bit processors; SIMD) Intel iPSC (7 cube in maximum configuration; message passing) CSE 586 Spring 00

Topologies ring 2-d Mesh Hypercube (d = 3) CSE 586 Spring 00

Performance Metrics Message: Bandwidth Bisection bandwidth Header: routing info and control Payload: the contents of the message Trailer: checksum Bandwidth Maximum rate at which the network can propagate info. once the message enters the network Bisection bandwidth Divide the network roughly in 2 equal parts: sum the bandwidth of the lines that cross the imaginary dividing line CSE 586 Spring 00

Performance Metrics (c’ed) Transmission time (no contention): time for the message to pass through the network Size of message/bandwidth Time of flight Time for the 1st bit to arrive at the receiver Transport latency: transmission time + time of flight Sender overhead: time for the proc. to inject the message Receiver overhead: time for the receiver to pull the message Total latency = Sender over. + transport latency + rec. over CSE 586 Spring 00

Routing (in interconn. networks) Destination-based routing - Oblivious Always follows the same path (deterministic). For example follow highest dimension of the hypercube first, then next one etc. Destination-based routing - Adaptive Adapts to congestion in network. Can be minimal , i.e., allow only paths of (topological) minimal path-lengths Can be non-minimal (e.g., use of random path selection or of “hot potato” routing, but other routers might choose paths selected on the address) CSE 586 Spring 00

Flow Control Entire messages vs. packets Circuit-switched (the entire path is reserved) Packet-switched , or store-and-forward (links, or hops, acquired and released dynamically) Wormhole routing (circuit-switched with virtual channels) Head of message “reserves” the path. Data transmitted in flints, i.e., amount of data that can be transmitted over a single channel. Virtual channels add buffering to allow priorities etc. Virtual cut-through (store-and-forward but whole packet does not need to be buffered to proceed) CSE 586 Spring 00