CS252/Patterson Lec 12.1 2/28/01 CS162 Computer Architecture Lecture 16: Multiprocessor 2: Directory Protocol, Interconnection Networks.

Slides:



Advertisements
Similar presentations
Comparison Of Network On Chip Topologies Ahmet Salih BÜYÜKKAYHAN Fall.
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
EECC756 - Shaaban #1 lec # 9 Spring Network Definitions A network is a graph V = {switches and nodes} connected by communication channels.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
ECE669 L25: Final Exam Review May 6, 2004 ECE 669 Parallel Computer Architecture Lecture 25 Final Exam Review.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz
DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
Interconnection Network Topology Design Trade-offs
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
Storage area network and System area network (SAN)
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Interconnect Network Topologies
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 12 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 January Session 4.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Network Connected Multiprocessors
Parallel Architecture
Interconnect Networks
Lecture 23: Interconnection Networks
Multiprocessor Interconnection Networks Todd C
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
Prof John D. Kubiatowicz
Interconnection Network Routing, Topology Design Trade-offs
John Kubiatowicz Electrical Engineering and Computer Sciences
CMSC 611: Advanced Computer Architecture
Introduction to Scalable Interconnection Network Design
Interconnection Network Design Lecture 14
Static Interconnection Networks
11 – Snooping Cache and Directory Based Multiprocessors
Interconnection Networks Contd.
Embedded Computer Architecture 5SAI0 Interconnection Networks
Static Interconnection Networks
Multiprocessors and Multi-computers
Presentation transcript:

CS252/Patterson Lec /28/01 CS162 Computer Architecture Lecture 16: Multiprocessor 2: Directory Protocol, Interconnection Networks

CS252/Patterson Lec /28/01 Larger MPs Separate Memory per Processor Local or Remote access via memory controller 1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache –Which caches have a copies of block, dirty vs. clean,... Info per memory block vs. per cache block? –PLUS: In memory => simpler protocol (centralized/one location) –MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

CS252/Patterson Lec /28/01 Distributed Directory MPs

CS252/Patterson Lec /28/01 Network Examples Bi-directional Ring – EX: HP V Class 2-D Mesh and Hypercube – SGI Origin and Cray T3E Crossbar and Omega Network – SMPs, IBM SP3, and IP Routers Clusters using ethernet, Gigabit ethernet, Myrinet, etc. Properties of various networks will be discussed later

CS252/Patterson Lec /28/01 CC-NUMA Multiprocessor: Directory Protocol What is Cache Coherent Non-Uniform Memory Access (CC-NUMA)? Similar to Snoopy Protocol: Three states –Shared: ≥ 1 processors have data, memory up-to-date –Uncached (no processor hasit; not valid in any cache) –Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Directory Size: Big => Limited Directory Schemes (Not to be discussed)

CS252/Patterson Lec /28/01 Directory Protocol No bus and don’t want to broadcast: –interconnect no longer single arbitration point –all messages have explicit responses Terms: typically 3 processors involved –Local node where a request originates –Home node where the memory location of an address resides –Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address

CS252/Patterson Lec /28/01 Example Directory Protocol Message sent to directory causes two actions: –Update the directory –More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: –Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared. –Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date: –Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. –Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

CS252/Patterson Lec /28/01 Example Directory Protocol Block is Exclusive : current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: –Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared. –Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty. –Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

CS252/Patterson Lec /28/01

CS252/Patterson Lec /28/01 Interconnection Network Routing, Topology Design Trade-offs

CS252/Patterson Lec /28/01 Interconnection Topologies Class networks scaling with N Logical Properties: –distance, degree Physical properties –length, width Static vs. Dynamic Networks Fully connected network –diameter = 1 –degree = N –cost? »bus => O(N), but BW is O(1) - actually worse »crossbar => O(N 2 ) for BW O(N) VLSI technology determines switch degree

CS252/Patterson Lec /28/01 What characterizes a network? Topology (what) –physical interconnection structure of the network graph –direct: node connected to every switch –indirect: nodes connected to specific subset of switches Routing Algorithm(which) –restricts the set of paths that msgs may follow –many algorithms with different properties »gridlock avoidance? Switching Strategy(how) –how data in a msg traverses a route –circuit switching vs. packet switching Flow Control Mechanism(when) –when a msg or portions of it traverse a route –what happens when traffic is encountered?

CS252/Patterson Lec /28/01 Flow Control What do you do when push comes to shove? –Ethernet: collision detection and retry after delay –FDDI, token ring: arbitration token –TCP/WAN: buffer, drop, adjust rate –any solution must adjust to output rate Link-level flow control

CS252/Patterson Lec /28/01 Topological Properties Routing Distance - number of links on route Diameter - maximum routing distance between any two nodes in the network Average Distance – Sum of distances between nodes/number of nodes Degree of a Node – Number of links connected to a node => Cost high if degree is high A network is partitioned by a set of links if their removal disconnects the graph Fault-tolerance – Number of alternate paths between two nodes in a network

CS252/Patterson Lec /28/01 Review: Performance Metrics Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Total Latency (processor busy) (processor busy) Includes header/trailer in BW calculation?

CS252/Patterson Lec /28/01 Example Static Network: 2-D Mesh Architecture

CS252/Patterson Lec /28/01 More Static Networks: Linear Arrays and Rings Linear Array –Diameter? –Average Distance? –Bisection bandwidth? –Route A -> B given by relative address R = B-A Torus? Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1

CS252/Patterson Lec /28/01 Multidimensional Meshes and Tori d-dimensional array –n = k d-1 X...X k O nodes –described by d-vector of coordinates (i d-1,..., i O ) d-dimensional k-ary mesh: N = k d –k = d  N –described by d-vector of radix k coordinate d-dimensional k-ary torus (or k-ary d-cube)? Ex: Intel Paragon (2D), SGI Origin (Hypercube), Cray T3E (3DMesh) 2D Grid 3D Cube

CS252/Patterson Lec /28/01 Hypercubes Also called binary n-cubes. # of nodes = N = 2 n. O(logN) Hops Good bisection BW Complexity –Out degree is n = logN correct dimensions in order –with random comm. 2 ports per processor 0-D1-D2-D3-D 4-D 5-D !

CS252/Patterson Lec /28/01 Origin Network Each router has six pairs of 1.56MB/s unidirectional links –Two to nodes, four to other routers –latency: 41ns pin to pin across a router Flexible cables up to 3 ft long Four “virtual channels”: request, reply, other two for priority or I/O

CS252/Patterson Lec /28/01 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address

CS252/Patterson Lec /28/01 Trees Diameter and ave distance logarithmic –k-ary tree, height d = log k N –address specified d-vector of radix k coordinates describing path down from root Fixed degree Route up to common ancestor and down –R = B xor A –let i be position of most significant 1 in R, route up i+1 levels –down in direction given by low i+1 bits of B H-tree space is O(N) with O(  N) long wires Bisection BW?

CS252/Patterson Lec /28/01 Real Machines Wide links, smaller routing delay Tremendous variation MachineTopologyCycle Time (ns) Channel Width (bits) Routing Delay (cycles) Flit (data bits) nCUBE/2Hypercube TMC CM-5Fat-Tree IBM SP-2Banyan Intel Paragon2D Mesh Meiko CS-2Fat-Tree20878 CRAY T3D3D Torus DASHTorus30162 J-Machine3D Mesh31828 MonsoonButterfly20162 SGI OriginHypercube MyricomArbitrary