John Kubiatowicz Electrical Engineering and Computer Sciences

Slides:

Advertisements

Similar presentations

Comparison Of Network On Chip Topologies Ahmet Salih BÜYÜKKAYHAN Fall.

Advertisements

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

EECC756 - Shaaban #1 lec # 9 Spring Network Definitions A network is a graph V = {switches and nodes} connected by communication channels.

CS 258 Parallel Computer Architecture Lecture 4 Network Topology and Routing February 4, 2008 Prof John D. Kubiatowicz

CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University.

CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz

Generic Multiprocessor Architecture

CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks (con’t) March 15 th, 2010 John Kubiatowicz Electrical Engineering and Computer.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Interconnection Network Topology Design Trade-offs

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

CS252/Patterson Lec /28/01 CS162 Computer Architecture Lecture 16: Multiprocessor 2: Directory Protocol, Interconnection Networks.

1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.

EECC756 - Shaaban #1 lec # 10 Spring Generic Multiprocessor Architecture Generic Multiprocessor Architecture Node: processor(s), memory system,

ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.

CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.

John Kubiatowicz Electrical Engineering and Computer Sciences

Interconnect Network Topologies

CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.

CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 12 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

Interconnect Networks

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection Networks.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Super computers Parallel Processing

Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

INTERCONNECTION NETWORK

Network Connected Multiprocessors

Overview Parallel Processing Pipelining

Parallel Architecture

Distributed and Parallel Processing

Interconnect Networks

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance.

Lecture 23: Interconnection Networks

Multiprocessor Interconnection Networks Todd C

Interconnection topologies

Course Outline Introduction in algorithms and applications

Prof John D. Kubiatowicz

Static and Dynamic Networks

Interconnection Network Routing, Topology Design Trade-offs

Interconnection Network Design Contd.

Introduction to Scalable Interconnection Network Design

Switching, routing, and flow control in interconnection networks

Lecture 14: Interconnection Networks

Interconnection Network Design Lecture 14

Introduction to Scalable Interconnection Networks

Storage area network and System area network (SAN)

Static Interconnection Networks

Interconnection Network Design

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2

Interconnection Networks Contd.

Embedded Computer Architecture 5SAI0 Interconnection Networks

CS 6290 Many-core & Interconnect

Interconnection Networks

Advanced Computer and Parallel Processing

Static Interconnection Networks

Switching, routing, and flow control in interconnection networks

Multiprocessors and Multi-computers

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

CS252 Graduate Computer Architecture Lecture 20 Multiprocessor Networks John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

Review: Flynn’s Classification (1966) Broad classification of parallel computing systems SISD: Single Instruction, Single Data conventional uniprocessor SIMD: Single Instruction, Multiple Data one instruction stream, multiple data paths distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) shared memory SIMD (STARAN, vector computers) MIMD: Multiple Instruction, Multiple Data message passing machines (Transputers, nCube, CM-5) non-cache-coherent shared memory machines (BBN Butterfly, T3D) cache-coherent shared memory machines (Sequent, Sun Starfire, SGI Origin) MISD: Multiple Instruction, Single Data Not a practical configuration 4/13/2009 cs252-S09, Lecture 20

Review: Parallel Programming Models Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? What orderings exist between operations? How do different threads of control synchronize? Data What data is private vs. shared? How is logically shared data accessed or communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? Cost How do we account for the cost of each of the above? 4/13/2009 cs252-S09, Lecture 20

Paper Discussion: “Future of Wires” “Future of Wires,” Ron Ho, Kenneth Mai, Mark Horowitz Fanout of 4 metric (FO4) FO4 delay metric across technologies roughly constant Treats 8 FO4 as absolute minimum (really says 16 more reasonable) Wire delay Unbuffered delay: scales with (length)2 Buffered delay (with repeaters) scales closer to linear with length Sources of wire noise Capacitive coupling with other wires: Close wires Inductive coupling with other wires: Can be far wires 4/13/2009 cs252-S09, Lecture 20

“Future of Wires” continued Cannot reach across chip in one clock cycle! This problem increases as technology scales Multi-cycle long wires! Not really a wire problem – more of a CAD problem?? How to manage increased complexity is the issue Seems to favor ManyCore chip design?? 4/13/2009 cs252-S09, Lecture 20

Formalism network is a graph V = {switches and nodes} connected by communication channels C Í V ´ V Channel has width w and signaling rate f = 1/t channel bandwidth b = wf phit (physical unit) data transferred per cycle flit - basic unit of flow-control Number of input (output) channels is switch degree Sequence of switches and links followed by a message is a route Think streets and intersections 4/13/2009 cs252-S09, Lecture 20

What characterizes a network? Topology (what) physical interconnection structure of the network graph direct: node connected to every switch indirect: nodes connected to specific subset of switches Routing Algorithm (which) restricts the set of paths that msgs may follow many algorithms with different properties gridlock avoidance? Switching Strategy (how) how data in a msg traverses a route circuit switching vs. packet switching Flow Control Mechanism (when) when a msg or portions of it traverse a route what happens when traffic is encountered? 4/13/2009 cs252-S09, Lecture 20

Topological Properties Routing Distance - number of links on route Diameter - maximum routing distance Average Distance A network is partitioned by a set of links if their removal disconnects the graph 4/13/2009 cs252-S09, Lecture 20

Interconnection Topologies Class of networks scaling with N Logical Properties: distance, degree Physical properties length, width Fully connected network diameter = 1 degree = N cost? bus => O(N), but BW is O(1) - actually worse crossbar => O(N2) for BW O(N) VLSI technology determines switch degree 4/13/2009 cs252-S09, Lecture 20

Example: Linear Arrays and Rings Diameter? Average Distance? Bisection bandwidth? Route A -> B given by relative address R = B-A Torus? Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 4/13/2009 cs252-S09, Lecture 20

Example: Multidimensional Meshes and Tori 3D Cube 2D Grid 2D Torus n-dimensional array N = kd-1 X ...X kO nodes described by n-vector of coordinates (in-1, ..., iO) n-dimensional k-ary mesh: N = kn k = nÖN described by n-vector of radix k coordinate n-dimensional k-ary torus (or k-ary n-cube)? 4/13/2009 cs252-S09, Lecture 20

On Chip: Embeddings in two dimensions 6 x 3 x 2 Embed multiple logical dimension in one physical dimension using long wires When embedding higher-dimension in lower one, either some wires longer than others, or all wires long 4/13/2009 cs252-S09, Lecture 20

Trees Diameter and ave distance logarithmic Fixed degree k-ary tree, height n = logk N address specified n-vector of radix k coordinates describing path down from root Fixed degree Route up to common ancestor and down R = B xor A let i be position of most significant 1 in R, route up i+1 levels down in direction given by low i+1 bits of B H-tree space is O(N) with O(ÖN) long wires Bisection BW? 4/13/2009 cs252-S09, Lecture 20

Fat-Trees Fatter links (really more of them) as you go up, so bisection BW scales with N 4/13/2009 cs252-S09, Lecture 20

Butterflies Tree with lots of roots! N log N (actually N/2 x logN) building block 16 node butterfly Tree with lots of roots! N log N (actually N/2 x logN) Exactly one route from any source to any dest R = A xor B, at level i use ‘straight’ edge if ri=0, otherwise cross edge Bisection N/2 vs N (n-1)/n (for n-cube) 4/13/2009 cs252-S09, Lecture 20

k-ary n-cubes vs k-ary n-flies degree n vs degree k N switches vs N log N switches diminishing BW per node vs constant requires locality vs little benefit to locality Can you route all permutations? 4/13/2009 cs252-S09, Lecture 20

Benes network and Fat Tree Back-to-back butterfly can route all permutations What if you just pick a random mid point? 4/13/2009 cs252-S09, Lecture 20

Hypercubes Also called binary n-cubes. # of nodes = N = 2n. O(logN) Hops Good bisection BW Complexity Out degree is n = logN correct dimensions in order with random comm. 2 ports per processor 0-D 1-D 2-D 3-D 4-D 5-D ! 4/13/2009 cs252-S09, Lecture 20

Relationship BttrFlies to Hypercubes Wiring is isomorphic Except that Butterfly always takes log n steps 4/13/2009 cs252-S09, Lecture 20

Real Machines Wide links, smaller routing delay Tremendous variation 4/13/2009 cs252-S09, Lecture 20

Some Properties Routing Average Distance Wire Length? Degree? relative distance: R = (b n-1 - a n-1, ... , b0 - a0 ) traverse ri = b i - a i hops in each dimension dimension-order routing? Adaptive routing? Average Distance Wire Length? n x 2k/3 for mesh nk/2 for cube Degree? Bisection bandwidth? Partitioning? k n-1 bidirectional links Physical layout? 2D in O(N) space Short wires higher dimension? 4/13/2009 cs252-S09, Lecture 20

Typical Packet Format Two basic mechanisms for abstraction encapsulation Fragmentation Unfragmented packet size n = ndata+nencapsulation 4/13/2009 cs252-S09, Lecture 20

Communication Perf: Latency per hop Time(n)s-d = overhead + routing delay + channel occupancy + contention delay Channel occupancy = n/b = (ndata+ nencapsulation)/b Routing delay? Contention? 4/13/2009 cs252-S09, Lecture 20

Store&Forward vs Cut-Through Routing Time: h(n/b + D/) vs n/b + h D/ OR(cycles): h(n/w + D) vs n/w + h D what if message is fragmented? wormhole vs virtual cut-through 4/13/2009 cs252-S09, Lecture 20

Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel mach. networks block in place link-level flow control tree saturation Closed system - offered load depends on delivered Source Squelching 4/13/2009 cs252-S09, Lecture 20

Bandwidth What affects local bandwidth? Aggregate bandwidth packet density b x ndata/n routing delay b x ndata /(n + wD) contention endpoints within the network Aggregate bandwidth bisection bandwidth sum of bandwidth of smallest set of links that partition the network total bandwidth of all the channels: Cb suppose N hosts issue packet every M cycles with ave dist each msg occupies h channels for l = n/w cycles each C/N channels available per node link utilization for store-and-forward: r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1! link utilization for wormhole routing? 4/13/2009 cs252-S09, Lecture 20

Saturation 4/13/2009 cs252-S09, Lecture 20

How Many Dimensions? n = 2 or n = 3 n >= 4 Short wires, easy to build Many hops, low bisection bandwidth Requires traffic locality n >= 4 Harder to build, more wires, longer average length Fewer hops, better bisection bandwidth Can handle non-local traffic k-ary d-cubes provide a consistent framework for comparison N = kd scale dimension (d) or nodes per dimension (k) assume cut-through 4/13/2009 cs252-S09, Lecture 20

Traditional Scaling: Latency scaling with N Assumes equal channel width independent of node count or dimension dominated by average distance 4/13/2009 cs252-S09, Lecture 20

Average Distance but, equal channel width is not equal cost! ave dist = d (k-1)/2 but, equal channel width is not equal cost! Higher dimension => more channels 4/13/2009 cs252-S09, Lecture 20