Computer architecture II

Slides:

Advertisements

Similar presentations

Ch. 12 Routing in Switched Networks

Advertisements

Network II.5 simulator ..

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Router Architecture : Building high-performance routers Ian Pratt

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Spring 2002CS 4611 Router Construction Outline Switched Fabrics IP Routers Tag Switching.

Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.

CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz

Communication operations Efficient Parallel Algorithms COMP308.

Computer Architecture II 1 Computer architecture II Programming for performance.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Issues in System-Level Direct Networks Jason D. Bakos.

Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.

Computer Architecture II 1 Computer architecture II Network topologies.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.

Storage area network and System area network (SAN)

Switching, routing, and flow control in interconnection networks.

Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

Interconnect Networks

Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection Networks.

CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.

1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University

Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.

ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.

1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Interconnection Networks Alvin R. Lebeck CPS 220.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Outline Why this subject? What is High Performance Computing?

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Super computers Parallel Processing

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)

1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.

1 Lecture 14: Interconnection Networks Topics: dimension vs. arity, deadlock.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.

Advanced Computer Networks

Lecture 23: Interconnection Networks

Packet Switching Datagram Approach Virtual Circuit Approach

Course Outline Introduction in algorithms and applications

Interconnection Network Design Contd.

Parallel Application Case Studies

Introduction to Scalable Interconnection Network Design

Switching, routing, and flow control in interconnection networks

Introduction to Scalable Interconnection Networks

Interconnection Networks Contd.

CS 6290 Many-core & Interconnect

Switching, routing, and flow control in interconnection networks

Presentation transcript:

Computer architecture II Introduction Computer Architecture II

Recap Programming for performance Amdahl’s law Partitioning for performance Addressing decomposition and assignment Orchestration for performance Case studies Ocean Barnes-Hut Raytrace Computer Architecture II

Plan for today Programming for performance Case studies Ocean Barnes-Hut Raytrace Scalable interconnection networks Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

Case study 1: Simulating Ocean Currents (a) Cross sections (b) Spatial discretization of a cross section Ocean: Modeled as several two-dimensional grids :Static and regular Steps: set up the movement equation, solve them, update the grid values Multi grid method sweep n x n first (finest level) Go coarser (n/2 x n/2, n/4 x n/4) or finer depending on the error in the current sweep Computer Architecture II

Computer Architecture II Case Study 1: Ocean Computer Architecture II

Computer Architecture II Partitioning Function parallelism: identify independent computation => reduce synchronization Data parallelism: static partitioning within a grid similar issues to those for kernel solver Block versus strip inherent communication: block better Artifactual communication: line strip better (spatial locality) Load imbalance due to grid border elements In the block case internal blocks do not have border elements Computer Architecture II

Computer Architecture II Orchestration Spatial Locality similar to equation solver 4D versus 2D arrays Block partitioning: poor spatial locality across rows, good across columns except lots of grids, so cache conflicts across grids Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on column-oriented boundary Computer Architecture II

Computer Architecture II Orchestration Temporal locality: Complex working set hierarchy (six working sets, three important) A few points for near-neighbor reuse three sub-rows partition of one grid Synchronization Barriers between phases and solver sweeps Locks for global variables Lots of work between synchronization events Computer Architecture II

Execution Time Breakdown 4D arrays 2D arrays 1026 x 1026 grid size with block partitioning on 32-processor Origin2000 4MB 2nd level cache 4-d grids much better than 2-d Smaller access time (better locality) Less time waiting at barriers Computer Architecture II

Case Study 2: Barnes-Hut Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Barnes Hut: Hierarchical Method taking advantage of force law G (m1m2/ r2) Computer Architecture II

Case Study 2: Barnes-Hut Space cell containing one body Sequential algorithm For each body (n times) traverse the tree top-down and compute the total force acting on that body. If the cell is far enough, compute the force Expected tree height: log n Computer Architecture II

Application Structure Main data structures: array of bodies, of cells, and of pointers to them Each body/cell has several fields: mass, position, pointers to others Contiguous chunks of pointers to bodies and cells are assigned to processes Computer Architecture II

Computer Architecture II Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: Non-uniform body distribution => non-uniform work and communication Cannot assign by inspection Distribution changes dynamically across time-steps Cannot assign statically Information needs fall off with distance from body Partitions should be spatially contiguous for locality Different phases have different work distributions across bodies No single assignment ideal for all Communication: fine-grained and irregular Computer Architecture II

Computer Architecture II Load Balancing Particles are not equal The number and mass of bodies acting upon differs Solution: Assign costs to particles based on the work Work unknown before hand and changes with time-steps But: System evolves slowly Solution: Use work per particle in the current phase as a estimate for the cost for next time-step Computer Architecture II

Load balancing: Orthogonal Recursive Bisection (ORB) Recursively bisect space into subspaces with equal work Work is associated with bodies, as computed in the previous phase Continue until one partition per processor costly Computer Architecture II

Another Approach: Cost-zones Insight: Quad-tree already contains an encoding of spatial locality. Cost-zones is low-overhead and very easy to program Store cost in each node of the tree Compute total work of the system (eg: 1000ops) and divide to the number of processors (eg 1000 ops / 10 proc = 100 ops/proc) Each processor traverses the tree and picks its range (0-100, 100-200 …) Computer Architecture II

Orchestration and Mapping Spatial locality: data distribution is much more difficult than in Ocean Redistribution across time-steps Logical granularity (body/cell) much smaller than page Partitions contiguous in physical space does not imply contiguous in array (where the body are stored) Temporal locality and working sets First working set (body to body interaction) Second working set (compute forces on a body: good temporal locality because system evolves slowly) Synchronization: Barriers between phases No synch within force calculation: data written different from data read Locks in tree-building, pt. to pt. event synch in center of mass phase Mapping: ORB maps well to hypercube, costzones to linear array Computer Architecture II

Execution Time Breakdown 512K bodies on 32-processor Origin2000 Static assignment of bodies versus costzones Good load balance Slow access for static due to lack of locality Computer Architecture II

Computer Architecture II Raytrace Map a 3D scene on a 2D display pixel by pixel Rays shot through pixels in image are called primary rays Reflect and refract when they hit objects and compute color and opacity Recursive process generates ray tree per primary ray Hierarchical spatial data structure keeps track of primitives in scene (similar to the tree of Barnes-Hut) Nodes are space cells, leaves have linked list of jobs Tradeoffs between execution time and image quality Computer Architecture II

Partitioning Scene-oriented approach Partition scene cells, process rays while they are in an assigned cell Ray-oriented approach Partition primary rays (pixels), access scene data as needed Simpler; used here Static assignment: bad load balance, unpredictability of ray bounce Dynamic assignment: use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing A block, the unit of assignment Insert all tiles in a queue A tile, the unit of decomposition and stealing Steal one tile at a time Computer Architecture II

Orchestration and Mapping Spatial locality Proper data distribution for ray-oriented approach very difficult Dynamically changing, unpredictable access, fine-grained access Distribute the memory pages round-robin to avoid contention Poor spatial locality Temporal locality Working sets large and ill defined due to unpredictability Replication would do well (but capacity limited) Synchronization: One barrier at end, locks on task queues Mapping: natural to 2-d mesh for image, but likely not important Computer Architecture II

Execution Time Breakdown Balls arranged in bunch Task stealing clearly very important for load balance Computer Architecture II

Scalable Interconnection Networks Computer Architecture II

Computer Architecture II Outline Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

Computer Architecture II Formalism Graph G=(V,E) V : switches and nodes E: communication channels (edges) e ÍV ´ V Route: (v0, ..., vk) path of length k between nodes 0 und k, where (vi,vi+1)E Routing distance Diameter: the maximal route length between two nodes Average distance Degree: number of input (output) channels of a node Bisection width: minimal number of parallel connections that saturates the network Computer Architecture II

What characterizes a network? Bandwidth (offered bandwidth) b = wf where width w (in bytes) and signaling rate f = 1/t (in Hz) Latency Time a message travels between two nodes Throughput (delivered bandwidth) How much from the offered bandwidth is effectively used Computer Architecture II

What characterizes a network? Topology physical interconnection structure of the network graph Routing Algorithm restricts the set of paths that messages may follow many algorithms with different properties Switching Strategy how data in a message traverses a route circuit switching vs. packet switching Flow Control Mechanism when a message or portions of it traverse a route what happens when traffic is encountered? Computer Architecture II

Computer Architecture II Goals Latency as small as possible High Throughput As many concurrent transfers as possible Bisection width gives the potential number of parallel connection Cost as low as possible Computer Architecture II

Computer Architecture II Bus (e.g. Ethernet) 1 2 3 4 5 Degree = 1 diameter = 1 No routing necessary bisection width = 1 CSMA/CD-protocol limited bus length Simplest and cheapest dynamic network Grad 1: Jeder Knoten hat nur eine ein-/ausgehende Leitung diameter: Es gibt eine direkte Verbindung von jedem Knoten zu jedem anderen Knoten, auf der kein weiterer Knoten als Zwischenstation eingesetzt ist. conectivity 1: Man kann einen Knoten abtrennen (z.B. vom Ethernet nehmen). Dann ist das Netz in zwei Teilnetze (eines davon ein-elementig) zerlegt. Keine Ausfallsicherheit. Wenn Netz belegt ist, dann gibt es keine "Umfahrung". bisection width: Eine einzige Nachricht von der einen Knotenhälfte zur anderen reicht aus, um das Netz zu sättigen. CSMD/CD-Protokoll Computer Architecture II

Computer Architecture II Complete graph 2 1 degree= n-1 too expensive for big nets diameter = 1 bisection width=ën/2û én/2ù 3 5 4 Static Network Connection between each Pair of nodes When cutting the network into two halves, each node has connection to n/2 other nodes. There are n/2 such Nodes. Keine Vermittlung/Adressierung nötig. Jeder Knoten ist mit jedem anderen Knoten direkt verbunden. Hoher Grad. diameter 1: Wegen der direkten Verbindung ist man in einem Schritt beim Ziel der Botschaft. Keine Vermittlungsarbeit notwendig. conectivity: Um einen Knoten vom Netz abzutrennen müssen n-1 Verbindungen, die zu den n-1 anderen Knoten bestehen durchtrennt werden. Computer Architecture II

Ring degree= 2 diameter = n/2 bisection width = 2 1 degree= 2 diameter = n/2 slow for big networks bisection width = 2 3 5 4 Static network A node i linked with nodes i+1 and i-1 modulo n. Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 Computer Architecture II

Computer Architecture II d-dimensional grid Cray T3D und T3E. 1,1 1,2 1,3 For d dimensions degree= d diameter = d ( dn –1) bisection width = ( dn) d–1 2,1 2,2 2,3 3,1 3,2 3,3 Static network Grad: Betrachte Knoten (2,2). Dieser hat in jeder Dimension 2 Nachbarn bisection width: Eindimensionales Gitter = Kette. Durch das Auftrennen einer Kante kann das Netz in zwei gleichgroße Hälften zerlegt werden. Zweidimensionales Gitter = Stellen Sie sich 16 Knoten in einem 4x4-Gitter vor. Man kann das Gitter in zwei 2x4-Teile zerlegen, indem man 4 Kanten zerschneidet. Wurzel(16)=4. Computer Architecture II

Computer Architecture II Crossbar 1    fast and expensive (n2 switches) Most: Processor x memory degree= 1 diameter = 2 bisection width = n/2 Ex: 4x4, 8x8, 16x16 2    3    1 2 3  switch conectivity: 1 Knoten abtrennen (Kreis und gestrichelter Kreis sind derselbe Knoten.) bisection width: Es ist möglich, dass die Hälfte der Prozessoren gleichzeitig mit der anderen Hälfte kommuniziert. n/2 Botschaften können gleichzeitig im Netz unterwegs sein. Die bisection width ist daher optimial (n/2). Dynamic network Computer Architecture II

Computer Architecture II Hypercube (1) Hamming-Distance = number of bits in which the binary representation of two numbers differ Two nodes are connected if the Hamming distance is 1 Routing from x to y by decreasing the Hemming distance 0010  0011  0000  0001  0100  0101  0111  0110  0000  0001  0011  0010  Static network Computer Architecture II

Computer Architecture II Hypercube (2) k dimensions, n= 2k nodes 0000  0001  0011  0010  degree= k diameter = k bisection width = n/2 Two (k-1)-hypercubes are linked through n/2 edges to form a k-hypercube 0100  0101  0111  0110  0000  0001  0011  0010  Intel iPSC/860, SGI Origin 2000 Computer Architecture II

Computer Architecture II Omega-Network (1) Building block: 2x2 Shuffle Perfect Shuffle Target = cyclic left shift 000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111 Computer Architecture II

Computer Architecture II Omega-Network (2) Log2n levels of of 2x2 Shuffle building block dynamic network 000 001 010 011 100 101 110 111 Level i looks at bit i If 0 goes up If 1 goes down See example for 100 sending to 110 Computer Architecture II

Computer Architecture II Omega-Network (3) n nodes, (n/2) log2n building blocks degree= 2 for nodes, 4 for building blocks diameter = log2n bisection width = n/2 for a random permutation, n/2 messages are expected to cross the network in parallel Extremes If all the nodes want to send to 0, only one message in parallel If each sends a message to himself n messages in parallel Computer Architecture II

Fat Tree /Clos-Network (1) Nodes = leaves of a tree Tree has the diameter 2log2n „von farthest left over the root to farthest right" Simple tree has bisection width = 1 bottleneck Fat Tree: Edges at level i have double capacity as edges at level i-1 At level i expensive switches with 2i inputs and 2i outputs Known as Clos-networks Computer Architecture II

Fat Tree/Clos-Network (2) Routing: Direct way over the lowest common parent When alternative exists, choose randomly. Tolerance to node failure diameter 2log2n, bisection width: n     Bei 16 Knoten hat eder Knoten im Inneren des Baums hat 4 Nachfolger. Wenn man die Knoten halbieren will, dann muss man bei jedem Wurzelknoten die hälft der Nachfolgerkanten, die zur anderen Knotenhälfte gehen, auftrennen. Bei 4 Wurzelknoten macht das insgesamt 8 Kanten.         CM-5                 Computer Architecture II

Computer Architecture II Switching How a message traverses the network from one node to the other Circuit switching One path from source to destination established All packets will take that way Like the telephone system Packet switching A message broken into a sequence of packets which can be sent across different routes Better utilization of network resources Computer Architecture II

Packet Routing There are two basic approaches to routing packets, based on what a switch does when the packet begins arriving Store-and-forward Cut-through Virtual cut-through Wormhole

Packet routing: Store-and-Forward A packet is completely stored at a switch before being forwarded The packet is always on at least two nodes Pb: Switches need lots of memory for storing the incoming packets Switching takes place step-by-step, the blocking danger is small Computer Architecture II

Packet routing: Cut through A packet may come partially into the switch and leave its tail on other nodes It may reside on more than 2 switches The decision to forward the packet may be taken right away What to do with the rest of the packet if the head blocks? Cut-through: gather tail where the head is It degenerates into store-and-forward for high contention Wormhole: If the head blocks the whole “worm” blocks Computer Architecture II

Store&Forward vs Cut-Through Routing h(n/b + D) vs n/b + h D h: number of hops n: message size b: bandwidth D: routing delay per hop Computer Architecture II

Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing Routing algorithms Arithmetic Source-based Table lookup Adaptive—route based on network state (e.g., contention)

(1) Arithmetic Routing For regular topology, use simple arithmetic to determine route E.g., 3D Torus xy-routing Packet header contains signed offset to destination (per dimension) At each hop, switch +/- to reduce offset in a dimension When x == 0 and y == 0, then at correct processor Drawbacks Requires ALU in switch Must re-compute CRC at each hop (1,1,1) (0,1,1) (0,0,1) (1,0,1) (0,1,0) (1,1,0) (0,0,0) (1,0,0)

(2) Source Based & (3) Table Lookup Routing Source specifies output port for each switch in route Very simple switches No control state Strip output port off header Myrinet uses this Can’t be made adaptive Table Lookup Very small header: contains a field that is a index into table for output port Big tables, must be kept up-to-date

Deterministic vs. Adaptive Routing Deterministic—follows a pre-specified route K-ary d-cube: dimension-order routing (x1, y1)  (x2, y2) First Dx = x2 - x1, Then Dy = y2 - y1, Tree: common ancestor Adaptive—route determined by contention for output port 001 000 101 100 010 110 111 011

Computer Architecture II (4) Adaptive Routing Essential for fault tolerance At least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Computer Architecture II

Computer Architecture II Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel machines networks block in place Traffic may back up toward the source tree saturation: backing up all the way long toward destination Discard packets and inform the source about that Computer Architecture II

Communication Perf: Latency Time(n)s-d = overhead + routing delay + channel occupancy + contention delay Overhead: time necessary for initiating the sending and reception of a message occupancy = (n + ne) / b n: data (payload) size ne: packet envelope size Routing delay Contention Computer Architecture II

Computer Architecture II Bandwidth What affects local bandwidth? packet density b x n/(n + ne) routing delay b x n / (n + ne + wD) D: nr. Of cycles waiting for a routing decision w: width of the channel contention endpoints within the network Aggregate bandwidth bisection bandwidth sum of bandwidth of smallest set of links that partition the network Bad if not uniform distribution of communication total bandwidth of all the channels Computer Architecture II

Computer Architecture II Interconnects Name Latency Bandwidth Topology Comments Gigabit 100-150us 1 Gb/s Star or Fat Tree Cheap for small systems Infiniband 4x 3.5-7us 10-20 Gb/s Fat Tree -Not as mature as Myrinet -Smaller switches(128 port) -Cost ~$500/card+port Myrinet 2-8 Gb/s Clos -Mature, de facto standard -256+256 port switches -cost ~$500/card + port NUMAlink4 1-2us 8-16 Gb/s -SGI Proprietary -Special uproc for I/O -shmem Quadrics 9 Gb/s -Expensive -Used in turn-key machines SCI/Dolphin 4 Gb/s 2D/3D Torus -Cabling nightmare! -Costs more than Myrinet Computer Architecture II

Computer Architecture II Myrinet Offered bandwidth 2+2 Gbit/s, full duplex 5-7 s latency Arbitrary Topology, Fat Tree/Clos-Network preferable Routing: Wormhole, Source Routing Cable (8+1 Bit parallel) or fiber optics Flow-control on each link Adaptor programmable RISC-Processor 333 MHz, PCI/PCI-X connection, upto 133 MHz, 64-Bit, 8 Gb/s over PCI-X Bus uni-directional 2 MB Computer Architecture II

Myrinet Fat Tree (128 node) 16x16 crossbar Hier sind nur 8 Linien gezeigt. Jede ist doppelt ausgelegt, wegen Duplex. Computer Architecture II

Myrinet PCI-Bus-Adaptor cable connect Netw. interface Net- DMA 2 MB SRAM Host- DMA PCI Bridge LanAI CPU 2MB SRAM PCI (-X)-bridge, 64 Bit, 66-133 MHz LanAI RISC, 333 MHz 2 LWL-connectors, both duplex Computer Architecture II

Computer Architecture II Myrinet 16x16 crossbar 8 computers connected in the front side (2 chanels) On the backside 8 outputs (2 chanels) toward next level of Clos network 32x32, two Computer Architecture II

Computer Architecture II 128-nodes Clos Building block from earlier Computer Architecture II

Myrinet 256+256-Clos-Network Routing network with bisection width 256 Front side 256 computer connection Back side 256 connection to next level routing units Computer Architecture II

Clos-Network with full bisection width: 64 nodes and 32 nodes Computer Architecture II