Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

Slides:

Advertisements

Similar presentations

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W.

Advertisements

Interconnection Networks: Flow Control and Microarchitecture.

A Novel 3D Layer-Multiplexed On-Chip Network

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Interconnection Networks: Topology and Routing Natalie EnrightJerger.

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Predictive Load Balancing Reconfigurable Computing Group.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Issues in System-Level Direct Networks Jason D. Bakos.

Interconnection Network Topology Design Trade-offs

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

Cristóbal Camarero With support from: Enrique Vallejo Ramón Beivide

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

John Kubiatowicz Electrical Engineering and Computer Sciences

Storage area network and System area network (SAN)

Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring

Dragonfly Topology and Routing

Interconnect Network Topologies

CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.

1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Interconnect Networks

On-Chip Networks and Testing

Networks-on-Chips (NoCs) Basics

1 Dynamic Interconnection Networks Miodrag Bolic.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

Network-on-Chip Introduction Axel Jantsch / Ingo Sander

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

Shanghai Jiao Tong University 2012 Indirect Networks or Dynamic Networks Guihai Chen …with major presentation contribution from José Flich, UPV (and Cell.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Ch 8. Switching. Switch  Devices that interconnected with each other  Connecting all nodes (like mesh network) is not cost-effective  Some topology.

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Virtual-Channel Flow Control William J. Dally

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

How to Train your Dragonfly

Interconnection Networks: Topology

Lecture 23: Interconnection Networks

Lecture 23: Router Design

Advance Computer Networking

Interconnection Network Design Lecture 14

Storage area network and System area network (SAN)

Lecture: Interconnection Networks

Interconnection Networks Contd.

Lecture: Interconnection Networks

Chapter 3 Part 3 Switching and Bridging

CS 6290 Many-core & Interconnect

Lecture 25: Interconnection Networks

Multiprocessors and Multi-computers

Presentation transcript:

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented by: Evan Su

Basic metrics Basic topologies Why high-radix Router microarchitecture High-radix topologies

Interconnection networks used to connect processors and memories in multiprocessors, as switching fabrics for high-end routers and switches, and for connecting I/O devices. Definition: determines arrangement of channels and nodes in the network (road map) Often first step in network design

Performance Metrics Average Hop Count Average Latency throughput Bisection Bandwidth

Hop Count The number of links traversed between source and destination

Latency Defined as the time it takes for a packet to traverse the network Latency= Header latency + serialization latency – Header latency: head arrives at input port – Serialization: time for rest of the packet to catch up

Throughput Data rate (bits/sec) that the network accepts per input port Offered load - % of capacity network accepts

Bisection Bandwidth Split N nodes into two groups of N/2 nodes such that the bandwidth between these two groups is minimum Why is it relevant: if traffic is completely random, the probability of a message going across the two halves is ½- tells how much traffic a network can support ( ½ of total traffic bandwidth)

9 Grid Hypercube Torus Criteria 64 nodes BusRing2Dtorus6-cubeFully connected Performance Bisection bandwidth Topology Examples

Why High Radix? Definition: number of inputs/outputs for each router For past 20 years, used low-radix k-ary n-cubes (torus) – Routers didn’t have enough bandwidth to support high radix Network routers have growth curve that obeys Moore’s law – Bandwidth increased – Packet length stayed the same – Latency gone down

Why High Radix? Approximately an order of magnitude increase in bandwidth every 5 years Bandwidth growth result of: – Increase in signaling rate – Increase in number of signals

High-Radix Routers

High-Radix vs. Low-Radix Cost Power dissipation latency

Cost Increasing radix of routers monotonically reduces overall cost Network cost proportional to total router bandwidth – Router pins – Connectors For fixed bisection bandwidth, cost proportional to hop count – High-radix => lower hop count

Cost

Power Power dissipated decreases with increasing radix Power proportional to number of router nodes As radix increases, hop count decreases and router nodes decrease as well – Independent of individual router node Router power due to I/O circuits, switch bandwidth. arbitration logic more complex with higher radix but negligible fraction of total power

Latency Bandwidth (B) is divided among 2k input and output channels so b = B/2k H = # hops t r = delay in router L = length of packets b = channel bandwidth B = total Bandwidth k = radix

Aspect Ratio Differentiate by dT/dk and set equal to zero Expression on right side determines router radix that minimizes network latency

Optimal Latency

Router Microarchitecture (VC) Route computation (RC) – based on info stored in header, select output port Virtual-channel allocation (VA)- packet must gain exclusive access to virtual channel of output port Switch allocation (SA)- if there is a free buffer in channel, flit can vie for access to crossbar Switch traversal (ST) – transfers flit from input to output buffers

Router Microarchitecture (VC)

Microarchitecture for High-radix Routing computation – linear function of bandwidth VC Allocator – quadratic function of input/ output ports because take bids from all ports Switch Allocator- quadratic function of ports

Baseline Performance Due to head of line blocking Before, overprovision switch because low cost

Fully buffered crossbar Separate the queuing up Had to compete for input and output of switch With crosspoint, decouples two allocations, always make forward progress

Fully buffered crossbar Trade performance for cost Crosspoint buffering dominates chip area (quadratic)

Hierarchical crossbar Using subswitches, area grows O(vk 2 / p) Decouples allocation, reduces HoL blocking V = inputs K = radix P = number of subswitches

Hierarchical crossbar Worst case performance Uniform random traffic

Okay! Back to Topologies Butterfly Clos Flattened Butterfly

Butterfly Network K-ary n-fly: k n network nodes Example: 2-ary 3-fly Routing from 000 to 010 – Dest address used to directly route packet – Bit n used to select output port at stage n

Butterfly Network Pros – Low hop count: H = log k N Cons – Deterministic routing/ no path diversity – Doesn’t exploit traffic locality

Clos Network nxm input switch rxr input switch mxn output switch

Clos Network Butterfly folded back on itself Pros – Path diversity (good performance on both benign and adversarial) Cons – Double cost of butterfly – H = 2 log k (N)

Folded Clos Network (Fat Tree) Similar to Clos Exploits locality

Flattened Butterfly Network Routers in each row are combined 4-ary 2-fly2-ary 4 -fly

Flattened Butterfly Network Routers in each row are combined

Flattened Butterfly Network On benign traffic – Approaches performance/cost of Butterfly – ½ cost of Clos network Eliminates redundant hops when no need for load balancing On adversarial traffic – Matches cost/performance of folded Clos – Order of magnitude better performance than Butterfly Use non-minimal global-adaptive routing

Routing In Figure, there are two minimal routes between node 0 ( ) and node 10 ( ). In general, if two nodes a and b have addresses that differ in j digits, then there are j! minimal routes between a and b. This path diversity derives from the fact that a packet routing in a flattened butterfly is able to traverse the dimensions in any order.

Routing Algorithms uniform random traffic worst case traffic pattern VAL = Valiant’s non-minimal oblivious algorithm MIN = minimal adaptive, UGAL = non-minimal adaptive algorithm UGAL-S = UGAL using sequential allocation CLOS AD = non-minimal adaptive routing in a flattened Clos

Routing Algorithms Valiant – picks random middle node b, and routes minimally from s to b and ten b to d. achieves only ½ network capacity, regardless of traffic Minimal Adaptive- chooses minimal route Adaptive Clos – minimum routing in benign traffic, folded- Clos routing in adversarial

Performance Comparison To compare the performance, a network of node size 1024 is taken and is constructed using the following topology by maintaining a constant bisection bandwidth.

Performance Comparison Uniform random trafficWorst-case traffic

Cost Comparison

Conclusion Use high-radix routers to take advantage of increased router bandwidth Flattened Butterfly exploits high-radix routers and global adaptive routing to give cost-effective network Flattened butterfly has lower hop count than folded Clos and better path diversity than conventional Butterfly On adversarial traffic, exploits global adaptive routing to match performance of folded Clos with ½ the cost