High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

Slides:

Advertisements

Similar presentations

Prof. Natalie Enright Jerger

Advertisements

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Jaringan Komputer Lanjut Packet Switching Network.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

EE 4272Spring, 2003 Chapter 10 Packet Switching Packet Switching Principles  Switching Techniques  Packet Size  Comparison of Circuit Switching & Packet.

4-1 Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving side, delivers.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Issues in System-Level Direct Networks Jason D. Bakos.

Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Switching, routing, and flow control in interconnection networks.

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.

1 A Mutual Exclusion Algorithm for Ad Hoc Mobile networks Presentation by Sanjeev Verma For COEN th Nov, 2003 J. E. Walter, J. L. Welch and N. Vaidya.

DISPERSITY ROUTING: PAST and PRESENT Seungmin Kang.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Improving QoS Support in Mobile Ad Hoc Networks Agenda Motivations Proposed Framework Packet-level FEC Multipath Routing Simulation Results Conclusions.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

EITnotes.com For more notes and topics visit:

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.

Sami Al-wakeel 1 Data Transmission and Computer Networks The Switching Networks.

ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.

Computer Networks with Internet Technology William Stallings

CSE 661 PAPER PRESENTATION

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group

1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.

BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.

CS440 Computer Networks 1 Packet Switching Neil Tang 10/6/2008.

Virtual-Channel Flow Control William J. Dally

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)

1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Network Layer4-1 Chapter 4 Network Layer All material copyright J.F Kurose and K.W. Ross, All Rights Reserved Computer Networking: A Top Down.

1. Layered Architecture of Communication Networks: Circuit Switching & Packet Switching.

Flow Control Ben Abdallah Abderazek The University of Aizu

1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.

On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

William Stallings Data and Computer Communications

Network Layer Goals: Overview:

Interconnection Networks: Flow Control

Azeddien M. Sllame, Amani Hasan Abdelkader

Lecture 23: Router Design

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Using Packet Information for Efficient Communication in NoCs

On-time Network On-chip

Switching Techniques.

CEG 4131 Computer Architecture III Miodrag Bolic

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Lecture: Interconnection Networks

Lecture 25: Interconnection Networks

Error Checking continued

Presentation transcript:

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam

Overview Motivation & Design Constraints Motivation & Design Constraints Network design Network design Performance Performance Adaptive Routing Adaptive Routing Conclusion Conclusion

Overview Motivation & Design Constraints Motivation & Design Constraints Network design Network design Performance Performance Adaptive Routing Adaptive Routing Conclusion Conclusion

Motivation Signal delay on wires is more important than transistor switching speed Signal delay on wires is more important than transistor switching speed Seriously decreased reliability in future processes Seriously decreased reliability in future processes Factory testing will not be possible Factory testing will not be possible Expect 20% of transistors to be DOA Expect 20% of transistors to be DOA Expect 10% more to die over several months Expect 10% more to die over several months Dataflow is an answer, but the network is currently a bottleneck Dataflow is an answer, but the network is currently a bottleneck

Dataflow Characteristics Unpredictable traffic Unpredictable traffic Cannot pre-allocate resources Cannot pre-allocate resources Highly bursty traffic Highly bursty traffic Quick delivery of bursts is critical Quick delivery of bursts is critical Nodes are not guaranteed to consume messages Nodes are not guaranteed to consume messages Potential for livelock & deadlock Potential for livelock & deadlock

Overview Motivation & Design Constraints Motivation & Design Constraints Network design Network design Performance Performance Adaptive Routing Adaptive Routing Conclusion Conclusion

Network Requirements High-Performance during bursts High-Performance during bursts Area efficient Area efficient Guarantee message delivery Guarantee message delivery Deadlock & Livelock free Deadlock & Livelock free Fault Tolerant Fault Tolerant Regular 2-D physical structure Regular 2-D physical structure

Topology On-chip - must be implementable in 2-D On-chip - must be implementable in 2-D Regular tiled structure suggests: Regular tiled structure suggests: Grid Grid Torus Torus Hypercube Hypercube Fat Tree Fat Tree Hypercube is difficult to route, scale Hypercube is difficult to route, scale Fat Tree has a single point of failure Fat Tree has a single point of failure

Routing Static routing does not provide essential fault tolerance Static routing does not provide essential fault tolerance Use a modified Virtual Channel algorithm Use a modified Virtual Channel algorithm VC guarantees deadlock free if nodes consume messages VC guarantees deadlock free if nodes consume messages Dynamically adaptive to handle transient faults & congestion Dynamically adaptive to handle transient faults & congestion Initial studies used static routing Initial studies used static routing

Flow Control Resource reservation not possible Resource reservation not possible Long-latency wires prohibit handshakes Long-latency wires prohibit handshakes Send messages assuming accept Send messages assuming accept Buffer just enough to allow receiver to send reject signal on subsequent clock cycle Buffer just enough to allow receiver to send reject signal on subsequent clock cycle

Deadlock-Free Operation Nodes cannot always consume messages Nodes cannot always consume messages Add a dedicated channel to and from memory Add a dedicated channel to and from memory Adds 8% area overhead Adds 8% area overhead Rotate stalled operands out of PEs to ensure forward progress Rotate stalled operands out of PEs to ensure forward progress Send first operand back at a faster rate to avoid livelock Send first operand back at a faster rate to avoid livelock

Overview Motivation & Design Constraints Motivation & Design Constraints Network design Network design Performance Performance Adaptive Routing Adaptive Routing Conclusion Conclusion

Performance Ran network-centric simulations Ran network-centric simulations 20 billion instructions 20 billion instructions Spec2000, Splash2, and Dataflow benchmarks Spec2000, Splash2, and Dataflow benchmarks Goal is to find optimum balance of: Goal is to find optimum balance of: Number of Virtual Channels Number of Virtual Channels Queue Length Queue Length Link Bandwidth Link Bandwidth Packets per message Packets per message

ASIC Model Performance must be balanced with area Performance must be balanced with area Developed RTL model of WaveScalar network architecture Developed RTL model of WaveScalar network architecture 90 nm process ASIC standard cell library 90 nm process ASIC standard cell library Timing per link: Timing per link: Grid links: 2.76 ns Grid links: 2.76 ns Torus links: 6.16 ns Torus links: 6.16 ns Network switch is 11.6% of chip area Network switch is 11.6% of chip area

Overview Motivation & Design Constraints Motivation & Design Constraints Network design Network design Performance Performance Adaptive Routing Adaptive Routing Conclusion Conclusion

Virtual Channels Flow Control In hardware only Head- of-Queue can be dequeued in one clock cycle In hardware only Head- of-Queue can be dequeued in one clock cycle If the first message in a queue is blocked then every message behind it is blocked If the first message in a queue is blocked then every message behind it is blocked The network utilization suffers due to idle links The network utilization suffers due to idle links

Virtual Channels Flow Channel Virtual Channels – several small queues instead of one long queue Virtual Channels – several small queues instead of one long queue Decouples buffer resources from link resources Decouples buffer resources from link resources Increase network throughput by increasing link usage Increase network throughput by increasing link usage

Dimension Order Routing Old WaveScalar Routing Protocol Old WaveScalar Routing Protocol Network topology is a static grid Network topology is a static grid Packets first travel to the correct x- coordinate and then to the correct y- coordinate Packets first travel to the correct x- coordinate and then to the correct y- coordinate Low network utilization from not using all available paths Low network utilization from not using all available paths Not fault tolerant Not fault tolerant

Adaptive Routing Progressively chooses longer routes instead of waiting for an unavailable resource Progressively chooses longer routes instead of waiting for an unavailable resource High Network Utilization High Network Utilization Fault tolerant Fault tolerant Can cause deadlock Can cause deadlock

Deadlock Free Adaptive Routing Some Virtual Channels are reserved for Dimension Order Routing, rest used for Adaptive routing Some Virtual Channels are reserved for Dimension Order Routing, rest used for Adaptive routing Every time a packet is routed in the wrong direction the Dimension Reversal count incremented Every time a packet is routed in the wrong direction the Dimension Reversal count incremented No packet is allowed to wait in a virtual channel with a packet that has a lower Dimension reversal count No packet is allowed to wait in a virtual channel with a packet that has a lower Dimension reversal count Mathematically proven to be deadlock free. Mathematically proven to be deadlock free.

Conclusion Best performance per area with: Best performance per area with: 2 Virtual Channels 2 Virtual Channels 2 Links 2 Links 2-4 entries per queue 2-4 entries per queue Torus Topology Torus Topology Adaptive Routing Adaptive Routing Dataflow chip networks can be high- performance at reasonable area Dataflow chip networks can be high- performance at reasonable area