Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Slides:

Advertisements

Similar presentations

Data Communications and Networking

Advertisements

Spartan-3 FPGA HDL Coding Techniques

Dynamic Topology Optimization for Supercomputer Interconnection Networks Layer-1 (L1) switch –Dumb switch, Electronic “patch panel” –Establishes hard links.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Fault-Tolerant Network-Interface for Spatial Division Multiplexing Based Network-on-Chip By Anup Das.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Jaringan Komputer Lanjut Packet Switching Network.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Reporter: Bo-Yi Shiu Date: 2011/05/27 Virtual Point-to-Point Connections for NoCs Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi- Azad IEEE TRANSACTIONS.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 24: April 21, 2010 Interconnect 6: Dynamically Switched Interconnect.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 22: April 4, 2007 Interconnect 7: Time Multiplexed Interconnect.

Dynamic NoC. 2 Limitations of Fixed NoC Communication NoC for reconfigurable devices:  NOC: a viable infrastructure for communication among task dynamically.

Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.

Issues in System-Level Direct Networks Jason D. Bakos.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,

Switching Techniques Student: Blidaru Catalina Elena.

Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

On-Chip Networks and Testing

Introduction to Routing and Routing Protocols By Ashar Anwar.

Synchronization and Communication in the T3E Multiprocessor.

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.

Network Aware Resource Allocation in Distributed Clouds.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.

Good Programming Practices for Building Less Memory-Intensive EDA Applications Alan Mishchenko University of California, Berkeley.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

J. Christiansen, CERN - EP/MIC

The Network Layer Introduction  functionality and service models Theory  link state and distance vector algorithms  broadcast algorithms  hierarchical.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 25: April 28, 2014 Interconnect 7: Dynamically Switched Interconnect.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

Improving NoC-based Testing Through Compression Schemes Érika Cota 1 Julien Dalmasso 2 Marie-Lise Flottes 2 Bruno Rouzeyre 2 WNOC

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Yu Cai Ken Mai Onur Mutlu

Static Process Scheduling

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.

Advanced Processor Group The School of Computer Science A Dynamic Link Allocation Router Wei Song, Doug Edwards Advanced Processor Group The University.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

Effective bandwidth with link pipelining Pipeline the flight and transmission of packets over the links Overlap the sending overhead with the transport.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Power-aware NOC Reuse on the Testing of Core-based Systems* CSCE 932 Class Presentation by Xinwang Zhang April 26, 2007 * Erika Cota, et al., International.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

Exploring Concentration and Channel Slicing in On-chip Network Router

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

On-Time Network On-chip

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Switching Techniques.

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Introduction Dedicated spatial interconnect links on a configured FPGA network can be inefficient for sparse communication patterns Overlaying virtual networks on top of the physical networks can help address this issue

Time-Multiplexed Pros –Can take advantage of global route information Cons –Offline computation can be compute intensive –Must allocate resources for communication schedule and all possible communication between operators

Packet-Switched Pros –No offline setup and resources for storing communication schedule –Routes are made for operators that are actually communicating Cons –Switches more complex –Routes can be less efficient

Novel Contributions of work Demonstration of efficient and scalable static and dynamic FPGA overlay networks Quantification of difference between offline scheduling and online routing Quantification of performance impacts due to balancing interconnects and computing Characterization of area and performance tradeoffs between time-multiplexed and packet-switched Quantification of performance difference between time-multiplexed and packet-switched under varying application communication loads.

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

NoC Early days – on-chip buses Later necessary to investigate scalable, high- performance, low-overhead on chip networks Networks are required since buses scale poorly As the number of PEs increases the communcation increases and more bandwidth is needed

Communication Patterns Need to know in order to choose network to use Configured switching is inefficient for apps that underutilize links Circuit switching is efficient for larger messages on shorter networks Need to know characteristics in order to make appropriate choice

Packet Switched How they improve on past work in FPGA- based overlay networks –Allow arbitrary topolgies –Use real applications and relistic PE architectures to generate traffic payloads –Network speed is much faster running at 166 MHz as compared to most running at MHz

Time Multiplexed Use a greedy router similar to the one used in the Virtual Wires project Virtual Wires overcame pin limitation by time sharing each physical wire among logical wires and pipelining This paper attempts to explore the entire design space as opposed to one system size or config

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Performance Analysis Several important quantities of the network have to be defined PE Input Serialization A bound of cycle count for input PE Output Serialization A bound of cycle count for output Network Bisection Maximum number of messages that can cross the network on a given cycle Network Latency Number of cycles required to cross the network

Butterfly Fat Trees Most FPGA NoCs have focused on meshes BFTs achieve higher performance at equivalent chip size Routing functions programmed in the split primitives determine path Single address bit is used to make a routing decision at each switch Time-multiplexed merge contains a context memory which stores computed routing

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Packet Switched Primitives have input queues Split primitives computes the routing decision in a single cycle based on the destination address Arbitration is done by selecting packets based on input queue occupancies Network with floorplaned and pipelined primitives can operate as high as 180 MHz

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Time Multiplexed Statically scheduled prior to runtime Switching primitives contain context memory Context memory requires 1 bit of storage per cycle Network capable of operating at 166 MHz Greedy routing algorithm used

Area and Latency of Switching

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

A real life application was mapped onto the networks ConceptNet – common-sense reasoning knowledge base represented as a graph Start with a inititial set of nodes, send activation from each node to it’s neighbors along weighed edges Time multiplexed run at 100% activity packet switched run between 1-100% activity level Limitations –Nodes limited to 128 edges of fanout or fanin –Can only process a single edge per cycle Application

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Java based infrastructure –simulates the packet switched network –computes schedules for time multiplexed network Used smallest set of ConceptNet predicates Java infrastructure generates VHDL netlist Hand coded VHDL for ConceptNet PEs Created custom multipliers instead of using onboard for speed Methodology

Methodology (cont) Synthesis and place and routing using Synplicity Compiler v8.0 Xilinx ISE v8.1i to obtain operating frequency and slice count Long wires that constrain performance are further pipelined based on post place-and-route timing analysis Lots of intervention to prepare system

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Results Three quantitative comparisons are provided to characterize the tradeoffs between packet switched and time multiplexed networks –Routing of identical topologies –Impact of area with identical area constraints –Examine performance while varying activity level (Activity Factors)

Routing identical topologies Small numbers of PEs induce a light communication load As PEs ⁭, communication ⁭ and offline routing starts to outperform online routing Online routing requires up to 63% more cycles than offline routing for larger networks

Impact of Area A couple of things to consider when talking about area –PE vs. Interconnect Tradeoff –Area-Time Tradeoff

PE vs. Interconnect Tradeoff Sometimes the network performs better with less PEs but more capacity in the network.

Area-Time Tradeoff Packet switched and time multiplexed networks may use significantly different amounts of area due to differences in switch sizes At smaller areas time multiplexing requires more cycles At higher cycle counts time multiplexing requires more area for context Performance is limited by 128 edge fanin or fanout limit

Activity Factors Packet-switching takes 8x as many cycles to route At some activity factors less than 100% packet-switching should be able to outperform time-multiplexing for same area

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Conclusions Demonstrated implementations of packet- switched and time-multiplexed FPGA overlay networks operating at 166 MHz Offline scheduling offers up to a 63% performance increase over online scheduling for equivalent topologies Packet-switching is up to 2x faster for small areas Time-multiplexing is up to 8x faster for large areas

Conclusions (cont.) For activity factors less than 30% or 5%, packet switching offers better performance At 32K slices and 100K slices respectively

Future Work Mapping larger communication graphs with smaller fanout limitations to fully test networks Compress context memory for time- multiplexing Improve efficiency of packet switching Extend work to multiple-chip networks

Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions Agenda

Wrap-up Paper takes a look at trade-offs involved in FPGA networks Thought it was a good look at design decisions and gave actual guidance to the designer Describes interesting alternative to mesh network (BFTs)