SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center for High-Performance Reconfigurable Computing (CHREC) Department of Electrical and Computer Engineering University of Florida
2 of 16 Introduction – Parallel Computation Edges indicate communication volume 1.System Formulation 3. Task Allocation / System Placement Source FIR Sink Matrix IFFT Angle FFT Application decomposition High Performance Application 1, 7Data2,643,5 uProcMEMDSP1ASICDSP2 Modules To leverage parallel computation speedups, system can be decomposed in smaller tasks Parallel communication How do designers provide efficient module communication? Problem: Speedup can be limited by inefficient communication! Profile 1: DSP:0.5ms uProc: 2.2ms Profile 2: ASIC:0.5ms DSP: 2.5ms
3 of 16 Communication Architectures uProc MEM DSP1 ASICDSP2 a) Bus BusNetwork-on-Chip (NoC) Advantages Disadvantages MEM uProcDSP1 ASICDSP2 b) Network-on-Chip NoC node Very well known Smaller hardware overhead SoC standards: Coreconnect®, Amba®, Wishbone Scalable Very high bandwidth Wires are broken in smaller segments Multiple and simultaneous parallel communications Does not scale well as number of modules increases High power consumption due to long wires Cross-talk issues Significant area overhead Exacerbated by store-and-forward routers Interfaces between modules and nodes are not standard Specific signals and handshaking protocols for each design
4 of 16 General NoC architecture NoC Interface NoC Link NoC Node Routers (packet switching) Switches (circuit switching) MEM uProc DSP1 ASIC DSP2 I/O Slave DSP2 DSP2 uProc [1] Salminem et.al. Survey of Network-on-Chip Proposals. White Paper. OCP-IP, March 2008 NoC Topology Vary across designs Commonly 2D mesh or torus [1]
5 of 16 Motivation Relevant NoC metrics: Throughput Latency Area Power 2D Mesh NoC High throughput Low latency High communication parallelism Due to these advantages, some commercial 2D NoCs for ASICs have appeared: Arteris® How about NoC implementations in FPGAs? FPGAs are increasingly used in digital designs –Reconfigurable –Lower cost than ASICs NoC area overhead becomes a problem –Area of a 3x3 2D Mesh NoC consumed 28.72% of a Xilinx V2P30[2] (for maximum throughput of 9.5Gbps for complete 3x3 2D NoC) Problem is exacerbated with low capacity & low cost FPGA devices N7 N4 N1 N8 N5 N2 N9 N6 N3 Node Module Arteris NoC [2] B. Sethuraman, P. Bhattacharya, J. Khan, Ranga Vemuri: LiPaR: A light-weight parallel router for FPGA-based networks-on-chip. ACM Great Lakes Symposium on VLSI 2005:
6 of 16 CSCORES = Scalable Communication Architecture for Reconfigurable Embedded Systems Main contributions: High throughput / bandwidth –Circuit switching scheme Low area overhead –Linear topology Multiple clock domains Scalability –VHDL model with numerous architectural parameters –Allows customization for different SoCs communication needs SCORES - Contributions RECONFIGURABLE DEVICE (FPGA) Module 1Module 2Module 3 SCORES Interface scores-clk clk2 clk3 clk1 Different clock domains Implemented in Xilinx VLX25 FPGA
7 of 16 clk RECONFIGURABLE DEVICE (FPGA) Module 1Module 2Module 3 clk2 clk3 clk1 SCORES – Top Level Design SCORES main components: Switches – communication nodes inside SCORES Interfaces – communication between modules and SCORES Channels – communication links between switches and other switches or interfaces Modules access interfaces through local input ports and local output ports Module SCORES Switch Interface
8 of 16 SCORES – Parametric Architecture Module 4Module 3Module 2Module 1 kl – number of left switch channels kr – number of right switch channels ko - number local output ports from the interface ki - number local input ports to the interface SCORES Interfaces Switch N = Number of modules W = Width of a channel in bits Additional parameters Parameters enable SCORES to conform to custom communication requirements
9 of 16 SCORES – Terminology Interface Module 1Module 4Module 2Module 3 Producer: module which transmits data Consumer: module which receives data Streaming Data Channel (SDC): Dedicated path between a producer and a consumer Dynamically created and destroyed inside SCORES Bidirectional path Data flows from producer to consumer Control synchronization signals flow from consumer to producer Producer Streaming Data Channel (SDC) Consumer
10 of 16 SCORES – Communication Phases Interface Module 1Module 4Module 2Module 3 Three communication phases Phase I: Channel establishment: Producer requests a path to the consumer Path iteratively created inside switches between the producer and the consumer If a switch has no available channels –Sends a DENY signal to the producer –Producer can drop or maintain the request If successful, the Streaming Data Channel (SDC) is created between the producer and the consumer Producer Streaming Data Channel (SDC) Consumer
11 of 16 SCORES – Communication Phases Phase II: Streaming transmission Pipelined operation If consumer buffer is full –Consumer asserts “Full” to inform producer to pause transmission Interfaces built around asynchronous FIFOs –Eases crossing different clock domains Phase III: Channel release Producer deasserts its request Path between the producer and the consumer is iteratively destroyed Interface Module 1Module 4Module 2Module 3 Producer Streaming Data Channel (SDC) Consumer Register
12 of 16 SCORES – Simultaneous Data Transfers Interface Input Registers Switch 1Switch 2Switch 3Switch 4 Interface MUXes Free channel Set of FSM controllers running at each switch Allows SCORES to establish and operate multiple SDCs in parallel
13 of 16 Results – Clock Frequency Frequency (MHz) Number of right switch channels (Kr) (1 left switch channel) Number of left and right switch channels (Kr, Kl) (1 local input and 1 local output port per switch) Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right switch channel) Achieved SCORES maximum frequency is equal to the SCORES maximum throughput Customized SCORES switch with 32-bit channels, 2 left and right switch channels, and 1 local input and 1 local output port operates at 254 MHz (Throughput=8.0Gbps, post place-and-route timing report).
14 of 16 Results - Area Area (slices) Customized SCORES switch with 32-bit channels, 2 left and right switch channels and 1 local input and 1 local output port consumes 315 slices (1.41% of Virtex 4 VLX25) Number of right switch channels (Kr) (1 left switch channel) Number of left and right switch channels (Kr, Kl) (1 local input and 1 local output port per switch) Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right switch channel)
15 of 16 Conclusions We developed SCORES (Scalable Communication Architecture for Reconfigurable Embedded Systems) - a highly parametric communication architecture SCORES Contributions: –Low area overhead (315 slices for a 32-bit switch with multiple ports) –Modules can run at different and independent clock frequencies –Highly parametric design, which enables architecture optimization Future work –Optimization of switch FSM controllers –Development of algorithms for module placement inside SCORES –Tools for automatic determination of SCORES parameter values
16 of 16 Questions