A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang

Slides:

Advertisements

Similar presentations

Field Programmable Gate Array

Advertisements

FPGA (Field Programmable Gate Array)

SOC Design: From System to Transistor

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

ECE-777 System Level Design and Automation Hardware/Software Co-design

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Graduate Computer Architecture I Lecture 16: FPGA Design.

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

Submission May, 2000 Doc: IEEE / 086 Steven Gray, Nokia Slide Brief Overview of Information Theory and Channel Coding Steven D. Gray 1.

Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Tejas Bhatt and Dennis McCain Hardware Prototype Group, NRC/Dallas Matlab as a Development Environment for FPGA Design Tejas Bhatt June 16, 2005.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Reconfigurable Hardware in Wearable Computing Nodes Christian Plessl 1 Rolf Enzler 2 Herbert Walder 1 Jan Beutel 1 Marco Platzner 1 Lothar Thiele 1 1 Computer.

Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.

Chap. 1 Overview of Digital Design with Verilog. 2 Overview of Digital Design with Verilog HDL Evolution of computer aided digital circuit design Emergence.

COE4OI5 Engineering Design. Copyright S. Shirani 2 Course Outline Design process, design of digital hardware Programmable logic technology Altera’s UP2.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.

Abhishek Pandey Reconfigurable Computing ECE 506.

Robust Low Power VLSI ECE 7502 S2015 Analog and Mixed Signal Test ECE 7502 Class Discussion Christopher Lukas 5 th March 2015.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

TEMPLATE DESIGN © A Comparison-Free Sorting Algorithm Saleh Abdel-hafeez 1 and Ann Gordon-Ross 2 1 Jordan University of.

Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.

Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.

Marilyn Wolf1 With contributions from:

System-on-Chip Design

Programmable Hardware: Hardware or Software?

Dynamo: A Runtime Codesign Environment

ELEC 7770 Advanced VLSI Design Spring 2016 Introduction

FPGAs in AWS and First Use Cases, Kees Vissers

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Anne Pratoomtong ECE734, Spring2002

Matlab as a Development Environment for FPGA Design

ELEC 7770 Advanced VLSI Design Spring 2012 Introduction

ELEC 7770 Advanced VLSI Design Spring 2010 Introduction

HIGH LEVEL SYNTHESIS.

Presentation transcript:

A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang 2012-Jan-30

Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 2

Background A rapid rising demand for the high quality C language to RTL (C2RTL) tools [1] –Design gap between design productivity and transistor resources –Huge silicon capacity –Extensive use of embedded processors –Reuse of behavior IPs –Extensive adoption of accelerators –More time-to-market pressure 3

Background Designers have successfully developed various applications using C2RTL tools with much shorter design time.[2]-[4] –F–Focusing on Stream applications GAUT[5], ROCCC[6], Impulse C[7], ASC[8] –F–Focusing on general purpose applications Catapult C[9] Little control on the architecture leads to suboptimal results.[10] Hierarchical C2RTL framework Leaving the user to determine the FIFO capacity between modules. Algorithm for FIFO-connected blocks 4

To design a stream application, researchers also had investigated on FIFO sizing.[11]-[14] All of them adopts a more complicated behavior model for PE streams, which is not necessary in the hierarchical C2RTL framework. Our Hierarchical C2RTL Framework for FIFO-Connected Stream Applications: –The hierarchical implementation provides even10.43 times speedup compared to the flatten design. –We develop an algorithm to find the optimal FIFO capacity in a multiple-module design. –We demonstrate the proposed method in seven real applications. 5

Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 6

Motivation Hierarchical vs Flatten Approach –The synthesized quality for larger algorithms is generally not so good as small ones. –The translating time is unacceptable. –10.43 times performance speedup can be made by JPEG encoder case. 7 Performance with Different FIFO Capacity –FIFO capacity affect both speed and area. –Designers need to decide the FIFO size in an iterative way which is time consuming.

Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 8

Hierarchical C2RTL framework Step 1: We partition C codes into appropriate-size functions. Step 2: We use C2RTL tools to translate each function into a hardware process element (PE), which has a FIFO interface. Step 3: We connect those PEs with proper sized FIFOs. 9

Hierarchical C2RTL framework Step 1: −The C code partition has a great impact on the hardware performance. − A example of GSM case is shown in the figure. −Currently, we use a manual partition strategy. −An integer linear programming based partition approach is presented in [16]. 10

Hierarchical C2RTL framework Step 2: −We use the C2RTL tool of eXCite from Y Exploration, Inc[15] Step 3: −We define the FIFO interface and Modeling the FIFO −Define the PE interface Modeling PEs. 11

Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 12

Given a design consisting of N PEs, we need to determine the depth D (i-1)i of each FIFO, which maximizes the throughput TH all and minimizes the area A all. That is: Without losing generality, we set TH ref =TH best and A ref =∞. Algorithm for FIFO-connected blocks 13 FIFO Interconnection Formulation:

Algorithm for FIFO-connected blocks Optimal FIFO Capacity for Two PEs: –For PE 1 and PE 2, we assume that FIFO 01 never empty and FIFO 23 never full, and then we calculate FIFO 12 ’s capacity. 14

Algorithm for FIFO-connected blocks Step 1: –Find out the bottleneck –Talk about later Step 2: –FIFO sizing for two block –Optimal FIFO Capacity for Two PEs theory Step 3: –Merging two PEs into a new one –Talk about later 15 FIFO Capacity for Multiple Blocks:

Step 1(Find out bottleneck): −We set TH n as the throughput of the system consisting of PE 1 to PE n. We have a recursive equation As TH all =TH N, we can express TH all : Step 3(Merging two PEs): −We give out the equations to compute the interface parameters for the merged PE Algorithm for FIFO-connected blocks 16 FIFO Capacity for Multiple Blocks:

Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 17

Experiments Experimental Configurations –eXCite + ModelSim + Quartus –Seven benchmarks from real applications. They are: JPEG encode/decode AES encryption/decryption GSM ADPCM Filter Group The benefit of optimize FIFO capacity algorithm: –Area saving –Timing efficiency –The average time cost by the entire exploration time is N *log2(p) *C, using a binary searching algorithm 18

Experiments Flatten vs Hierarchical Approach: 19

Experiments Optimal FIFO Capacity: 20 −Compared our results with exhaustive simulations under random inputs. −Lists both the analytical results and the experimental ones on FIFO sizing in seven real cases.

Future works The automatically C code partition Using our algorithm in more complex architectures with feedback and branches 21

Reference [1] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for fpgas: From prototyping to deployment,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 30, no. 4, pp. 473–491, [2] B. Schafer, A. Trambadia, and K. Wakabayashi, “Design of complex image processing systems in esl,” in Asia and South Pacific Design Automation Conference. IEEE Press, 2010, pp. 809–814. [3] Y. Guo and D. McCain, “Rapid prototyping and vlsi exploration for 3g/4g mimo wireless systems using integrated catapult-c methodology,” in Wireless Communications and Networking Conference. IEEE, 2006, pp. 958–963. [4] M. Rossler, H. Wang, U. Heinkel, N. Engin, and W. Drescher, “Rapid prototyping of a dvb-sh turbo decoder using high-level-synthesis,” in Specification & Design Languages. IEEE Forum on, 2009, pp. 1–6. [5] G. Lhairech-Lebreton, P. Coussy, and E. Martin, “Hierarchical and multiple-clock domain high-level synthesis for low-power design on fpga,” in International Conference on Field Programmable Logic and Applications. IEEE, 2010, pp. 464–468. [6] J. Villarreal, A. Park, W. Najjar, and R. Halstead, “Designing mod-ular hardware accelerators in c with roccc 2.0,” in IEEE Annual International Symposium on Field- Programmable Custom Computing Machines. IEEE, 2010, pp. 127–

Reference [7] M. Gokhale, J. Stone, J. Arnold, and M. Kalinowski, “Stream-oriented fpga computing in the streams-c high level language,” in Field-Programmable Custom Computing Machines. IEEE Symposium, 2000, pp. 49–56. [8] O. Mencer, “Asc: a stream compiler for computing with fpgas,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 25, no. 9, pp. 1603–1617, [9] “ [10] A. Agarwal, “Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis,” Ph.D. dissertation, Massachusetts Institute of Technology, [11] K. Lahiri, A. Raghunathan, and S. Dey, “System-level performance anal-ysis for designing on-chip communication architectures,” in Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 20, no. 6, 2001, pp. 768– 783. [12] R. Cruz, “Quality of service guarantees in virtual circuit switched networks,” in Selected Areas in Communications, IEEE Journal on, vol. 13, no. 6, 1995, pp. 1048– [13] A. Maxiaguine, S. K ¨unzli, S. Chakraborty, and L. Thiele, “Rate analysis for streaming applications with on-chip buffer constraints,” in Asia and South Pacific Design Automation Conference. IEEE Press, 2004, pp. 131–

Reference [14] Y. Liu, S. Chakraborty, and R. Marculescu, “Generalized rate analysis for media- processing platforms,” in International Conference on Em-bedded and Real-Time Computing Systems and Applications, RTCSA. IEEE, 2006, pp. 305–314. [15] “ [16] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Partitioning of behavioral descriptions with exploiting function-level parallelism,” in Fundamentals,IEICE Trans on, vol. E93-A, 2010, pp. 488–499. [17] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: A benchmark program suite for practical c-based high-level synthesis,” in Circuits and Systems. IEEE International Symposium on, 2008, pp. 1192–

Thank you !