Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.
An introduction to: The uRT51 Microprocessor and Real-Time Programming Suite.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Paper Review: XiSystem - A Reconfigurable Processor and System
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
ROUTING ARCHITECTURE AND ALGORITHMS FOR A SUPERCONDUCTIVITY CIRCUITS-BASED COMPUTING HARDWARE Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue,
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
NISC set computer no-instruction
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
1 of 14 Lab 2: Design-Space Exploration with MPARM.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Presenter: Darshika G. Perera Assistant Professor
Evaluating Register File Size
Design-Space Exploration
FPGA: Real needs and limits
Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,
Dynamically Reconfigurable Architectures: An Overview
A High Performance SoC: PkunityTM
Presentation transcript:

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University, Fukuoka, Japan *Amirkabir University of Technology

2 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for an Extensible Processor Conclusion

3 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for an Extensible Processor Conclusion

4 Designing Embedded Systems Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 LD/ST: Load / Store CFU: Custom Functional Unit

5 Extensible Processors Goals Improving the performance and energy efficiency Maintaining compatibility and flexibility Using a hardware is augmented to the base processor for accelerating frequently executed portions of applications Accelerator implementations custom hardware (such as ASIP or Extensible Processors) reconfigurable fine/coarse grain hw CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 LD/ST: Load / Store CFU: Custom Functional Unit

6 Custom Instructions Instruction set customization  hardware/software partitioning (Identifying critical segments in applications) Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU) Critical segments  Most frequently executed portions of the applications A CI can be represented as a DFG

7 Reconfigurable Processors Adding and generating custom instructions after fabrication Using a reconfigurable functional unit (RFU) instead of custom functional unit CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 RFU Config Mem CFU: Custom Functional Unit RFU: Reconfigurable Functional Unit

8 How a Reconfigurable Processor Works GPP: General Purpose Processor RAC: Reconfigurable Accelerator subiu $25,$25, lbu$13,0($7) lbu$2,0($4) sll$2,$2,0x a0sra $14,$2,0x a8addiu$4,$4,1 4006b0srl$8,$2,0x1c 4006b8sll$2,$8,0x2 4006c0addu$2,$2,$ c8lw$2,0($2) 4006d0xori$13,$13,1 4006d8addu$10,$10,$ subiu $25,$25, sll$2,$2,0x a0sra $14,$2,0x lbu$13,0($7) 4006e0bgez$10,4006f0... Hot Basic Block Reconfigurable Processor

9 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for an Extensible Processor Conclusion

10 Designing a Reconfigurable Accelerator (RAC) Multitude of design parameters e.g. number of functional units, input/output ports, type of functional units and etc. Several design parameters high complexity of the RAC design the requirements for a methodological approach A major challenge finding the right balance between the different quality requirements (e.g. speedup, area, energy consumption)

11 Traditional Design Process Describing a reference model Verifying the model functional correctness Obtaining a rough estimates of performance Manual or semi-automatic generation of several alternative designs Choosing the most suitable design based on various performance metrics

12 Design Space Exploration Design Space Exploration (DSE) the process of analyzing several “functionally equivalent” implementation alternatives to identify an optimal solution Can become too computationally expensive Example: a design with four tasks on an architecture with three processing modules and each have four possible configurations results in 500 design alternatives Our Approach: Hybrid (Analytical & Quantitative) approach which drastically reduces the design time & effort

13 Assumptions RAC a matrix of FUs the width/height equal to w/h basically has a combinational logic FUs in RAC are fully connected Except the lack of connections from lower to upper rows Basic Elements Functional Units (logic resources) Multiplexers (routing resources)

14 Assumptions CIs (DFGs) are mapped onto the RAC and executed at runtime

15 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing a RAC for an Extensible Processor Conclusion

16 The Problem Determining a RAC specification while optimizing Speedup & Area Design parameters the RAC’s width and height the number of FUs, the number of input and output ports Speedup

17 Design Methodology- Tool Chain Our Approach: A Hybrid (Analytical & Quantitative) DSE Approach

18 Problem Formulation the fraction of all DFGs with the width of i and height of j (DFG(i,j)) : percentage of execution time concerns to all DFGs with the width of 4 and height of 3 is 7%. Average number of instructions in all DFG(i,j)s The number of CCs required for executing DFG(i,j) on the base processor

19 Problem Formulation When one or both dimensions of a DFGs are greater than RAC’s dimensions: Temporal Partitioning: Divides a DFG into time exclusive smaller DFGs Reconfiguration overhead time for loading subsequent partitions of a DFG from the configuration memory onto the RAC the number of Clock Cycles for executing DFG(i,j) on a RAC (w,h)

20 Problem Formulation

21 Delay of RAC = delay of mux( to 1) Delay of FU(k,i) Delay of MUX(k,i) Delay of RAC(w,h)

22 Optimization Problem Objective: Maximize( )

23 Area of RAC = delay of mux( to 1) Area is a secondary optimization parameter

24 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing a RAC for an Extensible Processor Conclusion

25 Designing RAC for a Reconfigurable Processor AMBER: a reconfigurable processor targeted for embedded systems Main components a base processor (general RISC processor) Sequencer and a coarse-grained reconfigurable functional unit (RFU) AMBER’s Architecture

26 Quantitative Approach 22 applications of Mibench were attempted Applications were executed on Simplescalar and profiled Hot segments and DFGs were extracted from the applications No limitation in the RFU’s initial architecture DFGs were mapped on the initial RFU architecture Specification of the designed architecture for AMBER’s RFU (Quantitative Approach)

27 Hybrid DSE Approach FU and various size multiplexers Synthesized using Hitachi 0.18um Measuring delay and area DFGs are analyzed to extract required information quantitatively Reconfiguration penalty time: 1 clock cycle The base processor clock frequency: 166MHz

28 Speedup Evaluation Increasing RAC’s width  more parallelism In the widths larger than 6: negative effect of growing the number of muxes and their sizes no more speedup achievable Increasing RAC’s Height longer delay speedup declines and area increases

29 Effect of the Base Processor Clock Frequency Increasing clock frequency  Reduction in the maximum achievable speedup no more speedup in the clock frequencies more than 450MHz

30 Effect of the Reconfiguration Overhead Time By increasing the reconfiguration overhead time The maximum achievable speedup degrades Height of RAC grows and longer DFGs are mappable

31 Comparison Design Method Design Time & Effort Basic Design Parameters Flexibility* Clark et al.Quantitative & Synthesis HighMapping rateLow Ours (previous)QuantitativeHighMapping rateLow Yehia et al.Synthesis & Simulation Very HighNo. of operations, inputs/outputs Low Ours (current) HybridLowSpeedup & Area High

32 O UTLINE Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing a RAC for an Extensible Processor Conclusion

33 C ONCLUSION Hybrid DSE approach Uses realistic data from the attempted applications Substantially reduces design time and designer efforts Can be used for shrinking a large design space Easily extendable to apply new design parameters More suitable where the new applications are added

Thanks for your attention!