High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
& Microelectronics and Embedded Systems M 2 μP - Multithreading Microprocessor Thesis Presentation Embedded Systems Research Group Department of Industrial.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Dynamically Reconfigurable Architectures: An Overview Juanjo Noguera Dept. Computer Architecture (DAC-UPC)
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.
EE3A1 Computer Hardware and Digital Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
ROUTING ARCHITECTURE AND ALGORITHMS FOR A SUPERCONDUCTIVITY CIRCUITS-BASED COMPUTING HARDWARE Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue,
NISC set computer no-instruction
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Variable Word Width Computation for Low Power
Evaluating Register File Size
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Dynamically Reconfigurable Architectures: An Overview
Control Unit Introduction Types Comparison Control Memory
Topic 5: Processor Architecture Implementation Methodology
Rocky K. C. Chang 6 November 2017
Topic 5: Processor Architecture
Presentation transcript:

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

2/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

3/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

4/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Introduction Designing Embedded Systems Embedded Microprocessors Application Specific Integrated Circuits (ASICs) Application Specific Instruction set Processors (ASIPs) Extensible Processors CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 LD/ST: Load / Store CFU: Custom Functional Unit

5/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Goal Improving the performance and energy efficiency of embedded processors, while maintaining compatibility and flexibility.

6/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Extensible Processors Enhancing the performance of a processor in embedded systems Using an accelerator for accelerating frequently executed portions of applications Accelerator implementations reconfigurable fine/coarse grain hw custom hardware (such as ASIP or Extensible Processors) AND OR AND2_OR CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 LD/ST: Load / Store CFU: Custom Functional Unit

7/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Custom Instructions Critical segments  Most frequently executed portions of the applications Instruction set customization  hardware/software partitioning Custom Instructions (CIs) are extracted from critical segments of an application and executed on an Custom Functional Unit (CFU) A CI is represented as a Data Flow Graph (DFG) CI or DFG generation levels high level or binary level (DFG nodes are the instructions level operations)

8/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Adaptive Extensible Processors Issues of Extensible Processors High NRE (Non-Recurring Engineering) and manufacturing costs Long time-to-market Adaptive Extensible Processor Adding and generating custom instructions after fabrication Using a reconfigurable functional unit (RFU) instead of custom functional unit CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 RFU Config Mem CFU: Custom Functional Unit RFU: Reconfigurable Functional Unit

9/38 Oct 16, 2007 Kyushu UniversityISOCC Korea General Overview of the Proposed Architecture GPP: General Purpose Processor RFU: Reconfigurable Functional Unit Register File ID/EXE Reg RFU ALU MUX EXE/MEM Reg GPPAugmented HW subiu $25,$25, lbu$13,0($7) lbu$2,0($4) sll$2,$2,0x a0sra $14,$2,0x a8addiu$4,$4,1 4006b0srl$8,$2,0x1c 4006b8sll$2,$8,0x2 4006c0addu$2,$2,$ c8lw$2,0($2) 4006d0xori$13,$13,1 4006d8addu$10,$10,$ subiu $25,$25, sll$2,$2,0x a0sra $14,$2,0x lbu$13,0($7) 4006e0bgez$10,4006f0.... Hot Basic Block Config Memory

10/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

11/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Control Data Flow Graph: Definition CDFG  The DFG containing control instructions (e.g. branch instruction) The sequence of execution changes due to the result of a branch instructions Types of CDFGs: CDFGs containing at most one branch instruction (last instruction)  accelerator does not need to support conditional execution CDFGs containing more than one branch instruction  accelerator should support conditional execution

12/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Why need to support Control Instructions- Motivations (1/2) Quantitative analysis approach using applications of Mibench DFG extraction process Short distance control instructions  small size DFGs (SSDFG) SSDFGs do not offer noticeable speedup  have to be run on the base processor

13/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Why need to support Control Instructions- Motivations (2/2) Analysis on 17 application of Mibench bitcount almost 92% of application is hot 32% out of 92% of hot portions do not worth to be accelerated due to the SSDFGs fft, fft(inv) and sha include few branch instructions supporting conditional execution results in no considerable speedup

14/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

15/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Basic Requirements DFG nodes receive their input from a single source CDFG nodes can have multiple sources The correct source is selected at run time according to the results of branches

16/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Basic Requirements Capability of selective receiving of inputs from both accelerator primary inputs and output of other instructions (FUs) for each node Capability of selecting the valid outputs from several outputs generated by accelerator according to conditions made by branch instructions Accelerator should be equipped by control path to provide with the correct selection of inputs and outputs for each FU and entire accelerator

17/38 Oct 16, 2007 Kyushu UniversityISOCC Korea CDFG Generation-Example (1/3) BB1 BB2 BB3 BB4 25 beq 7 8bgez bne ……………. 30 load

18/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Example 2 5 beq 7 8bgez bne exit4 exit3 exit1 exit2 25 beq 7 8bgez bne BB1 BB2 BB3 BB4 ……………. 30 CDFG Generation-Example (2/3)

19/38 Oct 16, 2007 Kyushu UniversityISOCC Korea CDFG Generation- Example (3/3) 25 beq 7 8 bgez bne exit4 exit3 exit1 exit2

20/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

21/38 Oct 16, 2007 Kyushu UniversityISOCC Korea CDFG Temporal Partitioning Algorithms CDFGs extracted from various applications have different sizes for some of the CDFGs the whole of CDFG can not be mapped on the accelerator due to the resource limitations of the accelerator Resource constraints the number of inputs, outputs, logics and routing resource constraints

22/38 Oct 16, 2007 Kyushu UniversityISOCC Korea CDFG Temporal Partitioning Algorithms Temporal partitioning Temporally divides a DFG into a number of smaller partitions each partition can fit into the target hardware dependencies among the nodes are not violated Temporal Partitioning algorithms for CDFGs Not-Taken Path Traversing alg. (NTPT) Frequency based TP alg.

23/38 Oct 16, 2007 Kyushu UniversityISOCC Korea TP Based on Not-Taken Paths (NTPT) Adds instructions from not-taken path of a control instruction to a partition until violating the target hardware architectural constraints or reaching to a terminator control instruction Generating a new partition is started with the branch instructions which at least one of their taken or not-taken instructions has not been located in the current partition Terminator instruction  an instruction which changes execution direction of the program an exit point for a CDFG

24/38 Oct 16, 2007 Kyushu UniversityISOCC Korea TP Based on Not-Taken Paths (NTPT) 2 5 beq 7 8 bgez bne … bgez beq beq bne …

25/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Frequency-Based TP Algorithm NTPT algorithm instructions were selected only from not-taken paths of branches Frequency based algorithm execution frequency of taken and not-taken paths is an effective factor for in adding them to the current partition a frequency threshold is defined to determine that whether instruction is critical or not one of the taken or not-taken paths or both of them can be critical

26/38 Oct 16, 2007 Kyushu UniversityISOCC Korea TP Based on Not-Taken Paths (NTPT) 2 5 beq 7 8 bgez bne bgez beq bne % 90% 10% 11 80% 20% 12 … …

27/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Evaluating Proposed Algorithms Average Partition No.Efficiency

28/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

29/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Case study: Extending an Extensible Processor to Support Conditional Execution AMBER an extensible processor targeted for embedded systems Goal: accelerating application execution base processor  a general RISC processor Including: profiler, sequencer and a coarse grain RFU RFU: an accelerator without conditional execution support

30/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Extending AMBER ’ s RFU to Support Conditional Execution (CRFU) The selectors of muxes are used for choosing data for FU inputs controlled by the configuration bits The outputs of FUs are only applied to the Selector-Muxes in the lower- level rows

31/38 Oct 16, 2007 Kyushu UniversityISOCC Korea CRFU Selection using branch results Inputs of Selector-Mux (one-bit width) originate from the FUs executing branch instructions Data Selector-Mux Selection using configuration bits

32/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

33/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Experimental Results Synthesis result of the extended RFU Synopsys tools and Hitachi 0.18μm area  2.1 mm 2. CDFG configuration bit-stream  615 bits Control signals: 375 bits Executing applications on the Simplescalar as ISS  Profiling and extracting hot instruction sequence NTPT temporal partitioning algorithm for generating mappable CDFGs

34/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Experimental Results-Speedup Average Speedup: 2.1 (CDFG) versus 1.1 (DFG)

35/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Experimental Results-Energy Saving Average Energy Saving: 43% (CDFG) versus 21% (DFG)

36/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Experimental Results-Comparison

37/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Outline Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support Conditional Execution Experimental Results Conclusion

38/38 Oct 16, 2007 Kyushu UniversityISOCC Korea Conclusion Handling branch instruction & extending DFGs to CDFGs Basic requirements for an accelerator featuring conditional execution CDFG temporal partitioning algorithms NTPT traverse not-taken path of the branch instructions Frequency-based temporal partitioning algorithm Extending AMBER ’ s RFU to support conditional execution Increasing speedup Reducing energy consumption

Thank You!