Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Lecture 2-Berkeley RISC Penghui Zhang Guanming Wang Hang Zhang.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Configurable System-on-Chip: Xilinx EDK
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Hardware-Software Codesign Elvira Kitsis Hermawan Ho Alex Papadimoulis.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Automated Design of Custom Architecture Tulika Mitra
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
System-on-Chip Design
Cache Memory.
Introduction to Reconfigurable Computing
Micro-programmed Control Unit
Computer Architecture (CS 207 D) Instruction Set Architecture ISA
Anne Pratoomtong ECE734, Spring2002
Introduction to cosynthesis Rabi Mahapatra CSCE617
Dynamically Reconfigurable Architectures: An Overview
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Control Unit Introduction Types Comparison Control Memory
Processor Organization and Architecture
A High Performance SoC: PkunityTM
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University Speaker: 陳雋中

Outline Introduction Introduction System Architecture System Architecture Tool Overview Tool Overview Experiments Experiments Conclusion Conclusion

Sw ______ Introduction (1/3) Drawbacks of current dynamic optimizations Drawbacks of current dynamic optimizations –Currently limited to software optimizations Alternatively, we could perform hw/sw partitioning Alternatively, we could perform hw/sw partitioning –Achieve large speedups (2x to 10x common) –However, presently dynamic optimization not possible Sw ______ Hw ______ Profiler Critical Regions Processor ASIC/FPGA

Introduction (2/3) Ideally, we would perform hardware/software partitioning dynamically Ideally, we would perform hardware/software partitioning dynamically –Most partitioning approaches have complex tool flows –Achieves better results than software optimizations >2x speedup, energy savings >2x speedup, energy savings Appropriate architecture required Appropriate architecture required –Requires a processor and configurable logic

Introduction (3/3) Binary-level hw/sw partitioning Binary-level hw/sw partitioning –Binary is profiled and hardware candidates are determined –Binary is updated to use hardware Many advantages over source-level partitioning Many advantages over source-level partitioning –Supports any language or software compiler No change in tools No change in tools –Better performance estimation at binary level Enables dynamic hw/sw partitioning Enables dynamic hw/sw partitioning Binary Netlist Processor FPGA Updated Binary Profiling Hw Exploration Decompilation Behavioral Synthesis Binary Updater

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add Dynamic Partitioning Module add

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq Dynamic Partitioning Module beq

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops SW

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops HW Frequent Loops

Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor Dynamic Partitioning Module Dynamic Hw/Sw Partitioning SW _________ SW Frequent Loops Configurable Logic Frequent Loops

Dynamic Partitioning Module(1/2) Dynamic partitioning module executes partitioning tools on chip Dynamic partitioning module executes partitioning tools on chip –Profiler, partitioning compiler, synthesis, place&route Profiler Partitioning Compiler Synthesis SW Binary HW SW Source Place&Route Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor

Dynamic Partitioning Module(2/2) Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Architectural components Architectural components –Profiler –Additional processor and memory But SOCs may have dozens anyways But SOCs may have dozens anyways Alternatively, we could share main processor Alternatively, we could share main processor Memory Profiler Partitioning Co-Processor

Configurable Logic Greatly simplified in order to create lean place & route tools Greatly simplified in order to create lean place & route tools DMA used to access memory DMA used to access memory Two registers Two registers –R0_Input stores data from memory –R1_InOut stores temporary data & data to write back to memory Fabric Fabric –Supports combinational logic –Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input Configurable Logic Fabric R1_InOut

Tool Overview Binary Loop Profiling Small, Frequent Loops Decompilation Place & Route HW RT and Logic Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Tool flow slightly different from standard partitioning flow Tool flow slightly different from standard partitioning flow –Decompilation –Binary modification

Loop Profiling Non-intrusive profiler Non-intrusive profiler –Monitors instruction bus Very little overhead Very little overhead –Small cache (~16 entries) and 2,300 logic gates Less than 1% power overhead Less than 1% power overhead

Decompilation Decompilation recovers high-level information Decompilation recovers high-level information

DMA Configuration Maps memory accesses to our DMA architecture Maps memory accesses to our DMA architecture –Reads/writes –Increment/decrement address updates –Single/block request modes Optimizes DFG for DMA Optimizes DFG for DMA –Removes address calculations –Removes loop counters/exit conditions

Bitfile Creation Combines place&routed hardware description with DMA configuration into bitfile Combines place&routed hardware description with DMA configuration into bitfile –Used to initialize the configurable logic HW Netlist Bitfile Creation DMA Configuration Bitfile DMA R0_Input Configurable Logic Fabric R1_InOut

Binary Modification Updates the application binary in order to utilize the new hardware Updates the application binary in order to utilize the new hardware –Loop replaced with jump to hw initialization code loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. hw_init: 1.Initialize HW registers 2.Enable HW 3.Shutdown processor Woken up by HW interrupt 4.Store any results 5.Jump to after_loop loop: Jump hw_init.. after_loop: …..

Tool Statistics Executed on SimpleScalar Executed on SimpleScalar –Similar to a MIPS instruction set –Used 60 MHz clock Statistics Statistics –Total run time of only 1.09 seconds –Requires less than ½ megabyte of RAM –Code size much smaller than standard synthesis tools

Experiments Benchmark Information Benchmark Information –Powerstone (Brev, g3fax1&2) –NetBench (url) –Logic minimization kernel (logmin) Statistics Statistics –55% of total time spent in loops that are moved to hardware –Ideal speedup of 2.8 –These loops were only 2.4% of the size of the original application

Experiments Results Results –Achieved average speedup of 2.6, close to ideal 2.8 –Hardware loops were 20X faster than software loops Even with simple architecture and tools, large speedups were achieved Even with simple architecture and tools, large speedups were achieved

Conclusion Dynamic hardware/software partitioning has advantages over other partitioning approaches Dynamic hardware/software partitioning has advantages over other partitioning approaches Achieved average speedup of 2.6 Achieved average speedup of 2.6 –Very close to ideal speedup of 2.8 Future work Future work –More complex configurable logic fabric Sequential logic and increased inputs/outputs Sequential logic and increased inputs/outputs Support larger hardware regions, not just simple loops Support larger hardware regions, not just simple loops Improved algorithms (especially place and route) Improved algorithms (especially place and route) –Handle more complex memory access patterns