Dynamic Hardware/Software Partitioning: A First Approach

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Configurable System-on-Chip: Xilinx EDK

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.

Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Programmable Logic Devices

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

UNIT – Microcontroller.

Instructor: Dr. Phillip Jones

Introduction to Reconfigurable Computing

Improving java performance using Dynamic Method Migration on FPGAs

Anne Pratoomtong ECE734, Spring2002

Introduction to cosynthesis Rabi Mahapatra CSCE617

Reconfigurable Computing

Dynamically Reconfigurable Architectures: An Overview

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Embedded systems, Lab 1: notes

Ann Gordon-Ross and Frank Vahid*

A High Performance SoC: PkunityTM

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Warp Processor: A Dynamically Reconfigurable Coprocessor

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine

Introduction Dynamic optimizations an increasing trend Advantages Examples Dynamo Dynamic software optimizations Transmeta Crusoe Dynamic code morphing Just In Time Compilation Interpreted languages Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage

Introduction Drawbacks of current dynamic optimizations Currently limited to software optimizations Limited speedup (1.1x to 1.3x common) Alternatively, we could perform hw/sw partitioning Achieve large speedups (2x to 10x common) However, presently dynamic optimization not possible Hw ______ Profiler Critical Regions Sw ______ Sw ______ Processor ASIC/FPGA

Introduction Ideally, we would perform hardware/software partitioning dynamically Transparent partitioning Supports all sw languages/tools Most partitioning approaches have complex tool flows Achieves better results than software optimizations >2x speedup, energy savings Adapts to actual usage Appropriate architecture required Requires a processor and configurable logic

Introduction Microprocessor/FPGA single-chip platforms make partitioning more attractive More efficient communication, smaller size Higher performance, low power Examples Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLIC Makes dynamic hw/sw partitioning more feasible However, partitioning must be performed at binary level FPGA Processor Processor FPGA 1990s 2003

Introduction Enables dynamic hw/sw partitioning Binary-level hw/sw partitioning Binary is profiled and hardware candidates are determined Regions to be partitioned are decompiled into CDFG CDFG is synthesized to hardware Binary is updated to use hardware Many advantages over source-level partitioning Supports any language or software compiler No change in tools Better software size and performance estimation at binary level Enables dynamic hw/sw partitioning Binary Netlist Processor FPGA Updated Binary Profiling Hw Exploration Decompilation Behavioral Synthesis Binary Updater

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor add add add add add add add add add add add add add add add add add add add add add add add add SW _________

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq SW _________

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor add add add add add add add add add add add Dynamic Partitioning Module add add add add add add add add add add add add add SW _________

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor beq beq beq beq beq beq beq beq beq beq beq Dynamic Partitioning Module beq beq beq beq beq beq beq beq beq beq beq beq beq SW _________

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor Dynamic Partitioning Module SW SW SW SW SW SW SW SW SW _________ Frequent Loops

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Memory Micro- processor Dynamic Partitioning Module HW HW HW HW HW HW HW Frequent Loops SW _________ Frequent Loops

Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Configurable Logic Frequent Loops SW _________ Frequent Loops

Dynamic Partitioning Module Dynamic partitioning module executes partitioning tools on chip Profiler, partitioning compiler, synthesis, place&route Memory Dynamic Partitioning Module Configurable Logic Micro- processor SW Source Profiler Partitioning Compiler SW Binary Synthesis Place&Route HW

Dynamic Partitioning Module Synthesis and place & route tools all moved on-chip These tools typically execute on powerful workstations Most people will cringe at idea of moving these tools on-chip However, dynamic partitioning deals with small regions of code Typically, small innermost loops Therefore, we can develop lean tools that work specifically for these small loops Lean tools make on-chip execution possible Area overhead becoming less critical due to Moore’s Law

System Architecture Microprocessors On-chip memory Configurable logic MIPS (may be many) On-chip memory Configurable logic Dynamic partitioning module Memory Dynamic Partitioning Module Configurable Logic Micro- processor

Dynamic Partitioning Module Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Architectural components Profiler Additional processor and memory But SOCs may have dozens anyways Alternatively, we could share main processor Memory Profiler Partitioning Co-Processor

Configurable Logic Fabric Greatly simplified in order to create lean place & route tools DMA used to access memory Two registers R0_Input stores data from memory R1_InOut stores temporary data & data to write back to memory Fabric Supports combinational logic Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input R1_InOut Configurable Logic Fabric

Configurable Logic Fabric 3-input 2-output LUTS surrounded by switch matrices Switch Matrix Connect wire to same channel on different side LUT 3-input (8 word) 2-output SRAM Configurable Logic Fabric Switch Matrix LUT Configurable Logic Fabric LUT T LUT UT ... SM M 1 2 3 Inputs SRAM (8x2) Outputs

Tool Overview Binary Loop Profiling Small, Frequent Loops Decompilation Place & Route HW RT and Logic Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Tool flow slightly different from standard partitioning flow Decompilation Binary modification

Frequent Loop Cache Controller Loop Profiling Non-intrusive profiler Monitors instruction bus Very little overhead Small cache (~16 entries) and 2,300 logic gates Less than 1% power overhead To L1 Memory Micro-processor Frequent Loop Cache Controller rd/wr Frequent Loop Cache rd/wr addr addr data saturation sbb data ++ data

Decompilation Decompilation recovers high-level information Creates optimized CDFG All instruction-set inefficiencies are removed Binary partitioning has been shown to achieve similar results to source-level partitioning for many applications [Greg Stitt, Frank Vahid, ICCAD 2002]

DMA Configuration Maps memory accesses to our DMA architecture Reads/writes Increment/decrement address updates Single/block request modes Optimizes DFG for DMA Removes address calculations Removes loop counters/exit conditions Memory Read Increment Address Block Request 1 r1 + Read r2 DMA Read + r2 r3 r3

Register Transfer Synthesis Maps DFG operations to hw library components Adders, Comparators, Multiplexors, Shifters Creates Boolean expression for each output bit in dataflow graph by replacing hw components with corresponding expressions r1 r2 + r4 r3 8 < r5 32-bit adder 32-bit comparator r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. …….

Logic Synthesis Optimizes Boolean equations from RT synthesis Large opportunity for logic minimization due to use of immediate values in the binary Simple on-chip 2-level logic minimization method Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed) r1 4 + r2 r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] … r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2]’ xor carry[1] r2[3] = r1[3] xor carry[2] …

Technology Mapping Maps logic operations to 3-input, 2-output LUTs Traverse logic network and combine nodes to determine single output LUTs Combine nodes to form two output LUTs 3-input, 2-output LUTs

Placement Nodes along critical path are placed in single horizontal row Build dependencies between remaining nodes and placed nodes Use dependencies to place remaining nodes Either above or below placed nodes LUT LUT LUT LUT

Routing Greedy algorithm Place and route most complex task; At each switch matrix, choose direction to route Continue to route until reaching switch matrix that is already in use Backtrack to previous switch matrix, and try another direction Place and route most complex task; currently working on improvements

Configurable Logic Fabric Bitfile Creation Combines place&routed hardware description with DMA configuration into bitfile Used to initialize the configurable logic HW Netlist Bitfile Creation DMA Configuration Bitfile DMA R0_Input Configurable Logic Fabric R1_InOut

Binary Modification Updates the application binary in order to utilize the new hardware Loop replaced with jump to hw initialization code Wisconsin Architectural Research Tool Set (WARTS) EEL (Executable Editing Library) We assume memory is RAM or programmable ROM loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. loop: Jump hw_init .. after_loop: ….. hw_init: Initialize HW registers Enable HW Shutdown processor Woken up by HW interrupt Store any results Jump to after_loop

Tool Statistics Executed on SimpleScalar Statistics Similar to a MIPS instruction set Used 60 MHz clock (like Triscend A7 device) Statistics Total run time of only 1.09 seconds Requires less than ½ megabyte of RAM Code size much smaller than standard synthesis tools

Experiments Benchmark Information Statistics Powerstone (Brev, g3fax1&2) NetBench (url) Logic minimization kernel (logmin) Statistics 55% of total time spent in loops that are moved to hardware Ideal speedup of 2.8 These loops were only 2.4% of the size of the original application

Experiments Results Achieved average speedup of 2.6, close to ideal 2.8 Hardware loops were 20X faster than software loops Even with simple architecture and tools, large speedups were achieved

Conclusion Dynamic hardware/software partitioning has advantages over other partitioning approaches Completely transparent Designers get performance/energy benefits of hw/sw partitioning by simply writing software Quality likely not as good as desktop CAD for some applications, so most suitable when transparency is critical (very often!) Achieved average speedup of 2.6 Very close to ideal speedup of 2.8 Future work More complex configurable logic fabric Designed in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputs Support larger hardware regions, not just simple loops Improved algorithms (especially place and route) Handle more complex memory access patterns