Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic.

Slides:

Advertisements

Similar presentations

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Automated Design of Custom Architecture Tulika Mitra

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance.

EE3A1 Computer Hardware and Digital Design

1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.

Computer Architecture Lecture 32 Fasih ur Rehman.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Sunpyo Hong, Hyesoon Kim

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Presenter: Darshika G. Perera Assistant Professor

ECE354 Embedded Systems Introduction C Andras Moritz.

Embedded Systems Design

FPGA: Real needs and limits

FPGAs in AWS and First Use Cases, Kees Vissers

Improving java performance using Dynamic Method Migration on FPGAs

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Karthik Shankar, Roman Lysecky

Karthik Shankar, Roman Lysecky

Presentation transcript:

Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic and Autonomous Software-to- Hardware Translation for High- Performance and Low-Power Embedded Computing

Introduction Past & Present: Standard Software Binaries  Software Binaries of the Past  Binary directly related to processor’s ISA  Limited portability  Instructions were executed as specified and in-order  Current Software Binary  Specifies application functionality but not specific to underlying processor architecture  Develop new architectures for existing applications SW Binary SW Application Compiler Processor Architecture 2 Roman Lysecky, University of Arizona

Introduction Past & Present: Standard Software Binaries  Standard SW Binary  Enabled ecosystem of applications, compilers, and architectures  Provided separation of concerns  Applications:  Developers can focus on application  Choose appropriate programming language to capture functionality  Architectures:  Focus on improving and developing new architectures to execute SW binary better  Compilers:  Focus on optimizing application for specific architecture SW Binary SW Application Compiler Processor Architecture 3 Roman Lysecky, University of Arizona

Introduction Past & Present: Standard Software Binaries  Processor Architectures  Many alternative architectures can exists for a given standard binary  Current Software Binary  Specifies application functionality in well defined manner but not specific to underlying processor architecture  Develop new architectures for existing applications SW Binary SW Application Compiler Processor Architecture VLIWSuperScalar 4 Roman Lysecky, University of Arizona

Reconfigurable Computing Past & Present: FPGAs  Field Programmable Gate Arrays (FPGAs)  Reconfigurable device that can implement any circuit simply by downloading bits  Basic logic elements are N-input look up table (LUT) and flip-flops  Small N-address memory that stores truth table for any N-input logic function  Arrays of LUTs and routing elements (called switch matrices (SM)) FPGA SM CLB SM CLB SM CLB SM CLB SM CLB SM CLB SM LUT abcd e f o1o2o3o4 5 Roman Lysecky, University of Arizona

Reconfigurable Computing Past & Present: FPGAs  FPGAs are sometimes better than microprocessors  P rovide concurrency from the bit-level to task-level x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... SW Binary Compilation Requires between 32 and 128 cycles Processor Circuit for Bit Reversal Bit Reversed X Value Original X Value Requires only 1 cycle (speedup of 32x to 128x) Synthesis FPGA … … 6 Roman Lysecky, University of Arizona

Reconfigurable Computing Past & Present: FPGAs SW Binary SW Application Compiler Processor Bitstream HW Circuit Synthesis FPGA Hardware Software FPGA can implement “circuits” simply by downloading “software” bits Software 7 Roman Lysecky, University of Arizona

Reconfigurable Computing Past & Present: FPGAs  FPGAs can be combined with microprocessors  Benefits of HW/SW Partitioning  Speedup of 2X to 10X  1000X possible for some highly parallelizable applications  Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ COPROCESSOR (FPGA) 8 Roman Lysecky, University of Arizona

Reconfigurable Computing Past & Present: FPGAs  Why aren’t FPGAs common?  Bitstream not standardized  Programmability  Solution: Hide FPGA from application developer  Just like the underlying processor architecture is hidden Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ COPROCESSOR (FPGA) 9 Roman Lysecky, University of Arizona

Warp Processing Dynamic Software to Hardware Translation µP On-chip CAD I$ D$ Profiler W-FPGA A PPLICATION I NITIALLY E XECUTES ON M ICROPROCESSOR 1 P ROFILER D YNAMICALLY D ETECTS A PPLICATION’S K ERNELS 2 O N- C HIP CAD M APS K ERNELS ONTO FPGA 3 W ARPED E XECUTION IS 2-100X F ASTER – OR – C ONSUMES 75% L ESS P OWER 5 C ONFIGURE FPGA AND U PDATE A PPLICATION B INARY 4 10 Roman Lysecky, University of Arizona

Warp Processing Dynamic Software to Hardware Translation SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation Logic Synthesis Tech. Mapping/Packing Placement Routing 11 Roman Lysecky, University of Arizona

12 Warp Processing Dynamic Software to Hardware Translation  Challenges of dynamic SW to HW translation  Existing FPGAs require extremely complex CAD tools  Designed to handle large arbitrary circuits, ASIC prototyping, etc.  Require long execution times and very large memory usage  Not suitable for dynamic on-chip execution 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation Roman Lysecky, University of Arizona

Warp Processing Dynamic Software to Hardware Translation  Solution: Develop a custom Warp-oriented FPGA (i.e., CAD-oriented FPGA)  Careful simultaneous design of FPGA and CAD  FPGA features evaluated for impact on CAD  Add architecture features for SW kernels  W-FPGA validated in collaboration with Intel Research Shuttle  Enables development of fast, lean JIT FPGA compilation tools  1.4 s/kernel on 75 MHz ARM7 SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation 13 Roman Lysecky, University of Arizona

Warp Processing Performance-driven Warp Processing  Warp Processing – Adaptive Computing  Embeds compiler/synthesis within architecture  Autonomously adapts/optimizes software binary at runtime to improve performance or reduce power consumption  Performance-Driven Warp Processing (Low End)  Goal: Maximize application performance over software execution  Target: Low to mid-range embedded processors  E.g., MHz ARM processor  Average speedup of 7.4X across several embedded benchmark applications 14 Roman Lysecky, University of Arizona

Warp Processing Performance-driven Warp Processing  Performance-Driven Warp Processing (High End)  Target: High-end 624 MHz XScale processor  Average speedup of 2.5X compared to 624 MHz XScale processor Max Speedup: 6X Avg Speedup: 2.5X 15 Roman Lysecky, University of Arizona

Warp Processing Bad Analogy Time! Nürburgring Control Dominant Data Dominant Excellent Candidate for FPGA Excellent Candidate for μ P 16 Roman Lysecky, University of Arizona

Warp Processing Performance-driven Warp Processing Lap time: 9:31 Nürburgring Lap time: 7:22.1 Lap time: 6:21?? 17 Roman Lysecky, University of Arizona

Nürburgring Warp Processing Bad Analogy Time! Control Dominant Data Dominant Excellent Candidate for FPGA Excellent Candidate for μ P If no dependencies exist, what performance gains can we get from parallel execution of FPGA and μP ? 18 Roman Lysecky, University of Arizona

Warp Processing Performance-driven Warp Processing 26/34 MPG Nürburgring 13/22 MPG ??? What if you cared more about power (i.e. fuel consumption)? 19 Roman Lysecky, University of Arizona

Warp Processing Low-power Warp Processing  Low-Power Warp Processing  Goal: Reduce overall power consumption without any degradation in performance  Leverage dynamic voltage/frequency scaling of processor and FPGA W ARPED HW/SW E XECUTION (µP/FPGA) A PPLICATION E XEC (µP/FPGA) O N- C HIP CAD P OWER P ERFORMANCE P OWER R EDUCTION P ROFILE SW E XECUTION (µP) O N- C HIP CAD µ P V/F REQ S CALING FPGA F REQ S CALING SW P OWER SW P ERF N O P ERF. D ECREASE L OW- P OWER W ARPED E XECUTION 7 20 Roman Lysecky, University of Arizona

Warp Processing Low-power Warp Processing  Low-Power Warp Processing  Goal: Reduce overall power consumption without any degradation in performance  Leverage dynamic voltage/frequency scaling of processor and FPGA  Average reduction in power consumption of 74% Max Reduction: 97% Avg Reduction: 74% 21 Roman Lysecky, University of Arizona

Warp Processing Benefits  Warp Processing  Maintains ecosystem supported by standard software binary  Optimize software binary execution without developer effort – or even knowledge thereof  Builds upon software binary concept to leverage benefits of FPGAs  For those applications where FPGAs are beneficial  Optimized at runtime, where additional information may be known  Reduces designer effort  No hardware design expertise needed! 22 Roman Lysecky, University of Arizona

23 Warp Processing Challenges  Warp Processing & HW/SW Partitioning Challenges  Limited support for pointers  Limited support for dynamic memory allocation  Limited support for function recursion  Very limited support for floating- point operations Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 24 Introduction Floating Point Software Applications  Floating Point Representation  Pros  IEEE standard 754  Convenience - supported within most programming languages  C, C++, Java, etc.  Cons  Partitioning floating point kernels directly to hardware requires:  Large area resources  Multi-cycle latencies  Alternatively, can use fixed point representation to support real numbers void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } SE (8 bits)M (23 bits) Single Precision Floating Point:

Roman Lysecky, University of Arizona 25 Introduction Fixed Point Software Applications void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } I (12 bits)F (20 bits) Fixed Point (32.20): typedef long fixed; #define PRECISION_AMOUNT 16 void Reference_IDCT(short* block) { int i, j, k, v; fixed part_prod, tmp[64]; long long prod; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_product = 0; for (k=0; k<8; k++) { prod=c[k][j]*( ((fixed)block[8*i+k]) <<PRECISION_AMOUNT ); part_prod += prod >>(PRECISION_AMOUNT*2)); } tmp[8*i+j] = part_prod; }... }  Fixed Point Representation  Pros  Simple and fast hardware implementation  Mostly equivalent to integer operations  Cons  No direct support within most programming languages  Requires application to be converted to fixed point representation

Roman Lysecky, University of Arizona 26 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Converting Floating Point SW to Fixed Point SW  Manually or automatically convert software to utilize fixed point representation  Need to determine appropriate fixed point representation Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion

Roman Lysecky, University of Arizona 27 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Automated Tools for Converting Floating Point to Fixed Point  fixify - Belanovic, Rupp [RSP 2005]  Statistical optimization approach to minimize signal to quantization noise (SQNR) of fixed point code  FRIDGE - Keding et al. [DATE 1998]  Designer specified annotations on key fixed point values can be interpolated to remaing code  Cmar et al. [DATE 1999]  Annotate fixed point values with range requirements  Iterative designer guided simulation framework to optimize implementation  Menard et al. [CASES 2002], Kum et al. [ICASSP 1999]  Conversion for fixed-point DSP processors Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion

Roman Lysecky, University of Arizona 28 HW Software Application (C/C++) Introduction Converting Floating Point to Fixed Point  Converting Floating Point SW to Fixed Point HW  Convert resulting floating point hardware to fixed point software to utilize fixed point representation  Shi, Brodersen [DAC 2004]  Cmar et al. [DATE 1999]  Must still convert software to fixed point representation Application Profiling Critical Kernels (Float) Partitioning SW (C/Matlab) SW (Float) HW (Fixed) Float to Fixed Conversion SW (Fixed)

Roman Lysecky, University of Arizona 29 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains  Proposed Partitioning for Floating Point SW to Fixed Point HW  Separate computation into floating point and fixed point domains  Floating Point Domain  Processor (SW), Caches, and Memory  All values in memory will utilize floating point representation  Fixed Point Domain  HW Coprocessors  Float-to-Fixed and Fixed-to-Float converters at boundary between SW/Memory and HW will perform conversion µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 30 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains  Potential Benefits  No need to re-write initial floating point software  Final software can utilize floating point  Efficient fixed point implementation  Can treat floating point values as integers during partitioning  Still requires determining the appropriate fixed point representation  Can be accomplished using existing methods or directly specified by designer HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation

Roman Lysecky, University of Arizona 31 Partitioning Floating Point SW to Fixed Point HW Float-to-Fixed and Fixed-to-Float Converters  Float-to-Fixed and Fixed-to-Float Converters  Implemented as configurable Verilog modules  Configurable Floating Point Options:  FloatSize  MantissaBits  ExponentBits  Configurable Fixed Point Options:  FixedSize  RadixPointSize  RadixPoint  RadixPoint can be implemented as input or parameter RadixPoint RadixPointSize Normal Cases Zero Float Fixed Normal Shift Calc Shifter OverflowException FixedSize SE M Dir Amount - NormalCases FloatSize Special Cases Overflow Calc

Roman Lysecky, University of Arizona 32 Partitioning Floating Point SW to Fixed Point HW Coprocessor Interface  Hardware Coprocessor Interface  Integrates Float-to-Fixed and Fixed- to-Float converters with memory interface  All values read from memory are converted through Float-to-Fixed converter  Integer: IntDataIn  Fixed: FixedDataIn  Separate outputs for integer and fixed data  Integer: WrInt, IntDataOut  Fixed: WrFixed, FixedDataOut HW Coprocessor AddrBE DataOut Rd DataIn WrFixedIntDataOut Wr FixedDataOut IntDataIn FixedDataIn WrInt Fixed- to-Float Float-to- Fixed

Roman Lysecky, University of Arizona 33 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow  HW/SW Partitioning of Floating Point SW to Fixed Point HW  Kernels initially partitioned as integer implementation  Synthesis annotations used to identify floating point values HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule

Roman Lysecky, University of Arizona 34 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow  HW/SW Partitioning of Floating Point SW to Fixed Point HW  Fixed point registers, computations, and memory accesses converted to specified representation HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule module Coprocessor (Clk, Rst, Addr, BE, Rd, WrInt, WrFixed, IntDataOut, FixedDataOut, IntDataIn, FixedDataIn);... // Fixed point register reg signed [FixedSize-1:0] p; // Integer register reg signed [31:0] c1; Clk) begin // Fixed point multiplication and addition // with conversion from integer to fixed // point p >> RadixPoint) + (c1 << RadixPoint); end endmodule

Roman Lysecky, University of Arizona 35 Partitioning Floating Point SW to Fixed Point HW Experimental Results  Experimental Setup  250 MHz MIPS processor with floating point support  Xilinx Virtex-5 FPGA  HW coprocessors execute at maximum frequency achieved by Xilinx ISE 9.2  Benchmarks  MPEG2 Encode/Decode (MediaBench)  Epic (MediaBench)  FFT/IFFT (MiBench)  All applications require significant floating point operations  Partition both integer and floating point kernels µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 36 Partitioning Floating Point SW to Fixed Point HW Experimental Results  Floating Point and Fixed Point Representations  Utilized fixed point representation that provide identical results as software floating point implementation  MPEG2 Encode/Decode (MediaBench)  Float: integer (memory), single precision (computation)  Fixed: 32-bit, radix of 20 (12.20)  Epic (MediaBench)  Float: single precision (memory), double precision (computation)  Fixed: 64-bit, radix of 47 (17.47)  FFT/IFFT (MiBench)  Float: single precision (memory), double precision (computation)  Fixed: 51-bit, radix of 30 (21.30) µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 37 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Float-to-Fixed and Fixed-to-Float Converters  Fixed-to-Float and Float-to-Fixed Converter Performance (RadixPoint Parameter vs. Input)  Float-to-Fixed ( RadixPoint Parameter ):  9% faster and 10% fewer LUTs compared to input version  Fixed-to-Float ( RadixPoint Parameter ):  25% faster but requires 30% more LUTs than input version µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 38 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Application Speedup  Application Speedup  RadixPoint Parameter Implementation:  Average speedup of 4.4X  Maximum speedup of 6.8X (fft/ifft)  RadixPoint Input Implementation:  Average speedup of 4.0X  Maximum speedup of 6.2X (fft/ifft) µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN

Roman Lysecky, University of Arizona 39 µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Warp Processing Dynamically Adaptable Fixed-Point Coprocessing  Warp Processing  Dynamically adaptable fixed-point coprocessors  Float-to-Fixed and Fixed-to-Float converters opens door to dynamically adapting fixed point representation at runtime  RadixGen Component  Responds to various overflows and dynamically adjusts RadixPoint  Float-to-Fixed conversion overflow  Integer-to-Fixed conversion overflow  Arithmetic overflow  Similar performance speedups compared to RadixPoint input implementation achievable µP I$ D$ Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Coprocessor RadixGen Arithmetic Conv. Integer

40 Roman Lysecky, University of Arizona

41 Application Profiling HW/SW Partitioning  Hardware/software Partitioning  Profiling is a critical step within hardware/software partitioning  Often utilized to determine critical software region  Frequently executed loops or functions  Critical kernels can be re- implemented in hardware  Speedup of 2X to 10X  Speedup of 1000X possible  Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 42 Introduction Application Profiling – Warp Processing  Warp Processing - Dynamic Hardware/Software Partitioning  Dynamically re-implements critical kernels as HW within W-FPGA  Requires non-intrusive profiling to determine critical kernels at runtime  Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005]  Monitors short backwards branches  Maintains a small list of branch executions frequency  May lead to sub-optimal partitioning as it does not provide detailed loop execution statistics

Roman Lysecky, University of Arizona 43 Introduction Application Profiling – HW/SW Partitioning  Loop iteration count alone may not provide sufficient information for accurate performance estimation  Example  Assume we want to partition only one of the following two loops to HW:  With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate KernelTotal Iterations% Exec Time A10,00033% B12,00045% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 44 Introduction Application Profiling – Warp Processing  However, communication requirements can significantly impact overall performance  Kernel A may in fact be the better choice KernelTotal Iterations% Exec Time A10,00033% B12,00045% Avg Iters/ExecExecs Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 45 Introduction Application Profiling – Goal: Non-Intrusive Profiling  Non-intrusive Application Profiling  Goal: Profile application at runtime to determine detailed loop execution statistics with no impact on application execution  Runtime overhead cannot be tolerated by many applications at runtime  E.g. Real-time and embedded systems  May lead to missed deadlines and potentially system failure Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 46 Introduction Application Profiling – Existing Profiling Methods  Software Based Profiling  Instrumenting - insert code directly within software  E.g., monitor branches, basic blocks, functions, etc.  Intrusive: Increases code size and introduces runtime overhead  Statistical Sampling  Periodically interrupt processor – or execute additional software task – to monitor program counter  Statistically determine the application profile  Very good accuracy with reduced overhead compared to instrumentation  Intrusive: Introduces runtime overhead Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 47 Introduction Application Profiling – Existing Profiling Methods  Hardware Based Profiling  Processor Support – Event Counters  Many processors include event counters that can be used to profile an application  Intrusive: Requires additional software support to process event counters to profile application  JTAG – Joint Test Action Group  Standard interface for reading register within hardware devices  Intrusive: Requires the processor to be halted to read the values Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)

Roman Lysecky, University of Arizona 48 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Dynamic Application Profiler (DAProf)  Non-intrusively monitors both loop executions and iterations  Monitors processor’s instruction bus and branch execution behavior to build application profile  Requires a short backwards branch (sbb) signal from microprocessor µP I$ D$ DAProf iAddr sbb FPGA/ASIC Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 49 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler FIFO  Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s  Synchronizes between processor execution frequency and slower internal profiler frequency Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 50 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  Tag: Address of the short backwards branch  Offset: Negative branch offset  Corresponds to the size of the loop  Currently supports loops with less than 256 instructions Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 51 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  CurrIter: Number of iterations for the current loop execution  AvgIter: Average Iterations per execution of the loop  13-bit fixed point representation with 10 bits integer and 3 bits fractional Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 52 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  InLoop: Flag indicating loop is currently executing  Utilized to distinguish between loop iterations and loop executions  Freshness: Indicates how recently a loop has been executed  Utilized to ensure newly identified loops are not immediately replaced from the profile cache Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 53 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache Outputs  found: Indicates if current loop (identified by iAddr) is found within the profile cache  foundIndex: Location of loop within profile cache, if found  replaceIndex: Loop that will be replaced upon new loop execution  Loop not identified as fresh with least total iterations Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 54 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If loop is found within cache  If InLoop flag is set  New iteration  Increment current iterations  Otherwise  New execution  Increment executions  Set current iterations to 1  Set InLoop flag  Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

Roman Lysecky, University of Arizona 55 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If loop is not found within cache  Replace profile cache entry  Initialize execution and current iterations to 1  Set InLoop flag  Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

Roman Lysecky, University of Arizona 56 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If current sbb (iAddr) is detected outside a loop within the profile cache  AND, the loop’s InLoop flag is set  Reset InLoop flag  Update average iterations  Ratio based average iteration calculation  Simple hardware requirements  Good accuracy for applications considered DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

Roman Lysecky, University of Arizona 57 Dynamic Application Profiler (DAProf) Hardware Implementation  DAProf Hardware  Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog  Synthesized using Synopsys Design Compiler targeted at UMC.18µm Design Area Maximum Frequency mm 2 Gates% of ARM 9 Fully Associative , %415 MHz 16-way Associative1.2274, %438 MHz 8-way Associative0.9659, %495 MHz Profiler FIFO Profiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

Roman Lysecky, University of Arizona 58 Dynamic Application Profiler (DAProf) Profiling Accuracy  DAProf Profiling Accuracy  Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling  Results presented for 8-way DAProf design  All three associativity performed similarly well 90% accuracy for average iterations 97% accuracy for executions 95% accuracy for % execution time

Roman Lysecky, University of Arizona 59 Dynamic Application Profiler (DAProf) Profiling Accuracy – Function Call Interference  DAProf Profiling Accuracy  Some applications are affected by function call interference  Loop execution within functions called from within a loop may lead to InLoop flag being incorrectly reset for calling loop  Average iterations will be incorrectly updated Function Call Interference

Roman Lysecky, University of Arizona 60 Dynamic Application Profiler (DAProf) Function Call Support  Extended DAProf Profiler with Function Call Support  Monitors function calls and returns to avoid function call interference  InFunc: Flag within Profile Cache to determine is a loop has called a function  Will not update average iterations until function call returns Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3) Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (30) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) F RESH- N ESS (3) FOUND I NDEX REPLACE I NDEX FOUND SBB FUNC RET I A DDR SBB FUNC RET I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) I N F UNC (1)

Roman Lysecky, University of Arizona 61 Dynamic Application Profiler (DAProf) Profiling Accuracy with Function Call Support  DAProf Profiling Accuracy with Function Support  Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling  Results presented for 8-way DAProf design  All three associativity performed similarly well 95% accurate for average iterations, executions, and % execution time

Arigatou gozaimasu 62 Roman Lysecky, University of Arizona