Download presentation
Presentation is loading. Please wait.
Published byMillicent Snow Modified over 9 years ago
1
Roman Lysecky Department of Electrical and Computer Engineering University of Arizona rlysecky@ece.arizona.edu http://www.ece.arizona.edu/~embedded Dynamic and Autonomous Software-to- Hardware Translation for High- Performance and Low-Power Embedded Computing
2
Introduction Past & Present: Standard Software Binaries Software Binaries of the Past Binary directly related to processor’s ISA Limited portability Instructions were executed as specified and in-order Current Software Binary Specifies application functionality but not specific to underlying processor architecture Develop new architectures for existing applications SW Binary SW Application Compiler Processor Architecture 2 Roman Lysecky, University of Arizona
3
Introduction Past & Present: Standard Software Binaries Standard SW Binary Enabled ecosystem of applications, compilers, and architectures Provided separation of concerns Applications: Developers can focus on application Choose appropriate programming language to capture functionality Architectures: Focus on improving and developing new architectures to execute SW binary better Compilers: Focus on optimizing application for specific architecture SW Binary SW Application Compiler Processor Architecture 3 Roman Lysecky, University of Arizona
4
Introduction Past & Present: Standard Software Binaries Processor Architectures Many alternative architectures can exists for a given standard binary Current Software Binary Specifies application functionality in well defined manner but not specific to underlying processor architecture Develop new architectures for existing applications SW Binary SW Application Compiler Processor Architecture VLIWSuperScalar 4 Roman Lysecky, University of Arizona
5
Reconfigurable Computing Past & Present: FPGAs Field Programmable Gate Arrays (FPGAs) Reconfigurable device that can implement any circuit simply by downloading bits Basic logic elements are N-input look up table (LUT) and flip-flops Small N-address memory that stores truth table for any N-input logic function Arrays of LUTs and routing elements (called switch matrices (SM)) FPGA SM CLB SM CLB SM CLB SM CLB SM CLB SM CLB SM LUT abcd e f o1o2o3o4 5 Roman Lysecky, University of Arizona
6
Reconfigurable Computing Past & Present: FPGAs FPGAs are sometimes better than microprocessors P rovide concurrency from the bit-level to task-level x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... SW Binary Compilation Requires between 32 and 128 cycles Processor Circuit for Bit Reversal Bit Reversed X Value Original X Value Requires only 1 cycle (speedup of 32x to 128x) Synthesis FPGA … … 6 Roman Lysecky, University of Arizona
7
Reconfigurable Computing Past & Present: FPGAs SW Binary SW Application Compiler Processor Bitstream HW Circuit Synthesis FPGA Hardware Software FPGA can implement “circuits” simply by downloading “software” bits Software 7 Roman Lysecky, University of Arizona
8
Reconfigurable Computing Past & Present: FPGAs FPGAs can be combined with microprocessors Benefits of HW/SW Partitioning Speedup of 2X to 10X 1000X possible for some highly parallelizable applications Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ COPROCESSOR (FPGA) 8 Roman Lysecky, University of Arizona
9
Reconfigurable Computing Past & Present: FPGAs Why aren’t FPGAs common? Bitstream not standardized Programmability Solution: Hide FPGA from application developer Just like the underlying processor architecture is hidden Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ COPROCESSOR (FPGA) 9 Roman Lysecky, University of Arizona
10
Warp Processing Dynamic Software to Hardware Translation µP On-chip CAD I$ D$ Profiler W-FPGA A PPLICATION I NITIALLY E XECUTES ON M ICROPROCESSOR 1 P ROFILER D YNAMICALLY D ETECTS A PPLICATION’S K ERNELS 2 O N- C HIP CAD M APS K ERNELS ONTO FPGA 3 W ARPED E XECUTION IS 2-100X F ASTER – OR – C ONSUMES 75% L ESS P OWER 5 C ONFIGURE FPGA AND U PDATE A PPLICATION B INARY 4 10 Roman Lysecky, University of Arizona
11
Warp Processing Dynamic Software to Hardware Translation SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation Logic Synthesis Tech. Mapping/Packing Placement Routing 11 Roman Lysecky, University of Arizona
12
12 Warp Processing Dynamic Software to Hardware Translation Challenges of dynamic SW to HW translation Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation Roman Lysecky, University of Arizona
13
Warp Processing Dynamic Software to Hardware Translation Solution: Develop a custom Warp-oriented FPGA (i.e., CAD-oriented FPGA) Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Add architecture features for SW kernels W-FPGA validated in collaboration with Intel Research Shuttle Enables development of fast, lean JIT FPGA compilation tools 1.4 s/kernel on 75 MHz ARM7 SW Binary Bitstream Circuit Partitioning Updated SW Binary Decompilation RT Synthesis Binary Update JIT FPGA Compilation 13 Roman Lysecky, University of Arizona
14
Warp Processing Performance-driven Warp Processing Warp Processing – Adaptive Computing Embeds compiler/synthesis within architecture Autonomously adapts/optimizes software binary at runtime to improve performance or reduce power consumption Performance-Driven Warp Processing (Low End) Goal: Maximize application performance over software execution Target: Low to mid-range embedded processors E.g., 100-200 MHz ARM processor Average speedup of 7.4X across several embedded benchmark applications 14 Roman Lysecky, University of Arizona
15
Warp Processing Performance-driven Warp Processing Performance-Driven Warp Processing (High End) Target: High-end 624 MHz XScale processor Average speedup of 2.5X compared to 624 MHz XScale processor Max Speedup: 6X Avg Speedup: 2.5X 15 Roman Lysecky, University of Arizona
16
Warp Processing Bad Analogy Time! Nürburgring Control Dominant Data Dominant Excellent Candidate for FPGA Excellent Candidate for μ P 16 Roman Lysecky, University of Arizona
17
Warp Processing Performance-driven Warp Processing Lap time: 9:31 Nürburgring Lap time: 7:22.1 Lap time: 6:21?? 17 Roman Lysecky, University of Arizona
18
Nürburgring Warp Processing Bad Analogy Time! Control Dominant Data Dominant Excellent Candidate for FPGA Excellent Candidate for μ P If no dependencies exist, what performance gains can we get from parallel execution of FPGA and μP ? 18 Roman Lysecky, University of Arizona
19
Warp Processing Performance-driven Warp Processing 26/34 MPG Nürburgring 13/22 MPG ??? What if you cared more about power (i.e. fuel consumption)? 19 Roman Lysecky, University of Arizona
20
Warp Processing Low-power Warp Processing Low-Power Warp Processing Goal: Reduce overall power consumption without any degradation in performance Leverage dynamic voltage/frequency scaling of processor and FPGA W ARPED HW/SW E XECUTION (µP/FPGA) A PPLICATION E XEC (µP/FPGA) O N- C HIP CAD P OWER P ERFORMANCE P OWER R EDUCTION P ROFILE SW E XECUTION (µP) O N- C HIP CAD µ P V/F REQ S CALING FPGA F REQ S CALING SW P OWER SW P ERF N O P ERF. D ECREASE 1 23 4 5 6 L OW- P OWER W ARPED E XECUTION 7 20 Roman Lysecky, University of Arizona
21
Warp Processing Low-power Warp Processing Low-Power Warp Processing Goal: Reduce overall power consumption without any degradation in performance Leverage dynamic voltage/frequency scaling of processor and FPGA Average reduction in power consumption of 74% Max Reduction: 97% Avg Reduction: 74% 21 Roman Lysecky, University of Arizona
22
Warp Processing Benefits Warp Processing Maintains ecosystem supported by standard software binary Optimize software binary execution without developer effort – or even knowledge thereof Builds upon software binary concept to leverage benefits of FPGAs For those applications where FPGAs are beneficial Optimized at runtime, where additional information may be known Reduces designer effort No hardware design expertise needed! 22 Roman Lysecky, University of Arizona
23
23 Warp Processing Challenges Warp Processing & HW/SW Partitioning Challenges Limited support for pointers Limited support for dynamic memory allocation Limited support for function recursion Very limited support for floating- point operations Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
24
Roman Lysecky, University of Arizona 24 Introduction Floating Point Software Applications Floating Point Representation Pros IEEE standard 754 Convenience - supported within most programming languages C, C++, Java, etc. Cons Partitioning floating point kernels directly to hardware requires: Large area resources Multi-cycle latencies Alternatively, can use fixed point representation to support real numbers void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } SE (8 bits)M (23 bits) Single Precision Floating Point:
25
Roman Lysecky, University of Arizona 25 Introduction Fixed Point Software Applications void Reference_IDCT(short* block) { int i, j, k, v; float part_prod, tmp[64]; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_prod = 0.0; for (k=0; k<8; k++) { part_prod+=c[k][j]*block[8*i+k]; } tmp[8*i+j] = part_prod; }... } I (12 bits)F (20 bits) Fixed Point (32.20): typedef long fixed; #define PRECISION_AMOUNT 16 void Reference_IDCT(short* block) { int i, j, k, v; fixed part_prod, tmp[64]; long long prod; for (i=0; i<8; i++) for (j=0; j<8; j++) { part_product = 0; for (k=0; k<8; k++) { prod=c[k][j]*( ((fixed)block[8*i+k]) <<PRECISION_AMOUNT ); part_prod += prod >>(PRECISION_AMOUNT*2)); } tmp[8*i+j] = part_prod; }... } Fixed Point Representation Pros Simple and fast hardware implementation Mostly equivalent to integer operations Cons No direct support within most programming languages Requires application to be converted to fixed point representation
26
Roman Lysecky, University of Arizona 26 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point Converting Floating Point SW to Fixed Point SW Manually or automatically convert software to utilize fixed point representation Need to determine appropriate fixed point representation Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion
27
Roman Lysecky, University of Arizona 27 Software Application (C/C++) Introduction Converting Floating Point to Fixed Point Automated Tools for Converting Floating Point to Fixed Point fixify - Belanovic, Rupp [RSP 2005] Statistical optimization approach to minimize signal to quantization noise (SQNR) of fixed point code FRIDGE - Keding et al. [DATE 1998] Designer specified annotations on key fixed point values can be interpolated to remaing code Cmar et al. [DATE 1999] Annotate fixed point values with range requirements Iterative designer guided simulation framework to optimize implementation Menard et al. [CASES 2002], Kum et al. [ICASSP 1999] Conversion for fixed-point DSP processors Software Application (Fixed) Application Profiling Critical Kernels Partitioning HWSW Software Application (Float) Float to Fixed Conversion
28
Roman Lysecky, University of Arizona 28 HW Software Application (C/C++) Introduction Converting Floating Point to Fixed Point Converting Floating Point SW to Fixed Point HW Convert resulting floating point hardware to fixed point software to utilize fixed point representation Shi, Brodersen [DAC 2004] Cmar et al. [DATE 1999] Must still convert software to fixed point representation Application Profiling Critical Kernels (Float) Partitioning SW (C/Matlab) SW (Float) HW (Fixed) Float to Fixed Conversion SW (Fixed)
29
Roman Lysecky, University of Arizona 29 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains Proposed Partitioning for Floating Point SW to Fixed Point HW Separate computation into floating point and fixed point domains Floating Point Domain Processor (SW), Caches, and Memory All values in memory will utilize floating point representation Fixed Point Domain HW Coprocessors Float-to-Fixed and Fixed-to-Float converters at boundary between SW/Memory and HW will perform conversion µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN
30
Roman Lysecky, University of Arizona 30 Partitioning Floating Point SW to Fixed Point HW Separate Floating Point and Fixed Point Domains Potential Benefits No need to re-write initial floating point software Final software can utilize floating point Efficient fixed point implementation Can treat floating point values as integers during partitioning Still requires determining the appropriate fixed point representation Can be accomplished using existing methods or directly specified by designer HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation
31
Roman Lysecky, University of Arizona 31 Partitioning Floating Point SW to Fixed Point HW Float-to-Fixed and Fixed-to-Float Converters Float-to-Fixed and Fixed-to-Float Converters Implemented as configurable Verilog modules Configurable Floating Point Options: FloatSize MantissaBits ExponentBits Configurable Fixed Point Options: FixedSize RadixPointSize RadixPoint RadixPoint can be implemented as input or parameter RadixPoint RadixPointSize Normal Cases Zero Float Fixed Normal Shift Calc Shifter OverflowException FixedSize SE M Dir Amount - NormalCases FloatSize Special Cases Overflow Calc
32
Roman Lysecky, University of Arizona 32 Partitioning Floating Point SW to Fixed Point HW Coprocessor Interface Hardware Coprocessor Interface Integrates Float-to-Fixed and Fixed- to-Float converters with memory interface All values read from memory are converted through Float-to-Fixed converter Integer: IntDataIn Fixed: FixedDataIn Separate outputs for integer and fixed data Integer: WrInt, IntDataOut Fixed: WrFixed, FixedDataOut HW Coprocessor AddrBE DataOut Rd DataIn WrFixedIntDataOut Wr FixedDataOut IntDataIn FixedDataIn WrInt Fixed- to-Float Float-to- Fixed
33
Roman Lysecky, University of Arizona 33 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow HW/SW Partitioning of Floating Point SW to Fixed Point HW Kernels initially partitioned as integer implementation Synthesis annotations used to identify floating point values HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; always @(posedge Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule
34
Roman Lysecky, University of Arizona 34 Partitioning Floating Point SW to Fixed Point HW Partitioning Tool Flow HW/SW Partitioning of Floating Point SW to Fixed Point HW Fixed point registers, computations, and memory accesses converted to specified representation HW (Integer) Software Application (C/C++) Application Profiling Critical Kernels Partitioning Fixed Point Conversion HW (Fixed) SW (Float) Floating Point Profiling (Optional) Fixed Point Representation module Coprocessor (Clk, Rst, Addr, BE, Rd, Wr, DataOut, DataIn); input Clk, Rst; output [31:0] Addr; output BE, Rd, Wr; output signed [31:0] DataOut; input signed [31:0] DataIn; // syn_fixed_point (p:SP) reg signed [31:0] p; reg signed [31:0] c1; always @(posedge Clk) begin // syn_fixed_point (p:SP, DataIn:SP) p <= p * DataIn + c1; end endmodule module Coprocessor (Clk, Rst, Addr, BE, Rd, WrInt, WrFixed, IntDataOut, FixedDataOut, IntDataIn, FixedDataIn);... // Fixed point register reg signed [FixedSize-1:0] p; // Integer register reg signed [31:0] c1; always @(posedge Clk) begin // Fixed point multiplication and addition // with conversion from integer to fixed // point p >> RadixPoint) + (c1 << RadixPoint); end endmodule
35
Roman Lysecky, University of Arizona 35 Partitioning Floating Point SW to Fixed Point HW Experimental Results Experimental Setup 250 MHz MIPS processor with floating point support Xilinx Virtex-5 FPGA HW coprocessors execute at maximum frequency achieved by Xilinx ISE 9.2 Benchmarks MPEG2 Encode/Decode (MediaBench) Epic (MediaBench) FFT/IFFT (MiBench) All applications require significant floating point operations Partition both integer and floating point kernels µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN
36
Roman Lysecky, University of Arizona 36 Partitioning Floating Point SW to Fixed Point HW Experimental Results Floating Point and Fixed Point Representations Utilized fixed point representation that provide identical results as software floating point implementation MPEG2 Encode/Decode (MediaBench) Float: integer (memory), single precision (computation) Fixed: 32-bit, radix of 20 (12.20) Epic (MediaBench) Float: single precision (memory), double precision (computation) Fixed: 64-bit, radix of 47 (17.47) FFT/IFFT (MiBench) Float: single precision (memory), double precision (computation) Fixed: 51-bit, radix of 30 (21.30) µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN
37
Roman Lysecky, University of Arizona 37 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Float-to-Fixed and Fixed-to-Float Converters Fixed-to-Float and Float-to-Fixed Converter Performance (RadixPoint Parameter vs. Input) Float-to-Fixed ( RadixPoint Parameter ): 9% faster and 10% fewer LUTs compared to input version Fixed-to-Float ( RadixPoint Parameter ): 25% faster but requires 30% more LUTs than input version µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN
38
Roman Lysecky, University of Arizona 38 Partitioning Floating Point SW to Fixed Point HW Experimental Results – Application Speedup Application Speedup RadixPoint Parameter Implementation: Average speedup of 4.4X Maximum speedup of 6.8X (fft/ifft) RadixPoint Input Implementation: Average speedup of 4.0X Maximum speedup of 6.2X (fft/ifft) µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN
39
Roman Lysecky, University of Arizona 39 µP I$ D$ HW C OPROCESSORS (ASIC/FPGA) Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Warp Processing Dynamically Adaptable Fixed-Point Coprocessing Warp Processing Dynamically adaptable fixed-point coprocessors Float-to-Fixed and Fixed-to-Float converters opens door to dynamically adapting fixed point representation at runtime RadixGen Component Responds to various overflows and dynamically adjusts RadixPoint Float-to-Fixed conversion overflow Integer-to-Fixed conversion overflow Arithmetic overflow Similar performance speedups compared to RadixPoint input implementation achievable µP I$ D$ Fixed-to-FloatFloat-to-Fixed F IXED P OINT D OMAIN F LOATING P OINT D OMAIN Coprocessor RadixGen Arithmetic Conv. Integer
40
40 Roman Lysecky, University of Arizona
41
41 Application Profiling HW/SW Partitioning Hardware/software Partitioning Profiling is a critical step within hardware/software partitioning Often utilized to determine critical software region Frequently executed loops or functions Critical kernels can be re- implemented in hardware Speedup of 2X to 10X Speedup of 1000X possible Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
42
Roman Lysecky, University of Arizona 42 Introduction Application Profiling – Warp Processing Warp Processing - Dynamic Hardware/Software Partitioning Dynamically re-implements critical kernels as HW within W-FPGA Requires non-intrusive profiling to determine critical kernels at runtime Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005] Monitors short backwards branches Maintains a small list of branch executions frequency May lead to sub-optimal partitioning as it does not provide detailed loop execution statistics
43
Roman Lysecky, University of Arizona 43 Introduction Application Profiling – HW/SW Partitioning Loop iteration count alone may not provide sufficient information for accurate performance estimation Example Assume we want to partition only one of the following two loops to HW: With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate KernelTotal Iterations% Exec Time A10,00033% B12,00045% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
44
Roman Lysecky, University of Arizona 44 Introduction Application Profiling – Warp Processing However, communication requirements can significantly impact overall performance Kernel A may in fact be the better choice KernelTotal Iterations% Exec Time A10,00033% B12,00045% Avg Iters/ExecExecs 50002 26000 Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
45
Roman Lysecky, University of Arizona 45 Introduction Application Profiling – Goal: Non-Intrusive Profiling Non-intrusive Application Profiling Goal: Profile application at runtime to determine detailed loop execution statistics with no impact on application execution Runtime overhead cannot be tolerated by many applications at runtime E.g. Real-time and embedded systems May lead to missed deadlines and potentially system failure Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
46
Roman Lysecky, University of Arizona 46 Introduction Application Profiling – Existing Profiling Methods Software Based Profiling Instrumenting - insert code directly within software E.g., monitor branches, basic blocks, functions, etc. Intrusive: Increases code size and introduces runtime overhead Statistical Sampling Periodically interrupt processor – or execute additional software task – to monitor program counter Statistically determine the application profile Very good accuracy with reduced overhead compared to instrumentation Intrusive: Introduces runtime overhead Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
47
Roman Lysecky, University of Arizona 47 Introduction Application Profiling – Existing Profiling Methods Hardware Based Profiling Processor Support – Event Counters Many processors include event counters that can be used to profile an application Intrusive: Requires additional software support to process event counters to profile application JTAG – Joint Test Action Group Standard interface for reading register within hardware devices Intrusive: Requires the processor to be halted to read the values Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µP I$ D$ HW COPROCESSOR (ASIC/FPGA)
48
Roman Lysecky, University of Arizona 48 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Dynamic Application Profiler (DAProf) Non-intrusively monitors both loop executions and iterations Monitors processor’s instruction bus and branch execution behavior to build application profile Requires a short backwards branch (sbb) signal from microprocessor µP I$ D$ DAProf iAddr sbb FPGA/ASIC Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
49
Roman Lysecky, University of Arizona 49 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler FIFO Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s Synchronizes between processor execution frequency and slower internal profiler frequency Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
50
Roman Lysecky, University of Arizona 50 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache Tag: Address of the short backwards branch Offset: Negative branch offset Corresponds to the size of the loop Currently supports loops with less than 256 instructions Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
51
Roman Lysecky, University of Arizona 51 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache CurrIter: Number of iterations for the current loop execution AvgIter: Average Iterations per execution of the loop 13-bit fixed point representation with 10 bits integer and 3 bits fractional Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
52
Roman Lysecky, University of Arizona 52 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache InLoop: Flag indicating loop is currently executing Utilized to distinguish between loop iterations and loop executions Freshness: Indicates how recently a loop has been executed Utilized to ensure newly identified loops are not immediately replaced from the profile cache Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
53
Roman Lysecky, University of Arizona 53 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache Outputs found: Indicates if current loop (identified by iAddr) is found within the profile cache foundIndex: Location of loop within profile cache, if found replaceIndex: Loop that will be replaced upon new loop execution Loop not identified as fresh with least total iterations Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
54
Roman Lysecky, University of Arizona 54 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If loop is found within cache If InLoop flag is set New iteration Increment current iterations Otherwise New execution Increment executions Set current iterations to 1 Set InLoop flag Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
55
Roman Lysecky, University of Arizona 55 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If loop is not found within cache Replace profile cache entry Initialize execution and current iterations to 1 Set InLoop flag Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
56
Roman Lysecky, University of Arizona 56 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If current sbb (iAddr) is detected outside a loop within the profile cache AND, the loop’s InLoop flag is set Reset InLoop flag Update average iterations Ratio based average iteration calculation Simple hardware requirements Good accuracy for applications considered DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
57
Roman Lysecky, University of Arizona 57 Dynamic Application Profiler (DAProf) Hardware Implementation DAProf Hardware Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog Synthesized using Synopsys Design Compiler targeted at UMC.18µm Design Area Maximum Frequency mm 2 Gates% of ARM 9 Fully Associative1.75107,47720.00%415 MHz 16-way Associative1.2274,74414.00%438 MHz 8-way Associative0.9659,03611.00%495 MHz Profiler FIFO Profiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
58
Roman Lysecky, University of Arizona 58 Dynamic Application Profiler (DAProf) Profiling Accuracy DAProf Profiling Accuracy Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling Results presented for 8-way DAProf design All three associativity performed similarly well 90% accuracy for average iterations 97% accuracy for executions 95% accuracy for % execution time
59
Roman Lysecky, University of Arizona 59 Dynamic Application Profiler (DAProf) Profiling Accuracy – Function Call Interference DAProf Profiling Accuracy Some applications are affected by function call interference Loop execution within functions called from within a loop may lead to InLoop flag being incorrectly reset for calling loop Average iterations will be incorrectly updated Function Call Interference
60
Roman Lysecky, University of Arizona 60 Dynamic Application Profiler (DAProf) Function Call Support Extended DAProf Profiler with Function Call Support Monitors function calls and returns to avoid function call interference InFunc: Flag within Profile Cache to determine is a loop has called a function Will not update average iterations until function call returns Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3) Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (30) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) F RESH- N ESS (3) FOUND I NDEX REPLACE I NDEX FOUND SBB FUNC RET I A DDR SBB FUNC RET I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) I N F UNC (1)
61
Roman Lysecky, University of Arizona 61 Dynamic Application Profiler (DAProf) Profiling Accuracy with Function Call Support DAProf Profiling Accuracy with Function Support Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling Results presented for 8-way DAProf design All three associativity performed similarly well 95% accurate for average iterations, executions, and % execution time
62
Arigatou gozaimasu 62 Roman Lysecky, University of Arizona
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.