Presentation is loading. Please wait.

Presentation is loading. Please wait.

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.

Similar presentations


Presentation on theme: "Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer."— Presentation transcript:

1 Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Tucson, AZ USA {ajaynair, rlysecky}@ece.arizona.edu

2 University of Arizona 2  Application profiling is useful for many purposes  Often used to identify frequently executed code regions  Allowing a designer to focus on optimizing those regions  Map frequently executed code and data regions to non- interfering cache regions  Used within binary translation approaches to store translation results  x86, Transmeta Crusoe  Can be used to create optimized SW or HW implementations selected at runtime  And many others…. Introduction Application Profiling

3 University of Arizona 3 Introduction Application Profiling – HW/SW Partitioning  Hardware/software Partitioning  Profiling is a critical step within hardware/software partitioning  Often utilized to determine critical software region  Frequently executed loops or functions  Critical kernels can be re- implemented in hardware  Speedup of 2X to 10X  Speedup of 1000X possible  Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

4 University of Arizona 4 Introduction Application Profiling – Warp Processing Overview µPµP On-chip CAD I$ D$ Profiler W-FPGA A PPLICATION I NITIALLY E XECUTES ON M ICROPROCESSOR 1 P ROFILER D YNAMICALLY D ETECTS A PPLICATION’S K ERNELS 2 O N- C HIP CAD M APS K ERNELS ONTO FPGA 3 W ARPED E XECUTION IS 2-100X F ASTER – OR – C ONSUMES 75% L ESS P OWER 5 C ONFIGURE FPGA AND U PDATE A PPLICATION B INARY 4

5 University of Arizona 5 Introduction Application Profiling – Warp Processing  Warp Processing - Dynamic Hardware/Software Partitioning  Dynamically re-implements critical kernels as HW within W-FPGA  Requires non-intrusive profiling to determine critical kernels at runtime  Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005]  Monitors short backwards branches  Maintains a small list of branch executions frequency  May lead to sub-optimal partitioning as it does not provide detailed loop execution statistics

6 University of Arizona 6 Introduction Application Profiling – HW/SW Partitioning  Loop iteration count alone may not provide sufficient information for accurate performance estimation  Example  Assume we want to partition only one of the following two loops to HW:  With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate KernelTotal Iterations% Exec Time A10,00033% B12,00045% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

7 University of Arizona 7 Introduction Application Profiling – Warp Processing  However, communication requirements can significantly impact overall performance  Kernel A may in fact be the better choice KernelTotal Iterations% Exec Time A10,00033% B12,00045% Avg Iters/ExecExecs 50002 26000 Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

8 University of Arizona 8 Introduction Application Profiling – Goal: Non-Intrusive Profiling  Non-intrusive Application Profiling  Goal: Profile application at runtime to determine detailed loop execution statistics with no impact on application execution  Runtime overhead cannot be tolerated by many applications at runtime  E.g. Real-time and embedded systems  May lead to missed deadlines and potentially system failure Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

9 University of Arizona 9 Introduction Application Profiling – Existing Profiling Methods  Software Based Profiling  Instrumenting - insert code directly within software  E.g., monitor branches, basic blocks, functions, etc.  Intrusive: Increases code size and introduces runtime overhead  Statistical Sampling  Periodically interrupt processor – or execute additional software task – to monitor program counter  Statistically determine the application profile  Very good accuracy with reduced overhead compared to instrumentation  Intrusive: Introduces runtime overhead Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

10 University of Arizona 10 Introduction Application Profiling – Existing Profiling Methods  Hardware Based Profiling  Processor Support – Event Counters  Many processors include event counters that can be used to profile an application  Intrusive: Requires additional software support to process event counters to profile application  JTAG – Joint Test Action Group  Standard interface for reading register within hardware devices  Intrusive: Requires the processor to be halted to read the values Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)

11 University of Arizona 11 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Dynamic Application Profiler (DAProf)  Non-intrusively monitors both loop executions and iterations  Monitors processor’s instruction bus and branch execution behavior to build application profile  Requires a short backwards branch (sbb) signal from microprocessor µPµP I$ D$ DAProf iAddr sbb FPGA/ASIC Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

12 University of Arizona 12 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler FIFO  Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s  Synchronizes between processor execution frequency and slower internal profiler frequency Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

13 University of Arizona 13 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  Tag: Address of the short backwards branch  Offset: Negative branch offset  Corresponds to the size of the loop  Currently supports loops with less than 256 instructions Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

14 University of Arizona 14 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  CurrIter: Number of iterations for the current loop execution  AvgIter: Average Iterations per execution of the loop  13-bit fixed point representation with 10 bits integer and 3 bits fractional Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

15 University of Arizona 15 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache  InLoop: Flag indicating loop is currently executing  Utilized to distinguish between loop iterations and loop executions  Freshness: Indicates how recently a loop has been executed  Utilized to ensure newly identified loops are not immediately replaced from the profile cache Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

16 University of Arizona 16 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profile Cache Outputs  found: Indicates if current loop (identified by iAddr) is found within the profile cache  foundIndex: Location of loop within profile cache, if found  replaceIndex: Loop that will be replaced upon new loop execution  Loop not identified as fresh with least total iterations Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

17 University of Arizona 17 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If loop is found within cache  If InLoop flag is set  New iteration  Increment current iterations  Otherwise  New execution  Increment executions  Set current iterations to 1  Set InLoop flag  Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

18 University of Arizona 18 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If loop is not found within cache  Replace profile cache entry  Initialize execution and current iterations to 1  Set InLoop flag  Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

19 University of Arizona 19 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling  Profiler Controller  If current sbb (iAddr) is detected outside a loop within the profile cache  AND, the loop’s InLoop flag is set  Reset InLoop flag  Update average iterations  Ratio based average iteration calculation  Simple hardware requirements  Good accuracy for applications considered DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

20 University of Arizona 20 Dynamic Application Profiler (DAProf) Hardware Implementation  DAProf Hardware  Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog  Synthesized using Synopsys Design Compiler targeted at UMC.18µm Design Area Maximum Frequency mm 2 Gates% of ARM 9 Fully Associative1.75107,47720.00%415 MHz 16-way Associative1.2274,74414.00%438 MHz 8-way Associative0.9659,03611.00%495 MHz Profiler FIFO Profiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)

21 University of Arizona 21 Dynamic Application Profiler (DAProf) Profiling Accuracy  DAProf Profiling Accuracy  Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling  Results presented for 8-way DAProf design  All three associativity performed similarly well 90% accuracy for average iterations 97% accuracy for executions 95% accuracy for % execution time

22 University of Arizona 22 Dynamic Application Profiler (DAProf) Profiling Accuracy – Function Call Interference  DAProf Profiling Accuracy  Some applications are affected by function call interference  Loop execution within functions called from within a loop may lead to InLoop flag being incorrectly reset for calling loop  Average iterations will be incorrectly updated Function Call Interference

23 University of Arizona 23 Current Work – Dynamic Application Profiler Function Call Support  Extended DAProf Profiler with Function Call Support  Monitors function calls and returns to avoid function call interference  InFunc: Flag within Profile Cache to determine is a loop has called a function  Will not update average iterations until function call returns Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3) Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (30) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) F RESH- N ESS (3) FOUND I NDEX REPLACE I NDEX FOUND SBB FUNC RET I A DDR SBB FUNC RET I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) I N F UNC (1)

24 University of Arizona 24 Current Work – Dynamic Application Profiler Profiling Accuracy with Function Call Support  DAProf Profiling Accuracy with Function Support  Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling  Results presented for 8-way DAProf design  All three associativity performed similarly well 95% accurate for average iterations, executions, and % execution time

25 University of Arizona 25 Conclusions  Conclusions  Developed a non-intrusive dynamic application profiler (DAProf)  Profiles an application at runtime providing detailed loop execution characteristics  Developed efficient methods for identifying loop executions from loop iterations  Developed Freshness based replacement policy to ensure newly executed loops are not immediately replaced  Developed efficient method for monitoring function call executions  Achieves excellent profiling accuracy  On average, better than 95% accurate for average iterations per executions, loops executions, and estimated percentage of total application execution time  Efficient Hardware Implementation  Area requirement as little as 11% of an ARM9 processor  Maximum operating frequency of 495 MHz

26 University of Arizona 26 Current/Future Work  Current/Future Work  Current DAProf performs excellently for profiling single threaded software applications  However, multitasked/multithreaded applications may lead to context switch interference  Similar implications as that of function call interferences  Need for task/thread aware profiling


Download ppt "Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer."

Similar presentations


Ads by Google