Download presentation
Presentation is loading. Please wait.
1
Karthik Shankar, Roman Lysecky
Non-Intrusive Dynamic Application Profiling for Multitasked Applications Karthik Shankar, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ Don’t mention abt “THEProf” here
2
Introduction Application Profiling
Application profiling is useful for many purposes Often used to identify frequently executed code regions Allowing a designer to focus on optimizing those regions [Graham et al. – SCC 1982] Map frequently executed code and data regions to non-interfering cache regions Used within binary translation approaches to store translation results [Ebcioglu et al. – TC 2001] x86, Transmeta Crusoe Can be used to create optimized SW or HW implementations selected at runtime [Lakshminarayana et al. – DAC 1999] Spend a little more time on this slide
3
Introduction Application Profiling – HW/SW Partitioning
Hardware/software Partitioning Profiling is a critical step within hardware/software partitioning Often utilized to determine critical software region Frequently executed loops or functions Critical kernels can be re-implemented in hardware Speedup of 2X to 10X Speedup of 1000X possible Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning Profiling is just one method that can be used in HW/SW partitioning. It is a good method though µP I$ D$ HW COPROCESSOR (ASIC/FPGA) HW SW
4
Introduction Application Profiling – Warp Processing Overview
PROFILER DYNAMICALLY DETECTS APPLICATION’S KERNELS 2 Profiler APPLICATION INITIALLY EXECUTES ON MICROPROCESSOR 1 µP I$ D$ ON-CHIP CAD MAPS KERNELS ONTO FPGA 3 WARPED EXECUTION IS 2-100X FASTER – OR – CONSUMES 75% LESS POWER 5 Run time profiling is important W-FPGA On-chip CAD CONFIGURE FPGA AND UPDATE APPLICATION BINARY 4 Warp Processing - Dynamic Hardware/Software Partitioning [Lysecky et al. – TODAES 2006][Lysecky – DATE 2007] Dynamically re-implements critical kernels as HW within W-FPGA Requires profiling to determine critical kernels at runtime
5
Introduction Application Profiling – Existing Profiling Methods
Software Based Profiling Instrumenting - insert code directly within software [Hazelwood & Klauser – CASES 2006] Intrusive: Increases code size and introduces runtime overhead Statistical Sampling [Dean et al. – MICRO 1997] Periodically monitor program counter Very good accuracy with reduced overhead Intrusive: Introduces runtime overhead Need for Non-intrusive Profiling Runtime overhead cannot be tolerated by many applications at runtime E.g. Real-time and embedded systems May lead to missed deadlines, potentially system failure, or inaccurate profiling results due to changes in execution behavior Software Application (C/C++) Application Profiling Critical Kernels Partitioning HW SW µP I$ D$ HW COPROCESSOR (ASIC/FPGA) Advantages of statistical profiling
6
Introduction Application Profiling – Existing Profiling Methods
Hardware Based Profiling Processor Event Counters [Conte et al. – IJPP 1996] Processor’s event counters can be used to profile an application Intrusive: Requires additional software support to process event counters to profile application Frequent Loop Detection Profiler [Gordon-Ross & Vahid – TC 2005] Non-intrusively monitors short backwards branches Maintains a list of relative branch executions frequency, i.e. loop iteration counts May lead to sub-optimal partitioning as it does not provide detailed loop execution statistics Software Application (C/C++) Application Profiling Critical Kernels Partitioning HW SW µP I$ D$ HW COPROCESSOR (ASIC/FPGA) Limited number of counters for profiling. Area requirement of the counters. Also the additional software requirement for configuration and maintaining of counters
7
Introduction Application Profiling – Existing Profiling Methods
Software Application (C/C++) Application Profiling Critical Kernels Partitioning HW SW µP I$ D$ HW COPROCESSOR (ASIC/FPGA) Loop iteration count alone may not provide sufficient information for accurate performance estimation Example Assume we want to partition only one of the following two loops to HW: With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate Kernel Total Iterations % Exec Time A 10,000 33% B 12,000 45%
8
Introduction Application Profiling – Existing Profiling Methods
Software Application (C/C++) Application Profiling Critical Kernels Partitioning HW SW µP I$ D$ HW COPROCESSOR (ASIC/FPGA) However, communication requirements can significantly impact overall performance Kernel A may in fact be the better choice Breakdown between iterations and executions plays an important role during communication Kernel Total Iterations % Exec Time A 10,000 33% B 12,000 45% Avg Iters/Exec Execs 5000 2 6000
9
Introduction Application Profiling – Existing Profiling Methods
Software Application (C/C++) Application Profiling Critical Kernels Partitioning HW SW µP I$ D$ HW COPROCESSOR (ASIC/FPGA) Dynamic Application Profiler [Nair & Lysecky – CASES 2008 ] Non-intrusively profiles an application at runtime to determine detailed loop execution statistics Provides breakdown of loop executions versus average iterations per execution Achieves profiling accuracy of 90% with as little as 11% area overhead Limitations No support for function calls Function call interference can lead to decreased profiling accuracy E.g. results in 35% error in reported average iterations for mad No support for multitasked applications
10
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Extended Dynamic Application Profiler (DAProf) Non-intrusively monitors Instruction bus Short backwards branches Function calls and returns Context switches Provides detailed profile of loop executions and average iterations per execution SBB DAProf µP FUNC RET I$ D$ IADDR FPGA/ASIC Signals can be developed from the DAProf itself, just by monitoring the address/instruction bus Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
11
Dynamic Application Profiler Non-intrusive Dynamic Application Profiling
Profile Task Filter Programmable component storing start and end address of each task (or region) to be profiled Monitors instruction bus to detect context switch Asserts CS signal if iAddr falls outside of the current task’s address range Can also be used to selectively profile tasks and/or specific code regions E.g. Selectively profile specific application tasks, operating system calls, or C library code Increases profiling accuracy for profiled regions Task filter maintains the current bounds of execution. When the PC goes out of these bounds, CS signal is raised. The profile task filter filters out any other region of code that are not in the table. Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
12
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profiler FIFO Stores iAddr and instruction offset (iOffset) of executed sbb’s Stores iAddr for function calls/returns with Func and Ret signals Stores iAddr after context switch with CS signal Synchronizes between processor execution frequency and slower internal profiler frequency It allows the internal prof to operate at lower frequency. This helps in power reduction. Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
13
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profile Cache Tag: Address of the short backwards branch Offset: Negative branch offset Corresponds to the size of the loop Currently supports loops with less than 256 instructions Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
14
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profile Cache CurrIter: Number of iterations for the current loop execution AvgIter: Average Iterations per execution of the loop 17-bit fixed point representation with 14 bits integer and 3 bits fractional Execs: Number of times a loop executes. 17 bit fixed point representation has good enough accuracy and saves lot of area. Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
15
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profile Cache InLoop: Flag indicating loop is currently executing Utilized to distinguish between loop iterations and loop executions InFunc: Flag indicating if a loop has called a function Avoids erroneously resetting InLoop flag during function calls InCS: Flag indicating if a context switch occurred during a loop’s execution Avoids erroneously resetting InLoop flag in multitasked environment The Infunc and INCs flags help suspend the loop profiling till the control comes back to the loop. Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
16
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profile Cache Freshness: Indicates how recently a loop has been executed Utilized to ensure newly identified loops are not immediately replaced from the profile cache Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
17
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profile Cache Outputs found: Indicates if current loop (identified by iAddr) is found within the profile cache foundIndex: Location of loop within profile cache, if found replaceIndex: Loop that will be replaced upon new loop execution Loop not identified as fresh with least total iterations Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
18
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profiler Controller If CS: Copy InLoop to InCS Clear InCS for current task loops If Func: Copy InLoop to InFunc If Ret: Clear InFunc and InCs for current task DAProf (iAddr, iOffset, sbb, func, ret, cs, found, foundIndex, replaceIndex): if ( cs ) { for all i, InCS[i] = InLoop[i] for all i, if ( InLoop[i] && (iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) InCS[i] = 0 } if ( func ) { for all i, InFunc[i] = InLoop[i] else if ( ret ) { for all i, if ( (InFunc[i] || InCS[i] ) && (iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InFunc[i] = 0 else if …
19
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profiler Controller If loop is found within cache If InLoop flag is set New iteration Increment current iterations Otherwise New execution Increment executions Set current iterations to 1 Set InLoop flag Update Freshness else if ( sbb ) { if ( found ) { if ( InLoop[foundIndex] ) CurrIter[foundIndex] = CurrIter[foundIndex] + 1 else { for all i, if ( !InCS[i] ) Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh InFunc[replaceIndex] = 0 InCS[replaceIndex] = 0 for all i, if ( InLoop[i] && !InFunc[i] && !InCS[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8
20
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profiler Controller If loop is not found within cache Replace profile cache entry Initialize execution and current iterations to 1 Set InLoop flag Update Freshness else if ( sbb ) { if ( found ) { if ( InLoop[foundIndex] ) CurrIter[foundIndex] = CurrIter[foundIndex] + 1 else { for all i, if ( !InCS[i] ) Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh InFunc[replaceIndex] = 0 InCS[replaceIndex] = 0 for all i, if ( InLoop[i] && !InFunc[i] && !InCS[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8
21
Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling
Profiler Controller If current sbb (iAddr) is detected outside a loop within the profile cache AND, the loop’s InLoop flag is set AND, InFunc and InCS flags are not set Reset InLoop flag Update average iterations else if ( sbb ) { if ( found ) { if ( InLoop[foundIndex] ) CurrIter[foundIndex] = CurrIter[foundIndex] + 1 else { for all i, if ( !InCS[i] ) Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh InFunc[replaceIndex] = 0 InCS[replaceIndex] = 0 for all i, if ( InLoop[i] && !InFunc[i] && !InCS[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 If the execution goes outside the bounds of a particular loop, the InLoop flag is reset if, InFunc and InCS are not set.
22
Dynamic Application Profiler (DAProf) Experimental Results – Hardware Synthesis
Hardware Synthesis Results Synthesized using Synopsys Design Compiler targeting UMC 0.18um technology Fully Associative 460 MHz using 132,714 logic gates 19% of the area for an ARM9 with 32KB cache 16-way Associative 529 MHz using 93,194 logic gates 13% of the area for an ARM9 with 32KB cache 8-way Associative 600 MHz using 71,121 logic gates 10% of the area for an ARM9 with 32KB cache Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
23
Dynamic Application Profiler (DAProf) Experimental Results – Hardware Synthesis
Experimental Setup Considered 14 multitasked applications composed of individual tasks from MiBench benchmark suite Each multitasked application consists of 2 to 5 tasks CJPEG DJPEG FFT TIFF2 BW RGBA SUSAN DIJKSTRA BIT COUNT STRING SEARCH QSORT RAWC AUDIO RAWD MT2.1 MT2.2 MT2.3 MT2.4 MT2.5 MT2.6 MT2.7 MT3.1 MT3.2 MT3.3 MT3.4 MT4.1 MT4.2 MT5.1
24
Dynamic Application Profiler (DAProf) Experimental Results – Loop Identification
Detects overall top 15 loops across all tasks Detects top 2 to 5 loops within each task The results were compared with accurate simulation/instrumentation based profiling method Profiler FIFO Profiler Controller PROFILE CACHE TAG (30) OFFSET (8) CURRITER (14) AVGITER (17) EXECS (16) INLOOP (1) INFUNC FOUNDINDEX REPLACEINDEX FOUND SBB FUNC RET IADDR IOFFSET DYNAMIC APPLICATION PROFILER (DAProf) INCS Profiler Task Filter CS
25
Dynamic Application Profiler (DAProf) Experimental Results – Profiling Accuracy
Profiling Accuracy – Average Iterations Percentage error in average iterations of fully-associative, 16-way associative and 8-way associative DAProf 1.3%, 1.3%, and 1.5% error, respectively Greater than 98.5% accuracy on average across all implementations Worst case error of 6% for 8-way DAProf Average error of 1.5% for 8-way DAProf
26
Dynamic Application Profiler (DAProf) Experimental Results – Profiling Accuracy
Profiling Accuracy – Loop Execution Percentage error in loop executions of fully-associative, 16-way associative and 8-way associative DAProf 0.5% error for all implementations Greater than 99.5% accuracy on average across all implementations Worst case error of 2.6% Average error of only 0.5%
27
Conclusions & Future Work
Developed a non-intrusive dynamic application profiler (DAProf) Provides efficient methods for identifying loop executions from loop iterations Provides efficient method for monitoring function call executions and context switches Achieves excellent profiling accuracy On average, greater than 95% accuracy for both average iterations and loops executions Efficient hardware implementation Maximum operating frequency of 600 MHz for 8-way associative DAProf As little as 10% area overhead compared to an ARM9 processor Future Work Provides support for directly profiling function call execution Provides efficient methods for non-intrusive profiling within multicore systems Utilize non-intrusive profiling methods to detect soft errors within software application execution
28
THANK YOU! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.