Download presentation
Presentation is loading. Please wait.
Published byAmi Atkins Modified over 8 years ago
1
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Tucson, AZ USA {ajaynair, rlysecky}@ece.arizona.edu
2
University of Arizona 2 Application profiling is useful for many purposes Often used to identify frequently executed code regions Allowing a designer to focus on optimizing those regions Map frequently executed code and data regions to non- interfering cache regions Used within binary translation approaches to store translation results x86, Transmeta Crusoe Can be used to create optimized SW or HW implementations selected at runtime And many others…. Introduction Application Profiling
3
University of Arizona 3 Introduction Application Profiling – HW/SW Partitioning Hardware/software Partitioning Profiling is a critical step within hardware/software partitioning Often utilized to determine critical software region Frequently executed loops or functions Critical kernels can be re- implemented in hardware Speedup of 2X to 10X Speedup of 1000X possible Energy reduction of 25% to 95% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
4
University of Arizona 4 Introduction Application Profiling – Warp Processing Overview µPµP On-chip CAD I$ D$ Profiler W-FPGA A PPLICATION I NITIALLY E XECUTES ON M ICROPROCESSOR 1 P ROFILER D YNAMICALLY D ETECTS A PPLICATION’S K ERNELS 2 O N- C HIP CAD M APS K ERNELS ONTO FPGA 3 W ARPED E XECUTION IS 2-100X F ASTER – OR – C ONSUMES 75% L ESS P OWER 5 C ONFIGURE FPGA AND U PDATE A PPLICATION B INARY 4
5
University of Arizona 5 Introduction Application Profiling – Warp Processing Warp Processing - Dynamic Hardware/Software Partitioning Dynamically re-implements critical kernels as HW within W-FPGA Requires non-intrusive profiling to determine critical kernels at runtime Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005] Monitors short backwards branches Maintains a small list of branch executions frequency May lead to sub-optimal partitioning as it does not provide detailed loop execution statistics
6
University of Arizona 6 Introduction Application Profiling – HW/SW Partitioning Loop iteration count alone may not provide sufficient information for accurate performance estimation Example Assume we want to partition only one of the following two loops to HW: With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate KernelTotal Iterations% Exec Time A10,00033% B12,00045% Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
7
University of Arizona 7 Introduction Application Profiling – Warp Processing However, communication requirements can significantly impact overall performance Kernel A may in fact be the better choice KernelTotal Iterations% Exec Time A10,00033% B12,00045% Avg Iters/ExecExecs 50002 26000 Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
8
University of Arizona 8 Introduction Application Profiling – Goal: Non-Intrusive Profiling Non-intrusive Application Profiling Goal: Profile application at runtime to determine detailed loop execution statistics with no impact on application execution Runtime overhead cannot be tolerated by many applications at runtime E.g. Real-time and embedded systems May lead to missed deadlines and potentially system failure Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
9
University of Arizona 9 Introduction Application Profiling – Existing Profiling Methods Software Based Profiling Instrumenting - insert code directly within software E.g., monitor branches, basic blocks, functions, etc. Intrusive: Increases code size and introduces runtime overhead Statistical Sampling Periodically interrupt processor – or execute additional software task – to monitor program counter Statistically determine the application profile Very good accuracy with reduced overhead compared to instrumentation Intrusive: Introduces runtime overhead Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
10
University of Arizona 10 Introduction Application Profiling – Existing Profiling Methods Hardware Based Profiling Processor Support – Event Counters Many processors include event counters that can be used to profile an application Intrusive: Requires additional software support to process event counters to profile application JTAG – Joint Test Action Group Standard interface for reading register within hardware devices Intrusive: Requires the processor to be halted to read the values Software Application (C/C++) Application Profiling Critical Kernels Partitioning HWSW µPµP I$ D$ HW COPROCESSOR (ASIC/FPGA)
11
University of Arizona 11 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Dynamic Application Profiler (DAProf) Non-intrusively monitors both loop executions and iterations Monitors processor’s instruction bus and branch execution behavior to build application profile Requires a short backwards branch (sbb) signal from microprocessor µPµP I$ D$ DAProf iAddr sbb FPGA/ASIC Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
12
University of Arizona 12 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler FIFO Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s Synchronizes between processor execution frequency and slower internal profiler frequency Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
13
University of Arizona 13 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache Tag: Address of the short backwards branch Offset: Negative branch offset Corresponds to the size of the loop Currently supports loops with less than 256 instructions Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
14
University of Arizona 14 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache CurrIter: Number of iterations for the current loop execution AvgIter: Average Iterations per execution of the loop 13-bit fixed point representation with 10 bits integer and 3 bits fractional Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
15
University of Arizona 15 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache InLoop: Flag indicating loop is currently executing Utilized to distinguish between loop iterations and loop executions Freshness: Indicates how recently a loop has been executed Utilized to ensure newly identified loops are not immediately replaced from the profile cache Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
16
University of Arizona 16 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profile Cache Outputs found: Indicates if current loop (identified by iAddr) is found within the profile cache foundIndex: Location of loop within profile cache, if found replaceIndex: Loop that will be replaced upon new loop execution Loop not identified as fresh with least total iterations Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
17
University of Arizona 17 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If loop is found within cache If InLoop flag is set New iteration Increment current iterations Otherwise New execution Increment executions Set current iterations to 1 Set InLoop flag Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
18
University of Arizona 18 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If loop is not found within cache Replace profile cache entry Initialize execution and current iterations to 1 Set InLoop flag Update Freshness DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
19
University of Arizona 19 Dynamic Application Profiler (DAProf) Non-intrusive Dynamic Application Profiling Profiler Controller If current sbb (iAddr) is detected outside a loop within the profile cache AND, the loop’s InLoop flag is set Reset InLoop flag Update average iterations Ratio based average iteration calculation Simple hardware requirements Good accuracy for applications considered DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh } for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }
20
University of Arizona 20 Dynamic Application Profiler (DAProf) Hardware Implementation DAProf Hardware Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog Synthesized using Synopsys Design Compiler targeted at UMC.18µm Design Area Maximum Frequency mm 2 Gates% of ARM 9 Fully Associative1.75107,47720.00%415 MHz 16-way Associative1.2274,74414.00%438 MHz 8-way Associative0.9659,03611.00%495 MHz Profiler FIFO Profiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3)
21
University of Arizona 21 Dynamic Application Profiler (DAProf) Profiling Accuracy DAProf Profiling Accuracy Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling Results presented for 8-way DAProf design All three associativity performed similarly well 90% accuracy for average iterations 97% accuracy for executions 95% accuracy for % execution time
22
University of Arizona 22 Dynamic Application Profiler (DAProf) Profiling Accuracy – Function Call Interference DAProf Profiling Accuracy Some applications are affected by function call interference Loop execution within functions called from within a loop may lead to InLoop flag being incorrectly reset for calling loop Average iterations will be incorrectly updated Function Call Interference
23
University of Arizona 23 Current Work – Dynamic Application Profiler Function Call Support Extended DAProf Profiler with Function Call Support Monitors function calls and returns to avoid function call interference InFunc: Flag within Profile Cache to determine is a loop has called a function Will not update average iterations until function call returns Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (8) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) FOUND I NDEX REPLACE I NDEX FOUND SBB I A DDR SBB I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) FRESH- NESS (3) Profiler FIFOProfiler Controller P ROFILE C ACHE T AG (30) O FFSET (30) C URR I TER (10) A VG I TER (13) E XECS (16) I N L OOP (1) F RESH- N ESS (3) FOUND I NDEX REPLACE I NDEX FOUND SBB FUNC RET I A DDR SBB FUNC RET I A DDR I O FFSET D YNAMIC A PPLICATION P ROFILER (DAP ROF ) I N F UNC (1)
24
University of Arizona 24 Current Work – Dynamic Application Profiler Profiling Accuracy with Function Call Support DAProf Profiling Accuracy with Function Support Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling Results presented for 8-way DAProf design All three associativity performed similarly well 95% accurate for average iterations, executions, and % execution time
25
University of Arizona 25 Conclusions Conclusions Developed a non-intrusive dynamic application profiler (DAProf) Profiles an application at runtime providing detailed loop execution characteristics Developed efficient methods for identifying loop executions from loop iterations Developed Freshness based replacement policy to ensure newly executed loops are not immediately replaced Developed efficient method for monitoring function call executions Achieves excellent profiling accuracy On average, better than 95% accurate for average iterations per executions, loops executions, and estimated percentage of total application execution time Efficient Hardware Implementation Area requirement as little as 11% of an ARM9 processor Maximum operating frequency of 495 MHz
26
University of Arizona 26 Current/Future Work Current/Future Work Current DAProf performs excellently for profiling single threaded software applications However, multitasked/multithreaded applications may lead to context switch interference Similar implications as that of function call interferences Need for task/thread aware profiling
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.