Download presentation
Presentation is loading. Please wait.
Published byHåkan Gustafsson Modified over 5 years ago
1
rePLay: A Hardware Framework for Dynamic Optimization
Paper by: Sanjay J. Patel, Member, IEEE, and Steven S. Lumetta, Member, IEEE Presentation by: Alex Rodionov
2
Outline Motivation, Introduction, Basic Concepts Frame Constructor
Optimization Engine Frame Cache Frame Sequencer Simulation results Conclusion
3
Motivation Want to make programs run faster One way: code optimization
Done by compiler Ex: automatic loop unrolling, common sub-expr. elimination Compiler optimizations are conservative... Optimized code must still be correct No knowledge of dynamic runtime behavior Handling pointer aliasing is complicated
4
rePLay Framework “frame” optimized frame instruction stream
5
rePLay Framework Performs code optimization at runtime Consists of:
In hardware With access to dynamic behavior Speculatively; potentially unsafe optimizations Consists of: A software-programmable optimization engine Hardware to identify, cache, and sequence blocks of program code for optimization A recovery mechanism to undo speculative execution Integrates into an existing micro-architecture
6
rePLay Framework
7
Frames One or more consecutive basic blocks from original program flow:
8
Frames Begin at branch targets End at erratically-behaving branches
Include well-behaving branches. They: Are kept inside frame Allow frame to span multiple basic blocks Are converted into assertion instructions
9
Assertions
10
Assertions Ensure that the frame executes completely
Evaluate the same condition as the branches they replace Force execution to restart at the beginning of the frame if the condition evaluates to false Will re-execute using original code, not the frame Can be inserted later to verify other speculations besides branches (ex: data values)
11
Frames - Summary Built from speculatively sequential basic blocks
Form the scope/boundary for optimizations Include assertion instructions to verify speculations during execution
12
Frame Construction
13
Frame Construction Frames are built from already-executed instructions over time As conditional branches are promoted to assertions, the frame grows Fired assertions can be demoted back to branches Un-promoted control instructions terminate a frame Once a frame contains enough instructions (>threshold), it is done
14
Frame Construction Use branch bias table to promote branches to assertions Count number of times branch had same outcome Use two such tables: One for conditional branches (T vs. NT) One for indirect branches (arbitrary target)
15
Branch Promotion/Demotion
16
Results
17
Results 64KB for cond. branches, 10KB for indirect
18
Frame Construction - Summary
We desire: Construction of large frames Promotion of consistently-behaving branches Parameters to play with: Branch promotion threshold Minimum frame size Branch history length Size of branch bias tables
19
Optimization Engine
20
Optimization Engine Performs code optimization within frames
Is software-programmable, has own instruction set and local memory Optimizes frames in parallel with execution of program Can make speculative and unsafe optimizations, as long as assertions inserted Design is open – no implementation details proposed
21
Possible Optimizations
Value speculation Pointer aliasing speculation Eliminating stack operations across function call boundaries Anything else a compiler does, plus what is afraid to do
22
Frame Cache
23
Frame Cache Delivers optimized frames for execution
Can increase instruction delivery throughput even without optimization Does not replace regular instruction cache Must hold all cache lines of a frame May lead to cache fragmentation Fired assertion -> eviction from cache
24
Frame Cache Implementation
B B C C C C
25
Frame Cache Implementation
Frames span multiple consecutive cache lines Frames indexed by their starting PC, maps to first cache line of frame Last cache line of frame has a termination bit Cache is 4-way set associative Further implementation details lacking Authors' model is a bit unrealistic Cache size measured in # of any-sized frames
26
Effect on Frame Size Larger cache may hold larger frames
27
Effect on Frame Code Coverage
Cache misses mean no frame is fetched
28
Effect on Frame Completion
29
Frame Cache - Summary Having a finite-sized frame cache does not severely affect Code coverage by frames Instructions per frame Successful frame completion
30
Frame Sequencer
31
Frame Sequencer Augments a standard branch predictor with a frame predictor Frame predictor predicts which frame to fetch from the frame cache A selector chooses final branch prediction: Execute optimized frame (frame predictor) Execute unoptimized basic block (regular branch predictor) History-based or confidence-based
32
Frame Sequencer
33
Frame Predictor Uses a table
Indexed by path history (same as in frame constructor) Outputs a frame address' starting PC Entries added/removed when frames enter/leave the frame cache
34
Predictor Accuracy Results
16K entry frame predictor Unknown selector mechanism Low prediction percentages compensated by reduction of total branch count
35
Frame Sequencer – Summary
Even if frames complete without assertions once started, need to know when to start a frame Choose a frame based on previous branch target history Choose when to initiate this frame vs. listening to the conventional branch predictor
36
Putting it All Together
37
Putting it All Together
Configuration: Branch Bias Table Direct-mapped 64KB for conditional branches 10KB for indirect branches Path history length of 6 (?) Frame Cache 256 frames (of arbitrary size) 4-way set associative Frame Predictor 16K entries Path history length of 6
38
Putting it All Together
8 SPECint95 benchmarks Trace-driven simulator based on SimpleScalar Alpha AXP ISA
39
Putting it All Together
Results: Avg. frame size: 88 instructions Frame coverage: 68% of instruction stream Frame completion rate: 97.81% Frame predictor accuracy: 81.26%
40
Conclusion The rePLay Framework provides a system to perform risky dynamic code optimizations in a speculative manner Even with no optimizations, you still get: Increased fetch bandwidth Reduction in number of branches to execute
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.