Download presentation
Presentation is loading. Please wait.
1
From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA and Olivier TEMAM LRI, Paris South University France
2
2/18 Scaling Up Processors Larger pipelines, caches, instruction windows and reservation stations Aggressive speculation mechanisms : branch prediction, value prediction, data prefetching.. Rely on ILP exploitation What about scaling with little ILP?
3
3/18 Concept 2 64*num_registers input! (Theoretically) … addq r1,r2,r3 subq r3,10,r4 … sll r5,6,r6 addq r5,r5,r4 Program r1r2r3rn r6 = f 1 (r1,r2,…,rn)r4 = f 2 (r1,r2,…,rn) Logic circuit r1 63 r1 62 r1 61 r1 1 r1 0 f1 63 f1 62 f1 61 f1 1 f1 0 Combinatorial Functions A sequence of instructions is a set of functions
4
4/18 Principles An « independent » Function for each output f r3 (r9,r10) = r9 + r10 – 1 f r4 (r9,r10) = sign_extension(r9 + r10 – 1)31:0 f r5 (r9,r10) = ((r9 + r10 – 1) > 1 f br (r9,r10) = (r9 + r10 – 1) ((r9 + r10 – 1) >1) DFG
5
5/18 Hardware Operator + + ab out c f1f1 f1 i = f’(a i,b i,cout1 i-1 ) cout1 i =f’ c (a i,b i,cout1 i-1 ) out i = f’’(f1 i,c i,cout2 i-1 ) = f’’(a i,b i,c i,cout1 i-1,cout2 i-1 ) cout2 i = f’’ c (a i,b i,c i,cout1 i-1,cout2 i-1 ) Eliminate dependencies to calculate a+b+c r10 + r9 –1 to hardware operators
6
6/18 Complexity Effectiveness Scalability of ILP Vs. Functions Complexity Performance ILP exploitation Functions
7
7/18 Related Work ASIC General-Purpose context 3-1 Interlock Collapsing ALU [Y. Sazeides, S. Vassiliadis and J. Smith, Micro’ 29, 1996] Chimaera [Z. YE et al., ISCA’ 27, 2000] Grid Processors [R. Nagarajan et al., MICRO’ 34, 2001] Cascade one or more hardware operators to execute specific functions AND OR XOR ANDORXOR Adder
8
8/18 Building Functions From traces of instructions to configuration macros compilation toolchain to study: Potential of the approach Performance analysis on a superscalar processor Traces
9
9/18 Potential of the Approach Cuts : limits to DFG collapsing (height) Number of inputs Non-collapsable instructions Load instructions (27,7 %) Carries from upper significant bits Theoretical speedup The lower the ILP the higher speedup op LD op mem F2 mem F1 @ op Cut @
10
10/18 Theoretical Speedup
11
11/18 Number of Inputs
12
12/18 Non Collapsable Instructions
13
13/18 Implementation rePlay Framework
14
14/18 Performance Evaluation
15
15/18 RePlay Optimization Engine Delay Function built “offline”
16
16/18 Latency of Function units
17
17/18 Future Work Address prediction to overcome Load cuts Address Prediction & Cache Preloading op LD op mem F2 mem F1 @ op LD op mem @ op @’ F1 @ LD @’ F2 mem
18
18/18 Q & A
19
Carries from Upper Significant Bits
20
Optimization Engine Delay
21
Latency of Function units
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.