Presentation is loading. Please wait.

Presentation is loading. Please wait.

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

Similar presentations


Presentation on theme: "IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co."— Presentation transcript:

1 IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

2 Philosophy Large files Large files –Most processors have lots of registers Explicit control over register-renaming Explicit control over register-renaming –Most processors have register renaming IA-64 makes the register names SW- visible & makes the renaming explicit IA-64 makes the register names SW- visible & makes the renaming explicit

3 Outline Register Stack Register Stack –Register Stack Engine Register Rotation Register Rotation –Loop Branches –Modulo-Scheduling of Loops Summary Summary

4 Register Stack Motivation: Motivation: –Automatic save/restore of GRs on procedure call/return –Cache traffic reduction –Latency hiding of register spill/fill

5 General Registers Stacked Static 0 31 32127

6 GR Stack Frame (inputs) Static 0 31 32127locals outputs illegal size of frame (sof) sofsol Current Frame Marker (CFM) size of locals (sol)

7 GR Stack Frame - Example size of frame (sof) size of locals (sol) 32 46 loc out52 sofsolCFM 2114

8 GR Stack Frame - Call 32 46 loc out52sofsol CFM 2114 PFM xx 3238out sofsol 70 2114 call

9 GR Stack Frame - Allocate 32 46 loc out52sofsol CFM 2114 PFM xx 3238out sofsol 70 2114 callalloc sofsol 1916 2114 32 48 loc out50inputs

10 GR Stack Frame - Return 32 46 loc out52sofsol CFM 2114 PFM xx 3238out sofsol70 2114 callalloc sofsol 1916 2114 32 48 loc out50 32 46 loc out52sofsol 2114 2114 return

11 Instructions br.call br.call –Copies CFM to PFM –Creates new frame with only output regs –Saves local regs from previous frame alloc alloc –Resizes current frame –Saves PFM to a GR

12 Instructions (cont.) mov to PFS mov to PFS –Restores PFM from a GR br.ret br.ret –Restores CFM from PFM –Restores local regs for previous frame

13 Leaf Procedure Optimization No need to save/restore PFM No need to save/restore PFM Can always use scratch static GRs Can always use scratch static GRs Can omit alloc if: Can omit alloc if: –Not many registers needed –Register rotation not needed

14 Register Save Engine Automatically spills/fills registers from memory as needed Automatically spills/fills registers from memory as needed Registers saved on a Backing Store Stack Registers saved on a Backing Store Stack Spills/fills NaT bits as well Spills/fills NaT bits as well

15 Reg Stack & Backing Store sol a unallocated unallocated procA procB procC currentframe sol b sof c procA’sancestors procA procB call return Physicalstackedregisters BackingStore RSEloads/stores A calls B calls C

16 Register Stack: Summary Exposes register renaming to SW Exposes register renaming to SW Avoids register spill when few needed Avoids register spill when few needed Hides register spill/fill Hides register spill/fill Programmable sizes Programmable sizes –only use as many registers as you need

17 Outline Register Stack Register Stack –Register Stack Engine Register Rotation Register Rotation –Loop Branches –Modulo-Scheduling of Loops Summary Summary

18 Register Rotation Motivation: Motivation: –pipeline-schedule loops onto HW –remove extraneous work from loop –minimize start-up overhead –small code footprint –maximum computational throughput with few instructions

19 GR Stack Frame w/ Rotation locals Static 0 31 32 127 outputs sof sofsol Current Frame Marker (CFM) sol Size of Rotating (sor) sorrrb.grrrb.frrrb.pr

20 GR Rotation Size of rotating region multiple of 8 Size of rotating region multiple of 8 Rotating region overlays current frame Rotating region overlays current frame –Starts at r32 –Overlay allows rotation & stack renaming in a single level of adders –Must copy input registers before loop

21 FR Rotation Rotating Static 0 31 32 127 Upper 3/4 of register file rotates

22 Predicate Rotation Rotating Rotating Static 0 15 16 63 Upper 3/4 of register file rotates

23 PalmSunny isSprings RRB=0 Register Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRs Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. ld 1 R35... 35: 34: 33: 32: 36:... Palm

24 PalmSunny isSprings IA-64... 35: 34: 33: 32: 36:... RRB=0 Register Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRs Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm ld 2 R34 st 1 R35 Springs Palm

25 PalmSunny isSprings IA-64... 34: 33: 32: 127: 35:... RRB=-1 Register Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRs Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 3 R34 st 2 R35 is Springs Palm

26 PalmSunny isSprings IA-64... 33: 32: 127: 126: 34:... RRB=-2 Register Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRs Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 4 R34 st 3 R35 Sunny is Springs is

27 PalmSunny isSprings IA-64... 32: 127: 126: 125: 33:... RRB=-3 Register Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRs Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs st 4 R35 Sunny is is Sunny

28 Loop Branches br.cloop uses LC for simple, non- pipelined loops br.cloop uses LC for simple, non- pipelined loops –decrements LC and loops until LC is 0 br.ctop uses LC and EC for pipelined counted loops br.ctop uses LC and EC for pipelined counted loops br.wtop uses branch predicate and EC for pipelined “while” loops br.wtop uses branch predicate and EC for pipelined “while” loops br.cexit, br.wexit used for unrolled, pipelined loops br.cexit, br.wexit used for unrolled, pipelined loops

29 br.ctop Function (simplified): Function (simplified): –if (LC>0) {LC--; pr[63]=1;rrb--; loop;} else if (EC>1) {EC--; pr[63]=0;rrb--; loop;} else {EC--; pr[63]=0;rrb--;fall_through;} LC counts main loop iterations LC counts main loop iterations EC counts pipeline stages for drain EC counts pipeline stages for drain

30 Software Pipelining Overlapping execution of different loop iterations Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time

31 Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations Software Pipelining Traditional architectures use loop unrolling Traditional architectures use loop unrolling –High overhead: extra code for loop body, prologue, and epilogue Synergistic use of IA-64 features: Synergistic use of IA-64 features: –Full Predication –Special branches –Register rotation: removes loop copy overhead –Predicate rotation: removes prologue & epilogue

32 Pipelined Loop Example DAXPY inner loop DAXPY inner loop –dy[i] = dy[i] + (da * dx[i]) –2 loads, 1 fma, 1 store / iteration Machine assumptions Machine assumptions –can do 2 loads, 1 store, 1 fma, 1 br / cycle –load latency of 2 clocks –fma latency of 1 clocks

33 Example: Pipeline Each column represents 1 source iteration Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

34 .rotf dx[3], dy[3], tmp[2] movar.lc = 3// #iterations-1 movar.ec = 4// #stages movpr.rot = 0x10000 ;; looptop: (p16)ldfddx[0] = [dxsp],8 (p16)ldfddy[0] = [dysp],8 (p18)fma.dtmp[0] = da, dx[2], dy[2] (p19)stfd [dydp] = tmp[1],8 br.ctop looptop ;; Example Code

35 (p16) ld x (p16) ld y (p18) fma (p19) st Loop Execution. 63:0 16:1 17:0 18:0 19:0... RRB=0 LC=3EC=4 (p16) (p18) (p19) Execution Sequence InitializationInitialization (p63)

36 (p16) ld x (p16) ld y (p18) fma (p19) st. 63:0 16:1 17:0 18:0 19:0... RRB=0 LC=3EC=4 (p16) (p18) (p19) Execution Sequence Branch 1... 63:1 16:1 17:0 18:0 19:0...... 62:0 63:1 16:1 17:0 18:0... Loop Execution 1 RRB=-1 LC=2EC=4 (p63)

37 (p16) ld x (p16) ld y (p18) fma (p19) st. 62:0 63:1 16:1 17:0 18:0... RRB=-1 LC=2EC=4 (p16) (p18) (p19) Execution Sequence Branch 2... 62:1 63:1 16:1 17:0 18:0...... 61:0 62:1 63:1 16:1 17:0... Loop Execution 1 RRB=-2 LC=1EC=4 (p63)

38 (p16) ld x (p16) ld y (p18) fma (p19) st. 61:0 62:1 63:1 16:1 17:0... RRB=-2 LC=1EC=4 (p16) (p18) (p19) Execution Sequence Branch 3... 61:1 62:1 63:1 16:1 17:0... Loop Execution 1 RRB=-3 LC=0EC=4... 60:0 61:1 62:1 63:1 16:1... (p63)

39 (p16) ld x (p16) ld y (p18) fma (p19) st. 60:0 61:1 62:1 63:1 16:1... RRB=-3 LC=0EC=4 (p16) (p18) (p19) Execution Sequence Branch 4... 59:0 60:0 61:1 62:1 63:1... Loop Execution 0 RRB=-4 LC=0EC=3 (p63)

40 (p16) ld x (p16) ld y (p18) fma (p19) st. 59:0 60:0 61:1 62:1 63:1... RRB=-4 LC=0EC=3 (p16) (p18) (p19) Execution Sequence Branch 5... 58:0 59:0 60:0 61:1 62:1... Loop Execution 0 RRB=-5 LC=0EC=2 (p63)

41 (p16) ld x (p16) ld y (p18) fma (p19) st. 58:0 59:0 60:0 61:1 62:1... RRB=-5 LC=0EC=2 (p16) (p18) (p19) Execution Sequence Branch 6... 57:0 58:0 59:0 60:0 61:1... Loop Execution 0 RRB=-6 LC=0EC=1 (p63)

42 (p16) ld x (p16) ld y (p18) fma (p19) st fall through. 57:0 58:0 59:0 60:0 61:1... RRB=-6 LC=0EC=1 (p16) (p18) (p19) Execution Sequence Branch 7... 56:0 57:0 58:0 59:0 60:0... Loop Execution 0 RRB=-7 LC=0EC=0 (p63)

43 Pipelining & Latency Suppose we change the latencies Suppose we change the latencies –load latency of 6 clocks –fma latency of 4 clocks

44 Example: New Pipeline Each column represents 1 source iteration Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

45 .rotf dx[7], dy[7], tmp[5] movar.lc = 3// #iterations-1 movar.ec = 11// #stages movpr.rot = 0x10000 ;; looptop: (p16)ldfddx[0] = [dxsp],8 (p16)ldfddy[0] = [dysp],8 (p22)fma.dtmp[0] = da, dx[6], dy[6] (p26)stfd [dydp] = tmp[4],8 br.ctop looptop ;; Updated Loop

46 Rotation: Summary Loop pipelining maximizes performance; minimizes overhead Loop pipelining maximizes performance; minimizes overhead –Avoids code expansion of unrolling and code explosion of prologue and epilogue –Smaller code means fewer cache misses –Greater performance improvements in higher latency conditions Reduced overhead allows S/W pipelining of small loops with unknown trip counts Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes

47 Outline Register Stack Register Stack –Register Stack Engine Register Rotation Register Rotation –Loop Branches –Modulo-Scheduling of Loops Summary Summary

48 Register Model Summary GR Stack GR Stack –Overlap call/ret operations with real work –RSE hides spills/fillls GR, FR, PR Rotation GR, FR, PR Rotation –General acceleration for all types of loops SW-visible resources SW-visible resources –Large named register files & renaming HW simplicity and explicit control HW simplicity and explicit control

49 IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.


Download ppt "IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co."

Similar presentations


Ads by Google