Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus, Scott A. Mahlke, Richard B. Brown Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor
2
University of Michigan Electrical Engineering and Computer Science 2 Architected Registers: More or Less? Fewer registers: – Smaller hardware structures: more power efficient –Tighter instruction encoding, small memory footprint –However, more loads/stores to memory Reduce performance, increase in power More registers: – Larger hardware structures: less power efficient – Increase in code size, larger memory footprint – However, Map more variables from memory to registers: reduce power Enable ILP optimizations
3
University of Michigan Electrical Engineering and Computer Science 3 Objective of this Work Provide a large number of architected registers But, maintain instruction encoding and thus code size Use a windowed register file architecture –But, in an unconventional way Traditional register window –Reduce function save/restore overhead Our approach –Large register file partitioned into multiple windows –Appearance of a large register file
4
University of Michigan Electrical Engineering and Computer Science 4 Windowed Register File Architecture Machine Status Register (MSR) window status bit 16-regs 3-bit operand field FU 8-regs add r1, r2, r3 iw-mov r9, r1 win-swap #2 sub r2, r1, r3 0010 r1: register 1 in register file 1010 r1: register 9 in register file toggle active window
5
University of Michigan Electrical Engineering and Computer Science 5 Wireless Integrated Microsystems (WIMS) Developed at the University of Michigan, (Robert Senger et al, DAC 2003)
6
University of Michigan Electrical Engineering and Computer Science 6 Related Work Traditional use of register windows –Reduce save/restore, context switch overhead –SPARC, IA-64, ADSP-219x, Tensilica Procedure call overhead small in embedded domain –Procedure inlining reduces call/return overhead –~ 2% increase in performance using infinite register windows in WIMS Register connects[Kiyohara:93], register queues[Smelyanskiy:01] –Fixed ISA, provide more registers than allowed –Layer of indirection to access every operand
7
University of Michigan Electrical Engineering and Computer Science 7 Motivating Example loop: LOAD R1-1, [SP, #24] ADD R1-0, R1-3, R1-1 LOAD R1-0, [R1-0] LOAD R1-1, [SP, #32] STORE [SP, #72], R1-0 ADD R1-0, R1-3, R1-1 LOAD R1-0, [R1-0] LOAD R1-1, [SP, #72] MPY R1-0, R1-1, R1-0 STORE [SP,#40], R1-0 LOAD R1-0, [SP, #16] ADD R1-1, R1-3, R1-0 LOAD R1-0, [R1-1] LOAD R1-1, [SP, #40] ADD R1-0, R1-1, R1-0 LOAD R1-1, [SP, #80] STORE [R1-3], R1-0 ADD R1-0, R1-1, #1 ADD R1-3, R1-3, #4 CMP R1-0, #100 BRCT loop loop: IW-MOV R1-0, R2-1 WIN-SWAP #1 ADD R1-3, R1-2, R1-0 IW-MOV R1-0, R2-2 LOAD R1-1, [R1-3] ADD R1-3, R1-2, R1-0 LOAD R1-0, [R1-3] MPY R1-3, R1-1, R1-0 IW-MOV R1-1, R2-3 ADD R1-1, R1-2, R1-1 LOAD R1-1, [R1-0] ADD R1-0, R1-3, R1-1 STORE [R1-2], R1-0 WIN-SWAP #2 ADD R2-0, R2-0, #1 WIN-SWAP #1 ADD R1-2, R1-2, #4 WIN-SWAP #2 CMP R2-0, #100 BRCT loop loop: ADD R1-3, R1-0, R1-6 LOAD R1-2, [R1-3] ADD R1-3, R1-0, R1-7 LOAD R1-4, [R1-3] MPY R1-3, R1-2, R1-4 ADD R1-2, R1-0, R1-5 ADD R1-1, R1-1, #1 LOAD R1-4, [R1-2] ADD R1-2, R1-3, R1-4 STORE [R1-0], R1-2 ADD R1-0, R1-0, #4 CMP R1-1, #100 BRCT loop 1-window of 8-registers1-window of 4-registers2-window of 4-registers each
8
University of Michigan Electrical Engineering and Computer Science 8 Tradeoffs for the Compiler Move variables from memory to register Reduces spill code Distribute program variables and temporaries to all available registers in multiple windows VS Balance these issues in an intelligent manner Register UtilizationRegister Management Reduce overhead due to window management instructions Activate windows (swaps) Data transfer (iw-moves) Bundle accesses to same window Fewer transitions between windows
9
University of Michigan Electrical Engineering and Computer Science 9 Register Window Partitioning VR5 VR6VR2 VR3 VR4 VR1 Partition-1 Partition-2 Weight Calculation Partition weight: Over-commitment of register resources Edge weight: Penalty of separating VRs Partitioning algorithm: Move nodes between partitions to minimize partition+edge wts Modified FM graph partitioning algorithm
10
University of Michigan Electrical Engineering and Computer Science 10 Edge Weight Calculation: Move Cost loop: 1 ADD VR34, VR27, VR32 2 LOAD VR6, [VR34] 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 MPY VR10, VR6, VR9 IW-MOVE VR100, VR9 ( x 3104) MPY VR10, VR6, VR100 VR6VR9 edge weight = move-cost + swap-cost Computed once before partitioning
11
University of Michigan Electrical Engineering and Computer Science 11 Edge Weight Calculation: Swap Cost edge weight = move-cost + swap-cost VR6VR9 edge weight = 3104 + 6208 = 9312 swap cost : 2 x 3104 = 6208 loop: 1 ADD VR34, VR27, VR32 2 LOAD VR6, [VR34] 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 LOAD VR6, [VR34] MPY VR10, VR6, VR9 LOAD VR9, [VR27] SWAP active window Computed once before partitioning
12
University of Michigan Electrical Engineering and Computer Science 12 Partition Weight Calculation loop: 1 ADD VR34, VR27, VR32 2 LOAD VR6, [VR34] 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 VR10 VR20VR2VR34 VR6 VR9 VR27VR32 VR9 VR10 VR32 VR27 VR34 VR6 VR2 VR20 Estimates the spill pressure using crude register allocation Partition weight = sum of the cost of all the spilled VRs Computed dynamically during node assignment process
13
University of Michigan Electrical Engineering and Computer Science 13 Partition Weight Calculation: Example Assume 3-registers per window/partition, and all VRs are assigned to one partition Spill Cost VR32 : 3104 VR2: 9312 VRs 10,6,20,34: 6208 VR27: 12416 Spilled VRs = {32, 20} VRs 32, 20 are spilled loop: 1 2 LOAD VR6, [VR34] 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 VR10 VR20VR2VR34 VR6 VR9 VR27VR32 ADD VR24, VR27, VR32
14
University of Michigan Electrical Engineering and Computer Science 14 Partition Weight Calculation: Example Assume 3-registers per window/partition, and all VRs are assigned to one partition Spill Cost VR32 : 3104 VR2: 9312 VRs 10,6,20,34: 6208 VR27: 12416 Spilled VRs = {32, 20, 6} VRs 32, 20 are spilled loop: 1 ADD VR34, VR27, VR32 2 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 VR10 VR20VR2VR34 VR6 VR9 VR27VR32 VRs 6 are spilled LOAD VR6, [VR34] Continuing further, partition weight = spill cost of VRs 32, 20, 6, 10 = 21728
15
University of Michigan Electrical Engineering and Computer Science 15 Node Partitioning: Example Partition weight of P1 = sum of spill cost of VRs 32,20,6,10 = 21728 VR9 VR10 VR32 VR27 VR34 VR6 VR2 VR20 P1P2 Partition weight of P2 = 0 VR Partition Edge Total gain 26208-22763932 66208-11669-5461 96208-10723-4515 106208-10675-4467 206208-42341974 276208-13008-6800 323104-7332-4228 340-10436 Total Gain = Partition Weight + Edge Weight VR2
16
University of Michigan Electrical Engineering and Computer Science 16 Node Partitioning: Final Example Partition weight of P1 = spill cost of VRs 32 = 3104 VR9 VR10 VR32 VR27 VR34 VR6 VR2 VR20 P1P2 Partition weight of P2 = 0 loop: 1 WIN_SWAP #1 2 LOAD 32:R1-0, [SP, #0] 3 ADD 34: R1-3, 27:R1-1, 32:R1-0 4 LOAD 9:R1-3, [27:R1-1] 5 LOAD 6:R1-2, [34:R1-3] 6 MPY 39:R1-0, 6:R1-2, 9:R1-3 7 IW_MOV 10:R2-2, 39:R1-0 8 WIN_SWAP #2 9 ADD 20:R2-1, 20:R2-1, 10:R2-2 10 ADD 2:R2-0, 2:R2-0, #1 11 WIN_SWAP #1 12 ADD 27:R1-1, 27:R1-1, #4 13 WIN_SWAP #2 14 CMP 2:R2-0, #32 15 BRCT loop 3104 1 Reduced from 6-spill to 1-spill operations Added 5 additional window management instructions Performance remains the same but decrease in power
17
University of Michigan Electrical Engineering and Computer Science 17 Performance of WIMS: 8 registers/window 1-window vs 2 and 4 windows -50 -40 -30 -20 -10 0 10 20 30 40 50 fir rawc rawd g721enc g721dec compress sha yacc cjpegdjpeg gsmenc gsmdec unepic mpg2dec average PerformanceSpill benefit Swap and move overhead 97 83 54 58 85 91 65 86 64 69 93 95 79 78 58 85 75 84 99 68 77 697655 88 61 99 % cycles
18
University of Michigan Electrical Engineering and Computer Science 18 Performance of VLIW: 8-registers/window 1-window vs 2 and 4 windows 82 -50 -40 -30 -20 -10 0 10 20 30 40 50 fir rawc rawd g721enc g721dec compress sha yacc cjpeg djpeg gsmenc gsmdec unepic mpg2dec average 98 99 70 86 66 7236 73 58 5662 63 90 29 50 77 72 95 96 51 77 62 79 69 98 99 PerformanceSpill benefit Swap and move overhead % cycles 91
19
University of Michigan Electrical Engineering and Computer Science 19 Power savings on the 8-register WIMS : 1-window vs 2 and 4-window machine 0 5 10 15 20 fir rawc rawd g721encg721dec compress sha yacc cjpeg djpeg gsmenc gsmdec unepic mpeg2dec average 2-window 4-window % power savings
20
University of Michigan Electrical Engineering and Computer Science 20 Conclusion A novel graph partitioning based compiler algorithm to exploit windowed register files within a single procedure Hardware/software solution to deal with reducing code size and maintaining effectively large number of register w2.r4w4.r4w8.r4w2.r16w2.r8w4.r8 WIMS2.9612.716.182.55710 VLIW10.5826.3833.835.031822 7% reduction in power for the 8-register case on WIMS Average improvement in performance
21
University of Michigan Electrical Engineering and Computer Science 21 Swap Cost Over-Counting loop: 1 ADD VR34, VR27, VR32 2 LOAD VR6, [VR34] 3 LOAD VR9, [VR27] 4 MPY VR10, VR6, VR9 5 ADD VR20, VR20, VR10 6 ADD VR2, VR2, #1 7 ADD VR27, VR27, #4 8 CMP VR2, 32 9 BRCT loop 3104 1 MPY VR10, VR6, VR9 LOAD VR9, [VR27] SWAP - VR9, VR6 SWAP - VR9, VR10 SWAP - VR27, VR10 SWAP - VR27, VR6 4-swaps! In reality only, 1 swap required Solution : normalize swap cost Swap cost between VRs 6 and 9 = 1/4 of cost of single swap = 1/4 * 3104 = 776 vr9 vr27 vr10 vr6
22
University of Michigan Electrical Engineering and Computer Science 22 Swap Insertion & Optimization Remove redundant swaps Hoist swaps to less frequently executed region Combine swaps with other instructions BRL/RTS optimization WIN_SWAP #1 add r1, r2, r3 sub r4, r1, r2 load r1, [r4] IW_MOV r9, r1 WIN_SWAP #2 shl r3, r4, r5 add r3, r9 #2 load r2, [r3] Brl _foo() WIN_SWAP #1 load r4, [r5] add r4, r4 # 4 WIN_SWAP #1 mov r1, #0 load r4 [_a] mul r2, r3, r4
23
University of Michigan Electrical Engineering and Computer Science 23 Performance of WIMS: 2-window 8- register vs 1-window 16-register
24
University of Michigan Electrical Engineering and Computer Science 24 Overall Compilation System FRONTEND PREPASS SCHEDULING CODE GENERATION REGISTER ALLOCATION SWAP INSERTION POSTPASS SCHEDULING REGISTER PARTITIONING CALCULATE EDGE WEIGHTS CALCULATE PARTITION WEIGHTS MOVE NODES NAIVE SWAP INSERTION SWAP OPTIMIZATION
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.