Power-Aware Operand Delivery Erika Gunadi and Mikko H. Lipasti University of Wisconsin - Madison ARF+PL ARF+noPL PRF+PL PRF+PL RAT- ARF PRF ROB- data tag PL RS BP BACKGROUND & MOTIVATION Power as design constraint More transistor on chip -- more power dissipated to heat Challenges in cooling and packaging design Microprocessor technology shifts from performance-focused to power-effective design Register renaming is used to support operand delivery in OoO machines Widely used design: ARF and PRF PRF used by MIPS R10K, Alpha 21264, IBM Power4, IBM Power 5, Intel Pentium 4 ARF used by PowerPC 604, Intel P6 family, Intel Core family, AMD K8 family Industry trend ARF for mobile and power-aware server; PRF for power hungry design No quantitative analysis has been published Rename wr RAT re RAT wr RAT re RAT wr RAT re RAT wr RAT re RAT Delay Area Rename Queue Read WB Retire 2.07 0.06 2.69 N/A 2.83 0.82 1.98 0.09 2.94 N/A 0.80 0.28 3.21 1.53 N/A 2.83 2.81 2.01 0.38 N/A 0.74 2.02 2.76 2.25 0.29 N/A 1.68 1.56 2.46 1.05 N/A 2.52 2.12 0.62 N/A 1.32 1.42 2.16 2.03 0.98 N/A 0.71 1.21 2.02 1.93 0.24 N/A 4.58 Read re ARF re ROB re PRF Queue wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB Sched wr RS wr RS wr RS wr RS Issue re PL re PL re PL re PL Read re ARF re ROB re PRF Exe execute execute execute execute WB wr ROB, wr RAT wr PL wr ROB wr RAT wr PRF, wr RAT wr PL wr PRF wr RAT DESIGN SPACE Based on where the speculative results are stored – ARF and PRF ARF Kept non-speculative values in a small ARF and speculative values in the ROB Wrote results to the ROB at writeback stage Copied results to the ARF as instructions retire PRF Kept both speculative and non-speculative in PRF Wrote results only once (to the PRF) at writeback stage Updated only rename table (RAT) as instructions retire Based on where the operands read occur – Payload RAM and no Payload RAM Payload RAM (PL) Read operand values at dispatch stage before inserted into the reservation stations Copied ready operands to payload RAM and got unready operands from bypass logic No Payload RAM (noPL) Only checked operands ready status at dispatch stage Read operands values from necessary structures (PRF, ARF+ROB)) upon issue Both classifications are orthogonal to each other Retire re ROB, wr ARF wr RAT re ROB, wr ARF wr RAT re ROB re ROB MACHINE TRADEOFFS NoPL -- added one pipeline stage between schedule and execute Increased load misscheduling penalty and branch resolution loop Decreased IPC ARF – harder to do out-of-order branch misprediction resolution On PRF, easily done by checkpointing RAT: MIPS R10K, Alpha 21264, Power4,5 On ARF, RAT could point to ROB or ARF, a potential of stale pointer on simple checkpoint ARF machines used in-order branch resllution: Intel P6, AMD K8, Intel Core DETAILS OF DESIGN RAT Provided renaming to resolve RAW and to eliminate WAR and WAW Contained destination mapping, ready bits, and retire bits (only for ARF) Built using RAM and comparators ROB ROB-tag used to store information needed while instructions are in the window ROB-data used to store speculative results before being copied to ARF Register File RAM-based structure to store execution results Payload RAM Used to store input operands before instructions are executed Consisted of CAM structure for tags amd RAM structure for the data Reservation Stations Consisted of CAM and select logic Bypass Networks Two-level bypass network to bypass data for back-to-back execution Consisted of comparators and muxes ARF PRF Payload RAM (PL) Intel P6, Intel Core, Intel Pentium M, AMD K8 (Athlon, Opteron) None No Payload RAM (noPL) None Intel Pentium 4, MIPS R10K, Alpha 21264, IBM Power4, Power 5 ARF ROB PRF CONCLUSION Microprocessors can be categorized into ARF- or PRF, with or without PL Energy impact PRF+noPL consumes the least amount of energy On average, 20% less energy than ARF+PL-style machine in the affected structures Roughly 6-7%saving of total chip energy IPC impact Out-of-order branch resolution adds 3% of performace on average Operand read between issue and execute (PL) decreases IPC by 1-2% PRF machine also simplify the implementation out-of-order branch resolution Improved performance with comparable energy PLRAM ARF ROB PLRAM PRF METHODOLOGY Timing and Power Implemented in Verilog, synthesized using Design Compiler, placed&routed using Astro Used LSI Logic gflxp 0.11 micron CMOS standard cell library Microarchitectural Simplescalar/Alpha 3.0, speculative scheduling/squashing replay, load-store reordering Used SMARTS sampling with reference input sets 4-wide, 128 ROB, 32 ARF / 96 PRF, 32 LQ, 24 SQ, 32 sched, 64KB DL1, 16KB IL1, 2MB L2 ALU ALU ALU ALU ARF with Payload RAM ARF with no Payload RAM PRF with Payload RAM PRF with no Payload RAM