CSE477 L26 System Power.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 26: Low Power Techniques in Microarchitectures and Memories Mary Jane Irwin ( ) [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 L26 System Power.2Irwin&Vijay, PSU, 2003 Review: CMOS Energy & Power Equations E = C L V DD 2 P 0 1 + t sc V DD I peak P 0/1 1/0 + V DD I leak P = C L V DD 2 f + t sc V DD I peak f + V DD I leak f = P * f clock Dynamic power (~90% today and decreasing relatively) Short-circuit power (~8% today and decreasing absolutely) Leakage power (~2% today and increasing)
CSE477 L26 System Power.3Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T
CSE477 L26 System Power.4Irwin&Vijay, PSU, 2003 Reducing Power and Energy of Interconnects Share long data buses with time multiplexing (S 1 uses even cycles, S 2 odd) S2S2 S1S1 D1D1 D2D2 S1S1 S2S2 D2D2 D1D1 Buses are a significant source of power dissipation due to high switching activities and large capacitive loading l 15% of total power in Alpha l 30% of total power in Intel But what if data samples are correlated (e.g., sign bits)?
CSE477 L26 System Power.5Irwin&Vijay, PSU, 2003 Bus Multiplexing and Correlated Data Streams Bit position MSB LSB Bit switching probabilities For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams) l Bus sharing should not be used for positively correlated data streams l Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching
CSE477 L26 System Power.6Irwin&Vijay, PSU, 2003 Reducing Power and Energy of Memories Active power in memory of m columns and n rows P = V DD I DD where I DD = I array + I decode + I periphery = [mi act + m(n-1)i hld ] + [(n+m)C DE V int f] + [C PT V int f + I DCP ] l As expected, it is proportional to the size of the memory and is typically dominated by the array Partition the memory array into multiple smaller banks (see L23.11) so that only the addressed bank is activated l improves speed and lowers power -word line and bit line capacitances are reduced -number of bit cells activated reduced l At some point the delay and power overhead associated with the bank decoding circuit dominates (2 to 8 banks typical)
CSE477 L26 System Power.7Irwin&Vijay, PSU, 2003 Divided Word Line Divide RAM cells in each row into blocks where the cells in each block are accessed by a local word line (LWL) Only the memory cells in the activated block have their bit line pairs driven l improves speed (by decreasing word line capacitance) l lowers power dissipation (by decreasing the number of BL pairs activated) BSL LD WL i WL i+1 LWL i LWL i+1 Local decoder Block select line RAM cell BL j BL j+1 BL j+m Row block
CSE477 L26 System Power.8Irwin&Vijay, PSU, 2003 Bit Line Segmentation Divide RAM cells in each column into blocks where each block has its own local bit line (LBL) - only the memory cells in the activated block present a load on the bit line l lowers power dissipation (by decreasing bit line capacitance) -e.g., from more than 1pF for a 16Kb DRAM to ~200fF for a 64Mbit DRAM Switch to isolate segment LBL i+n,j LBL i,j BL j WL i SWL i+n,j SWL i,j Row decoder logic also identifies the segment (SWL) Has minimal effect on performance
CSE477 L26 System Power.9Irwin&Vijay, PSU, 2003 Glitch Reduction by Pipelining Glitches depend on the logic depth of the circuit - gates deeper in the logic network are more prone to glitching l arrival times of the gate inputs are more spread due to delay imbalances l usually affected more by primary input switching Reduce logic depth by adding pipeline registers l additional energy used by the clock and pipeline registers PC FetchDecodeExecuteMemoryWriteBack Instruction MAR MDR I$D$ clk pipeline stage isolation register
CSE477 L26 System Power.10Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T
CSE477 L26 System Power.11Irwin&Vijay, PSU, 2003 Clock Gating Gate off clock to idle functional units l e.g., floating point units l need logic to generate disable signal -increases complexity of control logic -consumes power -timing critical to avoid clock glitches at OR gate output l additional gate delay on clock signal -gating OR gate can replace a buffer in the clock distribution tree Most popular method for power reduction of clock signals and functional units RegReg clock disable Functional unit
CSE477 L26 System Power.12Irwin&Vijay, PSU, 2003 Clock Gating in a Pipelined Datapath For idle units (e.g., floating point units in Exec stage, WB stage for instructions with no write back operation) PC FetchDecodeExecuteMemoryWriteBack Instruction MAR MDR I$D$ clk No FPNo WB
CSE477 L26 System Power.13Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T
CSE477 L26 System Power.14Irwin&Vijay, PSU, 2003 Review: Dynamic Power as a Function of V DD Decreasing the V DD decreases dynamic energy consumption (quadratically) But, increases gate delay (decreases performance) V DD (V) t p(normalized) So if multiple levels of V DD are provided for use at run time, the clock frequency must also be adjusted.
CSE477 L26 System Power.15Irwin&Vijay, PSU, 2003 Dynamic Frequency and Voltage Scaling Always run at the lowest supply voltage that meets the timing constraints l DFS (dynamic frequency scaling) saves only power (e.g., Intel’s SpeedStep) l DVS (dynamic voltage scaling) + DFS saves both energy and power (e.g., Transmeta’s LongRun) A DVS+DFS system requires the following l A programmable clock generator (PLL) -PLL from 200MHz 700MHz in increments of 33MHz l A supply regulation loop that sets the minimum V DD necessary for operation at the desired frequency -32 levels of V DD from 1.1V to 1.6V l An operating system that sets the required frequency + supply voltage to meet the task completion deadlines -heavier load ramp up V DD, when stable speed up clock -lighter load slow down clock, when PLL locks onto new rate, ramp down V DD
CSE477 L26 System Power.16Irwin&Vijay, PSU, 2003 Dynamic Thermal Management (DTM) Trigger mechanism: on- chip temperature sensors l Based on differential voltage change across two diodes of different sizes l Usually requires more than one sensor l Hysteresis and delay are problems When to begin responding? l Trigger level set too high means higher packaging costs l Trigger level set too low means frequent triggering and loss in performance Choose trigger level to exploit difference between average and worst case power An example of DVS + DFS in action
CSE477 L26 System Power.17Irwin&Vijay, PSU, 2003 DTM Initiation and Response Mechanisms Operating system or micro-architectural initiation mechanism? l Hardware support can reduce the performance penalty by 20-30% Response mechanism – DVS+DFS l Incurs some delay since there is a OS context switch needed to set the new level of DVS + DFS l Increasing the trigger level reduces the frequency of context switching to set DVS + DFS The use of a thermal window (100Kcycles+) can help to “smooth” short thermal spikes
CSE477 L26 System Power.18Irwin&Vijay, PSU, 2003 DTM Activation and Deactivation Cycle Trigger Reached Turn Response On Initiation Delay Initiation Delay – OS interrupt/handler Response Delay – Invocation time (adjust clock, V DD ) Response Delay Policy Delay Check Temp Policy Delay – Number of cycles engaged Check Temp Shutoff Delay Turn Response Off Shutoff Delay – Disabling time (re-adjust clock, V DD ) temperature DTM trigger level Cooling capacity without DTM Cooling capacity with DTM savings
CSE477 L26 System Power.19Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T
CSE477 L26 System Power.20Irwin&Vijay, PSU, 2003 Speculated Power of a 15mm P
CSE477 L26 System Power.21Irwin&Vijay, PSU, 2003 Review: Variable V T at Run Time Reducing the V T increases the sub-threshold leakage current (exponentially) V T = V T0 + ( |-2 F + V SB | - |-2 F |) where V T0 is the threshold voltage at V SB = 0, V SB is the source- bulk (substrate) voltage, is the body-effect coefficient V SB (V) V T (V) But, reducing V T decreases gate delay (increases performance) l For an n-channel device, the substrate is normally tied to ground (V SB = 0) l A negative bias on V SB causes V T to increase l Adjusting the substrate bias at run time is called adaptive body- biasing (ABB) or dynamic threshold scaling (DTS) -Requires a triple well fab process
CSE477 L26 System Power.22Irwin&Vijay, PSU, 2003 DTS DTS can accomplish a variety of goals l Lower the leakage in standby mode by increasing V T to its maximum value l Compensate for threshold variations across the chip during normal operation l Throttle the throughput (by increasing V T ) to lower both the active and leakage power based on performance requirements Substrate biasing can be implemented on a complete chip, on a block-by-block basis, or on a cell-by-cell basis. l Per-cell granularity of substrate biasing has an area cost Unfortunately, the effectiveness of DTS is decreasing with technology scaling due to inherently lower body- effect factors V SB,p V SB,n
CSE477 L26 System Power.23Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T
CSE477 L26 System Power.24Irwin&Vijay, PSU, 2003 Reducing Power in Standby (Sleep) Mode For idle components, all power dissipation is due to leakage Can reduce leakage by DTS Or can reduce leakage by gating the supply rails when the circuit is in sleep mode l in normal mode, sleep = 1 and the sleep transistors must present as small a resistance as possible (via sizing) l in sleep mode, sleep = 0, the transistor stack effect reduces leakage by orders of magnitude Virtual V DD Virtual GND V DD !sleep sleep Or can eliminate leakage by switching off the power supply (but lose the memory state)
CSE477 L26 System Power.25Irwin&Vijay, PSU, 2003 Reducing Standby Power in Memories Leakage in memory arrays is becoming a major issue l leakage increase from 0.18 m to 0.13 m is a factor of almost 7 Techniques to control memory array leakage l turn off unused banks by switching off the power supply l apply DTS to non-active cells (maintains state) -memory cannot be accessed at speed when running on the lower V T l exploit transistor stacking (maintains state) V DD I leakage (A) 0.13 m l lower the supply voltage (maintains state) -memory cannot be access when running on the lower supply
CSE477 L26 System Power.26Irwin&Vijay, PSU, 2003 Leakage Controlled SRAM Cell Alternatives 0 1 Asymmetric SRAM Cell Gate control Virtual GND Gated-GND SRAM Cell 0 1 Cell state preserved Hardware versus software control of “mode” V DD (1V) V DD Low (.3V) Drowsy SRAM Cell !drowsy drowsy Cell Leakage Bit line leakage
CSE477 L26 System Power.27Irwin&Vijay, PSU, 2003 Leakage Controlled SRAM Savings and “Costs” bits, 70 nm, 1 ns cycle
CSE477 L26 System Power.28Irwin&Vijay, PSU, 2003 Leakage Controlled Cache Microarchitecture to prevent accessing drowsy lines word line word line drivers row decoder Reset Global Set !Q Q 0.3V (drowsy) 1V (active) word line power line SRAMs wordline gate Set: drowsy Reset: active
CSE477 L26 System Power.29Irwin&Vijay, PSU, 2003 Hardware Controlled Drowsy Cache Cache energy reduction l standby energy by 71% to 76% l total energy by 54% to 58% Run time increase l 0.41% Put cache lines into a low-power mode periodically independent of the access history l Periodic global set counter (~4000 cycles has good E-D trade-off) asserts drowsy signal -don’t need counters/predictor states for each line
CSE477 L26 System Power.30Irwin&Vijay, PSU, 2003 Next Lecture and Reminders Next lecture l System level interconnect -Reading assignment – Rabaey, et al, Chapter 9 Reminders l Project final reports due on-line by 5:00pm on Friday, December 5 th l Final grading negotiations/correction (except for the final exam) must be concluded by December 10 th l Final exam scheduled -Tuesday, December 16 th from 10:10 to noon in 118 and 113 Thomas