Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE477 L26 System Power.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 26: Low Power Techniques in Microarchitectures and Memories.

Similar presentations


Presentation on theme: "CSE477 L26 System Power.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 26: Low Power Techniques in Microarchitectures and Memories."— Presentation transcript:

1 CSE477 L26 System Power.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 26: Low Power Techniques in Microarchitectures and Memories Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477www.cse.psu.edu/~mji www.cse.psu.edu/~cg477 [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]

2 CSE477 L26 System Power.2Irwin&Vijay, PSU, 2003 Review: CMOS Energy & Power Equations E = C L V DD 2 P 0  1 + t sc V DD I peak P 0/1  1/0 + V DD I leak P = C L V DD 2 f + t sc V DD I peak f + V DD I leak f = P * f clock Dynamic power (~90% today and decreasing relatively) Short-circuit power (~8% today and decreasing absolutely) Leakage power (~2% today and increasing)

3 CSE477 L26 System Power.3Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T

4 CSE477 L26 System Power.4Irwin&Vijay, PSU, 2003 Reducing Power and Energy of Interconnects  Share long data buses with time multiplexing (S 1 uses even cycles, S 2 odd) S2S2 S1S1 D1D1 D2D2 S1S1 S2S2 D2D2 D1D1  Buses are a significant source of power dissipation due to high switching activities and large capacitive loading l 15% of total power in Alpha 21064 l 30% of total power in Intel 80386  But what if data samples are correlated (e.g., sign bits)?

5 CSE477 L26 System Power.5Irwin&Vijay, PSU, 2003 Bus Multiplexing and Correlated Data Streams Bit position MSB LSB Bit switching probabilities  For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams) l Bus sharing should not be used for positively correlated data streams l Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) - more random switching

6 CSE477 L26 System Power.6Irwin&Vijay, PSU, 2003 Reducing Power and Energy of Memories  Active power in memory of m columns and n rows P = V DD I DD where I DD = I array + I decode + I periphery = [mi act + m(n-1)i hld ] + [(n+m)C DE V int f] + [C PT V int f + I DCP ] l As expected, it is proportional to the size of the memory and is typically dominated by the array  Partition the memory array into multiple smaller banks (see L23.11) so that only the addressed bank is activated l improves speed and lowers power -word line and bit line capacitances are reduced -number of bit cells activated reduced l At some point the delay and power overhead associated with the bank decoding circuit dominates (2 to 8 banks typical)

7 CSE477 L26 System Power.7Irwin&Vijay, PSU, 2003 Divided Word Line  Divide RAM cells in each row into blocks where the cells in each block are accessed by a local word line (LWL)  Only the memory cells in the activated block have their bit line pairs driven l improves speed (by decreasing word line capacitance) l lowers power dissipation (by decreasing the number of BL pairs activated) BSL LD WL i WL i+1 LWL i LWL i+1 Local decoder Block select line RAM cell BL j BL j+1 BL j+m Row block

8 CSE477 L26 System Power.8Irwin&Vijay, PSU, 2003 Bit Line Segmentation  Divide RAM cells in each column into blocks where each block has its own local bit line (LBL) - only the memory cells in the activated block present a load on the bit line l lowers power dissipation (by decreasing bit line capacitance) -e.g., from more than 1pF for a 16Kb DRAM to ~200fF for a 64Mbit DRAM Switch to isolate segment LBL i+n,j LBL i,j BL j WL i SWL i+n,j SWL i,j  Row decoder logic also identifies the segment (SWL)  Has minimal effect on performance

9 CSE477 L26 System Power.9Irwin&Vijay, PSU, 2003 Glitch Reduction by Pipelining  Glitches depend on the logic depth of the circuit - gates deeper in the logic network are more prone to glitching l arrival times of the gate inputs are more spread due to delay imbalances l usually affected more by primary input switching  Reduce logic depth by adding pipeline registers l additional energy used by the clock and pipeline registers PC FetchDecodeExecuteMemoryWriteBack Instruction MAR MDR I$D$ clk pipeline stage isolation register

10 CSE477 L26 System Power.10Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T

11 CSE477 L26 System Power.11Irwin&Vijay, PSU, 2003 Clock Gating  Gate off clock to idle functional units l e.g., floating point units l need logic to generate disable signal -increases complexity of control logic -consumes power -timing critical to avoid clock glitches at OR gate output l additional gate delay on clock signal -gating OR gate can replace a buffer in the clock distribution tree  Most popular method for power reduction of clock signals and functional units RegReg clock disable Functional unit

12 CSE477 L26 System Power.12Irwin&Vijay, PSU, 2003 Clock Gating in a Pipelined Datapath  For idle units (e.g., floating point units in Exec stage, WB stage for instructions with no write back operation) PC FetchDecodeExecuteMemoryWriteBack Instruction MAR MDR I$D$ clk No FPNo WB

13 CSE477 L26 System Power.13Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T

14 CSE477 L26 System Power.14Irwin&Vijay, PSU, 2003 Review: Dynamic Power as a Function of V DD  Decreasing the V DD decreases dynamic energy consumption (quadratically)  But, increases gate delay (decreases performance) V DD (V) t p(normalized)  So if multiple levels of V DD are provided for use at run time, the clock frequency must also be adjusted.

15 CSE477 L26 System Power.15Irwin&Vijay, PSU, 2003 Dynamic Frequency and Voltage Scaling  Always run at the lowest supply voltage that meets the timing constraints l DFS (dynamic frequency scaling) saves only power (e.g., Intel’s SpeedStep) l DVS (dynamic voltage scaling) + DFS saves both energy and power (e.g., Transmeta’s LongRun)  A DVS+DFS system requires the following l A programmable clock generator (PLL) -PLL from 200MHz  700MHz in increments of 33MHz l A supply regulation loop that sets the minimum V DD necessary for operation at the desired frequency -32 levels of V DD from 1.1V to 1.6V l An operating system that sets the required frequency + supply voltage to meet the task completion deadlines -heavier load  ramp up V DD, when stable speed up clock -lighter load  slow down clock, when PLL locks onto new rate, ramp down V DD

16 CSE477 L26 System Power.16Irwin&Vijay, PSU, 2003 Dynamic Thermal Management (DTM)  Trigger mechanism: on- chip temperature sensors l Based on differential voltage change across two diodes of different sizes l Usually requires more than one sensor l Hysteresis and delay are problems  When to begin responding? l Trigger level set too high means higher packaging costs l Trigger level set too low means frequent triggering and loss in performance  Choose trigger level to exploit difference between average and worst case power  An example of DVS + DFS in action

17 CSE477 L26 System Power.17Irwin&Vijay, PSU, 2003 DTM Initiation and Response Mechanisms  Operating system or micro-architectural initiation mechanism? l Hardware support can reduce the performance penalty by 20-30%  Response mechanism – DVS+DFS l Incurs some delay since there is a OS context switch needed to set the new level of DVS + DFS l Increasing the trigger level reduces the frequency of context switching to set DVS + DFS  The use of a thermal window (100Kcycles+) can help to “smooth” short thermal spikes

18 CSE477 L26 System Power.18Irwin&Vijay, PSU, 2003 DTM Activation and Deactivation Cycle Trigger Reached Turn Response On Initiation Delay  Initiation Delay – OS interrupt/handler  Response Delay – Invocation time (adjust clock, V DD ) Response Delay Policy Delay Check Temp  Policy Delay – Number of cycles engaged Check Temp Shutoff Delay Turn Response Off  Shutoff Delay – Disabling time (re-adjust clock, V DD ) temperature DTM trigger level Cooling capacity without DTM Cooling capacity with DTM savings

19 CSE477 L26 System Power.19Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T

20 CSE477 L26 System Power.20Irwin&Vijay, PSU, 2003 Speculated Power of a 15mm  P

21 CSE477 L26 System Power.21Irwin&Vijay, PSU, 2003 Review: Variable V T at Run Time  Reducing the V T increases the sub-threshold leakage current (exponentially) V T = V T0 +  (  |-2  F + V SB | -  |-2  F |) where V T0 is the threshold voltage at V SB = 0, V SB is the source- bulk (substrate) voltage,  is the body-effect coefficient V SB (V) V T (V)  But, reducing V T decreases gate delay (increases performance) l For an n-channel device, the substrate is normally tied to ground (V SB = 0) l A negative bias on V SB causes V T to increase l Adjusting the substrate bias at run time is called adaptive body- biasing (ABB) or dynamic threshold scaling (DTS) -Requires a triple well fab process

22 CSE477 L26 System Power.22Irwin&Vijay, PSU, 2003 DTS  DTS can accomplish a variety of goals l Lower the leakage in standby mode by increasing V T to its maximum value l Compensate for threshold variations across the chip during normal operation l Throttle the throughput (by increasing V T ) to lower both the active and leakage power based on performance requirements  Substrate biasing can be implemented on a complete chip, on a block-by-block basis, or on a cell-by-cell basis. l Per-cell granularity of substrate biasing has an area cost  Unfortunately, the effectiveness of DTS is decreasing with technology scaling due to inherently lower body- effect factors V SB,p V SB,n

23 CSE477 L26 System Power.23Irwin&Vijay, PSU, 2003 Power and Energy Design Space Constant Throughput/Latency Variable Throughput/Latency EnergyDesign TimeNon-active ModulesRun Time Active (Dynamic) Logic design Reduced V dd TSizing Multi-V dd Clock Gating DFS, DVS (Dynamic Freq, Voltage Scaling) Leakage (Standby) Multi-V T Stack effect Pin ordering Sleep Transistors Multi-V dd Variable V T Input control Variable V T

24 CSE477 L26 System Power.24Irwin&Vijay, PSU, 2003 Reducing Power in Standby (Sleep) Mode  For idle components, all power dissipation is due to leakage  Can reduce leakage by DTS  Or can reduce leakage by gating the supply rails when the circuit is in sleep mode l in normal mode, sleep = 1 and the sleep transistors must present as small a resistance as possible (via sizing) l in sleep mode, sleep = 0, the transistor stack effect reduces leakage by orders of magnitude Virtual V DD Virtual GND V DD !sleep sleep  Or can eliminate leakage by switching off the power supply (but lose the memory state)

25 CSE477 L26 System Power.25Irwin&Vijay, PSU, 2003 Reducing Standby Power in Memories  Leakage in memory arrays is becoming a major issue l leakage increase from 0.18  m to 0.13  m is a factor of almost 7  Techniques to control memory array leakage l turn off unused banks by switching off the power supply l apply DTS to non-active cells (maintains state) -memory cannot be accessed at speed when running on the lower V T l exploit transistor stacking (maintains state) V DD I leakage (A) 0.13  m l lower the supply voltage (maintains state) -memory cannot be access when running on the lower supply

26 CSE477 L26 System Power.26Irwin&Vijay, PSU, 2003 Leakage Controlled SRAM Cell Alternatives 0 1 Asymmetric SRAM Cell Gate control Virtual GND Gated-GND SRAM Cell 0 1  Cell state preserved  Hardware versus software control of “mode” V DD (1V) V DD Low (.3V) Drowsy SRAM Cell !drowsy drowsy Cell Leakage Bit line leakage

27 CSE477 L26 System Power.27Irwin&Vijay, PSU, 2003 Leakage Controlled SRAM Savings and “Costs” 134.09 256 bits, 70 nm, 1 ns cycle

28 CSE477 L26 System Power.28Irwin&Vijay, PSU, 2003 Leakage Controlled Cache Microarchitecture to prevent accessing drowsy lines word line word line drivers row decoder Reset Global Set !Q Q 0.3V (drowsy) 1V (active) word line power line SRAMs wordline gate Set: drowsy Reset: active

29 CSE477 L26 System Power.29Irwin&Vijay, PSU, 2003 Hardware Controlled Drowsy Cache  Cache energy reduction l standby energy by 71% to 76% l total energy by 54% to 58%  Run time increase l 0.41%  Put cache lines into a low-power mode periodically independent of the access history l Periodic global set counter (~4000 cycles has good E-D trade-off) asserts drowsy signal -don’t need counters/predictor states for each line

30 CSE477 L26 System Power.30Irwin&Vijay, PSU, 2003 Next Lecture and Reminders  Next lecture l System level interconnect -Reading assignment – Rabaey, et al, Chapter 9  Reminders l Project final reports due on-line by 5:00pm on Friday, December 5 th l Final grading negotiations/correction (except for the final exam) must be concluded by December 10 th l Final exam scheduled -Tuesday, December 16 th from 10:10 to noon in 118 and 113 Thomas


Download ppt "CSE477 L26 System Power.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 26: Low Power Techniques in Microarchitectures and Memories."

Similar presentations


Ads by Google