Download presentation
Presentation is loading. Please wait.
1
C. H. Ziesler et al., 2003 Energy Recovery Design for Low-Power ASICs Conrad H. Ziesler 1 Joohee Kim 1 Suhwan Kim 2 Marios C. Papaefthymiou 1 1 Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor 2 T. J. Watson Research Center IBM Research, Yorktown Heights
2
C. H. Ziesler et al., 2003
3
Tutorial Outline Introduction to energy recovery Application of energy recovery to SOC design ● Fine-grain dynamic pipelines ● Finite state machines ● Memory arrays ● Multi-GHz clocking
4
C. H. Ziesler et al., 2003 Introduction to Energy Recovery ● Power dissipation in static CMOS design ● Energy recovery operation ● Implementation issues ● Quick glance at history
5
C. H. Ziesler et al., 2003 Static CMOS Power ● Leakage Power ● Crowbar Power ● Active Power
6
C. H. Ziesler et al., 2003 Static CMOS Active Power ● Active power to transfer charge to/from output capacitor – Energy stored in capacitor when charged to voltage V is ½ CV 2 – Energy dissipated in R1 to charge capacitor is ½ CV 2 – To discharge capacitor, all energy stored in it is dissipated in R2 ● Note: Voltage supply level fixed
7
C. H. Ziesler et al., 2003 What if we had a time-varying power-supply? ● Turn on switch when time- varying voltage source is at same level as the voltage on output capacitor. ● Voltage on output capacitor slews up or down at same rate as voltage source. ● Most of the energy stored in capacitor is returned to time- varying power-supply.
8
C. H. Ziesler et al., 2003 Model Simplification ● Use same switch for charging and discharging currents (if it can conduct in both directions). ● Time-varying voltage source Vpc, called power-clock, provides energy and synchronization. ● Textbooks sometimes draw ideal energy recovery model using current source instead of voltage source. Principal of operation is the same.
9
C. H. Ziesler et al., 2003 Example: Linear Ramp time Vpc IRIR VRVR PRPR E R = V R I R Vdd VCVC
10
C. H. Ziesler et al., 2003 Energy Dissipation ● Integrate charging or discharging current through resistor R ● Dissipation for either charging or discharging is approximately (RC/T) CV 2 – T is rise time of Vpc (linear ramp) If T >> RC delay and recovered energy is efficiently recycled, power savings can be substantial.
11
C. H. Ziesler et al., 2003 Implementation Issues ● How is power-clock generated? ● What capacitances should one try to recover energy from? ● How does the timing work if power and clock are intermingled? ?????
12
C. H. Ziesler et al., 2003 Power-Clock Generation Challenges ● Need to drive a large mostly capacitive load with a controlled rise/fall time waveform. ● Capacitive load may change as different gates switch on or off. ● Must do so with much less than CV 2 dissipation, otherwise gains are lost.
13
C. H. Ziesler et al., 2003 Where to recover energy from? ● Power-clock is time-varying – Synchronize energy recovery circuits with power-clock. ● Energy recovery requires small RC delay with respect to rise/fall time T – Aim at small RC/T ratio. ● Energy recovery saves on active switching power – Does not necessarily make sense to apply it on low-activity loads.
14
C. H. Ziesler et al., 2003 Good Candidates ● Large capacitive loads that frequently switch – Clock networks – Memory bit lines – LCD row/column drivers ● High activity loads that are synchronous to power-clock – Dynamic pipelined logic
15
C. H. Ziesler et al., 2003 Quick Glance at History ● Physics (1970s) ● Logical reversibility of computation ● Connection to thermodynamics (“adiabatic” computing) ● No absolute minimum to energy dissipation, if computing is arbitrarily slow. (Does not have to be slow, however.) ● Engineering (1990s) ● Logic circuitry ● Energy recycling circuitry ● VLSI prototyping
16
C. H. Ziesler et al., 2003 ● The case for reversible computation P. Solomon and D. Frank – IWLPD'94 ● Asymptotically zero-energy split-level charge recovery logic S.G. Younis and T. F. Knight, Jr. – IWLPD’94 ● Clock-powered CMOS: A hybrid adiabatic logic style for energy-efficient computing N. Tzartzanis and W. C. Athas – ARVLSI'99 And many, many others.... Sample Design Points from’90s
17
C. H. Ziesler et al., 2003 Split-Level Charge-Recovery Logic input /P P output Initially, and / at Vdd/2, P at Gnd, and /P at Vdd. On valid input, the pass gate is turned on by gradually swinging P and /P. Rails and / “split”, gradually swinging to Vdd and Gnd. As soon as output is sampled, pass gate is turned off. Internal node is restored by gradually swinging and / back to Vdd/2. When is the gate output restored? P internal node
18
C. H. Ziesler et al., 2003 Reversible Pipelines Gate outputs are restored using a reverse pipeline whose elements perform the inverse function of the forward pipeline Multiple phases required Split-level charge recovery logic block input EFGH P1P1 P P2P2 E -1 F -1 G -1 H -1 /P 1 /P 2 P2P2 P1P1 output node set by E, restored by F -1...
19
C. H. Ziesler et al., 2003 The Remainder of this Tutorial ● Application of energy recovery to SOC Modules – Low switching activity state-machines or datapaths – Fine-grain pipelines with high switching activity – Multi-gigahertz class design – Memory and memory-like arrays – Interconnect and I/O Defining Characteristics ● Low overhead, fast operation, no reversibility ● Energy recovering chips that work and save power over conventional operation!
20
C. H. Ziesler et al., 2003 Targeting Energy Recovery to ASICs and SOC Modules Module characterization – Throughput requirements ● How much time is availible to compute? ● Is the application easily pipelined? – Expected switching activity ● Can it be easily reduced? – Memory, computation, or control? – Clocking requirements ● Fixed frequency? Range of frequencies?
21
C. H. Ziesler et al., 2003 Low Switching Activity Finite-State Machines or Datapaths
22
C. H. Ziesler et al., 2003 Low Switching Activity Modules ● Best place to focus efforts is on clock tree and flip-flops – Switching activity is low everywhere else – Every capacitance connected to the clock switches every cycle ● Consider applying resonant clocking system
23
C. H. Ziesler et al., 2003 Sample Dissipation Breakdown ● Focus on clock dissipation — 20—50% of total power ● Apply energy recovery — Single-phase clock node ● Target design automation — Automate energy recovery
24
C. H. Ziesler et al., 2003 Resonant Clock ASIC ● Compatible with ASIC flow ● Synthesized by Conrad Ziesler and Joohee Kim using in-house standard-cells library and commercial tools ● Energy recovering clock tree and SRAM word/bit lines ● Low-cost bulk CMOS process: TSMC 0.25 m, 108-pin PGA package, through MOSIS ● High frequency (300MHz) ● Low voltage (1.0-1.5V)
25
C. H. Ziesler et al., 2003 ASIC Statistics ● Discrete wavelet transform (DWT) ● 3,897 gates, 413 ffs ● 15,571 transistors ● 400um x 900um ● 13.6 pF, 21 nH ● 300 MHz, 1.5V ● 0.25um logic process Dual-mode DWT Clock generator
26
C. H. Ziesler et al., 2003 Chip Microphotograph
27
C. H. Ziesler et al., 2003 System Overview...
28
C. H. Ziesler et al., 2003 The Energy Recovering Flip-Flop ● Clock signal: Single-phase resonant sinusoid ● Probe activates state element only if next state differs from present state. ● Low voltage operation at high speeds ● Delay similar to conventional flip-flops ● Fully compatible with standard-cell ASIC flow 14 transistors 84 m 2 probestate element
29
C. H. Ziesler et al., 2003 Flip-Flop Power Characterization Order of magnitude difference between idle (D, Q constant) and active (D, Q changing) dissipation. frequency (MHz) Flip-Flip Energy per Cycle 1000 100 10 1 energy (fJ) 200 300 500
30
C. H. Ziesler et al., 2003 Resonant Clock Generator ● Resonate entire clock capacitance with small inductor ● Pump resonant system with NMOS switch at appropriate times ● NMOS switch only conducts incremental losses whenever “on” NMOS Switch Control Pre-driver Driver
31
C. H. Ziesler et al., 2003 Clock Generator Operation
32
C. H. Ziesler et al., 2003 Our Contributions ● Application of energy recovery to ASIC clock network – Fully synthesized ASIC – Resonant LC tank forms power-clock – On-chip power-clock generator w/ off-chip inductor ● Fabrication in 0.25 m standard CMOS process – Compare energy recovering clocking system with conventional clocking system ● Direct DC power measurements for complete system – Correct operation, 100—300MHz – 20—50% savings in total power consumption – 80—90% savings in clock power
33
C. H. Ziesler et al., 2003 Recovering vs. Conventional Hardware ● Synthesized dual-mode ASIC — Conventional — Energy recovery ● Dual-mode flip-flop cell — Conventional clock tree with conventional flip-flop — Resonant clock tree with energy- recovering flip-flop ● Direct comparison of dissipation at target throughput using identical hardware structures
34
C. H. Ziesler et al., 2003 Correct Function signature output (conventional) signature output (energy recovery) signature output (Verilog simulator)
35
C. H. Ziesler et al., 2003 Summary Energy recovery technologies for reducing clock dissipation ● Single-phase sinusoidal clock ● Efficient, LC resonant clock generator ● Low power sinusoidally clocked flip-flop Key attributes ● Compatible with ASIC design flow, low overhead ● High frequency (100—300MHz) ● Low voltage (1—1.5V) ● Real, working chips (in 0.25 m logic process)
36
C. H. Ziesler et al., 2003 Break
37
C. H. Ziesler et al., 2003 Fine-Grain Pipelines with High Switching Activity
38
C. H. Ziesler et al., 2003 High Switching Activity Pipelines Problem Parameters ● Can't reduce switching activity ● Need lots of throughput ● Can easily do fine-grained pipelining Solution ● Dynamic energy recovering logic – True single-phase source-coupled adiabatic logic (SCAL)
39
C. H. Ziesler et al., 2003 SCAL-D Logic Family ● Dynamic logic family that works with simple sinusoidal LC resonant clock ● Alternate PMOS and NMOS type gates like NMOS/PMOS ''zipper'' domino logic ● Test chip: 200MHz 8-bit multiplier implemented in 0.5 m bulk silicon process
40
C. H. Ziesler et al., 2003 SCAL-D Topology ● Sense amplifier ● Precharge diodes ● Current switches ● Evaluation tree ● Current tail ● Power-clock Vss bias at bt af bf of ot NMOS NAND gate Power clock ● Single-phase sinusoidal power-clock ● Minimum-size low-swing evaluation tee ● Built-in state element ● Dual-rail noise-tolerant design
41
C. H. Ziesler et al., 2003 SCAL-D Operation NMOS Precharge Phase ● rising edge of power-clock ● charge transferred to load
42
C. H. Ziesler et al., 2003 SCAL-D Operation NMOS Evaluate Phase ● peak of power-clock ● non-adiabatic evaluation current ● current purposefully limited
43
C. H. Ziesler et al., 2003 SCAL-D Operation NMOS Sense Phase ● falling edge of power clock ● sense amplifiers drive load
44
C. H. Ziesler et al., 2003 SCAL-D Operation NMOS Hold Phase ● negative peak of power clock ● next pipeline stage samples outputs
45
C. H. Ziesler et al., 2003 SCAL-D Implicit Pipelining ● Free pipelining : no flip-flops needed Example: Pipelined "and—full adder" cell from array multiplier SCAL-D: 85 transistors Static CMOS: 100 transistors
46
C. H. Ziesler et al., 2003 ● Minimalist approach ● Simple tools: magic and spice ● Low-cost standard CMOS process: HP 0.5um, 40-pin DIP package, through MOSIS. ● Operational chip demonstrates practicality of energy-recovering circuit design ● Non-trivial size (8-bit operands, on-chip clock, self-test) ● High throughput ● Low energy dissipation Multiplier Chip ● Suhwan Kim (while still at UM) and Conrad Ziesler received First Prize in VLSI Design Contest, DAC 2001.
47
C. H. Ziesler et al., 2003 Chip Microphotograph
48
C. H. Ziesler et al., 2003 Test Chip Overview ● Two multipliers with self-test per chip (minimum size die) ● Integrated power-clock generator ● Resonant LC oscillator
49
C. H. Ziesler et al., 2003 Input BILBO (self-test) Product array Multiplicand buffers Result summation Result buffers Self-test Control Output BILBO (self-test) Multiplier and Self-Test ● 9,048 devices in multiplier array, 2,806 devices in self-test circuitry ● Implemented entirely in energy-recovering dynamic logic family.
50
C. H. Ziesler et al., 2003 Energy Comparison 2.7V 1.9V 2.2V 1.9V 1.6V 2.9V 50 100200 Frequency (MHz) 100 200 300 400 500 Dissipation per Cycle (pJ) 0 140 Energy recovery 4 stage static CMOS 2.3V 3.0V 2.0V 8 stage static CMOS 2 stage static CMOS 4x
51
C. H. Ziesler et al., 2003 ● Zero-voltage switching ● LC- resonant clock generation ● External/bondwire inductor L ● Resistive/capacitive adiabatic load ● Compact: 170 x 115 um Single-Phase Power-Clock Generator Power Clock Pulse Gen Ring Osc Gate Drive Power Switches: S1, S2 25 tr. 19 tr. 10 tr. Ring Osc. Pulse Gen. Gate Drive _+_+ _+_+ load S2 S1 PC L Vdd Vss Vbp Vbn
52
C. H. Ziesler et al., 2003 Switch Timings ● Inductor current builds linearly when switches are on. ● Peak switch current less than peak inductor current. ● Switch S1 turned on at positive voltage peak. ● Switch S2 turned on at negative voltage peak. ● Fixed ''on-window'' controlled by pulse generator. Inductor current Output voltage
53
C. H. Ziesler et al., 2003 ● Single-phase sinusoidal waveform @140MHz ● ~60pF load, ~10nH external inductor ● One DC supply (Vdd, Vss), two DC biases (NMOS, PMOS) Power-Clock Waveform Vdd Vss PMOS Bias NMOS Bias Power-Clock
54
C. H. Ziesler et al., 2003 Multi-GHz Clocking
55
C. H. Ziesler et al., 2003 Multi-Gigahertz Designs ● Need speed more than anything ● Clock distribution and skew biggest problems ● Retiming desirable for performance ● Need multiple phases of clock ● Clock power dominates
56
C. H. Ziesler et al., 2003 Rotary Clock TM Principles ● Consider MultiGig’s Rotary-Clock network (related slides courtesy of John Wood, MultiGig Inc.) ● Multiple transmission line loops arranged in grid ● Each loop supports a square-wave oscillation in lock step with neighbors ● Small variations between loops average/cancel out ● Ultra-high frequency low-skew clocking
57
C. H. Ziesler et al., 2003 Multi-Ring Visualization Phase lock at junctions without PLL
58
C. H. Ziesler et al., 2003 Numerical Example – Large Chip Process: 0.18u CMOS 1.8v Size: 15 x 15 mm Global clock: 2.5 GHz FFs: 2 million, Total capacitance: 10,000pF Metal: Width 40u, Spacing 40u, Thickness 1.5u Copper x 2 Active Area: < 1% Grid X pitch: 1300u, Grid Y pitch: 1300u Power CV 2 F: 78 W Rotary Power: 6 W
59
C. H. Ziesler et al., 2003 Chip Microphotograph
60
C. H. Ziesler et al., 2003 Test Chip Measurements Power: 75% less than CV 2 F of clock capacitance
61
C. H. Ziesler et al., 2003 Test chip #2 Switched-capacitor tuning (100MHz +/- 35% measured) Test chip #3 – quad ring 3.5 GHz Tunable (varactor +/- 10%) Jitter – below measurable levels Later Test Chips
62
C. H. Ziesler et al., 2003 Benefits of Rotary Clock Architecture ● Scalable in size and frequency. ● Reduces dynamic clock power. ● Guaranteed near-zero skew. ● Precise skew scheduling possible. ● Negligible jitter. ● Inherently low noise ● Tolerant to process, temperature, and supply variation.
63
C. H. Ziesler et al., 2003 Memories and Array-Like Structures
64
C. H. Ziesler et al., 2003 Memory and Array Structures ● Many heavily loaded wires all switching ● Already optimized algorithms to reduce number of accesses ● Already using low-power sense-amplifiers ● Consider energy recovery on the bit wires
65
C. H. Ziesler et al., 2003 Breakdown of CPU Dissipation strongARM, JSSC 1996 Memory area: 60% of CPU
66
C. H. Ziesler et al., 2003 Memory Power Long bit/word lines with large capacitance high power consumption 256x256 array
67
C. H. Ziesler et al., 2003 Energy Recovery Driver Power source Q 0 - V DD C Sinusoidal power-clock Synchronization ? — Correctness — Efficiency E = ( RC / t ) CV DD 2 ER Driver
68
C. H. Ziesler et al., 2003 Outline Energy recovering driver — Synchronization — Feedback Energy recovering SRAM — Operation Simulation of full-custom ERSRAM Voltage scaling behavior
69
C. H. Ziesler et al., 2003 One Extreme: Fully Gradual Transition Maximum power efficiency Relatively slow operation ON PC driver output
70
C. H. Ziesler et al., 2003 The Other Extreme: Abrupt Transition Low power efficiency Fast operation ON PC driver output
71
C. H. Ziesler et al., 2003 Partially Gradual Transition Energy efficient Fast ON PC driver output
72
C. H. Ziesler et al., 2003 ER Driver Core Pull-up control Pull-down control PC ch dch Single-phase power-clock Transmission gate ( wide range of operation ) Synchronizing circuitry driver output
73
C. H. Ziesler et al., 2003 Pull-Up Control PC ch_out PC ch ch_out Transistors sized to achieve correct timing and pulse width
74
C. H. Ziesler et al., 2003 Pull-Down Control PC dch_out Transistors sized to achieve correct timing and pulse width PC dch dch_out
75
C. H. Ziesler et al., 2003 Tolerance to Control Timing Variations dch ch Minimum required Maximum possible Minimum required Maximum possible PC driver output
76
C. H. Ziesler et al., 2003 Dissipation During Consecutive Charging/Discharging PC output Consecutive discharging PC output Consecutive charging
77
C. H. Ziesler et al., 2003 Complete Structure with Feedback Pull-up control Pull-down control PC ch dch Feedback circuitry —Prevents redundant dissipation during consecutive charging/discharging driver output
78
C. H. Ziesler et al., 2003 ERSRAM Operation WL BLT BLF writeidleread WL: Explicit discharge after each access for low power BL: Precharge low (with modified sense amp) for single cycle read and write
79
C. H. Ziesler et al., 2003 ERSRAM Architecture 128 x 256 Cell array Bl driver Sense amp Wl driver 128 x 256 Cell array Bl driver Sense amp Wl driver 2 x 128 x 256 Only drivers and sense amplifiers are different from that of conventional SRAM
80
C. H. Ziesler et al., 2003 Simulation TSMC 0.35mm process. Full-custom 256x256 conventional and ER SRAM Hspice simulation
81
C. H. Ziesler et al., 2003 Power Breakdown
82
C. H. Ziesler et al., 2003 Wide Operation Range Tolerant to variations in operating conditions Memory failure due to mistiming in sense amp enable : Conv. SRAM O : ERSRAM
83
C. H. Ziesler et al., 2003 Voltage Scaling Functions correctly down to 0.7V, 1MHz
84
C. H. Ziesler et al., 2003 Summary Static RAM with novel energy recovering driver Single-phase power-clock Single-cycle read and write High speed, low complexity Operation range : 0.7V, 1MHz – 3.5V, 500MHz with 0.35 m process Energy efficiency: 2.6x at 3V, 300MHz, alternating read-write access
85
C. H. Ziesler et al., 2003 Previous Work on SRAMs Multi-phase power-clock Multi-cycle operation Relatively high complexity and low/moderate speeds Inefficient during consecutive charging Not voltage scalable Somasekhar, Ye and Roy, ISLPED 1995 Tzartzanis and Athas, ISLPED 1996 Moon and Jeong, JSSC 1998 Avery and Jabri, ISLPED 1998 Kwon, Lim and Chae, ISLPED 2000 Ng and Lau, JCSC 2000 Tzartzanis, Athas and Svensson, ESSCC 2000
86
C. H. Ziesler et al., 2003 Interconnect and I/O ● Highly capacitive parallel buses that take an entire cycle for data transfer ● Can't reduce switching activity or drive voltage ● Low slew-rates desirable ● Consider energy recovering driver – Techniques similar to memories
87
C. H. Ziesler et al., 2003 Tutorial Summary Introduction to energy recovery Application of energy recovery to SOC design – Finite state machines – Fine-grain dynamic pipelines – Multi-GHz clocking – Memory arrays and I/O Functional chips in bulk silicon demonstrating substantial energy savings and fast operation in practice
88
C. H. Ziesler et al., 2003 Reference Material Energy recovery group web site at U. Michigan http://www.eecs.umich.edu/acal/energyrecovery Extensive list of references in our tutorial paper in SOC’03 proceedings The Physics of Computation R. Feynman
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.