Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits
Outline Motivation / Background Contributions Relaxed Quasi Delay-Insensitive (RQDI) RQDI Voltage Scaling RQDI Two Phase Circuits Results Summary
Motivation: How Does Dynamic Power Scale? α – activity factor (1x) N – total number of transistors (2x) C L – average load capacitance per transistor (.7x) V dd – doesn’t scale well anymore Scaled by 17-20% from 130nm to 65nm. Scaled by 10% at 45nm and 5.5% at 32nm.
Motivation: Power Scaling With Fixed Frequency
Motivation: Process Variations Getting Worse Process Variation in 65nm: FO4 delays across corners: FF is 70% faster than SS. Circuits need to be robust w.r.t. process variations. QDI is a logical place to start. SS CornerTT CornerFF Corner 13.6 ps18.2 ps22.6 ps
Background: QDI – WCHB Buffer Simple buffer. Neutrality is checked in the pull-up stack of the c-element. Timing assumption?
RQDI: Staticizer Timing Assumption I Data is neutral and enable is high.
RQDI: Staticizer Timing Assumption II Data is neutral and enable is high. Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low.
RQDI: Staticizer Timing Assumption III Data is neutral and enable is high. Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low. Nothing is fighting the weak feedback, _R0 can go high.
RQDI: Half Cycle Timing Assumption The half cycle timing assumption (HCTA): A small amount of combinational logic (1-2 transitions) will always switch within one half cycle of a process. There is a 4.5x 18 t.p.c.) timing margin. With worst case corners, 2.7x margin in 65nm. Wire delays make the assumption even more conservative. QDI has an HCTA in staticizers. RQDI allows them everywhere.
RQDI: HCHB Template N tracks neutrality. Check N+, but assume N- happens in the first half cycle. Two transition latency. 14 transition cycle time. Validity must be checked by pull- down.
RQDI Voltage Scaling: Scaling Scenarios Two possible scenarios for voltage scaling. Top: mismatched slack. Lower pipeline can run slower. Bottom: Token limited loop. Latency through loop should be minimal, but cycle time can scale. In some applications these can’t be avoided. Mismatched slack Token limited loop
RQDI Voltage Scaling: Slack Mismatch In An FPGA Logic blocks (LB) for logic. Switch boxes (SB) for routing. Limited routing resources. Imperfect slack matching. Can scale voltage on blue path.
RQDI Voltage Scaling: DVHB: Dual Voltage Template Data rails are full swing. Acknowledges are low swing. Latency remains constant through voltage scaling. Cycle time can be adjusted through voltage scaling.
RQDI Two Phase Circuits: Two Phase Buffer (HCFB2P) An HCTA exists on the right pair of XORs. Two transition latency. Seven transition cycle time. Twice the area of a WCHB. However, it can replace two stages.
RQDI Two Phase Circuits: Two Phase In An FPGA Replace routing (SB) with two phase logic. Logic (LB) remains four phase. Phase converters are placed around logic blocks. Routing makes up over half the area in an asynchronous FPGA, so power savings can be large. Width N Switch
RQDI Two Phase Circuits: Converters Need to convert between two phase (for routing) and four phase (for logic). The 4:2 converter is 3x larger than a WCHB. The 2:4 converter is 3.25x larger than a WCHB.
Experimental Setup Simulated in HSpice with a 65nm bulk technology. Circuits are sized to the drive strength of a 20/10 lambda inverter. NameDescriptionInputsOutputsImplies Validity? and2And21No or2Or21No xor2Exclusive Or21Yes faFull Adder32Yes bencBooth Encoder32No
Results : HCHB – Energy Per Cycle HCHB consumes 32% less energy than PCHB. HCHB consumes 36% less energy than PCEHB. Slight frequency improvement. Negligible latency penalty.
Results: HCHB – Total Transistor Area Despite the additional transistors to check validity, HCHB is smaller. HCHB is about 20% smaller than PCHB. HCHB is about 15% smaller than PCEHB.
Results: DVHB – Low voltage vs. Dual Voltage
Results: HCFB2P Switch – Energy Reduction vs. WCHB Wider switches means larger MUXes and larger PCs. The associated caps switch half as much. Over 50% reduction in power. Due to replacing two stages.
RQDI Two Phase Circuits: Results – Area Overhead Typically, there is about of 8 stages of 4-wide switches between logic blocks. Area overhead is 15%. With direct connections, there are about 10 stages with an overhead of 10%.
Summary RQDI allows half cycle timing assumptions outside of staticizers. With RQDI, we can simplify the PCHB logic template. The resulting template, HCHB, consumes 32% less energy. The dual voltage logic template can be used to adjust the dynamic slack of a stage. This allows us to save energy with a minimal throughput penalty in token limited loops. Replacing the routing in an FPGA with two phase logic can reduce energy consumption by 50%. Using the RQDI two phase buffer and converters will achieve this with a 10-15% area overhead.
Questions?