Deepa Soman, HyunSuk Nam, Rekha Srinivasaraghavan, Shashank Sivakumar Optimization of Power Reduction in FPGA Interconnect by Charge Recycling Presentation slide for courses, classes, lectures et al. Deepa Soman, HyunSuk Nam, Rekha Srinivasaraghavan, Shashank Sivakumar
Agenda Day 1 Day 2 Intro Power Consumption Techniques Power Reduction Techniques Discussions Day 2 Power Reduction Techniques (Conti) Charge Recycling Our Project Discussions Beginning course details and/or books/materials needed for a class/project.
Introduction Motivation Achilles’ Heel 3 A schedule design for optional periods of time/objectives. Introduction Motivation Achilles’ Heel Logic flexibility & re-programmability -longer wires (7-14 X) higher than asics
Power Consumption Dynamic Power - power consumed while the inputs are active Static power - power consumed even when there is no circuit activity !!! Dynamic Power Consumption Affected by Switching activity, Capacitance of transistors, supply voltage and frequency of operation Static Power Consumption Thermal characteristic accompanying Shrinking transistor size
Why Panic about Power?
Why Static Power??
Low Power Opportunities
Hardware Techniques Voltage Scaling Dual Vdd Frequency Scaling Clock Gating
Voltage Scaling Selecting core voltage based on performance requirements How to Choose? – From Timing Analysis Types: 1) Static Voltage Scaling 2) Dynamic Voltage Scaling
1. Static Voltage Scaling Selected core voltage only Realized using on chip Low-Dropout regulator(LDO) Voltage controlled by configuration bit stream 0.8-V - minimum dynamic and leakage power 1.0-V - overall highest performance 1.0v 0.8v LDO [1]"A FPGA Prototype Design Emphasis on Low Power Technique" Xu, Jian
2. Dynamic Voltage Scaling Provides different voltage levels Realized using voltage controlling unit Can be level shifter or DC-DC converter DVS implementation (LDMC – Logic Delay Measurement Unit) Delay error a novel Logic Delay Measurement Circuit using FPGA resources: to the first order, the reading produced by the LDMC tracks the critical path delay of a circuit that we wish to operate under DVS; we also show experimentally that by using a closed loop DVS system which keeps the LDMC reading above a threshold, no errors occur; ”Dynamic Voltage Scaling for Commercial FPGAs”, C.T. Chow1, L.S.M. Tsui1, P.H.W.
Dual Supply Voltage (Vdd) Separate voltage supplies for configuration SRAM and other elements Purpose: To support sleep mode Shutdown most logic except SRAM using LDO “A Dual-VDD Low Power FPGA Architecture” A. Gayasen1, K. Lee1, N. Vijaykrishnan1, M. Kandemir1, M.J. Irwin1, and T. Tuan2
Performance Static voltage scaling techniques leads to nearly 53% power reduction. Dynamic(upto 54%). Dual Vdd- 14% Merits: SVS - Simple hardware DVS - Self adaptive Dual Vdd – eliminate speed penalty Demerits: SVS - Voltage is fixed DVS - design complexity Dual Vdd - area overhead [1]"A FPGA Prototype Design Emphasis on Low Power Technique" Xu, Jian [2]”A 90-nm Low-Power FPGA for Battery-Powered Applications”,Tuan, Das, Steve, Sean
Frequency Scaling f : frequency of switching Simple dynamic clock management circuit (b) Using Feedback, PLL circuit can reduce skew; lock time (a) The simplest dynamic clock management circuit is an open-loop implementation with a clock divider inserted into the desired paths (b) Skew can be compensated by introducing a Phase Locked Loop (PLL) into the circuitry. The simplest dynamically scaled structure is obtained by taking feedback from a point that does not change frequency © This scheme can successfully apply dynamic clock division. For dynamic multiplication, the signal in the feedback path must be divided In the case of a large change in input frequency, the output of the PLL may take a long period to settle and regain a lock on the input signal. (c) dynamic clock division Merits: Can subsequently reduce voltage Demerits: Increased Latency Dynamic Clock Management Implementations
Benefits of Frequency Scaling Dynamic Clock Management for Low Power Applications in FPGAs As frequency decreases, power consumption also decreases "Dynamic Clock Management for Low Power Applications in FPGAs", Lan, zilic
Clock Gating Controlling the clock flow Purpose: To temporarily disable blocks Can be realized in hardware using clock enable signals minimizes power dissipation in clock circuits/network (a) a clock is driving a number of flip-flops. The top two rows of flip-flops are connected to a clock enable signal, clkEnable, whereas the bottom row of flip-flops is not connected to any clock enable signal. Observe that the clock is driven by global clock buffer (b) The new global clock buffer, called BUFGCE, The input to this buffer is also clk, however, the clock enable of this buffer is connected to flip-flop’s enable signal clkEnable, and the clkEnable signal is disconnected from the flip-flops it was previously feeding.
Clock Gating - Performance Clock Power Reduction for Virtex-5 FPGAs Over 20% power reductions are observed for the DSP circuits Eliminates unnecessary toggling on outputs, gates of FFs and clock signals industry-a,b,c,d, are DSP circuits, while the remaining circuits are collected from customers and are of unknown function Demerits: Clock skew "Clock Power Reduction for Virtex-5 FPGAs",Wang, Gupta, Anderson
Software Techniques System Level: Algorithm Modification CAD Tools : Logic Partitioning Mapping, Clustering Placement & Routing A
Low Power FFT Implementation Architecture Matrix multiplication ->1D array low power dissipation than 2D array Module Disabling – Clock gating to disable modules eg: twiddle factor calculation dynamic memory activation Multiple time multiplexed Pipeline uP Parallel Processing Algorithm : Block Matrix Multiplication Time-multiplexers instead of routing network are used for shuffling the intermediate data, thus reducing the burden of interconnection power for large FFT problem size. As pipeline stages increasesturn, reduces dynamic power. energy reduces - Pipelining reduces the number of spurious glitches which, in To reduce memory power, a method of dynamic memory activation is developed. Cache Based approach
FFT implementation Results 17% to 26% power reduction "High throughput energy efficient multi-FFTarchitecture on FPGAs" , Chen , Park, Prasanna
Energy Reduction Contributions of CAD Stages Clustering contributes to the major share ! "On the interaction between power aware FPGA CAD algorithms" , Julien , Steven
Power Aware Clustering Power Aware TV pack How?? Cost function Modification to include power
Results: Power Aware clustering “Netlength Based Routability Driven Power Aware Clustering" , Akoglu, Easwaran
Power Aware Placement Problem Addressed: Power analysis of configurable switches is usually implemented during the routing and mapping stages and has been largely ignored during the placement stage of the design due to the inaccuracy associated with power estimation at high level design process Proposed Idea: A Power-Aware Algorithm for the Design of Reconfigurable Hardware during High Level Placement Modeled the number of switches used in the circuit and employed simulated annealing algorithm to reduce the overall routing power
Results "On the interaction between power aware FPGA CAD algorithms" , Julien , Steven
Temperature Aware Routing leakage current increases exponentially with temperature Switching capacitance Needs the knowledge of spatial distribution of parameters
Algorithm By discouraging routing algorithm to form connections that cross hotspot regions Cost Function Modification: Power Savings Range between 30 – 63 % "A Temperature-Aware Placement and Routing targeting 3D FPGAs", Kostas, Soudris
Power-Aware FPGA Design Flow Step 1 Power Based Architectural (High level modelling) RTL Voltage scaling, Dual Vdd Freq Scaling, Clock gating Step 2 Power Aware Packing or Clustering CAD Power Aware Placement Tools Power Aware Routing
Main/Baseline Paper Problem Addressed Proposed idea Power consumption in FPGAs is dominated by interconnect(62%) Proposed idea Charge recycling for power reduction in FPGA interconnect
Charge Recycling (CR)
Charge Recycling in FPGAs How?? “Unused routing resources “ as reservoirs Reduces charge drawn from Vdd 25% reduction in energy Unused/Reservoir Unused/Reservoir Unused w/o friends !!
CR-Capable FPGA Interconnect Analysis Four components SRAM Cell Produce signals CR and TS : control a switch (Normal, CR, tri-state ) Delay Line Transition between VIN and DLOUT CR Circuit Perform the charge sharing between the load and reservoir Input Stage
Experiments/Methodology VPR6.0 Baseline : Island style, Unidirectional, Wilton (K=6 ,N=4) Router – Path Finder - Cost Function Modification Post Routing CR mode VPR place/route tool helps in finding % increase in area
VPR Cost Function Cost Function – Path Finder Modified Cost Function
Post - Routing Mixed Integer Linear Program Tries to maximize the number of nodes to be put into CR mode Constraint: Critical delay of the circuit
Results Dynamic power in the FPGA interconnect is reduced by up to ∼15-18.4%
Results Continued… Number of min-width transistors as the area metric Reductions in power savings are not directly proportional to the reduction in CR-capable switches (area)
What we propose new? Not all unused wires become friends Unused wires connected to constant voltage “URekha” --- Unused wires Tri-stated “further power savings!!” ~6% savings
Thank you