Download presentation
1
Routing Track Duplication with Fine-Grained Power-Gating for FPGA Interconnect Power Reduction
Yan Lin, Fei Li and Lei He EE Department, UCLA Partially supported by NSF grant CCR Address comments to
2
Outline Review and Motivation
Interconnect Leakage Power Reduction using Power-gating Interconnect Dynamic Power Reduction using Dual-Vdd Conclusions and Ongoing Work
3
Power Limitation of FPGAs
Existing FPGAs are HIGHLY power inefficient (> 100X more than ASIC) E.g. [Kusse, ISLPED’98] Power is likely the largest limitation for FPGAs Design Example Vdd Energy Xilinx XC4003A 5v 4.2mW/MHz Static CMOS ASIC 3.3v 5.5uW/MHz It is well known that FPGAs are highly power inefficient compared to ASIC. The previous work has shown that for the same circuit implemented on FPGA consumes more than 100X power than that implemented on Static CMOS. Thus power is likely the largest limitation for FPGAs.
4
FPGA Power Reduction Power aware FPGA CAD algorithms for existing FPGA architectures CAD algorithms to minimize power-delay product [Lamoureux et al, ICCAD’03] Configuration inversion for leakage reduction [Anderson et al, FPGA’04] Power efficient FPGA circuits and architectures Dual-Vdd and Vdd-programmable FPGA logic blocks [Li et al, FPGA’04][Li et al, DAC’04] Vdd-programmable FPGA interconnects [Li et al, ICCAD’04] [Anderson et al, ICCAD’04] As far as FPGA power reduction is concerned, Previous work has studied power aware FPGA CAD algorithms without changing the current FPGA architectures. A suite of CAD algorithms are proposed to minimize power-delay product. Also, the configuration inversion was proposed to reduce lkg. Other previous work design power efficient FPGA circuits and architectures. Dual Vdd and Vdd-prog FPGAs have been proposed these Papers. This paper mainly focuses on FPGA interconnect power reduction. In the next couple of slides, I will review the FPGA architecture background and the Vdd-programmable interconnects proposed in that paper.
5
Overall FPGA Structure
Cluster-based Island Style FPGA Structure Logic blocks are embedded into routing resources Wire segment connectivity is programmable Here, we show the overall structure of a cluster-based island style FPGA. The logic blocks are surrounded by horizontal and vertical routing channels which consist of wire segments. The wire segments can be connected to each other via a switch block at each intersection of vertical and horizontal routing channel. The input/output pins of a logic block can be programmed to connect to wire segments in routing channels.
6
FPGA Routing Structure
Subset Programmable switch block An incoming track can be connected to different outgoing tracks with the same track number Programmable connection block Subset programmable switch block, dash line -> possible connection, bi-direction, Subset -> connect to same track number Programmable connection block, multiplexer-based, select one wire segment to be connected to logic block input pin, buffer between wire segment and multiplexer is connection switch
7
Vdd-programmable Interconnects [Li et al, ICCAD’04]
Conventional routing switch Vdd-programmable switch Vdd selection for used switch Power-gating unused switch Configurable Vdd-level conversion Avoid excessive leakage when low Vdd switch drives high Vdd switches Power transistor Here we review Vdd-programmable switch, based on conventional routing switch implemented by tri-state buffer, two power transistors are inserted between the dual power supply rails and the buffer. Turning off one of the power transistor can perform Vdd selection for an used one, turning off both can perform power-gating for unused one. Configurable level converter is inserted in front of each interconnect switch to avoid excessive leakage when VddL drives VddH
8
Limitation of Vdd-programmable Interconnects [Li et al, ICCAD’04]
Fine-grained Vdd-level converter insertion Area overhead 54% area overhead for circuit s38584 Leakage overhead 36% leakage overhead for circuit s38584 SRAM cell overhead 300% SRAM cell overhead for each switch Area/SRAM efficient low-power interconnects are needed However, the fine-grained Vdd-level converter introduces large area and leakage overhead. Analysis shows for circuit s38584, the area and leakage overhead is … respectively. Also, to achieve Vdd programmability and Configurable Vdd-level conversion, configuration SRAM cell overhead is 300%. It will increase configuration signal routability and SRAM is vulnerable from soft error. Therefore, … is needed
9
Outline Review and Motivation
Interconnect Leakage Power Reduction using Power-gating Interconnect Dynamic Power Reduction using Dual-Vdd Conclusions and Ongoing Work
10
Low Utilization Rate of Interconnects
78.15% of total power is consumed by global interconnect power [Li et al, DAC’04] 47% of global interconnect power is leakage Why? Extremely low utilization rate (~12% w/ minimum array) Circuit # of total interconnect switches # of unused interconnect switches Utilization rate (%) alu4 apex4 bigkey clma des diffeq dsip elliptic ex5p frisc 36478 43741 63259 653181 87877 42746 75547 140296 45404 31224 37703 54017 593343 79932 36974 70138 125800 39288 216993 14.40% 13.80% 9.87% 9.16% 9.04% 13.50% 7.16% 10.33% 13.47% 9.15% Average 11.90% The previous work shows that global interconnect power become more significant (around 78.15%) after applying programmable dual-Vdd to logic blocks. And 47% of global interconnect power is leakage. Considering two factors, we can power-gate unused interconnect switches to reduce fpga power. This is due to the extremely low interconnect utilization rate. (animation) We customize the FPGA chip size for each application and use minimum array which just fits the application. The utilization rate is about 12% and will be even lower in the real world.
11
Interconnect Utilization Rate is Intrinsically Low
Programmable switch block no more than 25% Programmable connection block Only one is used (for 64 tracks) In fact, it is because of programmability and not related to application and architecture. Programmable switch block – use one direction, get 50%, use 3 out of 6 dash lines, further get 25% Programmable connection block – can only use one out of 64 for example Thus, it is intrinsic and power-gating…is necessary Power-gating unused interconnects is necessary
12
Vdd-gateable Routing Switch
Conventional routing switch Vdd-gateable routing switch Only two states for a routing switch High Vdd Power-gating Enable power-gating capability w/o extra SRAM cells Power transitor Here we shows the circuit of routing switch with power-gating capability, we call it Vdd-gateable routing switch. Based on conventional routing switch, one power transistor is inserted ( animation) When used, turning on both power transistor and pass transistor When unused, turning off both 4. Keep pass transistor M1 to prevent sneak leakage path Enable power-gating w/o extra SRAM
13
Vdd-Gateable Connection Block
Conventional connection block Vdd-gateable connection block Here we shows … For connection switch, replace buffer with Vdd-gateable switch Replace multiplexer with decoder to select one wire segment to connect to logic block input and power-gate other switches Add one extra SRAM to disable decoder and power-gate all switches when the whole block is unused Need a low decoder to avoid leakage overhead Enable power-gating capability w/ only one extra SRAM for a connection block Only n+1 SRAM cells for 2n connection switches A low leakage decoder is needed
14
Power and Delay of Vdd-gateable Switch
Vdd-gateable switch compared to conventional switch Dynamic power is almost the same >300X leakage power reduction ~6% delay increase Vdd Routing switch delay (ns) Energy per switch (Joule) w/o power-gating w/ power-gating 1.3v 5.90E-11 6.26E-11(6%) 3.3E-14 3.25E-14 1.0v 6.99E-11 7.42E-11(6.1%) 1.63E-14 1.65E-14 Here shows Power&Delay of Vdd-gateable switch Almost no dynamic power overhead—power transistor doesn’t switch, no charge discharge of drain/source capacitance Achieve 300X leakage reduction at the cost of 6% delay increase
15
Power Reduction by Power-gating Unused Interconnects
Circuit Single-Vdd (baseline) Total Power Saving Interconnect power (W) Total power (W) [Li et al, ICCAD04] Vdd-gateable Interconnects alu4 0.0657 0.0769 25.13% 29.09% apex4 0.0437 0.0500 21.83% 30.70% bigkey 0.1044 0.1375 33.38% 24.89% clma 0.4918 0.5450 23.42% 45.69% des 0.1688 0.2136 36.71% 31.79% diffeq 0.0292 0.0360 17.50% 45.20% dsip 0.1003 0.1280 34.34% 43.66% Avg. -- 25.19% 38.18% Use cycle accurate simulation Reduce 38% on average In contrast Vdd-programmable reduces 25% due to Vdd-level converter overhead although it has flexibility in Vdd-selection Vdd-programmable interconnects Vdd-gateable interconnects
16
Outline Review and motivation
Interconnect Leakage Power Reduction using Power-gating Interconnect Dynamic Power Reduction using Dual-Vdd FPGA fabrics and algorithms Design flow and quantitative evaluation Conclusions and Ongoing Work
17
Pre-Defined Dual-Vdd Routing Architecture
Reduce dynamic power with dual-Vdd by making use of timing slack Partition routing channel into VddH and VddL regions Vdd-gateable interconnect switch is used Ratio of VddH/VddL track is an architectural parameter Dual-Vdd technique makes use of the timing slack in the circuit to minimize power. High Vdd (VddH) is applied to devices on the critical paths to maintain the performance while low Vdd (VddL) is applied to devices on non-critical paths to reduce power. FPGA applications usually have large amount of surplus timing slack. We may apply dual-Vdd technique to FPGA interconnect fabric and leverage the surplus timing slack to reduce interconnect dynamic power. Here we shows the dual-Vdd routing structure. We partition the routing channel into two region. Use Vdd-gateable switch In different region, use different supply voltage Ratio is parameter How to determine the parameter?
18
Ratio of VddH to VddL Track
Determine ratio using dual-Vdd assignment profile without considering layout constraint Sensitivity-based dual-Vdd assignment Assignment unit --- a routing tree Power sensitivity --- ΔP/ ΔVdd Power difference for a routing tree between VddH and VddL Greedy algorithm --- sensitivity based Initial: uniform VddH assignment Procedure: assign VddL to routing tree with largest power sensitivity (but without increasing critical delay) Use assignment– sensitivity-based, greedy
19
Profile of Dual-Vdd Assignment
Assignment with no critical path delay increase (VddH:VddL=1.5v:1.0v) Circuits #of routing trees # of logic blocks # of I/O blocks VddL routing trees (%) VddL logic blocks (%) alu4 782 162 22 49.74 82.10 apex4 849 134 28 35.45 78.36 bigkey 1542 294 426 67.77 85.03 clma 7995 1358 144 69.74 89.84 s38417 5426 982 135 64.17 80.05 seq 1138 274 76 20.74 61.62 spla 2091 461 122 54.52 88.47 Avg. 54.54 80.28 54% low –Vdd routing trees, use 1:1 Set the ratio of VddH/VddL track to 1:1
20
Level Converter is NOT Needed
B A Subset---wire segment can only be connected to the segment with the same track number Suppose route A B, either use High Vdd track 0 or low Vdd track 2(animation) In other words, wire segment can only be connected to the segment with same Vdd level Thus, does not need level converter (animation) Wire segment can only be connected to another wire segment with the same track number via a subset switch block
21
Level Converter is NOT Needed
B A Wire segment can only be connected to another wire segment with the same track number via a subset switch block No level converter is needed in switch block
22
Layout Constraint Due to Dual-Vdd
Dual-Vdd introduces performance degradation due to layout constraint Insufficient routing resources for Vdd-matched routing trees May introduce detours Solutions Vdd-programmable interconnects [Li et al, ICCAD’04] Provide sufficient routing tracks for Vdd-matched routing trees Control leakage by power-gating unused interconnects Dual-Vdd introduces performance degradation Insufficient routing resourse Introduce detours to match Vdd type Solution Previous work---Vdd programmable In this paper, provide sufficient routing resource by increasing channel width Control leakage using Vdd-gateable
23
Design Flow for Dual-Vdd Interconnects
Tech Mapped Netlist (Single-Vdd) Timing Driven Layout (Single-Vdd) Arch Spec Dual-Vdd Assignment for Routing Trees Double Channel width Timing Driven Layout (Dual-Vdd) Power-gating Unused Switches Here is the design flow considering dual-vdd… Single Vdd P&R Dual-Vdd Assignment Dual-Vdd routing guided by assignment Power delay evaluation Another design path when channel is duplicated Achieve effective Vdd-programmability for each routing tree, skip dual-Vdd routing Delay/Power Model (dual-Vdd) Delay/Power Estimation Delay Power
24
Dual-Vdd Routing Algorithm
Based on the maze routing algorithm in VPR Modify the cost function TotalCost(n): the cost of routing tree T through wire segment n to the target sink j PathCostDv(n): the cost of the path from the current partial routing tree to wire segment n ExpectedDv(n,j): the estimated cost from wire segment n to the target sink j Matched(T,n): boolean function describing Vdd-matching status Dual-Vdd routing algorithm Based on maze routing used in VPR Modify original cost function To incorporate dual-Vdd Add one boolean function Match to penalize the routing tree with un-matched Vdd type
25
Outline Review and motivation
Interconnect Leakage Power Reduction using Power-gating Interconnect Dynamic Power Reduction using Dual-Vdd FPGA fabrics and algorithms Quantitative evaluation Conclusions and Ongoing Work
26
Comparison of Low Power Architectures
arch-SV 1.3v 1.0v 0.9v 1.5v arch-PV 1.5v/0.8v 1.3v/1.0v 0.9v/0.8v 1.0v/0.8v 0.27 arch-PV+PG 1.5v/0.8v 1.3v/1.0v 1.0v/0.8v 0.9v/0.8v arch-DV+PG(1.5W) 1.5v/0.8v 1.3v/0.9v 1.0v/0.8v 0.9v/0.8v 0.22 power (watt) 0.17 0.12 Circuit: S38584 0.07 60 70 80 90 100 110 120 130 clock frequency (MHZ) Dual-Vdd interconnects with fine-grained power gating May have performance degradation due to layout constraint Can reduce more power than purely power-gating unused switches Achieve 9.78% interconnect dynamic power reduction, 38.68% total power saving with 1.5W channel width W is the nominal routing channel width in single-Vdd FPGA Power performance tradeoff curves Single Vdd scaling, achieve power saving by scaling down Vdd at cost of performance lost Apply programmable Vdd to logic block, previous work Further power-gate unused interconnects Further introduce dual-Vdd by increasing routing channel by 50% Compared two low-power curves 3 & 4 we can see Compared the maximum clock frequency, there is performance lost due to layout constraint Dual-Vdd can reduce power comparing to purely power-gating The power reduction (gap between 3 and 4) decreases at lower clock freq, indicates timing slack is smaller at smaller clock freq
27
Impact of Routing Channel Width
30% 35% 40% 45% 50% 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 channel width power saving 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 normalized clock frequency 0.955 0.838 0.743 45.00% 38.68% 34.86% clock frequency We get the power reduction percentage at the maximum clock frequency achieved by dual-Vdd interconnects Channel width increases from 1.0W to 2.0W Power saving increases from 34.86% to 45% Normalized clock frequency increases from to 0.955 Here we show the impact of routing channel X-axis channel width Left Y-axis power saving Right Y-axis normalized clock freq Red curve– power saving Blue curve—clock freq Clear that increasing channel width --more power reduction --higher performance --due to sufficient routing resources -----increase Vdd-matched routing tree rate -----more dynamic power reduction and similar leakage by power-gating unused
28
Area Overhead of Vdd-gateable Interconnects
Device area is dominant Single-Vdd (baseline) Dual-Vdd w/ Power-gating (1.0W) Dual-Vdd w/ Power-gating (1.5W) Dual-Vdd w/ Power-gating (2.0W) [Li et al, ICCAD’04] Total FPGA area Area overhead (%) - 57% 118% 186% 220% Area overhead is mainly due to power transistors for power-gating capability Track duplication with power-gating vs Vdd-programmable interconnects [Li et at, ICCAD’04] More power reduction (45% vs 25%) & less area overhead Mainly due to Vdd-level converter removal High Vdd interconnects with power gating is BEST considering area However, larger channel width - larger area overhead Device area is dominant compared to wiring area Geo mean of MCNC benchmarks and area overhead shown in the table Compared duplicated channel width and Vdd-programmable interconnects Less area overhead power transistors for Vdd-programmability and Vdd-level converter More power reduction --- no Vdd-level converter Considering Area&power tradeoff, SingleVdd and Vdd-gateable interconnect is best
29
Outline Review and motivation
Interconnect Leakage Power Reduction using Power-gating Interconnect Dynamic Power Reduction using Dual-Vdd Conclusions and Ongoing Work
30
Conclusions and Ongoing Work
Developed power-gateable interconnects w/ virtually no extra SRAM cell Achieved 38.18% total power reduction using Vdd-gateable interconnects Achieved 24.78% interconnect dynamic power reduction, 45.00% total power reduction with duplicated (2W) channel width Ongoing work Power-ground design to support dual-Vdd Optimal mix of Vdd-programmable and Vdd-gateable interconnects Architecture evaluation considering Vdd programmability [Lin et al, to appear in FPGA’05]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.