Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants.

Slides:



Advertisements
Similar presentations
Dan Lander Haru Yamamoto Shane Erickson (EE 201A Spring 2004)
Advertisements

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.
Cross-layer Optimized Placement and Routing for FPGA Soft Error Mitigation Keheng Huang 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of Computer System.
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
1 Closed-Loop Modeling of Power and Temperature Profiles of FPGAs Kanupriya Gulati Sunil P. Khatri Peng Li Department of ECE, Texas A&M University, College.
EECE579: Digital Design Flows
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.
Yan Lin, Fei Li and Lei He EE Department, UCLA
Power-Aware Placement
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.
Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.
UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.
HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.
Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,
Dynamic Power Consumption In Large FPGAs WILLIAM GARCIA, ANDREW MORTELLARO.
ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.
Robust Low Power VLSI R obust L ow P ower VLSI Finding the Optimal Switch Box Topology for an FPGA Interconnect Seyi Ayorinde Pooja Paul Chaudhury.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.
XIAOYU HU AANCHAL GUPTA Multi Threshold Technique for High Speed and Low Power Consumption CMOS Circuits.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,
Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
An Improved “Soft” eFPGA Design and Implementation Strategy
1 Area-Efficient FPGA Logic Elements: Architecture and Synthesis Jason Anderson and Qiang Wang 1 IEEE/ACM ASP-DAC Yokohama, Japan January 26-28,
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Click to edit Master title style Literature Review Measuring the Gap Between FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS.
ASIC/FPGA design flow. Design Flow Detailed Design Detailed Design Ideas Design Ideas Device Programming Device Programming Timing Simulation Timing Simulation.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.
PROCEED: Pareto Optimization-based Circuit-level Evaluation Methodology for Emerging Devices Shaodi Wang, Andrew Pan, Chi-On Chui and Puneet Gupta Department.
IPF: In-Place X-Filling to Mitigate Soft Errors in SRAM-based FPGAs
MAPLD 2005 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan Dr. V. Kamakoti.
An Automated Design Flow for 3D Microarchitecture Evaluation
An Active Glitch Elimination Technique for FPGAs
A High Performance SoC: PkunityTM
FPGA Glitch Power Analysis and Reduction
Department of Electrical Engineering Joint work with Jiong Luo
A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P
Measuring the Gap between FPGAs and ASICs
Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.
Presentation transcript:

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants CCR , and CCR , and Altera under the California MICRO program UCLA Joint work with Deming Chen, Lei He, Fei Li, Yan Lin

Outline Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power Optimization Low Power Synthesis Conclusions

Why? FPGA is Known to be Power Inefficient! FPGA consumes X more power Why do we care about power optimization for FPGAs ?! Source: [Zuchowski, et al, ICCAD02]

ASICs Become Increasingly Expensive Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below Source: EETimes $0.0 $0.5 $1.0 $1.5 $2.0 $ nm180nm130nm100nm Total Cost for Mask Set ($M) 0 $10 $20 $30 $40 $50 $60 Cost/Mask ($K)

FPGA Advantages Short TAT (total turnaround time) No or very low NRE

Our Research Power Efficient FPGAs Circuit Design Fabric Design System Design Synthesis Tools

Outline Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power Optimization Low Power Synthesis Conclusions

FPGA Architecture Program mable IO K LUT Inputs D FF Clock Out BLE # 1 BLE # N N Outputs I Inputs Clock I N Programm able Logic Programm able Routing

BC-Netlist Generator Power Simulator Power BLIF Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) SLIF Delay Area fpgaEva flow [Cong, et al, ICCD’00] Arch Spec BLIF Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) SLIF Delay Area Evaluation Framework – fpgaEva-LP fpgaEva-LP [Li, et al, FPGA’03]

BC-Netlist Generator Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation BC-Netlist Back-annotation

Mixed-level Power Model – Overview Dynamic power  Switching power  Short-circuit power Related to signal transitions Functional switch Glitch Dynamic Interconnect & clock Macro-model Static Switch-level model Macro-model Logic Block components power sources Static Power  Sub-threshold leakage  Gate leakage  Reverse biased leakage Depending on the input vector

Cycle-Accurate Power Simulator Mixed-level Power Model Post-layout extracted delay & capacitance Random Vector Generation BC-Netlist Cycle Accurate Power Simulation with Glitch Analysis All cycles finished? No Power Values Yes

Power Breakdown Interconnect power is dominant Cluster Size = 12, LUT Size = 4Cluster Size = 12, LUT Size = 6

Power Breakdown (cont’d)  Leakage power becomes increasingly important (100nm) Cluster Size = 12, LUT Size = 4Cluster Size = 12, LUT Size = 6

Outline Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power Optimization  Architecture Parameter Selection  Dual-Vdd/Dual-Vt FPGA Architecture Low Power Synthesis with Dual-Vdd Conclusion

Total Power along LUT and Cluster Size Changes Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

Routing Architecture Evaluation

Architecture of Low-power and High-performance ApplicationsBest FPGA architectureEnergy (E) Delay (t) E3tE3tEt 3 Low-power (E 3 t) Cluster size 10, LUT size 4, wire segment length 4, 25% buffered routing switches High- performance (Et 3 ) Cluster size 12, LUT size 4, Wire segment length 4, 100% buffered routing switches Arch. Parameter selection leads to 10% power/delay trade-off Uniform FPGA fabrics provide limited power-performance tradeoff Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-Vdd fabrics

Outline Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power Optimization  Architecture Parameter Selection  Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04] Low Power Synthesis with Dual-Vdd Conclusion

Dual-Vdd LUT Design Dual-Vdd technique makes use of the timing slack to reduce power  VddH devices on critical path performance  VddL devices on non-critical paths power  Assume uniform Vdd for one LUT Threshold voltage Vt should be adjusted carefully for different Vdd levels  To compensate delay increase  To avoid excessive leakage power increase

Vdd/Vt-Scaling for LUTs Three scaling schemes  Constant-Vt scaling  Fixed-Vdd/Vt-ratio scaling  Constant-leakage scaling Constant-leakage scaling obtains a good tradeoff useful for both single-Vdd scaling and dual-Vdd design

Dual-Vt LUT Design LUT is divided into two parts  Part I: configuration cells high Vt  Part II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling) Configuration SRAM cells  Content remains unchanged after configuration  Read/write delay is not related to FPGA performance Use high Vt ~40% of Vdd  Maintain signal integrity  Reduce SRAM leakage by 15X and LUT leakage by 2.4X  Increase configuration time by 13%

Pre-Defined Dual-Vt Fabric FPGA fabric arch-SVDT  Dual-Vt inside a LUT  A homogeneous fabric at logic block level with much reduced leakage power Traditional design flow in VPR can be applied Power saving  11.6% for combinational circuits  14.6% for sequential circuits Circuit arch-SVST (Single Vt) arch-SVDT (Dual Vt) power (watt)power saving alu % apex % apex % des % ex % ex5p % misex % pdc % seq % spla % Avg.11.6% Table1 Combinational circuits circuit arch-SVST (Single Vt) arch-SVDT (Dual Vt) power (watt)power saving bigkey % clma % diffeq % dsip % elliptic % frisc % s % s % s % tseng % Avg.14.6% Table2 Sequential circuits

Dual-Vdd FPGA Fabric Granularity: logic block (i.e., cluster of LUTs)  Smaller granularity => intuitively more power saving  But a larger implementation overhead Layout pattern: pre-defined dual-Vdd pattern  Row-based or interleaved pattern  Ratio of VddL/VddH blocks is 2:1 (benchmark profiling) Interconnect uses uniform VddH L-block: VddL H-block: VddH

Simple Design Flow for Dual-Vdd Fabric Based on traditional design flow, but with new steps Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)

Comparison Between Vdd-Scaling and Dual-Vdd For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving) For low clock frequency, single-Vdd scaling is better Still a large gap between ideal dual-Vdd and real case  Ideal dual-Vdd is the result without layout pattern constraint circuit: alu Max. Clock Frequency (MHz) Power (watt) arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) arch-DVDT(pre-defined Vdd) 1.3v 1.0v 0.9v 1.3v/0.8v 1.0v/0.8v 0.9v/0.8v 1.5v 1.5/1.0v 1.3/1.0v 1.0/0.9v 1.5v/1.0v 1.3/0.9v

Vdd-Programmable Logic Block Power switches for Vdd selection and power gating One-bit control is needed for Vdd selection, but two-bit control power gating

Experimental Results with Vdd- Programmable Blocks Power v.s. performance Circuit: alu clock frequency (MHz) total power (watt) arch-SV (Vdd scaling) arch-DV (configurable Vdd) arch-DV (ideal case) arch-DV (pre-defined Vdd) 1.3 v 1.0v 1.5v/1.0v 1.3v/0.8v 1.0v/0.8v 1.5v/1.0v 1.3v/0.9v 1.0v/0.8v 1.5v/0.8v 1.3v/0.8v 1.0v/0.9v 1.5v 0.9v/0.8v 1.0v/0.8v 1.3v/0.8v 1.5v/1.0v

Outline Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power Optimization Low Power Synthesis Conclusions

Low Power Synthesis for Dual Vdd FPGAs FPGA architecture with dual-Vdds adds new layout constraints for synthesis tools Novel synthesis tools are required to support the architecture  Technology mapping [Chen, et al, FPGA’04]  Circuit clustering [Chen, et al, ISLPED’04]

Conclusions FPGA power consumption  Majority on programmable interconnects  Leakage is significant FPGA architecture optimization for power  Architecture parameter tuning has a limited impact  Using high Vt for configuration SRAM cells is helpful  Using programmable dual Vdd for logic blocks is helpful Power-efficient FPGA architectures introduce interesting CAD problems  Dual-Vdd mapping  Dual-Vdd clustering Up to 20% power saving reported using these algorithms