Click to edit Master title style Literature Review Measuring the Gap Between FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS Feburary 2007 Henry Chen February 26, 2010
Introduction Trade-offs between FPGAs and standard-cell ASICs –Decreased NRE, design time –Increased silicon area, power; decreased performance FPGA inefficiencies known and accepted, but largely un-quantified
Previous Comparisons Jones et al. (1986): MPGAs to standard cells –1.5 2.6x area, ~1.1x delay –Estimates based on only 5 circuits Brown et al. (1992): FPGAs to MPGAs –8 12x area, ~3x delay –Optimistic FPGA gate counting? –Anecdotal evidence –Doesn’t consider “hard” macros (multipliers, memories) Combine for FPGAs to standard cells –12 38x area, ~3.4x delay –Dated; based on (questionable?) extractions
Previous Comparisons (2000’s) Zuchowski et al. (2002): LUT to ASIC gate (0.25μm 90nm) –~ 1 / 45 gate density, 12 14x delay, ~500x dynamic power –Unexplained process-dependent density/power variation –Dependent on gates implemented per LUT Wilton et al. (2005): Partial programmable replacement –88x area, 2x delay –Single logic module Compton & Hauck (2007): FPGA apps. to standard-cell –Avg 7.2x area –Scaled FPGA 0.15μm to 0.18μm standard-cell
Methodology Implement in both FPGA and standard-cell –Altera Stratix II FPGA: TSMC 90nm multi-V t, 1.2V –Standard-cell: ST CMOS090 90nm, dual-V t, 1.2V Empirical results from 23 benchmarks –Rejected if different synthesis tools resulted in >5% register count deviation –Mix of logic, memory, DSP Analyze gains from FPGA’s DSP and memory blocks Exclude I/Os Have device data from Altera
Implementations FPGA –Altera-provided CAD flow –Speed/area balanced optimization; optimize critical paths performance, otherwise optimize area –Automatic DSP, memory block inference –Set to mimic effects of high resource utilization ASIC –Synopsys/Cadence synthesis/PAR flow –Free to choose from high/standard-V t cells –Timing-driven placement; target 75 85% utilization –Emphasized performance in compiled memories
Area Comparison ASIC –Post PAR’d core area –Include memory macros FPGA –Count only silicon area for used resources –Include surrounding routing resources –Count full block area even if only partially used –Area data from Altera
Area Comparison Results Logic only: 35x avg (17 ‒ 54x) Logic + DSP: 25x avg (12 ‒ 58x) Logic + Memory: 33x avg (19 ‒ 70x) Logic + Memory + DSP: 18x avg (9.5 ‒ 26x)
Impact of Hard Macros on Area Smaller area penalty for designs using hard macros –Hard macro close to ASIC implementation (plus programmable interface & routing)
Area Comparison Caveats Pessimistic FPGA area estimation; count full resource area even if only partially used (~5 ‒ 10% reduction) ASIC density may decrease for larger designs, while FPGAs are designed to handle large designs
Delay Comparison Altera Quartus II / Synopsys PrimeTime SI Static timing analysis to extract max. clock frequency Compare for different FPGA speed grades –FPGAs are binned for performance –ASICs tend to be designed for worst-case
Delay Comparison Results (Fastest Speed Grade) Logic only: 3.4x avg (1.9 ‒ 5.0x) Logic + DSP: 3.5x avg (2.4 ‒ 4.7x) Logic + Memory: 3.5x avg (2.8 ‒ 4.3x) Logic + Memory + DSP: 3.0x avg (2.6 ‒ 3.5x)
Delay Comparison Results (Slowest Speed Grade) Logic only: 4.6x avg (2.5 ‒ 6.7x) Logic + DSP: 4.6x avg (3.0 ‒ 6.3x) Logic + Memory: 4.8x avg (3.8 ‒ 5.7x) Logic + Memory + DSP: 4.1x avg (3.8 ‒ 4.7x)
Impact of Hard Macros on Delay Almost no benefit—sometimes penalty! –Fixed positions in FPGA; extra routing to use –Fixed architecture; some apps. may not use efficiently
Power Comparison Altera Quartus II Power Analyzer / Synopsys PrimePower Compare power, not energy consumption –FPGAs slower; need more time or parallelism –Implement for highest speed possible –Simulate at same operating frequency, voltage Measure only core power Assume constant toggle rates for all nets in design –Meaningful test vectors not available for all designs FPGA static power consumption scaled by used fraction
Power Comparison Results Logic only: 14x avg (5.7 ‒ 52x) Logic + DSP: 12x avg (7.5 ‒ 16x) Logic + Memory: 14x avg (12 ‒ 16x) Logic + Memory + DSP: 7.1x avg (5.3 ‒ 8.3x)
Impact of Hard Macros on Power Slight benefit—primarily from area savings? –Less area and interconnect
Power Consumption Caveats May be disproportionate power in FPGA clock network –“Overdesigned” for tested circuits –Could have small incremental power increase ASIC clock network would have to grow with designs
Static Power Comparison Unable to draw useful conclusions about static power –87x for typical silicon, typical temp. (25°C) –5.4x for worst-case silicon, worst-case temp. (85°C) Had to scale worst-case silicon temp. characterization Subthreshold leakage is process-dependent –Little information on leakage estimate factors –Different processes from different foundries Some correlation between static power and area gap (correlation coefficient ~0.8) –Hard macros likely reduced static power penalty
Conclusions Disparity hard to quantify—very application dependent –Avg. gap gap 3x; gap gap range 1.3 ‒ 9.1x All-LUT designs avg. 35x area, 3.4 ‒ 4.6x delay, 14x power –119x area, 47.6x power gap for equal performance (assuming ideal parallelization) Hard macros reduce area and power, but have little performance benefit –Avg. 18x area, 3 ‒ 4.1x delay, 7.1x power –54x area, 21.3x power for equal performance
References Jones, Jr., H. S., Nagle, P. R., Nguyen, H. T., “A Comparison of Standard Cell and Gate Array Implementations in a Common CAD System”, Proc. IEEE CICC, 1986, pp. 228 232 Brown, S. D., Francis, R., Rose, J., Vranesic, Z., Field-Programmable Gate Arrays, Norwell, MA: Kluwer, 1992 Zuchowski, P. S., Reynolds, C. B., Grupp, R. J., Davis, S. G., Cremen, B., Troxel, B., “A Hybrid ASIC and FPGA Architecture,” Proc. ICCAD, Nov. 2002, pp. 187 194 Wilton, S. J., Kafafi, N., Wu, J. C. H., Bozman, K. A., Aken’Ova, V., Saleh, R., “Design Considerations for Soft Embedded Programmable Logic Cores”, IEEE JSSC, vol 40, no. 2, pp. 485 497, Feb Compton, K., Hauck, S., “Automatic Design of Area-Efficient Configurable ASIC Cores,” IEEE Trans. Comp., vol 56, no. 5, pp. 662 672, May 2007