1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Introduction to CMOS VLSI Design Lecture 13: SRAM
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Introduction to CMOS VLSI Design SRAM/DRAM
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Lecture 19: SRAM.
Parts from Lecture 9: SRAM Parts from
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
Titan: Large and Complex Benchmarks in Academic CAD
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Using Cycle Efficiency as a System Designer Metric to Characterize an Embedded DSP and Compare Hard Core vs. Soft Core Advisor Dr. Vishwani D. Agrawal.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
Advanced VLSI Design Unit 06: SRAM
Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Physical Design of FabScalar Generated Superscalar Processors EE6052 Class Project Wei Zhang.
1 Multi-ported Memories for FPGAs via XOR Eric LaForest, Ming Liu, Emma Rapati, and Greg Steffan ECE, University of Toronto.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
An Improved “Soft” eFPGA Design and Implementation Strategy
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.
Cache Memory and Performance
Floating-Point FPGA (FPFPGA)
ESE534: Computer Organization
Variable Word Width Computation for Low Power
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Cache Memories CSE 238/2038/2138: Systems Programming
Introduction to Programmable Logic
Application-Specific Customization of Soft Processor Microarchitecture
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Instructor: Dr. Phillip Jones
Architecture Background
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
William Stallings Computer Organization and Architecture 7th Edition
Power-Aware Operand Delivery
Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 22: Shifters, Decoders, Muxes Mary Jane.
Introduction to Computer Systems
Digital Building Blocks
Central Processing Unit
Computer Architecture
CMOS VLSI Design Chapter 12 Memory
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
2.C Memory GCSE Computing Langley Park School for Boys.
UNIVERSITY OF MASSACHUSETTS Dept
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Modified from notes by Saeid Nooshabadi
Give qualifications of instructors: DAP
Measuring the Gap between FPGAs and ASICs
Automatic Tuning of Two-Level Caches to Embedded Applications
Application-Specific Customization of Soft Processor Microarchitecture
UNIVERSITY OF MASSACHUSETTS Dept
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose

2 Processor Microarchitecture ● Microarchitecture: How to arrange circuits to make a processor ● Depends on how efficient the circuits are ● Which depends on the substrate – Custom CMOS – Standard Cell – FPGA

3 Goals ● Make good microarchitecture design choices for bigger and faster FPGA soft processors ● Much existing literature on processor design for custom CMOS implementation – Comparisons of overall area/delay between substrates exist – But relative building block costs vary up to two orders of magnitude on FPGA vs. Custom CMOS ● This work: compares building blocks and infer microarchitectural conclusions – Also applicable to circuits other than processors

4 What we're measuring 1. Focus on processors as the complete circuit – FPGA vs. Custom: Synthesize RTL for FGPA 2. Compare building block circuits that are often used in processors – SRAM, CAM, Multiplier, Adder, … 3. Infer how existing microarchitectures should be modified for FPGA

5 Methodology ● FPGA circuits synthesized through Quartus II 10.0 – Largest, fast speed grade, 65 nm Stratix III (3LS340) – Area calculated from FPGA tile areas – A few results are from literature ● Custom CMOS design examples found in literature – High-performance circuit design and layout are difficult and time consuming – Normalize to 65 nm: Ideal area scaling and ring oscillator delay scaling

6 Metrics ● Area – Still a key design constraint on FPGAs ● Delay ● Power or energy: Not considered here – Data not often published and testing conditions not standard. – FPGA users mostly spared responsibility for not melting the chip.

7 1. Processor Core Comparison ● Complete circuit serves as a reference point for sub-circuit measurements later

8 Processor Core Comparison ● SPARC T1 and T2, Intel Atom and Nehalem – Compare CMOS to FPGA implementations – Compare just one core, excludes large caches ● FPGA implementation used RTL optimized from the custom CMOS implementation – Atom and Nehalem results cited from literature

9 Processor Cores: Area ● Area ratio: FPGA/Custom area – 17-27x (Geomean 23x)

10 Processor Cores: Speed ● Speed ratio: Custom/FPGA fmax – 18-26x (Geomean 22x)

11 2. Building Block Comparisons ● Compare area and delay ● Will go through one example on SRAMs

12 Single-Port SRAM ● Custom: A few design examples from literature and data from the CACTI area and delay models ● FPGA: Four ways to build memory on Stratix III – M144K (2k x 72-bit) – M9K (256 x 36-bit) – MLAB (32 x 20-bit) – Registers and muxes ● Used (n x 32-bit) memories in this section

13 Single-Port SRAM Density ● Single-port density ratio: 2-5x (compare to 23x) – Partly due to FPGA's dual-ported memory blocks Hard SRAM blocks save area 2- 5x

14 Single-Port SRAM Fmax ● SRAMs 7-10x ratio for < 256 kbit (compare to 22x) ● Big arrays: stitching small blocks adds more delay 7- 10x

15 ● Density ratio: 7x for 2r1w, more write ports worse ● Fmax ratio is 9x-15x for 2r1w through 20r10w 7x: Replicate RAM twice for 2r1w143x: Registers and muxes for 4r2w 23x: Replicate RAM 8x for 4r2w Multiported SRAM Density (2kb)

16 Summary: Building Blocks ● Lower ratios are better for FPGA

17 Building Blocks ● Area dominates the differences between block types ● Multiplexers are slow ● SRAM bits are cheap – Multiported memories are expensive ● CAMs and muxes are expensive ● Hard adders/multipliers save area, but aren't fast ● Pipeline latches slightly faster ● These costs affect microarchitecture choices...

18 3. Processor Microarchitecture CAM Multiported RAM Multiplexers

19 SRAM Ports: Clustered RF ● Choose architecture to minimize register file ports – Clustered register files: One write port per cluster

20 Scheduler CAM: Intel P6 ● P6 to Nehalem ● Values stored in three places ● RS is a CAM that stores values

21 Scheduler CAM: AMD K7 ● AMD K7/K8/K10 ● Values stored in three places ● RS is a CAM that stores values

22 Physical Register File ● MIPS R10000, Intel P4, Sandy Bridge, AMD Bobcat ● Values stored in one place ● Scheduler CAM stores no operands PRF: Fewer multiported RAMs and smaller CAM

23 Reducing Bypass Muxes ● Two sets of bypass muxes per operation ● Multiple issue makes bypass muxes even bigger

24 Fusing Operations ● Chaining dependent operations: 3 muxes/2 ops – Fused multiply-add works especially well because incremental cost of second operation is small Point-to-point saves one bypass mux

25 Summary ● Need to measure cost of building block circuits to guide microarchitecture design choices – Relative area costs span 2 orders of magnitude ● Microarchitecture choices should reflect costs – Examples: Reduce RAM port count, CAM size, and multiplexers; Take advantage of cheaper ALUs – Use clustered physical register file, (no reservation stations); Explore fusing dependent operations together

26 Future Work ● Use these results to guide the design of a larger and higher-performance soft processor – Use existing microarchitecture literature as guidance, and adapt for FPGA substrate

27 Thank You!