Titan: Large and Complex Benchmarks in Academic CAD

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

ECE 506 Reconfigurable Computing ece. arizona
Spartan-3 FPGA HDL Coding Techniques
Simulation of Fracturable LUTs
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research Cindy Mark Prof. Steve Wilton University of British Columbia Supported.
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
Programmable logic and FPGA
Lecture 3 1 ECE 412: Microcomputer Laboratory Lecture 3: Introduction to FPGAs.
Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
Power Reduction for FPGA using Multiple Vdd/Vth
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
CAD for Physical Design of VLSI Circuits
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
Julien Lamoureux and Steven J.E Wilton ICCAD
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.
Programmable Logic Devices
Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
An Improved “Soft” eFPGA Design and Implementation Strategy
FPGA CAD 10-MAR-2003.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Oleg Petelin and Vaughn Betz FPL 2016
Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing
Placement study at ESA Filomena Decuzzi David Merodio Codinachs
Floating-Point FPGA (FPFPGA)
Altera Stratix II FPGA Architecture
Presentation on FPGA Technology of
HeAP: Heterogeneous Analytical Placement for FPGAs
FPGAs in AWS and First Use Cases, Kees Vissers
Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees
Andy Ye, Jonathan Rose, David Lewis
Embedded systems, Lab 1: notes
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
FPGA Glitch Power Analysis and Reduction
Off-path Leakage Power Aware Routing for SRAM-based FPGAs
A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P
Measuring the Gap between FPGAs and ASICs
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Titan: Large and Complex Benchmarks in Academic CAD Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, Vaughn Betz

Outline Motivation Hybrid CAD Flow & Benchmarks VPR and Quartus II Comparison Conclusion and Future Work

Motivation

Evaluating FPGA Architectures and CAD Must quantitatively compare: FPGA Architectures FPGA CAD Algorithms Benchmarks often neglected Good benchmarks: Exploit device characteristics (i.e. hard blocks) Comparable to modern device sizes FPGA Architecture Benchmarks Modify CAD Flow Results Architecture comparison diagram (N4 vs N6?)

State of FPGA Benchmarks 14nm? 28nm MCNC20 (1991) < 1% of Stratix V No Hard Blocks VTR (2012) < 5% of Stratix V Few Hard Blocks Even smaller on future devices 20nm? MCNC (1991): * Comb Circuits * Some Sequential Circuits * No hard blocks VTR (2012): * Sequential Circuits * Few hard blocks

Why Don’t We Have Better Benchmarks? Academic tools cannot handle real designs Limited HDL support No IP Cores (Vendor, 3rd party) Vendor tools are too restrictive Limited to Vendor’s Architectures Cannot modify CAD algorithms

Options? Exploit vendor tool strengths? Upgrade academic tools Add support for wide range of HDLs Create an IP library A huge investment! Exploit vendor tool strengths? Hybrid CAD flow Vendor Academic High Level Diagram

Hybrid CAD Flow & Benchmarks

Building a Hybrid CAD Flow Quartus II Analysis & Elaboration Technology Mapping Packing Placement Routing Quartus II Post Technology Map Netlist: LUTs, Flip-Flops, Multipliers etc. VPR

Titan Flow Capabilities & Limitations Experiment Modification VTR Titan Titan Flow Method Device Floorplan Yes Architecture file Inter-cluster Routing Clustered Block Size / Configuration Intra-cluster Routing LUT size / Combinational Logic Element ABC re-synthesis New RAM Block Architecture file (up to 16K depth*) New DSP Block Architecture file (up to 36 bit width*) New Primitive Type No No method to pass black box through Quartus II Table from Paper * Maximum for Stratix IV

Titan 23 Benchmarks 44× 215× VTR MCNC 23 Benchmarks Wide range of application domains All make use of hard blocks (DSPs, RAMs) 90K to 1.9M netlist primitives 44× VTR 215× MCNC

Benchmark Details Logic Heavy RAM Heavy DSP Heavy

VPR and Quartus II Comparison

VPR and Quartus II Flows HDL Analysis & Elaboration Technology Mapping Quartus Map User Defined FPGA Arch. Packing Placement Routing Quartus Fit Packing Placement Routing VPR

Titan Compatible Architecture Primitive Description lcell_comb LUT and Full Adder dffeas Register mlab_cell LUT RAM mac_mult Multiplier mac_out Accumulator ram_block RAM Slice io_{i,o}buf I/O Buffer ddio_{in,out} DDR I/O pll Phase Locked Loop Architecture must use same primitives as logic synthesis Can be grouped into arbitrary blocks

Stratix IV Architecture Capture Floorplan: Based on EP4SE820 Fully Modeled Blocks: LAB DSP M9K M144K Routing Network: Mixture of long and short wires

ALM Internal Connectivity Architecture Details LAB Detailed internal connectivity Full instead of partial crossbars Extra carry chain connectivity M9K & M144K RAM Blocks All modes and sizes Approximated mixed-width modes DSP Blocks All Stratix IV multiplier/accumulator modes Extra routing flexibility for packing ALM Internal Connectivity

Benchmark Completion Quartus II 21/23 VPR 14/23 Tool Benchmarks Completed Quartus II 21/23 VPR 14/23

Tool Performance vs. Benchmark Size 36.5 Hours

Tool Memory vs. Benchmark Size

VPR Memory Breakdown

Normalized Performance 13.3× slower 50% faster 3.4×slower 2.7×slower 5.1×higher memory

Performance Breakdown 79% 55% 21%

Normalized Quality of Results 30% fewer 2.3× more 1.2× more 2.6× more

Impact of Clustering 1.8× WL reduction 23% area

Stratix IV & Academic LUT/FF Flexibility Additional flexibility in Stratix IV architecture allows for denser packing Can be detrimental to Wirelength Traditional Academic BLE Tight Packing, Higher Wirelength Stratix IV like Half-ALM Loose Packing, Lower Wirelength

Conclusion and Future Work

Conclusion Titan Flow Hybrid CAD Flow Enables academic tools to use large benchmarks Titan23 Benchmark Suite Significantly improves open-source FPGA benchmarks Comparison of VPR and Quartus II Stratix IV architecture capture VPR: 2.7x slower, 5.1x more memory, 2.6x more wire Identified packing density as an important factor in wirelength

VPR: Areas for Improvement Performance: Packer run-time Peak memory usage Quality/Modeling: Adjustable packing density More flexible routing network description

Future Work Timing Driven Comparison Goal: Correlate VPR timing model with micro-benchmarks Evaluate timing optimization results on large benchmarks Initial Results: Carry chains supported Wire and logic delays roughly correlated to Stratix IV Benchmark Size (ALUT/REG) VPR Critical Path (ps) QII Critical Path (ps) VPR/QII 8:1 Mux 2/12 932 1498 0.62 subtractor 11/15 1450 1381 1.05 32-bit Adder 32/96 1674 1718 0.97 diffeq1 1434/193 9935 11289 0.88 sha 772/893 6103 5416 1.13 ucsb_FIR 5410/12340 3084 2289 1.35 wb_conmax 11809/3326 5465 4066 1.34

Titan Flow & Titan 23 Benchmarks: Thanks! Questions? Email: kmurray@eecg.utoronto.ca Titan Flow & Titan 23 Benchmarks: http://uoft.me/titan

Detailed CAD Flow

Titan 23 Benchmarks

VPR Performance Relative to Quartus

VPR QoR Relative to Quartus