Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN 1.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

Spartan-3 FPGA HDL Coding Techniques
Altera FLEX 10K technology in Real Time Application.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Fast FPGA Resource Estimation Paul Schumacher & Pradip Jha Xilinx, Inc.
Integrated Circuits Laboratory Faculty of Engineering Digital Design Flow Using Mentor Graphics Tools Presented by: Sameh Assem Ibrahim 16-October-2003.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
FPGA Devices & FPGA Design Flow
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Configurable System-on-Chip: Xilinx EDK
Programmable logic and FPGA
EET 1131 Unit 4 Programmable Logic Devices  Read Kleitz, Chapter 4.  Homework #4 and Lab #4 due next week.  Quiz next week.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL Overview of Modern FPGAs ECE 448 Lecture 14.
ECE 699: Lecture 2 ZYNQ Design Flow.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
Digital System Design EEE344 Lecture 1 INTRODUCTION TO THE COURSE
Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
ECE 545 Project 1 Part IV Key Scheduling Final Integration List of Deliverables.
Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
George Mason University Modern FPGA Devices ATHENa - Automated Tool for Hardware EvaluatioN ECE 545 Lecture 11.
Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.
© 2003 Xilinx, Inc. All Rights Reserved For Academic Use Only Xilinx Design Flow FPGA Design Flow Workshop.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
Using Cycle Efficiency as a System Designer Metric to Characterize an Embedded DSP and Compare Hard Core vs. Soft Core Advisor Dr. Vishwani D. Agrawal.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
1ECE 545 – Introduction to VHDL Project Deliverables.
Lecture #3 Page 1 ECE 4110–5110 Digital System Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.HW#2 assigned Due.
ECE 545 Project 2 Specification Part I. Adjust your synthesizable code for Project 1 in such a way that it complies with the following requirements: a.
Lecture #2 Page 1 ECE 4110– Sequential Logic Design Lecture #2 Agenda 1.Logic Design Tools Announcements 1.n/a.
ECE 545 Project 2 Specification. Schedule of Projects (1) Project 1 RTL design for FPGAs (20 points) Due date: Tuesday, November 22, midnight (firm) Checkpoints:
ECE 545 Lecture 7 FPGA Design Flow.
ECE 545 Project 2 Specification. Project 2 (15 points) – due Tuesday, December 19, noon Application: cryptography OR digital signal processing optimized.
Introductory project. Development systems Design Entry –Foundation ISE –Third party tools Mentor Graphics: FPGA Advantage Celoxica: DK Design Suite Design.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Introduction to FPGA Tools
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Lecture 5B Block Diagrams HASH Example.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL FPGA Devices ECE 448 Lecture 5.
Delivered by.. Love Jain p08ec907. Design Styles  Full-custom  Cell-based  Gate array  Programmable logic Field programmable gate array (FPGA)
ASIC/FPGA design flow. Design Flow Detailed Design Detailed Design Ideas Design Ideas Device Programming Device Programming Timing Simulation Timing Simulation.
George Mason University ATHENa - Automated Tool for Hardware EvaluatioN ECE 545 Lecture 12.
ECE 545 Project 1 Introduction & Specification Part I.
ATHENa - Automated Tool for Hardware EvaluatioN
Hash Function Performance Metrics
Introduction to Programmable Logic
Reconfigurable Computing
Course Agenda DSP Design Flow.
Project Deliverables ECE 545 – Introduction to VHDL.
ECE 699: Lecture 3 ZYNQ Design Flow.
Basic Adders and Counters Implementation of Adders
HIGH LEVEL SYNTHESIS.
ECE 545 Remaining Tasks.
Measuring the Gap between FPGAs and ASICs
Presentation transcript:

Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN 1

Modern Benchmarking: Natural Progression of Tools 2 SoftwareASICsFPGAs eBACS D. Bernstein, T. Lange ??

ATHENa – Automated Tool for Hardware EvaluatioN 3 Set of scripts written in Perl aimed at an AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms Currently under development at George Mason University. Version

Why Athena? 4 "The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess of Wisdom was known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.” from "Athena, Greek Goddess of Wisdom and Craftsmanship"

Designers of ATHENa Venkata “Vinny” MS CpE student Ekawat “Ice” MS CpE student Marcin PhD ECE student Rajesh PhD ECE student Xin PhD ECE student Michal PhD exchange student from Slovakia

ATHENa Server FPGA Synthesis and Implementation Result Summary + Database Entries 2 3 HDL + scripts + configuration files 1 Database Entries Download scripts and configuration files8 Designer 4 HDL + FPGA Tools User Database query Ranking of designs 5 6 Basic Dataflow of ATHENa 0 Interfaces + Testbenches 6

7 synthesizable source files configuration files testbench constraint files result summary (user-friendly) result summary (user-friendly) database entries (machine- friendly) database entries (machine- friendly)

synthesizable source files configuration files result summary (user-friendly) result summary (user-friendly)

ATHENa Major Features (1) synthesis, implementation, and timing analysis in the batch mode support for devices and tools of multiple FPGA vendors: generation of results for multiple families of FPGAs of a given vendor automated choice of a best-matching device within a given family 9

ATHENa Major Features (2) automated verification of the design through simulation in the batch mode exhaustive search for optimum options of tools heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.) OR 10

ATHENa Major Features (2) automated verification of the design through simulation in the batch mode exhaustive search for optimum options of tools heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.) OR 11

12 Multi-Pass Place-and-Route Analysis GMU SHA-512, Xilinx Virtex runs for different placement starting points The smaller the better ~ 20% best worst Minimum clock 12

13 Dependence of Results on Requested Clock Frequency

ATHENa Applications single_run: - one set of options placement_search - one set of options - multiple starting points for placement exhaustive_search - multiple sets of options - multiple starting points for placement - multiple requested clock frequencies

SHA-1 Results Throughput [Mbit/s] Architectures Virtex 5 Virtex 4 Spartan 3 15

ATHENA Results for SHA-1, SHA-256 & SHA

Ideas (1) 17 Select several representative FPGA platforms with significantly different properties e.g., different vendor – Xilinx vs. Altera process - 90 nm vs. 65 nm LUT size - 4-input vs. 6-input optimization - low-cost vs. high-performance Use ATHENa to characterize all SHA-3 candidates and SHA-2 using these platforms in terms of the target performance metrics (e.g. throughput/area ratio)

Ideas (2) 18 Calculate ratio SHA-3 candidate performance vs. SHA-2 performance (for the same security level) Calculate geometrical average over multiple platforms

TechnologyLow-costHigh- performance 120/150 nmVirtex 2, 2 Pro 90 nmSpartan 3Virtex 4 65 nmVirtex 5 45 nmSpartan 6 40 nmVirtex 6 Xilinx FPGA Devices

Xilinx FPGA Device Support by Tools VersionLow-costHigh-performance Xilinx ISE 10.1All up to Virtex 5 Xilinx WebPACK 11.1Smallest up to Virtex 5 Xilinx WebPACK 11.3Smallest up to Virtex 5 Smallest Spartan 6, Virtex 6 Smallest up to Virtex 5 Smallest Spartan 6, Virtex 6

Altera FPGA Devices TechnologyLow-costMid-rangeHigh- performance 130 nmCycloneStratix 90 nmCyclone IIStratix II 65 nmCyclone IIIArria IStratix III 40 nmCyclone IVArria IIStratix IV

Altera FPGA Device Support by Tools VersionLow-costMid-rangeHigh- performance Quartus 7.1Cyclone IV none, Cyclone III all Arria GX all Arria II GX none Stratix II smallest, Stratix III none Quartus 8.1Cyclone IV none, Cyclone III all Arria GX all Arria II GX none Stratix I, II, III smallest Quartus 9.0 sp2, Sep. 09 Cyclone IV none, Cyclone III all Arria GX all Arria II GX none Stratix I, II, III smallest Quartus 9.1 Nov. 09 Cyclone IV smallest, Cyclone III all Arria GX all Arria II GX smallest Stratix I, II, III all Stratix IV none

FPGA and ASIC Performance Measures 23

The common ground is vague Hardware Performance: cycles per block, cycles per byte, Latency (cycles), Latency (ns), Throughput for long messages, Throughput for short messages, Throughput at 100 KHz, Clock Frequency, Clock Period, Critical Path Delay, Modexp/s, PointMul/s Hardware Cost: Slices, Slices Occupied, LUTs, 4-input LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on ASIC, DSP Blocks, BRAMS, Number of Cores, CLB, MUL, XOR, NOT, AND Hardware efficiency: Hardware performance/Hardware cost 24

25 Our Favorite Hardware Performance Metrics: Mbit/s for Throughput ns for Latency Allows for easy cross-comparison among implementations in software (microprocessors), FPGAs (various vendors), ASICs (various libraries)

26 But how to define and measure throughput and latency for hash functions? Time to hash N blocks of message = Htime(N, T CLK ) = Initialization Time(T CLK ) + N * Block Processing Time(T CLK ) + Finalization Time(T CLK ) Latency = Time to hash ONE block of message = Htime(1, T CLK ) = = Initialization Time + Block Processing Time + Finalization Time Throughput (for long messages) = Htime(N+1, T CLK ) - Htime(N, T CLK ) Block size = Block Processing Time (T CLK )

But how to define and measure throughput and latency for hash functions? Initialization Time(T CLK ) = cycles I ⋅ T CLK Block Processing Time(T CLK ) = cycles P ⋅ T CLK Finalization Time(T CLK ) = cycles F ⋅ T CLK Block size from specification from analysis of block diagram and/or functional simulation from place & route report (or experiment) 27

How to compare hardware speed vs. software speed? EBASH reports ( In graphs Time(n) = Time in clock cycles vs. message size in bytes for n-byte messages, with n=0,1, 2, 3, … 2048, 4096 In tables Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg Time(4096) – Time(2048) 2048 Performance for long message = 28

How to compare hardware speed vs. software speed? Throughput [Gbit/s] = Performance for long message [cycles/byte] 8 bits/byte ⋅ clock frequency [GHz] 29

30 How to measure hardware cost in FPGAs? 1. Stand-alone cryptographic core on FPGA 2. Part of an FPGA System On-Chip 3. FPGA prototype of an ASIC implementation Cost of a smallest FPGA that can fit the core. Unit: USD [FPGA vendors would need to publish MSRP (manufacturer’s suggested retail price) of their chips] – not very likely or size of the chip in mm 2 - easy to obtain Vector:(CLB slices, BRAMs, MULs, DSP units) for Xilinx (LEs, memory bits, PLLs, MULs, DSP units) for Altera Force the implementation using only reconfigurable logic (no DSPs or multipliers, distributed memory vs. BRAM): Use CLB slices as a metric. [LEs for Altera]

How to measure hardware cost in ASICs? 1. Stand-alone cryptographic core 2. Part of an ASIC System On-Chip Cost = f(die area, pin count) Tables/formulas available from semiconductor foundries Cost ~ circuit area Units: μm 2 or GE (gate equivalent) = size of a NAND2 cell 31

Deliverables (1) 1. Detailed block diagram of the Datapath with names of all signals matching VHDL code [electronic version a bonus] 2. Interface with the division into the Datapath and the Controller [electronic version] 3.ASM charts of the Controller, and a block diagram of connections among FSMs (if more than one used) [electronic version a bonus] 4. RTL VHDL code of the Datapath, the Controller, and the Top-Level Circuit 5. Updated timing and area analysis formulas for timing confirmed through simulation 32

Deliverables (2) 6. Report on verification − highest level entity verified for functional correctness Functional simulation Post-synthesis simulation Timing simulation [bonus] − verification of lower-level entities -Name of entity -Testbench used for verification -Result of verification, incorrect behavior, possible source of error 33

Deliverables (3) 7. Results of benchmarking using ATHENa – Entire core or the highest level entity verified for correct functionality – Xilinx Spartan 3, Virtex 4, Virtex 5 – Three methods of testing Single_run Placement_search [cost table = 1, 11, 21] Exhaustive_search [cost_table = 31, 41, 51; speed or area; two sets of requested frequencies] – Results generated by ATHENa – Your own graphs and charts – Observations and conclusions 34

Bonus Deliverables (4) 8. Pseudo-code [but not a C code] 9. Bugs and suspicious behavior of ATHENa 10. Additional results of benchmarking using ATHENa – Altera Cyclone II, Stratix II, Cyclone III, Arria I, Stratix III – Three methods of testing Single_run Placement_search [seed = 1, 1000, 2000] Exhaustive_search [seed = 3000, 4000, 5000; speed or area; two sets of requested frequencies] – Results generated by ATHENa – Your own graphs and charts – Observations and conclusions 35

Bonus Deliverables (5) 11. Report from the meeting with students working on the same SHA core – Summary of major differences – Advantages and disadvantages of your design 12. Bugs found in the – Padding script – Testbench – Class examples – Slides – Documentation – SHA-3 Packages – Etc. 36

Bonus Deliverables (6) 13. Extending the design to cover all hash function variants – Hash value sizes: 512 [highest priority], 384, 224 – Other variant/parameter support specific to a given hash function – Support through generics or constants 14. Padding in hardware Assuming that message size before padding is already a multiple of the – word size – byte size – a single bit 37

38 14 local students (with 3 former BSCpE graduates) 14 international students 4 GWU PhD candidates Composition of Students

After Grading 1.Summary of results published on the course web page 2.Selected students invited to develop articles/reports to be posted on the - ATHENa web page - SHA-3 Zoo Web Page 3.Unification, generalization and optimization of codes by Ice, myself, and other students 4.Presentation to NIST, conference submissions, presentation at the Second SHA-3 Conference in Santa Barbara in August