Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
The University of Adelaide, School of Computer Science
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Parallell Processing Systems1 Chapter 4 Vector Processors.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
JET Algorithm Attila Hidvégi. Overview FIO scan in crate environment JET Algorithm –Hardware tests (on JEM 0.2) –Results and problems –Ongoing work on.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
SAND Number: P Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Digital signature using MD5 algorithm Hardware Acceleration
Network Intrusion Detection Systems on FPGAs with On-Chip Network Interfaces Christopher ClarkGeorgia Institute of Technology Craig UlmerSandia National.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
1 2-Hardware Design Basics of Embedded Processors (cont.)
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Swankoski MAPLD 2005 / B103 1 Dynamic High-Performance Multi-Mode Architectures for AES Encryption Eric Swankoski Naval Research Lab Vijay Narayanan Penn.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Principles of Linear Pipelining
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Exploiting Parallelism
Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
JET Algorithm Attila Hidvégi. Overview FIO scan in crate environment JET Algorithm –Hardware tests (on JEM 0.2) –Results and problems –Some VHDL tips.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Buffering Techniques Greg Stitt ECE Department University of Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Presentation transcript:

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Adrian JaveloUCLA Craig Ulmer Sandia National Laboratories/CA

Ray-Triangle Intersection Algorithm Möller and Trumbore algorithm (1997) –TUV intersection point Modified to remove division –24Adds –26Multiplies –4Compares –15Delays –17Inputs –2Outputs (+4 bits) Goal: Build for a V2P20 –Catch: Do in 32b Floating-Point –Assume: 5 adds, 6 multiplies, 4 compares T +

Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary

Floating-Point and FPGAs Floating-Point has been weakness for FPGA Recent high-quality FP libraries –SNL: Keith Underwood & K. Scott Hemmert –USC, ENS Lyon, Nallatech, SRC, Xilinx FP units still challenging to work with –Deeply pipelined –Require sizable resources Single-Precision FunctionStagesMax in V2P20 Add1014 Multiply1118 Multiply (no denormals)622 Divide314

Implementing a Computational Kernel Desirable approach: full pipeline –One FP unit per operation –Issue new iteration every cycle Problems –Rapidly run out of chip space –Input bandwidth –Low utilization on “one-time” ops Need to consider techniques for reusing units

Our Approach: Recycling Architecture Build wrapper around an array of FP units –Apply traditional compiler techniques –Customize hardware data path Control Intermediate Buffering Input Selection Inputs Outputs

Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary

P Iterations Operation Scheduling Sequence execution on FP array Extract Data Flow Graph (DFG) –Wide and shallow –Need more parallelism Loop unrolling / Strip Mining –Pad FP units out to latency P –Work on a P iterations at a time –Sequentially issue strip of P iterations –Thus: ignore FP latency in scheduling P

# AddsMultiplies< Step-by-Step Scheduling Single Strip 40%36% 53%48% Back-to-Back: One Strip:

# AddsMultiplies< # AddsMultiplies< Step-by-Step Scheduling Single StripDouble Strip 40%36% 53%48% Back-to-Back: One Strip: 64%57% 80%72% Back-to-Back: Double Strip:

Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary

Mapping Operations to Units Assign operations in schedule to a specific unit –Assignments affect input selection unit’s hardware Two strategies: First-Come-First-Serve and a Heuristic ++xx Input Output Intermediate Buffering Input Selection Unit

Mapping Effects # AddsMultiplies< # AddsMultiplies< First-Come-First-ServeHeuristic MUX3MUX4MUX5MUX6MUX7MUX3MUX4MUX5MUX6MUX Multiplexers Required

Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary

Buffering Intermediate Values Necessary for holding values between stages –Input vs. Output Buffering –Block RAM (BRAM) vs Registers Focus on output buffering w/ registers –“Delay Pipe” houses a strip of P values Port 0 Port 1 BRAM P Registers Register Delay Pipe

Two Strategies Independently-Writable Delay Blocks Minimize number of buffers 40 Memories, 40 MUXs ++xx Input Z -1 Chaining Delay Blocks Minimize control logic 81 Memories, 0 MUXs + Z -1 + x x Input Chaining: 6% Faster, 19% Smaller, and 400% faster to build!

Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary

Performance Implemented: –Single-strip –Double-strip –Full-Pipeline (V2P50) V2P20 Area 70% 79% 199% Single-strip Double-strip Full Pipeline Clock Rate 155 MHz 148 MHz 142 MHz Single-strip Double-strip Full Pipeline GFLOPS Single-strip Double-strip Full Pipeline Input Bandwidth (Bytes/clock) Single-strip Double-strip Full Pipeline

Ongoing Work: Automation Natural fit for automation –Built our own tools –DFG analysis tools –VHDL generation Experiment –No strip mining –Change # of FP units –Change # Iterations –Find # clocks for 128 iterations

Concluding Remarks Reusing FP units enables FPGAs to process larger kernels –Apply traditional scheduling tricks to increase utilization –Algorithm shape affects performance –Simplicity wins Simple DFG tools go a long ways –Easy to adjust parameters, generate hardware –Focus on kernels instead of complete systems