Download presentation
Presentation is loading. Please wait.
Published byColeen Boyd Modified over 9 years ago
1
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer cdulmer@sandia.gov June 27, 2006 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Adrian JaveloUCLA Craig Ulmer Sandia National Laboratories/CA
2
Ray-Triangle Intersection Algorithm Möller and Trumbore algorithm (1997) –TUV intersection point Modified to remove division –24Adds –26Multiplies –4Compares –15Delays –17Inputs –2Outputs (+4 bits) Goal: Build for a V2P20 –Catch: Do in 32b Floating-Point –Assume: 5 adds, 6 multiplies, 4 compares T +
3
Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary
4
Floating-Point and FPGAs Floating-Point has been weakness for FPGA Recent high-quality FP libraries –SNL: Keith Underwood & K. Scott Hemmert –USC, ENS Lyon, Nallatech, SRC, Xilinx FP units still challenging to work with –Deeply pipelined –Require sizable resources Single-Precision FunctionStagesMax in V2P20 Add1014 Multiply1118 Multiply (no denormals)622 Divide314
5
Implementing a Computational Kernel Desirable approach: full pipeline –One FP unit per operation –Issue new iteration every cycle Problems –Rapidly run out of chip space –Input bandwidth –Low utilization on “one-time” ops Need to consider techniques for reusing units
6
Our Approach: Recycling Architecture Build wrapper around an array of FP units –Apply traditional compiler techniques –Customize hardware data path Control Intermediate Buffering Input Selection Inputs Outputs
7
Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary
8
P Iterations Operation Scheduling Sequence execution on FP array Extract Data Flow Graph (DFG) –Wide and shallow –Need more parallelism Loop unrolling / Strip Mining –Pad FP units out to latency P –Work on a P iterations at a time –Sequentially issue strip of P iterations –Thus: ignore FP latency in scheduling P
9
# AddsMultiplies< 0 1 2 3 4 5 6 7 8 0 1 2 Step-by-Step Scheduling Single Strip 40%36% 53%48% Back-to-Back: One Strip:
10
# AddsMultiplies< 0 1 2 3 4 5 6 7 8 0 1 2 # AddsMultiplies< 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 Step-by-Step Scheduling Single StripDouble Strip 40%36% 53%48% Back-to-Back: One Strip: 64%57% 80%72% Back-to-Back: Double Strip:
11
Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary
12
Mapping Operations to Units Assign operations in schedule to a specific unit –Assignments affect input selection unit’s hardware Two strategies: First-Come-First-Serve and a Heuristic ++xx Input Output Intermediate Buffering Input Selection Unit
13
Mapping Effects # AddsMultiplies< 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 # AddsMultiplies< 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 First-Come-First-ServeHeuristic MUX3MUX4MUX5MUX6MUX7MUX3MUX4MUX5MUX6MUX7 0 10 0 Multiplexers Required
14
Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary
15
Buffering Intermediate Values Necessary for holding values between stages –Input vs. Output Buffering –Block RAM (BRAM) vs Registers Focus on output buffering w/ registers –“Delay Pipe” houses a strip of P values Port 0 Port 1 BRAM P Registers Register Delay Pipe
16
Two Strategies Independently-Writable Delay Blocks Minimize number of buffers 40 Memories, 40 MUXs ++xx Input Z -1 Chaining Delay Blocks Minimize control logic 81 Memories, 0 MUXs + Z -1 + x x Input Chaining: 6% Faster, 19% Smaller, and 400% faster to build!
17
Outline Overview –Reusing floating-point hardware Adapting the Algorithm –Operation Scheduling –Mapping Operations to Units –Intermediate Data Values Performance Observations –Ongoing Work: Automation Summary
18
Performance Implemented: –Single-strip –Double-strip –Full-Pipeline (V2P50) V2P20 Area 70% 79% 199% Single-strip Double-strip Full Pipeline Clock Rate 155 MHz 148 MHz 142 MHz Single-strip Double-strip Full Pipeline GFLOPS 0.9 1.2 7.1 Single-strip Double-strip Full Pipeline Input Bandwidth (Bytes/clock) 7.6 11.3 68 Single-strip Double-strip Full Pipeline
19
Ongoing Work: Automation Natural fit for automation –Built our own tools –DFG analysis tools –VHDL generation Experiment –No strip mining –Change # of FP units –Change # Iterations –Find # clocks for 128 iterations
20
Concluding Remarks Reusing FP units enables FPGAs to process larger kernels –Apply traditional scheduling tricks to increase utilization –Algorithm shape affects performance –Simplicity wins Simple DFG tools go a long ways –Easy to adjust parameters, generate hardware –Focus on kernels instead of complete systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.