Download presentation
Presentation is loading. Please wait.
1
Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations Tom VanCourt Martin Herbordt {tvancour, herbordt} @ bu.edu BOSTO N UNIVERSITY www.bu.edu/caadlab Variations and extensions Dimensions: 1, 2, 3, … Adapts easily to dimensionality. Adapts easily to cluster size & shape. Can use non-power-of-2 memory arrays LSBs become X mod N X – efficient implementations for modest N X MSBs become X div N X – efficient implementations using block multipliers Allows wide range of design tradeoffs: Logic & multipliers vs. RAMsLatency vs. hardware Take advantage of dual-ported RAMs, when available Allocate less hardware to small grids Optimize de-interleaving multiplexers 4x4x4 RAM array requires 64:1 output multiplexers Implement efficiently as three layers of 4:1 multiplexers Write port design choices Can use dual-ported RAM for non-interfering, concurrent read & write Write single words or clusters – need not be same shape as read cluster Automation Java program – initial version available See http://www.bu.edu/caadlab/publications Source code and documentation Sample input for hex grid example above name=HexGrid Name of VHDL component axis=horizaxis=vert Symbol names for axis indices databits=16 Width of individual word output=B1,0,1output=A2,1,0 Define the access cluster output=B2,1,1output=C2,1,2 output=A3,2,0output=B3,2,1 output=C3,2,2 testSize=150,75 Grid size for test bench VHDL output HexGrid.vhdl Synthesizable entity definition HexGrid_def.vhdl Declaration package HexGrid_test_driver.vhdl Test bench - confirms implementation FPGAs: Technological opportunity Traditional memory interleaving for broad parallelism – in use since 1960s Generic: Designed to avoid application specifics Fixed bus:All applications use same memory interface Expensive:High hardware costs, accessible only for major processor designs FPGAs for memory interleaving – ideal technological match Customizable:Can adapt to arbitrary application characteristics Not just permitted, customization is inherent and compulsory Configurable:Unique interleaving structure for each application Multiple different structures for different parts of one application Free (almost) :10s to 100s of independently addressable RAM busses On-chip bus widths 100s to 1000s of bits Cheap, fast logic for address generation & de-interleaving networks FPGA-based computation is an emerging field Does not have software’s huge base of widely applicable techniques Needs to develop a “cookbook” of reusable computation structures Grid computation: candidate for acceleration Many applications in molecular dynamics, physics, molecule docking, Perlin noise, image processing … Computation characteristics being addressed: Cluster of grid points needed at each step Grid cells accessed in irregular order Invalidates typical schemes for reusing data Working set fits into FPGA’s on-chip RAM Implementing for FPGA computation Allows reconsideration of the whole algorithm Optimal FPGA algorithms are commonly very different from sequential implementations Developer has access to algorithm’s logical indexing scheme Extra design information in 2,3,... dimensional indexing, before flattening into RAM addresses FPGAs support massive, fine-grained parallelism in computation pipeline Often throttled by serial access to RAM operands Goal: Fetch enough operands to fill the width of the computation array Bilinear interpolation for computing off-grid points Implementation technique 3a 3b 2b 3c 2c 2d 2a 3d 1b 1c 1d 1a 1. Define application’s access cluster Convert to rectangular array 3a3b3c 2a2b2c 1b 2. Round up to power of 2 bounding box RAM banks indexed by {X, Y} mod 4 3. Address generation: Map access cluster to grid Handle wraparound: {X, Y} / 4 vs {X, Y} / 4 + 1 4. De-interleaving: Map RAM banks to outputs The general case, not just limited to 1D or 2D arrays 1 0 2 3 1 0 101230 1 0 10 1 0 10 +1? X Y Address generation RAM array 2 C 3 D 0 A 1 B De-interleave LSBs 1.Barnes, George H., Richard M. Brown, Maso Kato, David J. Kuck, Daniel L. Slotnick, and Richard A. Stokes. The Illiac IV Computer. IEEE Transactions on Computers 17(8), August 1968 2.Böhm, A.P.W., B. Draper, W. Najjar, J. Hammes, R. Rinker, M. Chawathe, and C. Ross. One-step Compilation of Image Processing Applications to FPGAs. Proc FCCM. 2001 3.M. B. Gokhale and J. M. Stone. Automatic Allocation of Arrays to Memories in FPGA Processors With Multiple Memory Banks. Proc. FCCM 1999 X 0:1 Z 0:1 Y 0:1 Output MSBs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.