K. Bazargan R. KastnerM. Sarrafzadeh Physical Design for Reconfigurable Computing Systems using Firm Templates Department of Electrical & Computer Engineering Northwestern University
Sep 10, 99 2 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, 99 3 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, 99 4 CPU Data Memory Control Data Instruction Memory (Program) RFUOPs CPU instructions The Architecture of a Reconfigurable System RFU
Sep 10, 99 5 Execution of a Sample Program RFU t y x x = 3*a - b; … C = RFUOP1(x,5); y = 4*x - c; for (i=0;i<3;i++){ x+=RFUOP2(y); ++y; } z = RFUOP1(x,3); a = z - y; b = RFUOP3(a,b); c = a - b; …CodeDFG => (on CPU) (on RFU) => No room on RFU to run all in parallel ==> run in sequence => (in parallel) =>
Sep 10, 99 6 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, 99 7 Application Example: Image Restoration The value of the center pixel in the next iteration: x k+1 = *y + x k - * (d**x k ) y: the pixel value from the original degraded image x k : the pixel value from the previous iteration d**x k denotes the weighted sum r 1 * (eight neighbor pixels) + r 0 * center pixel r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r0r0
Sep 10, 99 8 m o n Image Restoration (cont.) Incentive: –Processing of large images using FPGA’s with limited resources Strategy: –Segmentation of the image into smaller sized images suitable for the FPGA –Segments of size m x n are surrounded by an overlap of o.
Sep 10, 99 9 MEMORY m o n RFU Image Restoration: Data Flow Strategy Data flow strategy –Pixels of individual segments are restored in parallel by hardware. –Restored segments are written back after the overlap is discarded
Sep 10, Degraded ImageRestored Image Image Restoration Example
Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, Configuration Memory Config. Bits RFUOPs RFU Manager System Components Placement Engine Cache Manager Prefetch/Branch Prediction Unit Control Program Manager Instruction Mem. (Prog.) CPU instructions Data CPU RFU Data Memory Data
Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, Online Placement: Problem Definition Input: –RFU dimensions(W, H) –List of RFUOP events:(w, h, arrival, departure) arrival departure Output: –For each module, either Rejected (not able to place) [penalty?] Accepted:(x,y) accepted rejected
Sep 10, Online Placement When a new RFUOP arrives, –Is there enough room? –If yes, which location is best? Previous work –Bin-packing heuristics (1-D) - O(n 2 ) First Fit, Best Fit, Shelf, Look ahead, … –[Chazelle’83] The Bottom-Left heuristic. O(n 2 ) –[Healy-Creavin’97] O(n 2 lg n) += ? Current Placement New module to be inserted
Sep 10, Our Online Placement Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?) Packing rule –Best Fit, Bottom Left, First Fit
Sep 10, Heuristics for Choosing an Empty Rectangle New module to be inserted + = ? A B Current Placement Area( ) < Area( ) Choose A BF (Best Fit) Places the new module in the empty rectangle which causes less wasted space. FF (First Fit) Any of A or B could be chosen for placing the new module. BL (Bottom Left) P1P1 P2P2 Chooses the empty rect which is more to the bottom left y( P 2 ) < y( P 1 ) Choose B
Sep 10, Our Online Placement Managing the empty space –Keep empty rectangles explicitly, use “range tree” to store/access empty rects. –Efficient use of RFU real estate KAMER: Keep all O(n 2 ) maximal empty rectangles Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?)
Sep 10, Keeping All Empty Rectangles
Sep 10, Our Online Placement Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?) Managing the empty space –Keep empty rectangles explicitly, use “range tree” to store/access empty rects. –Efficient use of RFU real estate KAMER: Keep all O(n 2 ) maximal empty rectangles –Fast but sub-optimal Keep only O(n) empty rectangles –Shorter Seg. (SSEG), Square Empty Rects. (SQR),...
Sep 10, Keeping O(n) Empty Rectangles - SSEG
Sep 10, Heuristics for Choosing a Segment SSEG (Shorter Seg) BER (Balanced Empty Rects)LSQR (Larger Rect Square) SQR (Square Rects) LER (Large Empty Rects) LSEG (Longer Seg) S1S1 S2S2 Chooses the shorter of the two segments. Chooses the longer of the two segments. A B C D S1S1 S2S2 A B C D A B C D A B C D Chooses the segment which creates less area difference. Chooses the segment which creates the larger rectangle closer to square. S 1 < S 2 Area( B ) - Area( A ) > Area( D ) - Area( C ) AspectRatio( B ) > AspectRatio( D ) Chooses the segment which creates the larger empty rectangle. Chooses the segment which creates empty rectangles closer to squares. Area( B ) > Area( D ) Max{AR( A ),AR( B )} < Max{AR( C ),AR( D )} AR = AspectRatio
Sep 10, How Good is a Placement? Acceptance rate –percentage of modules accepted (placed) Volume penalty –Area complexity –Time-span in the system loop iterations –Penalty of rejecting a module penalty = volume = area * time Input data –Randomly generated dimensions –Randomly generated enter/leave time
Sep 10, Program snapshot
Sep 10, Online Placement Results Percentage of accepted modules using different bin-packing and empty space partitioning rules
Sep 10, Online Placement Results (cont.)
Sep 10, Online Placement Results (cont.)
Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU RFU area time
Sep 10, t y x 3-D Floorplanning RFU By deleting this RFUOP (CPU performs the operation)... DFGSchedule RFU CPU
Sep 10, t y x 3-D Floorplanning RFU This RFUOP can be moved on the RFU DFGSchedule RFU CPU
Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU These RFUOPs can be performed earlier...
Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU
Sep 10, Our Current 3-D Floorplanners No change in the schedule –Fixed insertion and deletions of RFUOPs Annealing based. –Move set Move operation from CPU set to RFU set Move operation from RFU set to CPU set Displace an already placed RFUOP on the RFU –Cost function Penalty in rejecting modules (sum of volumes of the RFUOPs in the CPU set) No overlap allowed during annealing Greedy –Sort the modules on decreasing vol., apply KAMER
Sep 10, Our Current 3-D Floorplanners (cont.) KAMER-BF-Decreasing –Sort the modules on their volumes –Use KAMER to find a fast placement of the modules Low-temp. annealing (LTSA) –Similar to KAMER-BFD, but use KAMER to place only the X% largest modules –Use low-temp annealing to place the rest Zero-temp. annealing (ZTSA) -- Greedy –Use KAMER to place as many modules as you can –Use only displace and move from CPU to RFU annealing moves.
Sep 10, Our Current 3-D Floorplanners (cont.) BFOP - Best Fit Online Placement –Sort the RFUOPs on volume (decreasing) –For each RFUOP, find candidate “corners” –Choose the corner which results in min wasted area (similar to well-studied 2-D Bin Packing problem) t y x A Floor corresponding to time t 1 t1t1 corners t1t1
Sep 10, Annealing-Based Offline vs. Online Percentage of accepted modules and penalties using two offline parameters. The higher the RFU acceptance rate and lower the penalty, the better the algorithm.
Sep 10, Offline Placement Results - All
Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, Flexible Modules Library of soft templates –Flexible shapes Constant area, different width,height Problem? Hard to build (PD should be done for each shape) –Median Use the same area, but square shape –Rotation Placement method –Use best shape (min wasted area)
Sep 10, Using Flexible Modules in BFOP Median uses a square module with the same area
Sep 10, Flexible Modules (cont.) “Firm” templates –Slice the module into x horizontal or vertical strips –If cannot place the module, use the 2-split, 3-split, … until you can fit. Problem? –Routing! –Limited module types can be split (like carry chains, etc. with min communication between stages) Vertical 3-split
Sep 10, Quality Improvements Using Firm Templates
Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work
Sep 10, Conclusion Which online algorithm? –If speed is an issue, SSEG, ow KAMER Online or offline? –If you have the schedule => offline Which offline algorithm? –BFOP is the best (faster+better quality) Median? Flexibility? Firm templates? –Surprisingly, median gives little improvement –If flexible shape avail, better than splitting (no additional routing problem) –How many splits? no-split 2-split: 23% improvement 5-split 6-split: 3% improvement