Berkeley: Sept 15, Physical Design Challenges of Reconfigurable Computing Systems Majid Sarrafzadeh NuCAD Department of ECE Northwestern University Ryan Kastner, Todd Haverkos, Kia Bazargan, Seda Ogrenci, Eli Bozorgzadeh, Candice McGrew Sponsored: DARPA, Motorola, AT&T, NSF
Berkeley: Sept 15, Faculty Position In VLSI Design & CAD (1-2 openings) VLSI Design & CAD: One of the six focused research areas in the department Assistant/Associate/Full Professor –(Northwestern rank: top 10; –ECE: top 20 (top 10 in 5 years) Contact:
Berkeley: Sept 15, Field Programmable Gate Array: FPGA
Berkeley: Sept 15, FPGA(Xilinx)
Berkeley: Sept 15, Degraded ImageRestored Image
Berkeley: Sept 15, Degraded ImageRestored Image
Berkeley: Sept 15, Image stored in on-chip memory Circuit to process the image residing on the rest of the chip FPGA chip On-board memory, where the image is stored FPGA chip Host processor ( image is stored here) System ASystem BSystem C
Berkeley: Sept 15, CPU Data Memory Control Data Instruction Memory (Program) RFUOPs CPU instructions The Architecture of a Reconfigurable System RFU
Berkeley: Sept 15, RFU Programmable logic Programmable connections Field Programmable Gate Array: FPGA SRAM cells used in configuration –Reconfigurable (runtime) –Static vs. dynamic configuration Hardware functions implemented as rectangular areas on the FPGA SRAM cells
Berkeley: Sept 15, System Components Configuration Memory Config. Bits RFUOPs RFU Manager Placement Engine Cache Manager Prefetch/Branch Prediction Unit Control Program Manager Instruction Mem. (Prog.) CPU instructions Data CPU RFU Data Memory Data
Berkeley: Sept 15, System Behavior Two kind of instructions –CPU instructions => always run on CPU Assume known runtime –RFUOPs, might be performed on CPU if not enough room on RFU Assume known runtime and reconfiguration time Runtime profiles and RFU status are used to decide between CPU and RFU
Berkeley: Sept 15, PD Challenges Problem: Given RFUOPs to be performed on RFU and DFG constraints, schedule them in time assign them physical location. Must be very fast: (mtools achieve 1000 cells per minute). Existing tools/techniques are very slow. Quality is less important. New PD algorithm/paradigms are needed. In this presentation: –placement, –routing, –an application on reconfigurable systems.
Berkeley: Sept 15, Firm Macros Not hard (too rigid), not soft (takes too much time to utilize the flexibility) Each unit is 80%-100% pre-designed: Can “break” the macros in limited ways We have defined a network algebra for combining circuits (based on parameterization using VHDL generics): combine a fast and a slow adder in multiple ways
Berkeley: Sept 15, Faculty Position In VLSI Design & CAD (1-2 openings) VLSI Design & CAD: One of the six focused research areas in the department Assistant/Associate/Full Professor –(Northwestern rank: top 10; –ECE: top 20 (top 10 in 5 years) –Contact:
Berkeley: Sept 15, Execution of a Sample Program RFU t y x x = 3*a - b; … C = RFUOP1(x,5); y = 4*x - c; for (i=0;i<3;i++){ x += RFUOP2(y); ++y; } z = RFUOP1(x,3); a = z - y; b = RFUOP3(a,b); c = a - b; …CodeDFG =>(on CPU) (on RFU)=> No room on RFU to run all in parallel ==> run in sequence => (in parallel) =>
Berkeley: Sept 15, Placement On-line placement –RFU calls needs to be executed as the program proceeds off-line placement –Have a complete or partial profile of the operation
Berkeley: Sept 15, Online Placement When a new RFUOP arrives –Is there enough space to place the RFUOP? –If yes, Which location is best to place it? Decision 1: Managing the empty space –Fast but sub-optimal Keep only O(n) empty rectangles –Shorter Seg. (SSEG), Square Empty Rects. (SQR),... –Efficient use of RFU real estate KAMER: Keep all O(n 2 ) maximal empty rectangles Decision 2: Packing rule –Best Fit, Bottom Left, First Fit
Berkeley: Sept 15, Keeping All Empty Rectangles Keeping O(n) Empty Rectangles - SSEG Cannot fit this
Berkeley: Sept 15, Area( ) < Area( ) Choose A Heuristics for Choosing an Empty Rectangle A B Current Placement New module to be inserted + = ? BF (Best Fit) FF (First Fit)BL (Bottom Left) Places the new module in the empty rectangle which causes less wasted space. Any of A or B could be chosen for placing the new module. P1P1 P2P2 Places the new module in rect w/ lower bottom-left corner, breaking the tie by picking leftmost one. y( P 2 ) < y( P 1 ) Choose B
Berkeley: Sept 15, Heuristics for Choosing a Segment SSEG (Shorter Seg) BER (Balanced Empty Rects)LSQR (Larger Rect Square) SQR (Square Rects) LER (Large Empty Rects) LSEG (Longer Seg) S1S1 S2S2 Chooses the shorter of the two segments. Chooses the longer of the two segments. A B C D S1S1 S2S2 A B C D A B C D A B C D Chooses the segment which creates less area difference. Chooses the segment which creates the larger rectangle closer to square. S 1 < S 2 Area( B ) - Area( A ) > Area( D ) - Area( C ) AspectRatio( B ) > AspectRatio( D ) Chooses the segment which creates the larger empty rectangle. Chooses the segment which creates empty rectangles closer to squares. Area( B ) > Area( D ) Max{AR( A ),AR( B )} < Max{AR( C ),AR( D )} AR = AspectRatio
Berkeley: Sept 15, Online Placement Results Table 1. Percentage of accepted modules using different bin-packing and empty space partitioning rules
Berkeley: Sept 15, Online Placement Results Volume that does not fit BEST
Berkeley: Sept 15, Online Placement Results (cont.)
Berkeley: Sept 15, t y x Off-line placement: 3-D Floorplanning RFU DFGSchedule RFU CPU RFU area time
Berkeley: Sept 15, t y x 3-D Floorplanning RFU By deleting this RFUOP (CPU performs the operation)... DFGSchedule RFU CPU
Berkeley: Sept 15, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU
Berkeley: Sept 15, Our 3-D Floorplanner: No change in the schedule Pure annealing –Move set Move operation from CPU set to RFU set Move operation from RFU set to CPU set Displace an already placed RFUOP on the RFU –Cost function: Volume –Very poor results Start with an ASAP schedule, use on-line to get an initial solution, then low-temperature annealing
Berkeley: Sept 15, Offline Penalty Online Penalty Ratio % % % % AlgorithmData set T50 T100 S100 S200 LTSA X=100% A % T50 T100 S100 S200 LTSA X=20% A % % % % % Offline Placement Results Place X% of the largest-volume modules using on-line placement
Berkeley: Sept 15, Flexibility of the Modules Library of modules have different implementations for each RFUOP –Experimental results with our online algorithms show about 60% reduction in penalty. 3-4 Implementations are enough
Berkeley: Sept 15, Faster Routing: mostly offline Technology-Mapped netlist Architecture Description File VPR Place Circuit or Read in Existing Placement Perform either Global or Combined Global/Detailed Routing Placement and Routing Output Files VPRCAD flow
Berkeley: Sept 15, Routing Algorithm (VPR) Call the VPR’s Router by an arbitrary channel width Based on PathFinder negotiated congestion algorithm Step1: Each net routed by the shortest path which can be found. ( Regardless of any overuse of wiring segments) Step2: Sequentially ripping-up and re-routing every net in the circuit ( by the lowest cost path found)
Berkeley: Sept 15, Fast Pattern Routing Maze-based routing algorithm has a good performance but it’s very slow. So, Speed-up the router by partially using pattern routing if an arbitrary net picked and routed differently, it would not change the result effectively.
Berkeley: Sept 15, Independent subset of nets Two geometrical independent sets of nets - Class 1 - Class 2
Berkeley: Sept 15, Routing Patterns 2 terminal net patterns Multi terminal net patterns (MST & RSTs) Cost = L + const / Flexibility
Berkeley: Sept 15, Implementation of Algorithm First choose the 2 terminal nets to route - More than 50% of the nets are 2 terminal nets. - In order to get the maximum independent sets, sort the two terminal nets in terms of their bounding boxes. - Classify the 2 terminal nets in geometrical independent classes - Route the classes, sequentially by pattern routing. Next choose the multi terminal nets ( low fan-out) - Route them in their corresponding RST patterns Finally, let the rest of the nets be routed by traditional router
Berkeley: Sept 15, Experimental Results
Berkeley: Sept 15, Faculty Position In VLSI Design & CAD (1-2 openings) VLSI Design & CAD: One of the six focused research areas in the department Assistant/Associate/Full Professor –(Northwestern rank: top 10; –ECE: top 20 (top 10 in 5 years) –Contact:
Berkeley: Sept 15, r0r0 r1r1 Image Restoration The value of the center pixel in the next iteration: x k+1 = *y + x k - * (d**x k ) r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 y: the pixel value from the original degraded image x k : the pixel value from the previous iteration d**x k denotes the weighted sum r 1 * (eight neighbor pixels) + r 0 * center pixel
Berkeley: Sept 15, Incentive : Processing of large sized images using FPGA’s with limited resources 1. Segmentation of the image into smaller sized images suitable for the FPGA Segments of size m x n are surrounded by an overlap of o. m o n
Berkeley: Sept 15, Pixels of individual segments are restored in parallel by hardware. Restored segments are written back after the overlap is discarded MEMORY m o n RFU
Berkeley: Sept 15, How bad is the segmentation? Theorem: The error introduces is about (w)**O example: (1/16) ** 2 = (1/264) Proof: By induction m o n
Berkeley: Sept 15,
Berkeley: Sept 15, Degraded ImageRestored Image
Berkeley: Sept 15, Degraded ImageRestored Image
Berkeley: Sept 15, Image stored in on-chip memory Circuit to process the image residing on the rest of the chip FPGA chip On-board memory, where the image is stored FPGA chip Host processor ( image is stored here) System ASystem BSystem C
Berkeley: Sept 15, ImageSoftware Running Time (sec) Running Time for System A (msec) Running Time for System C (msec) cameraman moon circle animals fish barbara yacht soccer announcer bluegirl cablecar cornfield Running Times of the Application on Software and on Different Systems (ignoring reconfiguration)
Berkeley: Sept 15, Conclusions Need radical departure (new algorithm, etc) from traditional PD algorithms. Fast (and lower quality) place & route tools Do as much as possible (building complex libraries, hierarchical routing, …) before compilation All of the above (and more) needed to make reconfigurable computing a reality.
Berkeley: Sept 15, Faculty Position In VLSI Design & CAD (1-2 openings) VLSI Design & CAD: One of the six focused research areas in the department Assistant/Associate/Full Professor –(Northwestern rank: top 10; –ECE: top 20 (top 10 in 5 years) Contact: