K. Bazargan R. KastnerM. Sarrafzadeh Physical Design for Reconfigurable Computing Systems using Firm Templates Department of Electrical & Computer Engineering.

Slides:



Advertisements
Similar presentations
Analysis of Floorplanning Algorithm in EDA Tools
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Temporal Placement.
Random Forest Predrag Radenković 3237/10
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Chapter 4: Trees Part II - AVL Tree
ARM-DSP Multicore Considerations CT Scan Example.
Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
Droplet-Aware Module-Based Synthesis for Fault-Tolerant Digital Microfluidic Biochips Elena Maftei, Paul Pop, and Jan Madsen Technical University of Denmark.
2D/3D Packing based on LFF (Less Flexibility First) principle.
EE663 Image Processing Edge Detection 5 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Efficient Multidimensional Packet Classification with Fast Updates Author: Yeim-Kuan Chang Publisher: IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 4, APRIL.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Berkeley: Sept 15, Physical Design Challenges of Reconfigurable Computing Systems Majid Sarrafzadeh NuCAD Department of ECE Northwestern University.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Dynamic NoC. 2 Limitations of Fixed NoC Communication NoC for reconfigurable devices:  NOC: a viable infrastructure for communication among task dynamically.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
2D Rectangular Packing with LFF and LFF/T Presented by Y. T. Wu.
Floorplanning. Obtained by subdividing a given rectangle into smaller rectangles. Each smaller rectangle corresponds to a module.
Chip Planning 1. Introduction Chip Planning:  Deals with large modules with −known areas −fixed/changeable shapes −(possibly fixed locations for some.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Elements of the Heuristic Approach
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
 Optimal Packing of High- Precision Rectangles By Eric Huang & Richard E. Korf 25 th AAAI Conference, 2011 Florida Institute of Technology CSE 5694 Robotics.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Finding dense components in weighted graphs Paul Horn
Operating Systems for Reconfigurable Systems John Huisman ID:
Efficient FPGA Implementation of QR
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
An Efficient Placement Strategy for Metaheuristics based Layout Optimization by Abdul-Rahim Ahmad Otman Basir Systems Design Engineering, University of.
Regularity-Constrained Floorplanning for Multi-Core Processors Xi Chen and Jiang Hu (Department of ECE Texas A&M University), Ning Xu (College of CST Wuhan.
Scheduling policies for real- time embedded systems.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
Hyper-heuristics. 2 Outline Hyper-heuristics Hyper-heuristics for strip packing Hyper-heuristics for Stock forecasting Conclusion.
Advanced Computer Architecture and Parallel Processing Rabie A. Ramadan http:
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
1. Placement of Digital Microfluidic Biochips Using the T-tree Formulation Ping-Hung Yuh 1, Chia-Lin Yang 1, and Yao-Wen Chang 2 1 Dept. of Computer Science.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
G. Volpi - INFN Frascati ANIMMA Search for rare SM or predicted BSM processes push the colliders intensity to new frontiers Rare processes are overwhelmed.
Vector Quantization CAP5015 Fall 2005.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
1 Floorplanning of Pipelined Array (FoPA) Modules using Sequence Pairs Matt Moe Herman Schmit.
Polygon Triangulation
Partial Reconfigurable Designs
Chapter 2 Memory and process management
Chart Packing Heuristic
Main Memory Management
Improve Run Generation
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Memory Management (1).
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

K. Bazargan R. KastnerM. Sarrafzadeh Physical Design for Reconfigurable Computing Systems using Firm Templates Department of Electrical & Computer Engineering Northwestern University

Sep 10, 99 2 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work

Sep 10, 99 3 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work  

Sep 10, 99 4 CPU Data Memory Control Data Instruction Memory (Program) RFUOPs CPU instructions The Architecture of a Reconfigurable System RFU

Sep 10, 99 5 Execution of a Sample Program RFU t y x x = 3*a - b; … C = RFUOP1(x,5); y = 4*x - c; for (i=0;i<3;i++){ x+=RFUOP2(y); ++y; } z = RFUOP1(x,3); a = z - y; b = RFUOP3(a,b); c = a - b; …CodeDFG => (on CPU) (on RFU) => No room on RFU to run all in parallel ==> run in sequence => (in parallel) =>

Sep 10, 99 6 Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, 99 7 Application Example: Image Restoration The value of the center pixel in the next iteration: x k+1 =  *y + x k -  * (d**x k ) y: the pixel value from the original degraded image x k : the pixel value from the previous iteration d**x k denotes the weighted sum r 1 *  (eight neighbor pixels) + r 0 * center pixel r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r1r1 r0r0

Sep 10, 99 8 m o n Image Restoration (cont.) Incentive: –Processing of large images using FPGA’s with limited resources Strategy: –Segmentation of the image into smaller sized images suitable for the FPGA –Segments of size m x n are surrounded by an overlap of o.

Sep 10, 99 9 MEMORY m o n RFU Image Restoration: Data Flow Strategy Data flow strategy –Pixels of individual segments are restored in parallel by hardware. –Restored segments are written back after the overlap is discarded

Sep 10, Degraded ImageRestored Image Image Restoration Example

Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, Configuration Memory Config. Bits RFUOPs RFU Manager System Components Placement Engine Cache Manager Prefetch/Branch Prediction Unit Control Program Manager Instruction Mem. (Prog.) CPU instructions Data CPU RFU Data Memory Data

Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, Online Placement: Problem Definition Input: –RFU dimensions(W, H) –List of RFUOP events:(w, h, arrival, departure) arrival departure Output: –For each module, either Rejected (not able to place) [penalty?] Accepted:(x,y) accepted rejected

Sep 10, Online Placement When a new RFUOP arrives, –Is there enough room? –If yes, which location is best? Previous work –Bin-packing heuristics (1-D) - O(n 2 ) First Fit, Best Fit, Shelf, Look ahead, … –[Chazelle’83] The Bottom-Left heuristic. O(n 2 ) –[Healy-Creavin’97] O(n 2 lg n) += ? Current Placement New module to be inserted

Sep 10, Our Online Placement Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?)  Packing rule –Best Fit, Bottom Left, First Fit 

Sep 10, Heuristics for Choosing an Empty Rectangle New module to be inserted + = ? A B Current Placement Area( ) < Area( )  Choose A BF (Best Fit) Places the new module in the empty rectangle which causes less wasted space. FF (First Fit) Any of A or B could be chosen for placing the new module. BL (Bottom Left) P1P1 P2P2 Chooses the empty rect which is more to the bottom left y( P 2 ) < y( P 1 )  Choose B

Sep 10, Our Online Placement Managing the empty space –Keep empty rectangles explicitly, use “range tree” to store/access empty rects. –Efficient use of RFU real estate KAMER: Keep all O(n 2 ) maximal empty rectangles  Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?) 

Sep 10, Keeping All Empty Rectangles

Sep 10, Our Online Placement Our approach: –Divide the empty space into explicit “empty rectangles” When a new RFUOP arrives –Is there enough room? (any ER large enough?) –If yes, which location is best? (which ER is best?) Managing the empty space –Keep empty rectangles explicitly, use “range tree” to store/access empty rects. –Efficient use of RFU real estate KAMER: Keep all O(n 2 ) maximal empty rectangles –Fast but sub-optimal Keep only O(n) empty rectangles –Shorter Seg. (SSEG), Square Empty Rects. (SQR),...  

Sep 10, Keeping O(n) Empty Rectangles - SSEG

Sep 10, Heuristics for Choosing a Segment SSEG (Shorter Seg) BER (Balanced Empty Rects)LSQR (Larger Rect Square) SQR (Square Rects) LER (Large Empty Rects) LSEG (Longer Seg) S1S1 S2S2 Chooses the shorter of the two segments. Chooses the longer of the two segments.   A B C D S1S1 S2S2 A B C D A B C D A B C D Chooses the segment which creates less area difference. Chooses the segment which creates the larger rectangle closer to square.  S 1 < S 2 Area( B ) - Area( A ) > Area( D ) - Area( C ) AspectRatio( B ) > AspectRatio( D ) Chooses the segment which creates the larger empty rectangle. Chooses the segment which creates empty rectangles closer to squares. Area( B ) > Area( D )  Max{AR( A ),AR( B )} < Max{AR( C ),AR( D )} AR = AspectRatio 

Sep 10, How Good is a Placement? Acceptance rate –percentage of modules accepted (placed) Volume penalty –Area  complexity –Time-span in the system  loop iterations –Penalty of rejecting a module penalty = volume = area * time Input data –Randomly generated dimensions –Randomly generated enter/leave time

Sep 10, Program snapshot

Sep 10, Online Placement Results Percentage of accepted modules using different bin-packing and empty space partitioning rules

Sep 10, Online Placement Results (cont.)

Sep 10, Online Placement Results (cont.)

Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU RFU area time

Sep 10, t y x 3-D Floorplanning RFU By deleting this RFUOP (CPU performs the operation)... DFGSchedule RFU CPU

Sep 10, t y x 3-D Floorplanning RFU This RFUOP can be moved on the RFU DFGSchedule RFU CPU

Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU These RFUOPs can be performed earlier...

Sep 10, t y x 3-D Floorplanning RFU DFGSchedule RFU CPU

Sep 10, Our Current 3-D Floorplanners No change in the schedule –Fixed insertion and deletions of RFUOPs Annealing based. –Move set Move operation from CPU set to RFU set Move operation from RFU set to CPU set Displace an already placed RFUOP on the RFU –Cost function Penalty in rejecting modules (sum of volumes of the RFUOPs in the CPU set) No overlap allowed during annealing Greedy –Sort the modules on decreasing vol., apply KAMER

Sep 10, Our Current 3-D Floorplanners (cont.) KAMER-BF-Decreasing –Sort the modules on their volumes –Use KAMER to find a fast placement of the modules Low-temp. annealing (LTSA) –Similar to KAMER-BFD, but use KAMER to place only the X% largest modules –Use low-temp annealing to place the rest Zero-temp. annealing (ZTSA) -- Greedy –Use KAMER to place as many modules as you can –Use only displace and move from CPU to RFU annealing moves.

Sep 10, Our Current 3-D Floorplanners (cont.) BFOP - Best Fit Online Placement –Sort the RFUOPs on volume (decreasing) –For each RFUOP, find candidate “corners” –Choose the corner which results in min wasted area (similar to well-studied 2-D Bin Packing problem) t y x A Floor corresponding to time t 1 t1t1 corners t1t1

Sep 10, Annealing-Based Offline vs. Online Percentage of accepted modules and penalties using two offline parameters. The higher the RFU acceptance rate and lower the penalty, the better the algorithm.

Sep 10, Offline Placement Results - All

Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, Flexible Modules Library of soft templates –Flexible shapes Constant area, different width,height Problem? Hard to build (PD should be done for each shape) –Median Use the same area, but square shape –Rotation Placement method –Use best shape (min wasted area)

Sep 10, Using Flexible Modules in BFOP Median uses a square module with the same area

Sep 10, Flexible Modules (cont.) “Firm” templates –Slice the module into x horizontal or vertical strips –If cannot place the module, use the 2-split, 3-split, … until you can fit. Problem? –Routing! –Limited module types can be split (like carry chains, etc. with min communication between stages) Vertical 3-split

Sep 10, Quality Improvements Using Firm Templates

Sep 10, Outline FPGA: What and why? What is Reconfigurable Computing System (RCS)? Application example RCS: System components Online placement: problem definition and our approach Offline placement and scheduling Flexible modules and firm templates Conclusion and future work 

Sep 10, Conclusion Which online algorithm? –If speed is an issue, SSEG, ow KAMER Online or offline? –If you have the schedule => offline Which offline algorithm? –BFOP is the best (faster+better quality) Median? Flexibility? Firm templates? –Surprisingly, median gives little improvement –If flexible shape avail, better than splitting (no additional routing problem) –How many splits? no-split  2-split: 23% improvement 5-split  6-split: 3% improvement