Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer Engineering University of California, Santa Barbara {gong, wanggang, June 10, 2005
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM2 What are we dealing with? Mapping high-level programs into FPGA-based reconfigurable computing architectures with distributed block RAM modules Objective: Improve utilizations of available storage resources, optimize system performance, and meet design goals
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM3 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM4 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM5 Target Architecture FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM6 Memory Access Latencies Memory access delay including access delay and propagation delays. Propagation delays are variables. One clock cycle to access near data, or two or even more to access data far away from the CLB. Difficult to distinguish which ones are near and which ones are remote before physical synthesis More difficult than traditional data partitioning in parallelizing compilation for NUMA machines
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM7 Outline Target architectures Data partitioning problem Problem formulation Data partitioning algorithm Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM8 Problem Formulation Inputs: An l-level nested loop L A set of n data arrays N An architecture with m BRAM modules M. Assumptions: Index expressions of array references are affine functions of loop indices; No indirect array references, or other similar pointer operations; All data arrays are assigned to block RAM modules No duplicate data.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM9 Problem Formulation (cont’d) Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M. Constraints: 1) hardware resource constraint 2) capacity constraint of each block RAM module 3) all data arrays are assigned to block RAM and each data element is assigned to one and only one block RAM module. Objective: minimize the total execution time (or maximize the system throughput) under the above constraints.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM10 Overview of Data Partitioning Algorithm Code analysis to determine possible partitioning directions Architectural-level synthesis discover the design properties Resource allocation, scheduling and binding Granularity adjustment Use experiential cost function to estimate performances
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM11 Code Analysis Calculate the iteration space IS(L) Calculate the data space DS(Ni) Obtain data access footprint F using the affine functions of loop indices Analyze F and IS(L) to obtain a set of possible partitioning directions.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM12 Architectural-level Synthesis Synthesize and pipeline the innermost iteration body, and collect execution time T, initial intervals II, and resource utilization um, ur, and ua
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM13 Granularity Adjustment For each possible partitioning direction, check different granularity to obtain the best performance Calculate the finest and coarsest grain for a homogeneous partitioning Finest: as less iterations as possible in one block RAM module, use all block RAM modules Coarsest: use as less block RAM modules as possible Estimate global memory accesses m r and total memory accesses m t, and their ratio Use cost function to estimate the execution time
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM14 Cost Function An experiential formulation based our architectural- level synthesis results. Estimate initial intervals for pipelined designs Benefit memory accesses to nearby block RAM modules Different resource utilizations and granularities affect the initial intervals
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM15 Outline Target architectures Data partitioning problem Memory optimizations Scalar replacement Data prefetching Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM16 Scalar Replacement Scalar replacement increases data reuses and reduces memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM17 Data Prefetching and Buffer Insertion Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier Reduce the length of critical paths
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM18 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM19 Experimental Setup Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and DSP SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM20 Results: Architectural Exploration Correlation bank: Different partitions of the array S deliver a wide variety of candidate solutions With quite different overall performance after synthesis and physical design.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM21 Results: Execution Time The average speedup: 2.75 times, and after further optimizations, the average speedup is 4.80 times faster.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM22 Results: Achievable Clock Frequencies About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM23 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM24 Concluding Remarks A data and iteration space partitioning approach for homogeneous block RAM modules integrated with existing architectural-level synthesis techniques parallelize input designs dramatically improve system performance Future work Irregular memory access Heterogeneous block RAM modules
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM25 Thank You Prof Ryan Kastner and Gang Wang Reviewers All audiences
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM26 Questions