Download presentation
Presentation is loading. Please wait.
1
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer Engineering University of California, Santa Barbara {gong, wanggang, kastner}@ece.ucsb.edu http://express.ece.ucsb.edu November 7, 2005
2
11/7/2005GONG et al: Storage Assignment2 What are we dealing with? FPGA-based reconfigurable architectures with distributed block RAM modules Synthesizing high-level programs into designs Block RAM Configurable Logic Blocks
3
11/7/2005GONG et al: Storage Assignment3 control logic Options of Storage Assignment MUX datapath control logic datapath Given the same storage/logic resources, different storage assignments exist OR
4
11/7/2005GONG et al: Storage Assignment4 Objective Different arrangements achieve different performances. Objective: achieve the best performance (throughput) under the resource constraints, improve resource utilizations, and meet design goals (time, frequencies, etc.)
5
11/7/2005GONG et al: Storage Assignment5 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
6
11/7/2005GONG et al: Storage Assignment6 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
7
11/7/2005GONG et al: Storage Assignment7 Target Architecture FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules
8
11/7/2005GONG et al: Storage Assignment8 Memory Access Latencies Memory access delay = BRAM access delay + interconnect delays BRAM access time are fixed with the architecture Interconnect delays are variables. One clock cycle to access near data, or two or even more to access data far away from the CLB. Difficult to precisely estimate execution time.
9
11/7/2005GONG et al: Storage Assignment9 Outline Target architectures Data partitioning problem Problem formulation Data partitioning algorithm Memory optimizations Experimental results Concluding remarks
10
11/7/2005GONG et al: Storage Assignment10 Problem Formulation Inputs: An l-level nested loop L A set of n data arrays N An architecture with BRAM modules M. Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M. Objective: optimize latency Block RAM Configurable Logic Blocks
11
11/7/2005GONG et al: Storage Assignment11 Overview of Data Partitioning Algorithm Code analysis Determine possible partitioning directions Architectural-level synthesis Resource allocation, scheduling and binding Discover the design properties Granularity adjustment Use experimental cost function to estimate performances
12
11/7/2005GONG et al: Storage Assignment12 Code Analysis Iteration space and data spaces Index functions determine access footprints iteration space data space S
13
11/7/2005GONG et al: Storage Assignment13 Iteration/Data Space Partitioning Partitioning on the iteration space derive corresponding partitioning on data spaces Using the index functions Communication-free partitioning iteration space data space S
14
11/7/2005GONG et al: Storage Assignment14 Iteration/Data Space Partitioning Communication-efficient partitioning Data access footprints overlapped The reason of remote memory accesses, when not placed together iteration spacedata space S
15
11/7/2005GONG et al: Storage Assignment15 Architectural-level Synthesis Synthesize the innermost iteration body Pipelining designs Collect performance results execution time T, initial intervals II, and resource utilization u mul, u BRAM, and u CLB
16
11/7/2005GONG et al: Storage Assignment16 Estimating the Execution Time Resource utilizations determine the performance of the pipelined designs Execution time are linear to the number of initial intervals and the granularity. When more resources are not occupied, more operations could be performed simultaneously.
17
11/7/2005GONG et al: Storage Assignment17 Granularity Adjustment For each possible partitioning direction, check different granularity to obtain the best performance Coarsest: use as less block RAM modules as possible control logic datapath
18
11/7/2005GONG et al: Storage Assignment18 Granularity Adjustment For each possible partitioning direction, check different granularity to obtain the best performance Finest: distribute data to all block RAM modules control logic datapath
19
11/7/2005GONG et al: Storage Assignment19 Cost Function An experiential formulation based our architectural- level synthesis results. Estimate global memory accesses m r and total memory accesses m t, and their ratio Factor benefits memory accesses to nearby block RAM modules
20
11/7/2005GONG et al: Storage Assignment20 Outline Target architectures Data partitioning problem Memory optimizations Scalar replacement Data prefetching Experimental results Concluding remarks
21
11/7/2005GONG et al: Storage Assignment21 Scalar Replacement Scalar replacement increases data reuses and reduces memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again
22
11/7/2005GONG et al: Storage Assignment22 Data Prefetching and Buffer Insertion Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier One (two, or more) cycle depend on the size of chip and the # of BRAM Reduce the length of critical paths
23
11/7/2005GONG et al: Storage Assignment23 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
24
11/7/2005GONG et al: Storage Assignment24 Experimental Setup Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and DSP SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle
25
11/7/2005GONG et al: Storage Assignment25 Collected Results Pre-layout and post-layout timing and area results are collected Original: assign one block RAM to the entire data array Partitioned: the iteration/data spaces are partitioned under resource constraints. Optimized: memory optimizations applied on the partitioned designs.
26
11/7/2005GONG et al: Storage Assignment26 Results: Execution Time The average speedup: 2.75 times Under given resources, partitioned to 4 portions. After further optimizations: 4.80 times faster.
27
11/7/2005GONG et al: Storage Assignment27 Results: Achievable Clock Frequencies About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.
28
11/7/2005GONG et al: Storage Assignment28 Outline Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
29
11/7/2005GONG et al: Storage Assignment29 Concluding Remarks A data and iteration space partitioning approach for homogeneous block RAM modules integrated with existing architectural-level synthesis techniques parallelize input designs dramatically improve system performance
30
11/7/2005GONG et al: Storage Assignment30 Thank You Prof Ryan Kastner and Gang Wang Reviewers All audiences
31
11/7/2005GONG et al: Storage Assignment31 Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.