Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Instruction Scheduling Using Max-Min Ant System Optimization Gang Wang, Wenrui Gong, and Ryan Kastner Dept. of Electrical and Computer Engineering University.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
10-1 Chapter 10 - Trends in Computer Architecture Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Ph.D. in Computer Science
Improving java performance using Dynamic Method Migration on FPGAs
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Department of Computer Science University of California, Santa Barbara
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Register Pressure Guided Unroll-and-Jam
CARP: Compression-Aware Replacement Policies
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer Engineering University of California, Santa Barbara {gong, wanggang, June 10, 2005

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM2 What are we dealing with?  Mapping high-level programs into FPGA-based reconfigurable computing architectures with distributed block RAM modules  Objective: Improve utilizations of available storage resources, optimize system performance, and meet design goals

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM3 Outline  Target architectures  Data partitioning problem  Memory optimizations  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM4 Outline  Target architectures  Data partitioning problem  Memory optimizations  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM5 Target Architecture  FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM6 Memory Access Latencies  Memory access delay including access delay and propagation delays. Propagation delays are variables.  One clock cycle to access near data, or two or even more to access data far away from the CLB.  Difficult to distinguish which ones are near and which ones are remote before physical synthesis  More difficult than traditional data partitioning in parallelizing compilation for NUMA machines

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM7 Outline  Target architectures  Data partitioning problem  Problem formulation  Data partitioning algorithm  Memory optimizations  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM8 Problem Formulation  Inputs:  An l-level nested loop L  A set of n data arrays N  An architecture with m BRAM modules M.  Assumptions:  Index expressions of array references are affine functions of loop indices;  No indirect array references, or other similar pointer operations;  All data arrays are assigned to block RAM modules  No duplicate data.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM9 Problem Formulation (cont’d)  Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M.  Constraints:  1) hardware resource constraint  2) capacity constraint of each block RAM module  3) all data arrays are assigned to block RAM and each data element is assigned to one and only one block RAM module.  Objective: minimize the total execution time (or maximize the system throughput) under the above constraints.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM10 Overview of Data Partitioning Algorithm  Code analysis to determine possible partitioning directions  Architectural-level synthesis discover the design properties  Resource allocation, scheduling and binding  Granularity adjustment  Use experiential cost function to estimate performances

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM11 Code Analysis  Calculate the iteration space IS(L)  Calculate the data space DS(Ni)  Obtain data access footprint F using the affine functions of loop indices  Analyze F and IS(L) to obtain a set of possible partitioning directions.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM12 Architectural-level Synthesis  Synthesize and pipeline the innermost iteration body, and collect execution time T, initial intervals II, and resource utilization um, ur, and ua

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM13 Granularity Adjustment  For each possible partitioning direction, check different granularity to obtain the best performance  Calculate the finest and coarsest grain for a homogeneous partitioning  Finest: as less iterations as possible in one block RAM module, use all block RAM modules  Coarsest: use as less block RAM modules as possible  Estimate global memory accesses m r and total memory accesses m t, and their ratio  Use cost function to estimate the execution time

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM14 Cost Function  An experiential formulation based our architectural- level synthesis results.  Estimate initial intervals for pipelined designs  Benefit memory accesses to nearby block RAM modules  Different resource utilizations and granularities affect the initial intervals

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM15 Outline  Target architectures  Data partitioning problem  Memory optimizations  Scalar replacement  Data prefetching  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM16 Scalar Replacement  Scalar replacement increases data reuses and reduces memory access  Memory are accessed in the previous iteration  Use contents already in registers rather than access it again

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM17 Data Prefetching and Buffer Insertion  Buffer insertion reduces critical paths, and optimizes clock frequencies.  Schedule the global memory access one cycle earlier  Reduce the length of critical paths

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM18 Outline  Target architectures  Data partitioning problem  Memory optimizations  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM19 Experimental Setup  Target architecture: Xilinx Virtex II FPGA.  Target frequency: 150 MHz.  Benchmarks: image processing applications and DSP  SOBEL edge detection  Bilinear filtering  2D Gauss blurring  1D Gauss filter  SUSAN principle.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM20 Results: Architectural Exploration  Correlation bank:  Different partitions of the array S deliver a wide variety of candidate solutions  With quite different overall performance after synthesis and physical design.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM21 Results: Execution Time  The average speedup: 2.75 times, and after further optimizations, the average speedup is 4.80 times faster.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM22 Results: Achievable Clock Frequencies  About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM23 Outline  Target architectures  Data partitioning problem  Memory optimizations  Experimental results  Concluding remarks

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM24 Concluding Remarks  A data and iteration space partitioning approach for homogeneous block RAM modules  integrated with existing architectural-level synthesis techniques  parallelize input designs  dramatically improve system performance  Future work  Irregular memory access  Heterogeneous block RAM modules

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM25 Thank You  Prof Ryan Kastner and Gang Wang  Reviewers  All audiences

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM26 Questions