USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
ECE 667 Synthesis and Verification of Digital Circuits
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
ECE-777 System Level Design and Automation Hardware/Software Co-design
Data Flow Analysis. Goal: make assertions about the data usage in a program Use these assertions to determine if and when optimizations are legal Local:
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.
1 Introduction to Data Flow Analysis. 2 Data Flow Analysis Construct representations for the structure of flow-of-data of programs based on the structure.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Reconfigurable Computing (EN2911X, Fall07)
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
System Partitioning Kris Kuchcinski
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Improving Code Generation Honors Compilers April 16 th 2002.
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
ICS 252 Introduction to Computer Design
ASC Program Example Part 3 of Associative Computing Examining the MST code in ASC Primer.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Authors: Jia-Wei Fang,Chin-Hsiung Hsu,and Yao-Wen Chang DAC 2007 speaker: sheng yi An Integer Linear Programming Based Routing Algorithm for Flip-Chip.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
CMSC 345 Fall 2000 Unit Testing. The testing process.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Automated Design of Custom Architecture Tulika Mitra
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Factored Use-Def Chains and Static Single Assignment Forms
A Practical Stride Prefetching Implementation in Global Optimizer
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall, Byoungro So Oct 2, 2003

USC 2 Mapping Assignment

USC 3 Mapping an Application to Hardware Compute Data Layout Partition Chip Capacity Manage Communicatio n

USC 4 Build on Prior Work in DEFACTO  Automatic design space exploration for individual loop nests (DAC03, PLDI02)  Analyses and transformations to exploit ILP (PLDI02) and maximize memory bandwidth (LPCP02)  Communication and pipeline analysis to exploit data and task parallelism (FCCM02, DAC03)

USC 5 This Research  Integrates communication and pipelining analysis with the single loop design space exploration  Defines and illustrates search space properties for the global optimization problem  Describes a search algorithm and presents a case study

USC 6 Sequential MVIS Kernel Read Write Execution Order Time A B 2-D array access order row-wise data dependence B RAW Edge Feature Distance F RAW D D Pipeline Stage S1 Pipeline Stage S2 Pipeline Stage S3

USC 7 Reaching Definition Data Access Descriptor Set describes basic data access information s program point r, w read or write array access  accessed array section, integer linear inequalities  traversal order, vector of dims., slowest to fastest  vector of dominant induction variables for ea. dim  set of statements this tuple describes (def or use)  set of reaching definitions

USC 8 Communication Requirements Read (4)Write (3) Stage S2 Stage S1 BB Communication RAW Solve directly for data, granularity, placement

USC 9 Task Graph  Nodes are pipeline stages  Communication edge descriptors (CEDs) computed from RDADs  array section, per communication instance send point  receive point

USC 10 Global Optimization Strategy  2 Criteria  Design’s execution time should be minimized  Design’s space utilization, for a given level of performance, should be minimized  Estimates  Behavioral synthesis area (all loops)  Behavioral synthesis timing (all loops)  Communication rates

USC 11 Transformations  Local  Unroll and jam  Scalar replacement  Custom data layout  Global  Communication granularity and placement  Producer-Consumer Rate Matching  Data reorganization on-chip

USC 12 High-Level Design Flow

USC 13 Observation 1: Non-increasing Memory Accesses  Choose to place communication on-chip

USC 14 Observation 2: Non-increasing Unroll Factor  Local solution assumed to be best-case performance, worst-case space estimate Reduce unroll factors

USC 15 Observation 3: Matching Rates without Affecting Performance  Avoid creating longer critical paths If rate prod (d) < rate cons (d), we can safely reduce the unroll factor for S3 until the rates match CED(d) rate prod (d) rate cons (d) CED(a) rate prod (a) rate cons (a)

USC 16 Optimization Algorithm: Step 1 Apply Pipeline and Communication Analysis for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { uh1 = -3*u[x][y] – 3*u[x+1][y]……; uh2 = -3*u[x][y] +3*u[x+1][y] …..; peak[x][y] = uh1 + uh2; } for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { if (feature_x[x][y] !=0) ssd[x][y] = (u[x][y]-v[x][y+1]) 2 ………. } for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { if (peak[x][y] > threshold) feature_x[x][y] = x; else feature_x[x][y] = 0; }

USC 17 Optimization Algorithm: Step 2 Find Single Loop Solutions in Isolation for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { uh1 = -3*u[x][y] – 3*u[x+1][y]……; uh2 = -3*u[x][y] +3*u[x+1][y] …..; peak[x][y] = uh1 + uh2; } for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { if (feature_x[x][y] !=0) ssd[x][y] = (u[x][y]-v[x][y+1]) 2 ………. } for (x=0;x<image-2;x++) { for (y=0;y<image-2;y++) { if (peak[x][y] > threshold) feature_x[x][y] = x; else feature_x[x][y] = 0; }

USC 18 Optimization Algorithm: Step 3 Match Producer and Consumer Rates CED(feature_x) rate prod (feature_x) rate cons (feature_x) CED(peak) rate prod (peak) rate cons (peak) rate prod (peak) = rate cons (peak) rate prod (feature_x) = rate cons (feature_x)

USC 19 Optimization Algorithm: Step 4 Apply Greedy Strategy to Meet Chip Constraint If not, apply greedy strategy and then repeat steps 3 and 4. Final Solution

USC 20 Related Work  Synthesizing high-level constructs  Handel-C, RaPiD, PipeRench, Babb et al.  Design space exploration  Derrien/Rajopadhye, Cameron, PICO  Program analysis on arrays  Hall et. al, Amarasinghe, Balasundaram & Kennedy  Pipeline analysis  Splash 2, Weinhardt & Luk, Du et. al, Goldstein et al.

USC 21 Conclusion  System-level compiler automatically derives a pipelined implementation with explicit communication, while partitioning the chip capacity among pipeline stages  Global optimization strategy  Built upon local solution with communication  Constrain the search space  Non-increasing memory accesses  Non-increasing unroll factors

USC 22 Contact Information Project Web Site Authors’ addresses ziegler, mhall,