Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Slides:



Advertisements
Similar presentations
Instruction Selection for Compilers that Target Architectures with Echo Instructions Philip BriskAni NahapetianMajid Sarrafzadeh Embedded and Reconfigurable.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Xianfeng Li Tulika Mitra Abhik Roychoudhury
Computer Architecture Instruction-Level Parallel Processors
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECE-777 System Level Design and Automation Hardware/Software Co-design
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
A Dictionary Construction Technique for Code Compression Systems with Echo Instructions Embedded and Reconfigurable Systems Lab Computer Science Department.
08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.
VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.
38 th Design Automation Conference, Las Vegas, June 19, 2001 Creating and Exploiting Flexibility in Steiner Trees Elaheh Bozorgzadeh, Ryan Kastner, Majid.
Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,
Instruction Level Parallelism (ILP) Colin Stevens.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
ICCAD’01: November, 2001 Instruction Generation for Hybrid Reconfigurable Systems Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and Majid Sarrafzadeh.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
VLSI Physical Design: From Graph Partitioning to Timing Closure Chapter 5: Global Routing © KLMH Lienig 1 EECS 527 Paper Presentation High-Performance.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Computer Architecture Principles Dr. Mike Frank
Methodology of a Compiler that Compresses Code using Echo Instructions
Superscalar Processors & VLIW Processors
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
CS 201 Compiler Construction
Instruction Scheduling Hal Perkins Winter 2008
Presentation transcript:

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles DAC ’04. June 9, San Diego Convention Center, San Diego, CA

Outline Custom Instruction Generation and Selection Resource Sharing Algorithm Description with Examples Datapath Synthesis Techniques Experimental Methodology and Results Summary

Custom Instruction Generation Compiler Profiles Application Code Extracts Favorable IR Patterns Synthesizes Patterns as Hardware Datapaths Custom Instruction Selection Area Constraints Limit on-Chip Functionality NP-Hard 0-1 Knapsack Problem Formulated as an Integer Linear Program (ILP) Custom Instruction Generation and Selection

For each custom instruction i Gain(i) : Estimated Performance Gain of i Area(i) : Estimated Area of i Selected(i) : 1 if i is Selected; 0 Otherwise Goal Maximize Gain of Selected Instructions Constraint Area of Selected Instructions FPGA Area < ILP Formulation for Instruction Selection Problem

What About Resource Sharing? Area = 17 Area = 25 Two DFGs 1.5 My Datapath Area = 28 ILP Area Estimate = 42 Area Costs

Analysis 0-1 Knapsack Problem Formulation Over- Estimated Area by 150% ILP Solvers Do Not Consider Resource Sharing How to Remedy This Develop a Resource Sharing Algorithm Avoid Additive Area Estimates Based on per- Instruction Costs

Resource Sharing for DFGs Given: A Set of DFGs G* = {G 1, …, G n } Goal: Construct a Consolidation Graph G C of Minimal Cost Constraints: G C Must be Acyclic G C Must be a Supergraph of each G i in G* That’s Life: The Problem is NP-Hard

Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)

Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)

Resource Sharing Overview Use Substring Matching to Share Resources Merge DFGs Along Matched Nodes G3G3 G4G4 G1G1 G2G2

Resource Sharing Overview Synthesize G C Requires Less Area than Synthesizing G 1 …G 4 Separately GcGc G3G3 G4G4 G1G1 G2G2

Area Costs Path-Based Resource Sharing P1:() P2:()

P1:() P2:() MACStr O(L) L – Length of String ( ) Area of MACStr = 26 Maximum Area Common Substring Area Costs

P1:() P2:() MACSeq O(L 2 /logL) L – Length of String ( ) Area of MACSeq = 43 Area Costs Maximum Area Common Subsequence

Resource Sharing Algorithm Global Phase Determine: Which DFGs to Merge An Initial Path to Merge Local Phase Aggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible

Resource Sharing Algorithm Area Costs G1G1 G2G2 G3G3 G4G4

Global Phase Area Costs G3G3 G4G4 G1G1 G2G2

Global Phase Area Costs G3G3 G4G4 G1G1 G2G2 MACSeq/MACStr

Entering Local Phase Area Costs G1G1 G2G2 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr

Local Phase Area Costs G1G1 G2G G 12

Returning To Global Phase Area Costs G 12 G3G3 G4G4

Global Phase Area Costs G3G3 G4G4 G 12

Global Phase Area Costs G3G3 G4G4 G 12 MACSeq/MACStr

Entering Local Phase Area Costs G 12 G4G4 MACSeq/MACStr

Local Phase Area Costs G4G G 12 G MACSeq/MACStr

Local Phase Area Costs G4G G 12 G MACSeq/MACStr

Local Phase Area Costs G4G G 12 G

Local Phase Area Costs G4G G 12 G MACSeq/MACStr

Local Phase Area Costs G4G G 12 G MACSeq/MACStr

A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr

A Local Decision Area Costs G4G4 G 12 G

A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr

A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr

Cycles are Illegal Area Costs ILLEGAL! 4 12 G G 124 MACSeq/MACStr

Cycles are Illegal Area Costs G LEGAL! 4 12 G 124 MACSeq/MACStr

Local Phase Area Costs G4G4 G 12 G

Returning To Global Phase Area Costs G3G3 G 124

Global Phase Area Costs G3G3 G 124

Global Phase Area Costs G3G3 G 124 MACSeq/MACStr

Global Phase Area Costs G3G3 G G 1234 MACSeq/MACStr

Global Phase Area Costs G3G3 G G 1234 MACSeq/MACStr

Local Phase Area Costs G3G3 G G 1234

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr 124

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr 124

Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr

Local Phase Area Costs G3G3 G 124 G

We’re Done Area Costs G1G1 G2G2 G3G3 G4G4 G 1234

We’re Done Area Costs G1G1 G2G2 G3G3 G4G4 Area = 17 Area = 25 Area = 14 Area = 20 G 1234 Area = 30 Total Area of DFGs = 76 G 1234

VLIW Synthesis Experimental Procedure Custom Instr. Generation Set of Patterns Machine-SUIF Compiler Consolidation Graph Construction Algorithm Consolidation Graph Estimate Area Pipeline Synthesis

Pipelined Datapath Synthesis Compiler Loop Bodies 80-90% of Program Execution Time Parallelism Exists Across Multiple Iterations Pipelined Datapath Yields Maximal Throughput. Data Flow Graph Insert Registers & Muxes

Pipelined Datapath Synthesis GcGc G1G1 G2G2 G3G3 G4G4

VLIW Datapath Synthesis Compiler Non-Loop Computations Instruction-Level Parallelism Similar to Latency-Constrained Scheduling in High-Level Synthesis Data Flow Graph

Benchmark Suite MediaBench Benchmark Suite Exp.BenchmarkFile/Function Num. Instrs. Largest Instr. (Operations) Avg. Ops per Instr Mesa PGP Rasta Epic JPEG MPEG2 Rasta blend.c idea.c mul_mdmd_md.c collapse_pyr jpeg_fdct_ifast jpeg_idct_4x4 jpeg_idct_2x2 idct_col FR4TR Lqsolve.c idct_row

Experimental Results XilinxE-1000 Area

Experimental Results XilinxE-1000 Area

Summary Area Estimates Based on Resource Sharing 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates Resource Sharing Algorithm PBRS applied to Data Flow Graphs Experimental Results ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths