Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

ECE 667 Synthesis & Verificatioin - FPGA Mapping 1 ECE 667 Synthesis and Verification of Digital Systems Technology Mapping for FPGAs D.Chen, J.Cong, DAOMap.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

ECE 667 Synthesis and Verification of Digital Systems

38 th Design Automation Conference, Las Vegas, June 19, 2001 Creating and Exploiting Flexibility in Steiner Trees Elaheh Bozorgzadeh, Ryan Kastner, Majid.

1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.

Technology Mapping.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1 Electrical.

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

DAG-Aware AIG Rewriting Alan Mishchenko, Satrajit Chatterjee, Robert Brayton Department of EECS, University of California Berkeley Presented by Rozana.

 Y. Hu, V. Shih, R. Majumdar and L. He, “Exploiting Symmetries to Speedup SAT-based Boolean Matching for Logic Synthesis of FPGAs”, TCAD  Y. Hu,

CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Configurable, reconfigurable, and run-time reconfigurable computing.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 3: January 12, 2004 Clustering (LUT Mapping, Delay)

Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

Fast Synthesis of Clock Gating from Existing Logic Aaron P. Hurst Univ. of California, Berkeley Portions In Collaboration with… Artur Quiring and Andreas.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Dynamo: A Runtime Codesign Environment

Architecture and Synthesis for Multi-Cycle Communication

Application-Specific Customization of Soft Processor Microarchitecture

Design and Analysis of Algorithm

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Dynamically Reconfigurable Architectures: An Overview

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan, Guoling Han, Zhiru Zhang Supported by NSF

Outline Motivation Related Works Problem Statement Proposed Solutions Experimental Results Conclusions

Motivation

Motivation (cont’d) Flexibility is required to satisfy different requirements and to avoid potential design errors Application Specific Instruction-set Processors (ASIPs) provide a solution to the tradeoff between efficiency and flexibility  A general purpose processor + specific hardware resource  Base instruction set + customized instructions  Specific hardware resource implements the customized instructions  Either runtime reconfigurable or pre-synthesized  Gain more popularity recently IFX Carmel 20xx, ARM, Tensilica Xtensa, STM Lx, ARC Cores

Application Specific Instruction-set Processor Program with basic instructions set I t 1 = a * b; t 2 = b * 0xf0; ; t 3 = c * 0x12; t 4 = t 1 + t 2 ; t 5 = t 2 + t 3 ; t 6 = t 5 + t 4 ; Custom Logic *** xf00x12 abc Execution time: 9 clock cycles *: 2 clock cycles+: 1 clock cycles

Application Specific Instruction-set Processor (cont ’ d) *** xf00x12 abc Program with extended instructions t 1 = extop1(a, b, 0xf0); t 2 = extop2(b, c, 0xf0, 0x12); t 3 = t 1 + t 2 ; Execution time: 5 clock cycles Speedup: 1.8 extops: 2 clock cycles+: 1 clock cycles Extended Instruction Set: I  extop1  expop2 extop1 extop2

Related Works [Kastner et al, TODAES ’ 02] Template generation + covering Limitation:  Minimum number of templates may not lead to maximum speedup  Ignore architecture constraints [Atasu et al, DAC ’ 03] Branch and bound Limitation:  High complexity  Instruction reuse is not considered [Peymandoust et al, ICASAP ’ 03] Instruction selection + instruction mapping Limitation:  Minimize the extended instruction number

Preliminaries Control data flow graph (CDFG)  Basic blocks(BBK) each bbk is a DAG, denoted by G(V, E)  Control edges Cone  A subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the cone  K-feasible cone Pattern A single output DAG  Trivial pattern  Nontrivial pattern  Associated with execution time, number of I/O, area Trivial Pattern Execution time I/O: 2-in 1-out *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Nontrivial Pattern SW Execution time HW Execution time I/O: 2-in 1-out Area: 2 {a, b, 0xf0}

Problem Statement Given:  G(V, E)  The basic instruction set I  Pattern constraints: I. Number of inputs |PI(p i )|  N in,  i; II. Number of outputs |PO(p i )| = 1,  i; III. Total area Objective:  Generate a pattern library P  Map G to the extended instruction set I  P, so that the total execution time is minimized.

Problem Decomposition Sub-problem 1. Pattern Enumeration: Generate all of the patterns S satisfying the constraints (i) and (ii) from G(V, E). Sub-problem 2. Instruction Set Selection: Select a subset P of S to maximize the potential speedup while satisfying the area constraint. Sub-problem 3. Application Mapping: Map G(V, E) to I  P so that the total execution time of G is minimized.

Proposed ASIP Compilation Flow Instruction Implementation / ASIP synthesis Pattern Generation / Pattern Selection Application Mapping Pattern library C ASIP constraints Implementation Mapped CDFG SUIF / CDFG generator CDFG

1.Pattern Enumeration All possible application specific instruction patterns should be enumerated Each pattern is a k-feasible cone Cut enumeration is used to enumerate all the k-feasible cones [cong et al, FPGA’99] In topological order, merge the cuts of fan- ins and discards those cuts not k-feasible

1.Pattern Enumeration (cont ’ d) 3-feasible cones: n 1 : {a, b} n 2 : {b, 0xf0} n 3 : {c, 0x12} *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n 4 : {n 1, n 2 }, n 5 : {n 2, n 3 }, {n 2, c, 0x12}, {n 3, b, 0xf0} {b, 0xf0, c, 0x12} n 6 : {n 4, n 5 }, {n 4, n 2, n 3 }, {n 5, n 1, n 2 } {n 1, b, 0xf0},{n 2, a, b},{a, b, 0xf0}

2.Pattern Selection (1) Resource cost and the execution time can be obtained using high-level estimation tool The extended instructions should satisfy the area constraint  Use all the enumerated patterns Optimal code can be generated Mapping becomes unaffordable  Heuristically select a set of patterns

2.Pattern Selection (2) Basic idea: simultaneously consider speed up, occurrence frequency and area. Speedup Tsw(p) = |V(p)| Thw(p) = Length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) Occurrence  Some pattern instances may be isomorphic  Graph isomorphism test [ Nauty Package ]  Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p)  Occurrence(p) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Pattern *+ T sw = 3 T hw = 2 Speedup = 1.5

2.Pattern Selection (3) Selection under Area Constraint  Can be formulated as a 0-1 knapsack problem 0-1 knapsack problem: Given n items (patterns) and weight W (area constraint A), and the ith item (pattern) is associated with value (gain) v i and weight (area) w i, select a subset of items to maximize the total value, while the total weight does not exceed W.  Optimally solvable by Dynamic programming algorithm.

3.Application Mapping (1) Application mapping covers each node in G(V, E) with the extended instruction set to minimize the execution time. The execution time of a mapped DAG is defined as the sum of the execution time of the patterns covering the DAG.

3.Application Mapping (2) Theorem: The application mapping problem is equivalent to the minimum-area technology mapping problem.  Execution time ↔ area  Total area = sum of area of each component  Total execution time = sum of execution time of each pattern  Minimum-area mapping is NP-hard → application mapping is NP-hard  A lot of minimum-area technology mapping algorithms

Minimum-area technology mapping [Keutzer, DAC’87 ] Tree decomposition + dynamic programming [Rudell] [Liao, ICCAD’95] Min-cost binate covering Given:  a boolean function f with variable set X  a cost function which maps X to a nonnegative integer Objective:  find an assignment for each variable so that the value of f is 1 and the sum of cost is minimized

Binate Covering (1) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5

Binate Covering (2) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 Covering clause: p 0 The fan-ins of the sink node need be covered by some pattern

Binate Covering (3) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 The nodes that generate inputs to p i must be covered by some other pattern PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 Covering clause: p 2 +p 6 +p 7 +p 10

Binate Covering (4) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 p 2 →p 4 & p 2 →p 5 ¬p 2 + p 4 & ¬ p 2 + p 5

Binate Covering (4) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 ¬p 6 + p 4 ¬p 7 + p 5

Binate Covering (5) *** xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 f = p 0 (p 2 +p 6 +p 7 +p 10 )(¬p 2 + p 4 )(¬ p 2 + p 5 )(¬p 6 + p 4 )(¬p 7 + p 5 ) (p 1 +p 8 +p 9 +p 11 ) (¬p 1 + p 3 )(¬ p 1 + p 4 ) (¬p 8 + p 3 )(¬p 9 + p 4 ) min-cost cover: p 0, p 10, p 11 with cost = 5

Experimental Results (1) A commercial reconfigurable system – Nios from Altera is used to implement the ASIPs.  5 extended instruction formats  up to 2048 instructions for each format Some DSP applications are taken as benchmark Altera’s Quartus II 3.0 is used to aid the synthesis and the physical design of the extended instructions.

Experimental Results (2) Pattern size vs. number of pattern instances (2-input patterns)

Experimental Results (3) Speedup under different input size constraints Speedup = T extended / T basic Ideal speedup pipeline hazard memory impact

Experimental Results (4) Speedup and resource overhead on Nios implementations

Conclusions Propose a set of algorithms for ASIP compilation  Actual performance metric is used as the optimization objective  Reduce the instruction mapping problem into an area-minimization logic covering problem  Operation duplication is considered implicitly Experiments show encouraging speedup

Thank You