Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Slides:

Advertisements

Similar presentations

Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.

Advertisements

Xianfeng Li Tulika Mitra Abhik Roychoudhury

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

ECE-777 System Level Design and Automation Hardware/Software Co-design

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Constraint Programming for Compiler Optimization March 2006.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.

CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Identifying Early Buyers from Purchase Data Paat Rusmevichientong, Shenghuo Zhu & David Selinger Presented by: Vinita Shinde Feb 18 th, 2010.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.

Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.

Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.

Efficiently Solving Convex Relaxations M. Pawan Kumar University of Oxford for MAP Estimation Philip Torr Oxford Brookes University.

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

Toshihide IBARAKI Mikio KUBO Tomoyasu MASUDA Takeaki UNO Mutsunori YAGIURA Effective Local Search Algorithms for the Vehicle Routing Problem with General.

Decision Procedures An Algorithmic Point of View

Clearing Algorithms for Barter Exchange Markets: Enabling Nationwide Kidney Exchanges Hyunggu Jung Computer Science University of Waterloo Oct 6, 2008.

Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.

Interference Graphs for Programs in Static Single Information Form are Interval Graphs Philip Brisk Processor Architecture Laboratory (LAP) EPFL Lausanne,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 10: February 18, 2015 Architecture Synthesis (Provisioning, Allocation)

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

Architecture and Design Automation for Application-Specific Processors Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Column Generation By Soumitra Pal Under the guidance of Prof. A. G. Ranade.

NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.

Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.

Approximation Algorithms Department of Mathematics and Computer Science Drexel University.

A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

Thermal-Aware Data Flow Analysis José L. Ayala – Complutense University (Spain) David Atienza – EPFL (Switzerland) Philip Brisk – EPFL (Switzerland)

1 Chapter 5 Branch-and-bound Framework and Its Applications.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Amit Verma National Institute of Technology, Rourkela, India

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

CSCI1600: Embedded and Real Time Software

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

From C to Elastic Circuits

Dynamically Scheduled High-level Synthesis

Bidirectional Query Planning Algorithm

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

2 Custom ISE Identification Register File ALUMULLD/ST Data Memory AFU out1 = F (in1, in2, in3, in4) out2 = G (in1, in2, in3, in4) Limited number of I/O ports

3 Outline Problem formulation ISE selection I/O serialisation Related work Non-optimality of earlier work Integer Linear Programming (ILP) formulation Results Conclusions

4 Problem Formulation Given a dataflow graph a set of forbidden nodes Find a subgraph S, which is convex free of forbidden nodes And, has largest gain M (S) = N exec * (SW (S) – HW (S)) f a x2x2 x1x1 d x3x3 h bceg

5 Convex Subgraph d cb a In order to execute the AFU we need the output of node b Computation of node b requires the output of AFU A non-convex AFU cannot be scheduled without creating a deadlock

6 I/O Serialisation f d bce 2 inputs, 4 outputs Available I/O ports: (1, 2) c b e d f

7 ISE Merit Estimation M (S) = N exec * (SW (S) – HW (S)) f a x2x2 x1x1 d x3x3 h bceg c b e d f

8 Related Work ISE identification under I/O constraints Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07] ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07] ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs [Pothineni07] I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07] Algorithms for specific processor models Single-issue RISC processor model [Verma07]

9 Earlier Work ISE SelectionI/O Serialisation Atasu03 Yu07 Chen07 Bonzini07 Pozzi05 Pothineni07 Optimal ISEs selection under various I/O constraints Exponential time I/O serialisation algorithm

10 Non-Optimality of Earlier Work cycle saved:

11 Our Contributions Optimal ILP formulation for a large class of processor models Earlier work consider RISC processor model only Single run In the earlier work ISE selection was done for various I/O constraints ISE selection and I/O scheduling together Another source of non-optimality of earlier work

12 Integer Linear Programming Objective function Linear constraints

13 ILP Formulation Linear constraints No forbidden nodes Convexity constraints I/O serialisation based constraints I/O access per cycle based constraints Objective function Saving in cycles should be maximum

14 ISE Selection Constraints (1 of 2) Variable: For each node n i a Boolean variable x i x i is true iff node n i is in the selected ISE Constraint: No forbidden node should be in the ISE If n i is a forbidden node, then x i = 0 Variable: For each node n i two Boolean variables p i and s i p i (s i ) is true iff at least a predecessor (successor) of n i is in the selected ISE Constraint: Subgraph corresponding to the selected ISE must be convex If (p i and s i are true), then x i must be true (i.e., p i + s i – x i ≤ 1)

15 ISE Selection Constraints (2 of 2) Relationship between p i, s i and x i p i = 0 if n i has no children U (x j U p j ) where n j ’s are children of n i s i = 0 if n i has no parents U (x j U p j ) where n j ’s are parents of n i

16 I/O Serialisation Based Constraints (1 of 3) n1n1 n2n2 n3n3 n4n4 n5n5 Variable: An integer variable intDelay i Denotes the cycle in which node n i is executed, e.g., intDelay 1 = 0 intDelay 4 = 1 intDelay 5 = 2 Variable: A real variable fractionalDelay i Denotes the smallest time after intDelay i cycle when output of n i are available, e.g., fractionalDelay 3 = HW (n 3 ) fractionalDelay 4 = HW (n 3 ) + HW (n 4 ) Variable: An integer variable ρ ij Denotes the number of stages across the edges between the nodes n i and n j, e.g., ρ 13 = 1 ρ 34 = 0 ρ 25 = 2

17 I/O Serialisation Based Constraints (2 of 3) Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay 4 = intDelay 3 + ρ 34 intDelay 5 = intDelay 2 + ρ 25 Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay 5 + ρ 57 R = intDelay 2 + ρ 26 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 Extra latches on output edges are created in order to realize an imaginary sink node

18 I/O Serialisation Based Constraints (3 of 3) Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node in the cycle fractionalDelay 3 = HW (n 3 ) Case 2: if node is not the first node in the cycle fractionalDelay 4 = fractionalDelay 3 + HW (n 4 ) Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay 3 ≤ λ fractionalDelay 4 ≤ λ n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7

19 I/O Access Per Cycle Based Constraints Variable: Boolean variables c ik IN and c ik OUT c ik IN is true, iff n i is an input of ISE and is accessed in the k th stage of execution (similarly for c ik OUT ) Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ c ik IN ≤ m ∑ c ik OUT ≤ n c ik IN and c ik OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

20 Objective Function Saving in cycles should be maximized SW (S) – HW (S) should be maximum SW (S) = ∑ x i SW (n i ) HW (S) = R Any processor model where SW (S) and HW (S) can be computed using linear inequalities, can be handled using ILP

21 Experimental Setup Input dataflow graph ISE selection Atasu03 ISE selection Atasu03 ILP method I/O serialisation Pozzi05 No serialisation exp / subopt exp / opt

22 Results (1 of 3) viterbi adpcmdecoder adpcmcoder No pipelining Pozzi’s algorithm ILP method

23 Results (2 of 3) Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results Benchmark: aes Biggest dataflow graph: 703 After 3 minutesAfter an hour

24 Results (3 of 3) The best AFU with 22 inputs and 22 outputs

25 Conclusions ISE SelectionI/O Serialisation Atasu03 Yu07 Chen07 Bonzini07 Pozzi05 Pothineni07 The methodology can be generalized for a large class of processor models Optimal, single run algorithm