1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Slides:



Advertisements
Similar presentations
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
ECE 667 Synthesis and Verification of Digital Circuits
Hadi Goudarzi and Massoud Pedram
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Parallell Processing Systems1 Chapter 4 Vector Processors.
1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Spie98-1 Evolutionary Algorithms, Simulated Annealing, and Tabu Search: A Comparative Study H. Youssef, S. M. Sait, H. Adiche
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Mahapatra-Texas A&M-Fall'001 Partitioning - I Introduction to Partitioning.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Maximum Network lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Mihaela Cardei, Jie Wu, Mingming Lu, and Mohammad O. Pervaiz Department.
Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing Y.Thomas Hou, Yi Shi, Jianping Pan, Scott F.Midkiff Mobile.
1 Contents college 3 en 4 Book: Appendix A.1, A.3, A.4, §3.4, §3.5, §4.1, §4.2, §4.4, §4.6 (not: §3.6 - §3.8, §4.2 - §4.3) Extra literature on resource.
1 Target-Oriented Scheduling in Directional Sensor Networks Yanli Cai, Wei Lou, Minglu Li,and Xiang-Yang Li* The Hong Kong Polytechnic University, Hong.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Elements of the Heuristic Approach
USING SAT-BASED CRAIG INTERPOLATION TO ENLARGE CLOCK GATING FUNCTIONS Ting-Hao Lin, Chung-Yang (Ric) Huang Graduate Institute of Electrical Engineering,
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Energy Efficient Routing and Self-Configuring Networks Stephen B. Wicker Bart Selman Terrence L. Fine Carla Gomes Bhaskar KrishnamachariDepartment of CS.
Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
1 Outline:  Outline of the algorithm  MILP formulation  Experimental Results  Conclusions and Remarks Advances in solving scheduling problems with.
Network Aware Resource Allocation in Distributed Clouds.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
SoftCOM 2005: 13 th International Conference on Software, Telecommunications and Computer Networks September 15-17, 2005, Marina Frapa - Split, Croatia.
Energy Aware Task Mapping Algorithm For Heterogeneous MPSoC Based Architectures Amr M. A. Hussien¹, Ahmed M. Eltawil¹, Rahul Amin 2 and Jim Martin 2 ¹Wireless.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Company name KUAS HPDS A Realistic Variable Voltage Scheduling Model for Real-Time Applications ICCAD Proceedings of the 2002 IEEE/ACM international conference.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Doshisha Univ., Kyoto, Japan CEC2003 Adaptive Temperature Schedule Determined by Genetic Algorithm for Parallel Simulated Annealing Doshisha University,
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
ELEC692 VLSI Signal Processing Architecture Lecture 3
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Pipelining and Retiming
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Multi-objective Optimization
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
Energy-Efficient Randomized Switching for Maximizing Lifetime in Tree- Based Wireless Sensor Networks Sk Kajal Arefin Imon, Adnan Khan, Mario Di Francesco,
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Evaluating Register File Size
October 9, 2003.
Presentation transcript:

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), 2010 Date:2010/05/20 吳俊雄

2 OUTLINE  INTRODUCTION  MULTI-OBJECTIVE ASIP DESIGN  Two Algorithms for Custom Instruction Synthesis 1. Mixed Integer Linear Programming 2. Simulated Annealing Method  EXPERIMENTAL RESULTS

3 INTRODUCTION  Traditional custom instruction synthesis flows for ASIPs mainly target performance improvement.  We show that the existing custom instruction exploration algorithms 1. Mixed Integer Linear Programming (MILP) 2. Simulated Annealing Method  And cost estimation methods 1. Performance improvement 2. Energy efficiency 3. Area overhead

4 INTRODUCTION  Our work presented in this paper has three major contributions 1. We address the importance of energy and resource efficiency in ASIP design 2. We discuss a set of key factors during the custom instruction selection 3. We show that traditional design space exploration algorithms are either not feasible or inefficient to estimate all the necessary factors  Since the theoretical complexity for exploring the design space thoroughly is O(2 n ), most practical techniques adopt heuristics to prune the design space during the search.  Present a holistic ASIP synthesis and simulation flow which allows the flexibility to adjust the optimization goal between energy efficiency, area overhead and performance.

5 MULTI-OBJECTIVE ASIP DESIGN  There are two major energy factors: 1. Instruction fetch consumes a considerable portion of the total energy within a processor. 2. The data communication between operations is originally implemented through register file accesses within the base processor.  The dynamic energy consumption is affected by the reduction of the number of instructions and data register file accesses.

6 MULTI-OBJECTIVE ASIP DESIGN  Custom processor 1 with CFU1 achieves better performance improvement, because it utilizes operation parallelism in the DFG to reduce the total execution cycles.  Custom processor 2 with CFU2 achieves larger energy saving, because it realizes a sub-graph covering more operations and data transfer edges.

7 MULTI-OBJECTIVE ASIP DESIGN  We show that generating custom instructions from a DFG can be viewed as solving an operation scheduling problem.  The scheduling scheme should ensure data dependency and that the input/output edges of each software stage satisfy the I/O constraint set by the register file ports.  For a scheduling scheme, the number of software stages with operations in represents the number of instructions for the customized processor. The edges across different software stages represent register file accesses.

8 Two Algorithms for Custom Instruction Synthesis  Mixed Integer Linear Programming (MILP)  Primary Variable definition: i: index of the operations, l: index of software stages.  Parameter definition: hardware execution delay k is the index of operation types. S 3, 4 =1

9 Two Algorithms for Custom Instruction Synthesis  Assistant Variable definition: execution cycle delay  Constraints: 1. data dependency constraint 2. I/O Sd 6 =0.8 i j

10 Two Algorithms for Custom Instruction Synthesis  SN:The number of instructions  SE:The total number of data accesses  For multi-issue, out-of-order processors equals to the longest execution path delay of the DFG  :The largest number of this type of operations among different software stages  :the number of functional modules (operators) of type k needed in the final custom hardware extension.

11 Two Algorithms for Custom Instruction Synthesis  :The unit hardware area of functional module type k.  energy consumption area overhead execution cycle  The advantage of applying MILP to solve the scheduling problem is that, theoretically, it can find the optimum solution to the problem with sufficient searching time.

12 Two Algorithms for Custom Instruction Synthesis  Simulated Annealing Method  Solution Vector definition: OPv = {op1, op2, op3,..., opn}  Solution variation mechanism: In each iteration, we randomly select n operations and move them to a different software stage to generate a new solution. n represents the maximum distance between current solution and the one it evolves to. t is the current temperature, T is the starting temperature and N is the total number of operations.

13 Two Algorithms for Custom Instruction Synthesis  The allowable range for certain operation to move around is determined by the location of its parent and child nodes.  In our algorithm, the actual moving range for an operation is further tightened by the current temperature - range = R * sqr(t/T ). We randomly move the operation to a software stage within this range. R=[3~8]

14 Two Algorithms for Custom Instruction Synthesis  Solution acceptance mechanism: A new solution is accepted when its cost is smaller than that of the current solution, or can be accepted with a probability of p when the new cost is larger than that of the current solution, where  Simulated Annealing algorithm balances the trade-off between the solution quality and searching time.

15 Two Algorithms for Custom Instruction Synthesis

16 MULTI-OBJECTIVE ASIP SYNTHESIS FLOW

17 EXPERIMENTAL RESULTS  CPLEX is used to solve the MILP problem for design space exploration.  The baseline processor is an out-of-order MIPS style processor.  Set the ratio between the weight variable g1 and g2 to be 12.2 : 1.  Set the register file I/O constraints to be 4/2.  We perform experiments for energy reduction and for performance improvement by setting the variable å2 and å3 at zero, and å1 and å2 at zero, respectively.

18 EXPERIMENTAL RESULTS The average speedup 1.42 for Binary Tree 1.64 for MILP (p.) 1.56 for MILP (e.) The average energy consumption reductions are 18.1%, 22.7% and 29.8%.

19 EXPERIMENTAL RESULTS  The custom instruction templates presented in (b) and (c) are targeting performance and energy efficiency, respectively. There are more operations in the templates identified for energy efficiency, shown in (c), and they include longer critical paths than the sub-graphs shown in (b).

20 EXPERIMENTAL RESULTS  For different designs, the ratio between å1 and å2 can be varied to find the best trade-off between them. å3=0, å1 = 1, å2 = 0 å1 = å2 = 0.5

21 EXPERIMENTAL RESULTS  The SA algorithm achieves an average of 1.46 performance speedup, which is a little lower than that achieved by the MILP algorithm (1.64).