Contents Introduction Bus Power Model Related Works Motivation

Slides:



Advertisements
Similar presentations
Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Anany Levitin ACM SIGCSE 1999SIG. Outline Introduction Four General Design Techniques A Test of Generality Further Refinements Conclusion.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Recent Development on Elimination Ordering Group 1.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
1 of 14 1/15 Schedulability Analysis and Optimization for the Synthesis of Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Ant Colony Optimization Optimisation Methods. Overview.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
CDCTree: Novel Obstacle-Avoiding Routing Tree Construction based on Current Driven Circuit Model Speaker: Lei He.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
Automated Design of Custom Architecture Tulika Mitra
UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
I N V E N T I V EI N V E N T I V E A Morphing Approach To Address Placement Stability Philip Chong Christian Szegedy.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
CSC310 © Tom Briggs Shippensburg University Fundamentals of the Analysis of Algorithm Efficiency Chapter 2.
1 of 14 1/15 Schedulability-Driven Frame Packing for Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB)
Problem-solving with Computers. 2Outline  Computer System  5 Steps for producing a computer program  Structured program and programming  3 types of.
Sunpyo Hong, Hyesoon Kim
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Advanced Algorithms Analysis and Design
Efficient Evaluation of XQuery over Streaming Data
Ioannis E. Venetis Department of Computer Engineering and Informatics
Evaluating Register File Size
Conception of parallel algorithms
Lecture 5 Dynamic Programming
Advanced Design and Analysis Techniques
System Control based Renewable Energy Resources in Smart Grid Consumer
Online Subpath Profiling
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
13 Text Processing Hongfei Yan June 1, 2016.
Department of Electrical & Computer Engineering
Introduction to Algorithms
Lecture 5 Dynamic Programming
metaheuristic methods and their applications
Dynamically Reconfigurable Architectures: An Overview
Performance Optimization for Embedded Software
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Compiler Back End Panel
Objective of This Course
Ann Gordon-Ross and Frank Vahid*
Compiler Back End Panel
Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?
Multi-Objective Optimization
Advanced Algorithms Analysis and Design
Algorithms for Budget-Constrained Survivable Topology Design
Final Project presentation
Introduction to Data Structures
Mapping DSP algorithms to a general purpose out-of-order processor
Total running time is O(E lg E).
CSE 373: Data Structures and Algorithms
Alex Bolsoy, Jonathan Suggs, Casey Wenner
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

An Efficient Low-Power Instruction Scheduling Algorithm for Embedded Systems

Contents Introduction Bus Power Model Related Works Motivation Figure-of-Merit Algorithm Overview Random Scheduling Schedule Selection Experimental Result Conclusion

Introduction (1/3) Nomadic life-style is wide spreading these days thanks to the rapid progress in microelectronics technologies. Not only did electronic equipment get smaller, it got smarter. The low power electronics will play a key role in this nomadic age. The figure-of-merit for nomadic age = (intelligence)/(Size * Cost * Power). Electronic Equipment toward Smaller Size Nomadic Tool

Compiler-in-Loop Architecture Exploration Introduction (2/3) ASIPs have high-programmability and application specific hardware structure. Because of ASIPs have high configurability and productivity, it have the merit of time-to-market. Retargetable compiler is essential tool for application analysis and code generation in the design of ASIPs. By equipping a retargetable compiler with an efficient scheduling algorithm, low-power code can be generated. Compiler-in-Loop Architecture Exploration

Power Distribution for ICORE Introduction (3/3) The power consumption of ASIP instruction memory was found to be 30% or higher of the entire process power consumption. Minimizing power consumption at instruction bus is critical in low-power ASIP design. Power Distribution for ICORE

Bus Power Model (1/2) Bit transition on bus lines is one of the major contributing factors to power consumption. Traditional power model just uses self-capacitance model. With the development of nanometer technologies, coupling-capacitance is significant. As a result, how to solve the crosstalk problem on buses has become an important issue. Self capacitance model Self & coupling capacitance model

Bus Power Model (2/2) Crosstalk type Bus power model

Instruction Recoding Instruction recoding analyzes the performance pattern of the application program and reassign the binary code. Histogram graphs are used for analysis of application performance pattern. Chattopadhyay et al. obtained initial solution using MWP and applied simulated annealing with the initial solution. Histogram Graph

Cold Schedule C. Su et al. first proposed cold scheduling How to reflect Control Dependency into SCG ? Why MST and Simulated Annealing as postprocess ? TSP is better choice

Cold Schedule K. Choi et al. has formulated cold-scheduling as TSP problem – Reasonable approach C. Lee et al. expanded cold-scheduling to VLIW

Comparison between Recoding and Cold Scheduling Motivation (1/2) Recoding Cold-Scheduling Input Instruction sequence Instruction binary format Output Recoded instruction binary Instruction order Optimization Scope Global Local Considered Inst. Field Partial field All fields Comparison between Recoding and Cold Scheduling

Motivation (1/2) (a) Different Scheduling Results (b) Constructed Histogram Graphs (c) Optimal Recoding Results

Figure-of-Merit Maximizing the variance of transition edge weights increases the efficiency of recoding. The larger the sum of self-loop edge weights, the greater will be the power saving effect of a code sequence. Figure-of-merit

Algorithm Overview Presented FM is global function Global instruction scheduling is difficult to implement We solve the optimization problem using random schedule gathering and schedule selection Schedule Selection

Random Scheduling Considerations - Runtime performance Make_Schedules_for_BBs (BB_SET[ ]) begin for each BB in BB_SET[ ] do list_schedule_solution = LIST_SCHEDULE (BB); latency_UB = LATENCY (list_schedule_solution); Insert list_schedule_solution to Schedules_for_BBs[BB];   for i = 0 until ITERATION_COUNT (BB) do new_schedule = RANDOM_SCHEDULE (BB); acceptable = False; if (LATENCY (new_schedule) <= latency_UB) then acceptable = True; for each schedule solution s in Schedules_for_BBs[BB] do if (LATENCY (s) == LATENCY (new_schedule)) then   similarity_measure = COMPARE (s, new_schedule); if (similarity_measure > Threshold*LATENCY (new_schedule)) then accpetable = False; break; end if (acceptable) then Insert new_schedule to Schedules_for_BBs[BB]; end end end return Schedules_for_BBs[ ]; Considerations - Runtime performance - BB size and iteration count - Differences (similarity) between random schedules

Schedule Selection (1/3) Problem formulation Global histogram graph can be decomposed to local histograms So, we can consider the divide-and-conquer algorithm Merge of Histogram Graphs

Schedule Selection (2/3) NP-Hardness - To maximize the global variance, must consider not only the sum of each local variances but also covariance of all local histogram pairs

Schedule Selection (3/3) We used the dynamic programming method to achieve local cost maximization via a bottom-up approach For further optimization, we used simulated-annealing Greedy Selection Algorithm

Comparison of PCC Values Experimental Result We used PCC as our measure of performance. Comparison of PCC Values

Conclusion We presented a new instruction scheduling algorithm for low-power code synthesis It’s very exhaustive method to generate low-power code in application specific domain But, advance of computing power makes our method reasonable