An Efficient Low-Power Instruction Scheduling Algorithm for Embedded Systems
Contents Introduction Bus Power Model Related Works Motivation Figure-of-Merit Algorithm Overview Random Scheduling Schedule Selection Experimental Result Conclusion
Introduction (1/3) Nomadic life-style is wide spreading these days thanks to the rapid progress in microelectronics technologies. Not only did electronic equipment get smaller, it got smarter. The low power electronics will play a key role in this nomadic age. The figure-of-merit for nomadic age = (intelligence)/(Size * Cost * Power). Electronic Equipment toward Smaller Size Nomadic Tool
Compiler-in-Loop Architecture Exploration Introduction (2/3) ASIPs have high-programmability and application specific hardware structure. Because of ASIPs have high configurability and productivity, it have the merit of time-to-market. Retargetable compiler is essential tool for application analysis and code generation in the design of ASIPs. By equipping a retargetable compiler with an efficient scheduling algorithm, low-power code can be generated. Compiler-in-Loop Architecture Exploration
Power Distribution for ICORE Introduction (3/3) The power consumption of ASIP instruction memory was found to be 30% or higher of the entire process power consumption. Minimizing power consumption at instruction bus is critical in low-power ASIP design. Power Distribution for ICORE
Bus Power Model (1/2) Bit transition on bus lines is one of the major contributing factors to power consumption. Traditional power model just uses self-capacitance model. With the development of nanometer technologies, coupling-capacitance is significant. As a result, how to solve the crosstalk problem on buses has become an important issue. Self capacitance model Self & coupling capacitance model
Bus Power Model (2/2) Crosstalk type Bus power model
Instruction Recoding Instruction recoding analyzes the performance pattern of the application program and reassign the binary code. Histogram graphs are used for analysis of application performance pattern. Chattopadhyay et al. obtained initial solution using MWP and applied simulated annealing with the initial solution. Histogram Graph
Cold Schedule C. Su et al. first proposed cold scheduling How to reflect Control Dependency into SCG ? Why MST and Simulated Annealing as postprocess ? TSP is better choice
Cold Schedule K. Choi et al. has formulated cold-scheduling as TSP problem – Reasonable approach C. Lee et al. expanded cold-scheduling to VLIW
Comparison between Recoding and Cold Scheduling Motivation (1/2) Recoding Cold-Scheduling Input Instruction sequence Instruction binary format Output Recoded instruction binary Instruction order Optimization Scope Global Local Considered Inst. Field Partial field All fields Comparison between Recoding and Cold Scheduling
Motivation (1/2) (a) Different Scheduling Results (b) Constructed Histogram Graphs (c) Optimal Recoding Results
Figure-of-Merit Maximizing the variance of transition edge weights increases the efficiency of recoding. The larger the sum of self-loop edge weights, the greater will be the power saving effect of a code sequence. Figure-of-merit
Algorithm Overview Presented FM is global function Global instruction scheduling is difficult to implement We solve the optimization problem using random schedule gathering and schedule selection Schedule Selection
Random Scheduling Considerations - Runtime performance Make_Schedules_for_BBs (BB_SET[ ]) begin for each BB in BB_SET[ ] do list_schedule_solution = LIST_SCHEDULE (BB); latency_UB = LATENCY (list_schedule_solution); Insert list_schedule_solution to Schedules_for_BBs[BB]; for i = 0 until ITERATION_COUNT (BB) do new_schedule = RANDOM_SCHEDULE (BB); acceptable = False; if (LATENCY (new_schedule) <= latency_UB) then acceptable = True; for each schedule solution s in Schedules_for_BBs[BB] do if (LATENCY (s) == LATENCY (new_schedule)) then similarity_measure = COMPARE (s, new_schedule); if (similarity_measure > Threshold*LATENCY (new_schedule)) then accpetable = False; break; end if (acceptable) then Insert new_schedule to Schedules_for_BBs[BB]; end end end return Schedules_for_BBs[ ]; Considerations - Runtime performance - BB size and iteration count - Differences (similarity) between random schedules
Schedule Selection (1/3) Problem formulation Global histogram graph can be decomposed to local histograms So, we can consider the divide-and-conquer algorithm Merge of Histogram Graphs
Schedule Selection (2/3) NP-Hardness - To maximize the global variance, must consider not only the sum of each local variances but also covariance of all local histogram pairs
Schedule Selection (3/3) We used the dynamic programming method to achieve local cost maximization via a bottom-up approach For further optimization, we used simulated-annealing Greedy Selection Algorithm
Comparison of PCC Values Experimental Result We used PCC as our measure of performance. Comparison of PCC Values
Conclusion We presented a new instruction scheduling algorithm for low-power code synthesis It’s very exhaustive method to generate low-power code in application specific domain But, advance of computing power makes our method reasonable