Decomposition of Instruction Decoder for Low Power Design TingTing Hwang Department of Computer Science Tsing Hua University
Power Dissipation Static dissipation due to leakage circuit Short-circuit dissipation Charge and discharge of output load capacitor
Power Dissipation Static dissipation due to leakage circuit Short-circuit dissipation Charge and discharge of output load capacitor V in V out V DD GND o
Dynamic Power Dissipation Model P: power dissipation C: load capacitance E: avg. transition count of the gate/ clock cycle V dd : supply voltage T cyc : clock period
Dynamic Power Dissipation Model P: power dissipation C: load capacitance E: Avg. transition count of the gate/ clock cycle V dd : supply voltage T cyc : clock period
Motivation Execution frequency of instructions is uneven Take MOV class as an example three instructions 22% execution frequency Profiling from Powerstone
Coupling Sub-decoders Partition an instruction decoder into two coupling sub-decoders The smaller decoder decodes only a small number of instructions When the smaller decoder is active, the larger decoder is turned off The smaller decoder is active frequently
Architecture of Coupling Sub- decoders Controls to turn on/off sub-decoders Activate-Control Input AND-OR Output OR Output bit0 I-Decoder0I-Decoder1 I-Activate Control FF1FF2FF3FFn … instruction I-Control0 I-Control1... S-Decoder0S-Decoder1 S-Activate Control S-Control0S-Control1... FF1 FFn 1101
Instruction Grouping Problem How to decompose Decoder so that the smaller sub-decoder is small the smaller sub-decoder is executed frequently the activate logic is small
Weighted Graph Model of Execution Sequence Node : instruction type Edge (U,V) : instruction U (V) executed after V (U) Weights on nodes and edges: execution frequency mov ldr mov b mul mov mul cmp mul b mov b ldr b mul cmp
Power Model SF i : transition frequency from Mi to Mi CF ij : transition frequency between Mi and Mj Power i : power of Mi estimated by Synopsys mov ldr b mul cmp Mj Mi
Instruction Grouping Problem : Graph Partitioning Generation of transition graph Initial clustering by random walk Initial partition of clusters Iterative improvement by moving clusters among groups
Experimental Process ARM7tdmi Circuit described by Verilog Circuit synthesized by Synopsys Design Compiler Power estimated by PrimePower: switching activities are collected by simulating Powerstone benchmark set
Results on Two-way Decomposition
Power Consumption Comparisons Power (W) Orig.Decomp.Improve Instruction Decoder 4.01E-42.81E % Control Unit 1.03E-38.35E % Lower power consumption
Critical Path and Area Comparisons Shorter critical path timing Area overhead Critical Path Timing (ns)Area Orig.Decomp.ImproveOrig.Decomp.Overhead Instruction Decoder % % Control Unit % %
Results on Multiple-way Decomposition
Power Consumption for Different Multi-way Grouping Two-way decomposition has best power reduction more groups more overhead 0 1.E-04 2.E-04 3.E-04 4.E-04 5.E-04 Original 2way 3way 4way DecoderOverhead Power (W)
Critical Path Timing for Different Multi-way Grouping Four-way decomposition has best timing reduction Original2 way3 way4 way 5 way T i m i n g ( n s ) DecoderOverhead
Area Comparisons Area for different multi-way grouping Original 2way 3way 4way 5way Area
Conclusions Two-way partitioning has the best results for 142-instruction set Compared to un-decomposed decoder 30% reduction in power consumption 13% improvement in critical path timing Compared to un-decomposed control-U 19% reduction in power consumption 12% improvement in critical path timing
Thank You