SECTIONS 1-7 By Astha Chawla CHAPTER 4 Optimizing Capacitance and Switching Activity to Reduce Dynamic Power SECTIONS 1-7 By Astha Chawla
Introduction C and A are intertwined P = V2 X f x Ceffective. ILP + Frequency increase => Power problem!! Factors affecting A: Complexity of the processor Exploitation of parallelism Bit-width of its structures etc. Optimized at the architectural and microarchitectural level Can be changed by run-time optimizations Factors affecting C: Size of a processor’s structure Organization to exploit locality Manipulated at the circuit and process technology level Determined at fixed design time
Excess Switching Activity. Idle-Unit switching activity: Triggered by clock transitions in unused portions of hardware. Idle –width switching activity : Mismatch in the implemented and the actual width of processor structures. Idle-capacity switching activity : When a program does not use the provided hardware architectures in their entirety. Parallel switching activity: Activity expended in parallel for performance Cacheable switching activity: Repetitive switching activity, convert computing activity to cache lookups Speculative switching activity: Speculatively executing incorrect instructions is wasted activity Value- dependent switching activity: Power consumed depends on the actual data values.
Capacitance Does not change dynamically Total capacitance = Capacitance of transistors + capacitance of wires. Burd and Brodersen: CL = CW + Cfixed Low power architectural techniques require partitioning: Wire partitioning Bit-line segmentation
IDLE- UNIT SWITCHING ACTIVITY. Static logic: To eliminate switching, enough to prevent inputs from changing. Dynamic logic: Power can be consumed even if the inputs to the circuit do not change No effect on computation Clock gating
Guarded evaluation: aims to shuts down part of the original circuit. Precomputation: aims to derive a precomputation circuit for a logic block multiplexed precomputation architecture. F(x=0), F(x=1) Guarded evaluation: aims to shuts down part of the original circuit.
Deterministic clock gating Gating the clock to the processor structures when they are known to be idle Power savings, improves EDP, without performance loss. Clock gating examples: IBM’s Power 5 Reduction in switching power > 25% Implements fine-grain gating domain Intel’s Xscale processors Implements three power- saving modes: Idle, Standby, Sleep Cuts down power consumption by 30%
Idle- Width switching activity: Core Arises from a mismatch between the designed bit-width of a processor and the actual bit-width needed in frequently occurring operations Dynamically detects narrow- width (16 bit wide or less) operands. Abundance in integer and multimedia applications Approaches: Value gating: disabling the unused width. Disabling switching in unused parts of ALU if both operands are narrow. Significant power savings Operation packing: Packs more than one narrow- width operation in the full width of hardware Improves performance without significant power overhead. Speculative operation packing. Significance compression: Compresses non-significant bits. Byte serial pipeline.
Idle- Width switching activity: Caches Dynamic zero compression: accesses only significant bits Only compresses zero bytes.- zero indicator bit Frequent value compression: dictionary loaded with the frequent values of a program. Simple Most efficient compression mechanism Frequent value cache: cache line contains compressed and uncompressed words. First array: holds 8 low-order bits. Second array: holds remaining 24 high-order bits
Packing compressed cache lines Space freed by compression remains empty. Increases cache utilization: indirect power savings. Packing techniques: Variable packing: packs variable number of cache lines into cache frames. expensive Fixed packing: preset number of cache lines are packed Reduced opportunities for compression Compression cache: Uses frequent value compression Does not attempt to pack cache lines into frames Frame holds either two compressed or one uncompressed line. Significance compression cache: lines are compressed using sign compression Instruction compression.
IDLE- CAPACITY SWITCHING ACTIVITY Wasted activity related to out-of-order execution Processor resources over provisioned to support high instruction throughput. Power inefficiency of out-of-order processors: Energy-per-instruction growth Ei ~ (IW)γ .
Resource partitioning. Cannot afford latency of very long wires. Partitioned by placing buffers Aimed at size vs speed trade-off. Wire partitioning Wire delay proportional to R x C . Breaking wire into ‘k’ segments improves delay by k2 Total energy increases exponentially with k. Replacing buffers with tristate devices.
IDLE- CAPACITY SWITCHING ACTIVITY: INSTRUCTION QUEUE. Resizable IQ, mix of CAM and SRAM Readiness feedback control Adjust IQ size based on the activity of its entries. Decision making scheme has a safety mechanism. Occupancy feedback control IQ, LSQ, ROB. Occupancy of a structure is the appropriate feedback control metric. Logical resizing without partitioning IQ organized as a circular FIFO buffer. Limiting the size logically by limiting the part that can be allocated to new entries ILP- contribution feedback control Instruction queue collapsing
IDLE-CAPACITY SWITCHING ACTIVITY: CORE Dynamically changing the width of an 8-issue processor to 6 or 4-issue. 6-issue processor: half of a cluster is disabled 4-issue processor: one whole cluster is disabled Appropriate functional units are clock gated. Decisions made at the end of the sampling window
THANK YOU!