|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

3 |Background |B-Processor mechanisms |Results |Conclusion

|Depending on when instructions read their source operands two pipeline designs are possible Operand values are read before issue Operand values are read after issue Issue  instruction sent to functional unit for execution Dispatch  instruction inserted into instruction scheduler 4

|Pipeline has a Data-Capture (DC) Scheduler DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core Data-Capture Scheduler Update Bypass and Wake up Fetch, Decode and Dispatch ARF Execution Units ROB/ Rename Buffer Read 5

|Results produced by instructions are copied twice First to ROB – on instruction completion Then to ARF – on instruction commit |ROB + ARF consume a significant portion of the total core power > 10% [Brooks et al. ISCA 2000] 6

|Design mechanism(s) to reduce the power consumption of the ROB + ARF reduce the number of writes to these structures 7

8 |Change the organization of these structures ports, hierarchical organization, banking [MICRO’92, MICRO’94] |Reduce accesses to these structures Register File Caches [Yung et al, ICCD ‘95] Reduce writes  Target short-lived variables (mostly VLIW)

|Many instruction results within a basic block are not visible outside the basic block we call such values BB-Internal values |Values visible outside a basic block are called BB-External values The last value written to a register within a basic block is a BB- External value … ADD R1, R2, R3 SUB R4, R1, R6 … MUL R1, R1, R4 … JGZ R10 Basic Block Inst-M Inst-N 9

|Dependency Distance (Dep-Distance) – integer value defined for every instruction For instructions producing BB-Internal value(s) only  it is the distance of last consumer from the instruction For instructions producing BB-External value(s)  it is infinite 10

|Many BB-Internal values become dead shortly after being produced i.e., all consumers of BB-Internal value are found within a short distance of the instruction producing the BB-Internal value >22% of all instructions produce BB-Internal values only and those values are consumed within 4 instructions of being produced 11

|Instruction results are broadcast over the bypass network |If we can guarantee that instructions dependent on BB- Internal values produced by a instruction have received the BB-Internal values from the bypass network then we can skip writing the BB-Internal values to the operand store(s) 12

|If results of a instruction are not being written to operand stores (Mechanism #1), then we can stop broadcast of results beyond first stage of bypass 13

14 |Assistance of the Compiler |Changes to ISA |Changes to hardware

15 |Do analysis of life-time of variables and identify the dep- distance of instructions in basic blocks

|Add 2-bits to instruction encoding Compiler passes dep-distance of instructions via this encoding Bits can be encoded in several ways Example encoding using multiples of 2 EncodingMeaning 00Dep-Distance is Infinite 011 ≤ Dep-Distance < 2 * 1 [1] 102 ^ 1 ≤ Dep-Distance < 2 * 2 [2-3] 112 ^ 2 ≤ Dep-Distance < 2 * 3 [4-7] 16

17 |Add a bit-mask (Presence Vector) to track the presence of instructions in Scheduler Bit-mask of same size as ROB  Bit mask has head and tail pointers  First 0 (from tail) in mask is set when a new instruction is dispatched  First 1 (from head) in mask is cleared when a instruction is retired

18 |When instruction is issued, check if all dependent instructions have been dispatched If dep-distance is n, check if n th bit from bit for this instruction is set  If set then do not write to ROB and ARF – IaIa IbIb IcIc IdId... – 0 1 1 1 1 … 0 SchedulerPV – IaIa IbIb IcIc IdId... – 0 1 1 1 1 … 0 SchedulerPV DD = 3 Check hit

011011 19 Dep-Distance

|Precise exceptions are not supported Many instructions will not update the architectural state as they are supposed to do  But at end of a basic block architectural state matches state obtained with regular execution Soln: Check-point RF at the end of each basic block, whenever there is an exception, rollback to start of basic block and execute in instruction-precise mode  Use a light weight RF check-pointing mechanism 20

|ARF  2 ARF + 1 Dirty Mask + Several State Masks Each bit mask is equal to size of ARF # of state masks is equal to the maximum number of basic blocks supported by pipeline + 1 ARF-1 ARF ARF-0 Dirty and State Masks2 copies of ARF 21 ARF

|Dirty mask Tracks which registers have been written by the current basic block |State mask Holds current mapping of registers i.e., whether latest value of register is in ARF 0 or in ARF 1 |First write to a register in a basic block flips the bit in the state mask register value at end of last basic block is untouched subsequent writes to same register use the current mapping 22

23 |MacSim Simulator with integrated McPAT-based tool for modeling power |Nehalem like core 4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs 22nm

24 |Power savings for ROB + ARF 15% over baseline, 7% over RFC-32 FP benchmarks – B-Processor skips writing many results and RFC mechanism writes lot of live values to ROB

25 |Power savings for Bypass Network baseline has two levels of bypass 10% savings on average

26 |ROB + ARF contribute a significant fraction of total power propose mechanism to reduce their power consumption |For bb-internal values, if all dependent instructions read value off bypass network then skip writes to ROB and ARF and broadcast beyond first stage of bypass |Mechanism results in correct architecture state at basic block granularity |Mechanism reduces ROB + ARF power consumption by 15% and bypass power consumption by 10% relative to conventional design

27 Thank You!

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Similar presentations

Presentation on theme: "|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Similar presentations

Presentation on theme: "|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2."— Presentation transcript:

Similar presentations

About project

Feedback