1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Slides:

Advertisements

Similar presentations

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

Lecture 7: Register Renaming. 2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

CS203 – Advanced Computer Architecture ILP and Speculation.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

Dynamic Scheduling Why go out of style?

CSL718 : Superscalar Processors

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Sequential Execution Semantics

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

How to improve (decrease) CPI

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Patrick Akl and Andreas Moshovos AENAO Research Group

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY

2 Outline – Introduction – Motivations – Contributions Basic idea: isolate short-lived operands in a small dedicated register file and avoid their writes to the ROB and the ARF Resources impacted: ROB, ARF Power savings: 21% with 32-entry additional RF – Results – Conclusions – Future work

3 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB A P6-like Superscalar Datapath

4 Out-of-Order Execution and In-Order Retirement ROB FRD Inst. Queue Ex ARF In-order front end Out-of-order core In-order retirement

5 Energy-dissipating Events ROB FRD Inst. Queue Ex ARF In-order front end Out-of-order core In-order retirement Write Read

6 The Idea : Isolating Short-Lived Values ROB FRD Inst. Queue Ex ARF Write Read SRF Write short-lived values into a small dedicated RF (SRF) In-order front end Out-of-order core In-order retirement

7 – Used to avoid false data dependencies. – A new physical register is allocated for EVERY new result – P6 style: ROB slots serve as physical registers Register Renaming LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, P2, 100 SUB P32, P31, P3 ADD P33, P32, P4

8 – Register Alias Table (RAT) maintains the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Original code

9 – Register Alias Table (RAT) maintains the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 Original code Renamed code

10 – Rename Table (RT) is used to maintain the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 Original code Renamed code

11 – Rename Table (RT) is used to maintain the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Original code Renamed code

12 – Our definition: a value is short-lived if the destination register is renamed by the time of the result generation. – Identified one cycle before the result writeback Short-Lived Values LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER

13 96-entry ROB, 4-way processor The Good News : 80%+ of the Values are Short-Lived As rename-to-writeback latency increases in future datapaths, the percentage of short-lived values will also go up

14 The Idea : Isolating Short-Lived Values ROB FRD Inst. Queue Ex ARF Write Read SRF Write short-lived values into a small dedicated RF (SRF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 In-order front end Out-of-order core In-order retirement

15 Need to hang on to the short-lived values to: Recover from branch mispredictions Reconstruct precise state Why do we need the SRF ? LOAD R1, R2, 100 BEQ R5, R1, #100 ADD R1, R5, R4

16 – Maintain the bit-vector Renamed – Set by the Renamer at the time of renaming Identifying Short-Lived Values Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed

17 – Maintain the bit-vector Renamed – Set by the Renamer at the time of renaming Identifying Short-Lived Values Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed

18 – Renamed bit is checked one cycle before writeback – Value produced by LOAD is short-lived because Renamed [31]=1 Identifying Short-Lived Values LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed

19 – When do we write short-lived values into the SRF? – When and how are the short-lived values removed from the SRF? – What happens on a branch misprediction? – How do we reconstruct a precise state? Managing the SRF: the Issues

20 Format of an SRF entry ValidROB idxData Branch Tag 1 Branch Tag 2 Dest. Arch. Reg. Branch Identifier for Renamer : used to remove this entry if renamer gets squashed Branch Identifier for this instruction : used to remove this entry if this instruction gets squashed Branch Identifier of an instruction = id/tag of immediately preceding conditional branch

21 – An instruction writes a short-lived result value into the SRF if: A free entry exists in the SRF No SRF entry keyed with the same ROB slot is already established – Bit-vector Allocated_in_SRF is maintained – One bit for each ROB entry – Set at the time of writeback if value is written into the SRF – Reset at the time of removing the value from the SRF Writing to the SRF: the Conditions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest. reg

22 Scenario 1 : Normal Commitment of Renamer Scenario 2 : Renamer gets squashed Scenario 3 : The instruction generating the short- lived value itself gets squashed Scenarios for Removing the Values from the SRF

23 – Values are removed by the Renamer – 2-step process: Mark the instruction whose value is to be removed from the SRF (done at the time of renaming) Remove the marked value from the SRF IF NEED BE (done at the time of commitment) – When ADD commits, it removes the value written by LOAD Removing the Values from the SRF : Scenario 1 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Renamer

24 Marking the Values for Removal Arch. Reg Phys. Reg. Location (0-ROB,1- ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 ROB LOADSUB 3233

25 Marking the Values for Removal Arch. Reg Phys. Reg. Location (0-ROB,1- ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 ROB LOADSUBADD FS (Flush SRF) field of the ROB

26 – FS field of B must match the ROB index field of a SRF entry – This SRF entry must belong to A Removing the Values (B is the renamer for A) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 LOADSUBADD SRF ROB 1311load ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format A B

27 Another Example (LOAD could not write to SRF) Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Original code Renamed code SRF was full! 31 1 Renamed

28 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIV R2, R2, R5 Original code Renamed code Committed 31 0 Renamed Committed

29 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIVR2, R2, R5 Original code Renamed code Committed 31 0 Renamed Committed

30 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 DIVP32, R31, R5 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIVR2, R2, R Renamed Original code Renamed code Committed

31 Another Example (A’s ROB slot is assigned for C) 31 LOADSUBADD SRF ROB 0 ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format A B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

32 Another Example (A’s ROB slot is assigned for C) 31 MULDIVADD SRF ROB 1312mul ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format C B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 DIVP32, R31, R5 D

33 – Bit-vector Uncommitted_Write is maintained One bit for each ROB entry Set at the time of establishing SRF entry Reset at the time of commitment – Instruction B removes the value written by A (allocated to ROB slot i) if: Allocated_in_SRF[i]=1, and (this needs to be better explained) Uncommitted_Write[i]=0; Ensuring that the right values are removed

34 – When an instruction allocated to ROB slot i commits and Allocated_in_SRF[i]=1, the data is not copied to the ARF. Avoiding Unnecessary Committments Dest. reg ROB FRD Inst. Queue Ex ARF Write Read SRF Write

35 – Problem: Renamer can get squashed -> stale entries remain in the SRF if nothing is done – Example: Handling Branch Mispredictions : Scenario 2 32 BRSUBADD ROB SRF 1311load LOAD 31

36 – Problem: Renamer can get squashed -> stale entries remain in the SRF if nothing is done – Example: Handling Branch Mispredictions 32 BR ROB SRF 1311load LOAD

37 – Solution: Tag each entry in the SRF with the id of the branch preceding the renamer (BT1). When the renamer is squashed, the value is removed from the SRF and is written to either the ROB (based on the value of Uncommitted_Write bit) Multiplex the ports to reduce complexity Handling Branch Mispredictions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

38 – Maintain the array Branch_Tags – One entry for each ROB slot Obtaining Branch Tag BT1 Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, P2, 100 BEQ P6, P7, 200 SUB P33, P31, P3 ADD P34, P33, P4 LOAD R1, R2, 100 BEQ R6, R7, 200 SUB R5, R1, R3 ADD R1, R5, R4 31 Branch_Tags 7

39 – Problem: The instruction whose value was inserted into the SRF can itself be squashed – Example: Handling Branch Mispredictions : Scenario 3 31 LOADSUBADD ROB SRF 1311load BR 30

40 – Problem: The instruction whose value was inserted into the SRF can itself be squashed – Example: Handling Branch Mispredictions ROB SRF 1311load BR 30

41 – Solution: Tag each entry in the SRF with the id of the branch preceding the instruction itself (BT2). Simply remove the value from the SRF if such a branch in mispredicted Handling Branch Mispredictions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

42 – Allow all instructions preceding the faulting instruction to commit – Squash all instructions following the faulting instruction – Copy the values of ALL valid SRF entries to the ARF. Supporting Precise Interrupts ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

43 Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE decks SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information Inter-thread buffers Data analyzer/ Intra-stream analysis Two separate threads Experimental Setup

44 % Results: Percentage of Values Written into the SRF 40.5%60.1%77.5%82.3%86.7%

45 cycles Results: Average Time Spent by a Value in the SRF Average: cycles

46 % Results: Percentage of Values not copied into the ARF 42.2%61.9%79.3%84.1%86.7%

47 pJ Results: Net Energy Reduction 21%16%9% ROB+additional logic ARFSRF 23%

48 pJ Results: Net Energy Reduction 21%16%9% ROB + additional logic ARF SRF 23%

49 – Register Traffic Analysis (Franklin and Sohi, MICRO’92). Studied the useful lifetime of register instances Delaying the writes until 30 more instructions are dispatched, can eliminate 80% of the writes (if perfect knowledge of the last use is available) Buffering 30 most recently generated results avoids 80% of wbks – Lozano and Gao (MICRO’95) 90% of all results values are short-lived (consumed while in the ROB) Mechanism to avoid commitment of these values and also avoid register allocation for them is proposed ROB slots are exposed to the compiler in the form of symbolic registers – Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02) Hardware-based scheme to avoid unnecessary commitments Copying from the ROB to the ARF is delayed until the ROB slot is reused. In many cases, the register is invalidated by the newer instruction Additional rename table is needed. About 75% of commits are avoided. Related Work

50 – Significant power savings & negligible impact on performance – Sources of power savings: majority of generated results written into small lightly-ported SRF Unnecessary commitments are avoided Additional logic/ storage needed to do this is simple – For a 32-entry SRF, more than 77% of writebacks and more than 79% of commitments can be avoided – This results in the energy savings of 21% on the ROB and the ARF Conclusions

51 THANK YOU ! This work was supported in part by DARPA through the PAC-C program and NSF LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY Parallel Architectures and Compilation Techniques (PACT’03) October 1 st 2003

52 – SRF – Three bit vectors (same size as the ROB) Renamed Allocated_in_SRF Uncommitted_Write – 4-bit array Branch_Tags (same size as the ROB) Complexity of the Solution