|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.
Chapter 12 CPU Structure and Function. Example Register Organizations.
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Finishing out EECS 470 A few snapshots of the real world.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Dynamic Scheduling Why go out of style?
Precise Exceptions and Out-of-Order Execution
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
Multiscalar Processors
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
Physical Register Inlining (PRI)
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
Out of Order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Improving Program Efficiency by Packing Instructions Into Registers
Flow Path Model of Superscalars
Pipelining: Advanced ILP
Sequential Execution Semantics
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
ECE 2162 Reorder Buffer.
Comparison of Two Processors
Ronny Krashinsky and Mike Sung
Alpha Microarchitecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Conceptual execution on a processor which exploits ILP
Spring 2019 Prof. Eric Rotenberg
ECE 721 Modern Superscalar Microarchitecture
Presentation transcript:

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

3 |Background |B-Processor mechanisms |Results |Conclusion

|Depending on when instructions read their source operands two pipeline designs are possible Operand values are read before issue Operand values are read after issue Issue  instruction sent to functional unit for execution Dispatch  instruction inserted into instruction scheduler 4

|Pipeline has a Data-Capture (DC) Scheduler DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core Data-Capture Scheduler Update Bypass and Wake up Fetch, Decode and Dispatch ARF Execution Units ROB/ Rename Buffer Read 5

|Results produced by instructions are copied twice First to ROB – on instruction completion Then to ARF – on instruction commit |ROB + ARF consume a significant portion of the total core power > 10% [Brooks et al. ISCA 2000] 6

|Design mechanism(s) to reduce the power consumption of the ROB + ARF reduce the number of writes to these structures 7

8 |Change the organization of these structures ports, hierarchical organization, banking [MICRO’92, MICRO’94] |Reduce accesses to these structures Register File Caches [Yung et al, ICCD ‘95] Reduce writes  Target short-lived variables (mostly VLIW)

|Many instruction results within a basic block are not visible outside the basic block we call such values BB-Internal values |Values visible outside a basic block are called BB-External values The last value written to a register within a basic block is a BB- External value … ADD R1, R2, R3 SUB R4, R1, R6 … MUL R1, R1, R4 … JGZ R10 Basic Block Inst-M Inst-N 9

|Dependency Distance (Dep-Distance) – integer value defined for every instruction For instructions producing BB-Internal value(s) only  it is the distance of last consumer from the instruction For instructions producing BB-External value(s)  it is infinite 10

|Many BB-Internal values become dead shortly after being produced i.e., all consumers of BB-Internal value are found within a short distance of the instruction producing the BB-Internal value >22% of all instructions produce BB-Internal values only and those values are consumed within 4 instructions of being produced 11

|Instruction results are broadcast over the bypass network |If we can guarantee that instructions dependent on BB- Internal values produced by a instruction have received the BB-Internal values from the bypass network then we can skip writing the BB-Internal values to the operand store(s) 12

|If results of a instruction are not being written to operand stores (Mechanism #1), then we can stop broadcast of results beyond first stage of bypass 13

14 |Assistance of the Compiler |Changes to ISA |Changes to hardware

15 |Do analysis of life-time of variables and identify the dep- distance of instructions in basic blocks

|Add 2-bits to instruction encoding Compiler passes dep-distance of instructions via this encoding Bits can be encoded in several ways Example encoding using multiples of 2 EncodingMeaning 00Dep-Distance is Infinite 011 ≤ Dep-Distance < 2 * 1 [1] 102 ^ 1 ≤ Dep-Distance < 2 * 2 [2-3] 112 ^ 2 ≤ Dep-Distance < 2 * 3 [4-7] 16

17 |Add a bit-mask (Presence Vector) to track the presence of instructions in Scheduler Bit-mask of same size as ROB  Bit mask has head and tail pointers  First 0 (from tail) in mask is set when a new instruction is dispatched  First 1 (from head) in mask is cleared when a instruction is retired

18 |When instruction is issued, check if all dependent instructions have been dispatched If dep-distance is n, check if n th bit from bit for this instruction is set  If set then do not write to ROB and ARF – IaIa IbIb IcIc IdId... – … 0 SchedulerPV – IaIa IbIb IcIc IdId... – … 0 SchedulerPV DD = 3 Check hit

Dep-Distance

|Precise exceptions are not supported Many instructions will not update the architectural state as they are supposed to do  But at end of a basic block architectural state matches state obtained with regular execution Soln: Check-point RF at the end of each basic block, whenever there is an exception, rollback to start of basic block and execute in instruction-precise mode  Use a light weight RF check-pointing mechanism 20

|ARF  2 ARF + 1 Dirty Mask + Several State Masks Each bit mask is equal to size of ARF # of state masks is equal to the maximum number of basic blocks supported by pipeline + 1 ARF-1 ARF ARF-0 Dirty and State Masks2 copies of ARF 21 ARF

|Dirty mask Tracks which registers have been written by the current basic block |State mask Holds current mapping of registers i.e., whether latest value of register is in ARF 0 or in ARF 1 |First write to a register in a basic block flips the bit in the state mask register value at end of last basic block is untouched subsequent writes to same register use the current mapping 22

23 |MacSim Simulator with integrated McPAT-based tool for modeling power |Nehalem like core 4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs 22nm

24 |Power savings for ROB + ARF 15% over baseline, 7% over RFC-32 FP benchmarks – B-Processor skips writing many results and RFC mechanism writes lot of live values to ROB

25 |Power savings for Bypass Network baseline has two levels of bypass 10% savings on average

26 |ROB + ARF contribute a significant fraction of total power propose mechanism to reduce their power consumption |For bb-internal values, if all dependent instructions read value off bypass network then skip writes to ROB and ARF and broadcast beyond first stage of bypass |Mechanism results in correct architecture state at basic block granularity |Mechanism reduces ROB + ARF power consumption by 15% and bypass power consumption by 10% relative to conventional design

27 Thank You!