Power-Aware Operand Delivery

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
Register Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Finishing out EECS 470 A few snapshots of the real world.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
CS203 – Advanced Computer Architecture ILP and Speculation.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Use of Pipelining to Achieve CPI < 1
IBM System 360. Common architecture for a set of machines
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
Physical Register Inlining (PRI)
Out of Order Processors
Dynamic Scheduling and Speculation
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Microprocessor Microarchitecture Dynamic Pipeline
Register Data Flow ECE/CS 752 Fall 2017
Flow Path Model of Superscalars
Advantages of Dynamic Scheduling
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Instruction Execution Cycle
Adapted from the slides of Prof
Instruction-Level Parallelism (ILP)
Prof. Onur Mutlu Carnegie Mellon University
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)
Presentation transcript:

Power-Aware Operand Delivery Erika Gunadi and Mikko H. Lipasti University of Wisconsin - Madison ARF+PL ARF+noPL PRF+PL PRF+PL RAT- ARF PRF ROB- data tag PL RS BP BACKGROUND & MOTIVATION Power as design constraint More transistor on chip -- more power dissipated to heat Challenges in cooling and packaging design Microprocessor technology shifts from performance-focused to power-effective design Register renaming is used to support operand delivery in OoO machines Widely used design: ARF and PRF PRF used by MIPS R10K, Alpha 21264, IBM Power4, IBM Power 5, Intel Pentium 4 ARF used by PowerPC 604, Intel P6 family, Intel Core family, AMD K8 family Industry trend ARF for mobile and power-aware server; PRF for power hungry design No quantitative analysis has been published Rename wr RAT re RAT wr RAT re RAT wr RAT re RAT wr RAT re RAT Delay Area Rename Queue Read WB Retire 2.07 0.06 2.69 N/A 2.83 0.82 1.98 0.09 2.94 N/A 0.80 0.28 3.21 1.53 N/A 2.83 2.81 2.01 0.38 N/A 0.74 2.02 2.76 2.25 0.29 N/A 1.68 1.56 2.46 1.05 N/A 2.52 2.12 0.62 N/A 1.32 1.42 2.16 2.03 0.98 N/A 0.71 1.21 2.02 1.93 0.24 N/A 4.58 Read re ARF re ROB re PRF Queue wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB Sched wr RS wr RS wr RS wr RS Issue re PL re PL re PL re PL Read re ARF re ROB re PRF Exe execute execute execute execute WB wr ROB, wr RAT wr PL wr ROB wr RAT wr PRF, wr RAT wr PL wr PRF wr RAT DESIGN SPACE Based on where the speculative results are stored – ARF and PRF ARF Kept non-speculative values in a small ARF and speculative values in the ROB Wrote results to the ROB at writeback stage Copied results to the ARF as instructions retire PRF Kept both speculative and non-speculative in PRF Wrote results only once (to the PRF) at writeback stage Updated only rename table (RAT) as instructions retire Based on where the operands read occur – Payload RAM and no Payload RAM Payload RAM (PL) Read operand values at dispatch stage before inserted into the reservation stations Copied ready operands to payload RAM and got unready operands from bypass logic No Payload RAM (noPL) Only checked operands ready status at dispatch stage Read operands values from necessary structures (PRF, ARF+ROB)) upon issue Both classifications are orthogonal to each other Retire re ROB, wr ARF wr RAT re ROB, wr ARF wr RAT re ROB re ROB MACHINE TRADEOFFS NoPL -- added one pipeline stage between schedule and execute Increased load misscheduling penalty and branch resolution loop Decreased IPC ARF – harder to do out-of-order branch misprediction resolution On PRF, easily done by checkpointing RAT: MIPS R10K, Alpha 21264, Power4,5 On ARF, RAT could point to ROB or ARF, a potential of stale pointer on simple checkpoint ARF machines used in-order branch resllution: Intel P6, AMD K8, Intel Core DETAILS OF DESIGN RAT Provided renaming to resolve RAW and to eliminate WAR and WAW Contained destination mapping, ready bits, and retire bits (only for ARF) Built using RAM and comparators ROB ROB-tag used to store information needed while instructions are in the window ROB-data used to store speculative results before being copied to ARF Register File RAM-based structure to store execution results Payload RAM Used to store input operands before instructions are executed Consisted of CAM structure for tags amd RAM structure for the data Reservation Stations Consisted of CAM and select logic Bypass Networks Two-level bypass network to bypass data for back-to-back execution Consisted of comparators and muxes ARF PRF Payload RAM (PL) Intel P6, Intel Core, Intel Pentium M, AMD K8 (Athlon, Opteron) None No Payload RAM (noPL) None Intel Pentium 4, MIPS R10K, Alpha 21264, IBM Power4, Power 5 ARF ROB PRF CONCLUSION Microprocessors can be categorized into ARF- or PRF, with or without PL Energy impact PRF+noPL consumes the least amount of energy On average, 20% less energy than ARF+PL-style machine in the affected structures Roughly 6-7%saving of total chip energy IPC impact Out-of-order branch resolution adds 3% of performace on average Operand read between issue and execute (PL) decreases IPC by 1-2% PRF machine also simplify the implementation out-of-order branch resolution Improved performance with comparable energy PLRAM ARF ROB PLRAM PRF METHODOLOGY Timing and Power Implemented in Verilog, synthesized using Design Compiler, placed&routed using Astro Used LSI Logic gflxp 0.11 micron CMOS standard cell library Microarchitectural Simplescalar/Alpha 3.0, speculative scheduling/squashing replay, load-store reordering Used SMARTS sampling with reference input sets 4-wide, 128 ROB, 32 ARF / 96 PRF, 32 LQ, 24 SQ, 32 sched, 64KB DL1, 16KB IL1, 2MB L2 ALU ALU ALU ALU ARF with Payload RAM ARF with no Payload RAM PRF with Payload RAM PRF with no Payload RAM