Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Low power Design Strategies Daniele Folegnani. Talk outline Why Low Power is Important Power Consumption in CMOS Circuits New Trends for Future Microprocessors.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Lecture 8 Shelving in Superscalar Processors (Part 1)
Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.
Low Power Techniques in Processor Design
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
Multiscalar Processors
SECTIONS 1-7 By Astha Chawla
Lynn Choi Dept. Of Computer and Electronics Engineering
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Flow Path Model of Superscalars
Power-Aware Operand Delivery
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Out-of-Order Commit Processor
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Spring 2019 Prof. Eric Rotenberg
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)
Presentation transcript:

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005

Why Low Power ? Embedded Space: Limited Battery Life Energy battery will not grow drastically in the near future High Performance Space: Heat Dissipation Very expensive cooling systems for power dissipation beyond 50 watt Failure mechanism such as thermal runaway gate dielectric, junction fatigue and etc. become significantly worse as temperature increases.

Ways To Reduce Processor Power Shutting down inactive elements Caching of already done work Smart reduction of some of the work

Smart reduction of some of the work Past design not pay attention to power, preferred simplicity. Information moved and re-written redundantly Avoid Unnecessary Information Transfer

Superscalar Architecture Fetch Decode Rename Instruction Queue Execute Logical Register File Physical Register File ROB F.U. Reservation Station Write-Back Dispatch Issue Load Store Queue

Power Consumption in superscalar processor Reservation Station: 27% ROB: 25% Rename Table: 14% UL2: 12%

Instruction Queue: Why a Major Power Consumer? Tasks involved in instruction queue Set an entry for a new dispatched instruction Read an entry to issue instructions to functional unit Wakeup instructions waiting in IQ once a result is produced by a functional unit Select instructions for issue when more ready instructions than issue width are available

Instruction Queue: A Power Hungry Structure RdyLRdyR RdyLRdyR TagL TagR == == OR Tag0TagIW-1 Instruction 0 Instruction (IQsize -1)

Wakeup: Major Power Consumer Activity Wakeup is the major power consumer Long wires to broadcast result tag from F.U. to all instruction waiting in instruction queue 2 * IW * IQ size * log (IQ size ) Comparators 2 * IQ size OR logic e.g. 2*8*128*log(128) = Comparators 2*128 = 248 OR logic

Low Power Instruction Queue Design Eliminating the unnecessary wakeup Many instructions wait in instruction queue for long periods. During this long period processor attempts to wakeup them every cycle. Example: Instruction encounter a cache miss

Instruction Issue Delay and Their Participation in Wakeup lazy instructions, despite their relatively low frequency, account for more than 85% of the total wakeup activity Instruction Issue Delay Distribution Wakeup Activity Distribution

Fetch Unit Decode Register Renaming Instruction Cache Instruction Queue Integer Registers PC F.U. 64 entries PC-index table If IID >= 10 Store PC If IID < 11 Remove PC Issue Dispatch IID Data Cache Write-Back Commit Identify Lazy Instruction Accuracy: 50% Effectiveness: 30% (one third of all lazy instructions are identified)

Optimizations to Reduce Wakeup Activity Selective Instruction Wakeup Wakeup A predicted Lazy instruction every two cycles, instead of every cycle Selective Fetch Slowdown If there are already many lazy instructions waiting in the pipeline, avoid adding more instructions.

Performance Degradation The Goal: Power-Efficient Design Save Power with no or small performance cost

Power Savings Average Power Saving: 14% Across most benchmarks power savings is more than 10%

Conclusion Power is going to be the most critical issue in processor design Instruction queue is on of the major power consumer. Selective Fetch Slow Down and Selective Wakeup: Reduce Instruction queue power up to 27% (average: 14%)

Thermal and Power dissipation costs

Why Low Power ? High performance microprocessors PowerPC704 consumes 85 Watt Alpha consume 100 Watt Growing demand of multimedia functionalities needs more computing power

Effectiveness and Accuracy Statistics gathered after runing a program: All instructions: 20 Lazy instructions: 10 Effectiveness:30%  3 lazy instructions identified correctly Accuracy:50%  6 instructions are predicted to be lazy

Comparator Source Operand Tag Result tag1Result tag2Result tag3Result tag4 Comparator Source Operand Tag Comparator V cc MUX Clk/ Lazy controller Source Operand Tag Broadcast Buffer

Overhead : CAM MUX:2 transistors, Comparator: 3 transistors Overhead: 128*2+128 = 128*3 = 384 Total Number of Comparator transistors: 3*total number of comparator = 3*128*2*8*log(128) = 43008

Overhead : 64 entry PC-index Table Branch Prediction Logic Size: 8000*(4+1) * 32 = Power Consumption : 7% of total processor power consumption 64 entry PC-Index Table: 64 * * 2 = 2176

Lazy Threshold Monitor Performance loss and Power Savings  10  Negligible Performance Loss, Significant Power Savings

Future Work Fast Instruction Prediction Configuration Sensitive Analysis ROB Power savings Register Renaming Power Savings Select Logic Power Savings