Dave Maze Edwin Olson Andrew Menard

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

CENTRAL PROCESSING UNIT

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Micro-Architecture Techniques for Sensor Network Processors Amir Javidi EECS 598 Feb 25, 2010.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

Low Power Techniques in Processor Design

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Issue Logic and Power/Performance Tradeoffs Edwin Olson Andrew Menard December 5, 2000.

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Runtime Software Power Estimation and Minimization Tao Li.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Outline Models Design of experiments Current Scheduler Completely Fair Scheduler(CFS) ◦ Since Linux ◦ /kernel/sched.c ◦ Maintain balance (fairness)

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Power and Performance Analysis of PDA Architectures Advanced VLSI Computer Architecture Fall 2000 Robert Lee Ripal Nathuji.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.

CS203 – Advanced Computer Architecture

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS203 – Advanced Computer Architecture

Temperature and Power Management

Jacob R. Lorch Microsoft Research

Green cloud computing 2 Cs 595 Lecture 15.

Stateless Combinational Logic and State Circuits

Microarchitectural Techniques for Power Gating of Execution Units

Power-Aware Operand Delivery

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Superscalar Processors & VLIW Processors

Central Processing Unit

Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003

Ken Barr, Ken Conley and Serhii Zhak Checkpoint I: October 19, 2000

Alpha Microarchitecture

Dual Mode Logic An approach for high speed and energy efficient design

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Sampoorani, Sivakumar and Joshua

* From AMD 1996 Publication #18522 Revision E

Chapter 6: Understanding and Assessing Hardware

Physical Implementation

Automatic Tuning of Two-Level Caches to Embedded Applications

Key Question: What is a simple machine?

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

Dave Maze Edwin Olson Andrew Menard Reconfigurable Issue Logic for Microprocessor Power/Performance Throttling Dave Maze Edwin Olson Andrew Menard

Observation Complex issue logic for out-of-order, speculative machines consumes a significant amount of power. Performance increase by complex issue logic is small.

Power/Performance Throttling Many applications have significant peak processing performance requirements but low average performance requirements. Very low power microprocessors can’t deliver best of breed performance. Fastest uPs consume way too much power. Examples: handheld device which might be in standby, waiting for user input, MP3 player mode, or MPEG4 video playback

What about Voltage/Frequency Scaling Reducing voltage (factor of α decreases performance by factor of α but decreases power consumption by α2.) Voltage scaling runs out of steam as Vdd’s approach a few Vt. Need other mechanisms for throttling power/performance.

Why is Issue Logic so power hungry? In issue queue, every instruction is checked every cycle to see if it can be dispatched. This involves broadcasts of data on long bitlines. In 21264, queues compaction accounts for additional energy. Alpha 21264 consumes 18-46% of total energy in issue logic [Gowan] [Gupta].

Our Three Approaches Add separate simple core to complex uP and switch between them Mode switches slow. Only 2 performance points. Only use a subset of the issue window Probably not as low-power. Provides continuum of performance points. Mode switches easy. Bypass issue logic completely Must flush issue window. Only 2 performance points.

Preliminary Results Issue Width X Window Size X In/Out of Order Using identical technologies 1 issue is very poorly modeled; IPC is probably too low and power is almost certainly too high. SpecInt95 li benchmark

Issue window throttling Issue width x window size

Tools Using Wattch (based on SimpleScalar) for high-level architectural modeling. Wattch gives us IPC and power data. Unable to measure critical path differences with Wattch. Open to suggestions… ??

Plan Build better models of in-order processors to provide fairer comparison. Custom tool? Try to get timing information (?) For checkpoint 2, paper mostly done except for some final data results.