Issue Logic and Power/Performance Tradeoffs Edwin Olson Andrew Menard December 5, 2000.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Instruction Set Design
Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Algorithms Today we will look at: what we mean by efficiency in programs why efficiency matters what causes programs to be inefficient? will one algorithm.
The Evolution of RISC A Three Party Rivalry By Jenny Mitchell CS147 Fall 2003 Dr. Lee.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Performance Counter Based Architecture Level Power Modeling ( ) MethodologyResults Motivation & Goals Processor power is increasing.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Skewed Compressed Cache
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
ECE 510 Brendan Crowley Paper Review October 31, 2006.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
The Pentium: A CISC Architecture Shalvin Maharaj CS Umesh Maharaj:
Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.
RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Low Power Techniques in Processor Design
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
An Introduction to 64-bit Computing. Introduction The current trend in the market towards 64-bit computing on desktops has sparked interest in the industry.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
TAUCHI – Tampere Unit for Computer-Human Interaction 1 Statistical Models of Human Performance I. Scott MacKenzie.
QUILT A GUI-based Integrated Circuit Floorplanning Environment for Computer Architecture Research and Education David H. Albonesi Computer Systems Laboratory.
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
CPU Inside Maria Gabriela Yobal de Anda L#32 9B. CPU Called also the processor Performs the transformation of input into output Executes the instructions.
The End of Conventional Microprocessors Edwin Olson 9/21/2000.
 Introduction to SUN SPARC  What is CISC?  History: CISC  Advantages of CISC  Disadvantages of CISC  RISC vs CISC  Features of SUN SPARC  Architecture.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Pipelining and Parallelism Mark Staveley
Runtime Software Power Estimation and Minimization Tao Li.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Stage 1 Statistics students from Auckland university Sampling variability & the effect of sample size.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Reduced Instruction Set Computing Ammi Blankrot April 26, 2011 (RISC)
Activity 1 Review the work from last lesson so that you can explain the following: -What is the purpose of a CPU. -What steps does the CPU take to process.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Power-Optimal Pipelining in Deep Submicron Technology
Complex and Reduced Instruction Set Computers
Variable Word Width Computation for Low Power
Dave Maze Edwin Olson Andrew Menard
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
A High Performance SoC: PkunityTM
2.C Memory GCSE Computing Langley Park School for Boys.
Computer Architecture
Presentation transcript:

Issue Logic and Power/Performance Tradeoffs Edwin Olson Andrew Menard December 5, 2000

The need for low-power architectures Low performance - PIMs High performance – video decoding/MP3 playback And increasingly, both. – How do you design an architecture that can do both?

A couple alternatives High performance processor that can be lobotomized – Modify Issue Logic – Change structure sizes Two separate cores – A high performance/high-power core – A low performance/low-power core

Other power throttling mechanisms Voltage scaling – Huge power savings – There’s a limit & high performance designs are pushing towards low voltage– which doesn’t leave much room for throttling. Burn & Coast – Compute at full speed, and then go into a sleep mode. – Simple linear power/performance throttling.

Methodology SimpleScalar/Wattch – Widely used but little/no verification. Several power models available, but very large margins of error. – Still, the size of structures is correlated to power consumption. Industry survey – Look at real-world processors with the range of characteristics of interest. SpecInt95 – Substantially reduced input sets to make simulation feasible.

Issue Window Scaling Popular idea- it’s a highly active chip structure. Window responsible for 20% of non-clock power (Alpha & Wattch agree) Does it work? – Let’s look at RUU usage What’s an upper bound on the useful size? How do smaller sizes impact performance and power?

RUU size upper bounds Modified SimpleScalar, let RUU be arbitrarily big.

Effect of bounded RUU size The RUU’s occupancy “saturates” as one would expect.

Effect of Bounded RUU Size

Bounded RUU Impact on Performance Performance rapidly approaches maximum. 8-issue needs a slightly larger RUU, as expected.

Bounded RUU impact on Power Power consumption increased in RUU as size increases

Power/Performance There’s a minimum! And it’s pretty much where maximum performance is. Hmmm. Structure8x88x168x328x64 Energy/Inst (li) Energy/Inst (perl) Energy/inst (compress) Energy/inst (m88ksim)

Analysis Some groups have advocated a variable capacity RUU. Even if scaling is perfect, there’s little to be gained. A power-conscious architect is likely to be cornered into just one reasonable RUU size.

Adding a separate core If we can’t lobotomize, perhaps we can add a completely separate CPU. Sounds like a good idea – Intuition: a simple in-order processor should have lower energy/instruction than a complex out-of-order one. – Small area overhead, around 1mm^2. Opportunity for more energy savings – Smaller register file – No issue window – Separate low-power caches (though this increases area)

Methodology SimpleScalar/Wattch is all but useless – Availability of only one parameterizable power model (Wattch) and we don’t know what trade-offs the designer made. – Wattch doesn’t support sim-inorder – E.g., Cacti cache model uses 10x greater energy than Krste. Industry Survey

PowerPC Statistics PPC440 is 2-issue, out of order PPC405 is single issue, in-order Both use same technology The 440 is twice as fast, but uses only 1.66 times the power!

AM5x86 vs. K6 5x86 is in-order K6 is out-of-order, 6 issue, 24 entry window K6 has slightly better power/performance – But it’s on a newer process (0.25um rather than 0.35)

Crusoe’s Voltage Scaling & Coast and Burn

Big Proviso CPUs available today, even the “low power” ones, are still after speed. – Low power IA32 is just a slower, high-power IA32. If you designed your simple core for super-low power (without very little regard for speed), how might this change?

Conclusion Smaller issue windows are not a win on power; they lower the amount of ILP found by too much. Multiple cores are not a win on power; the faster core tends to be more energy efficient.