PPEP: online Performance, power, and energy prediction framework

Slides:

Advertisements

Similar presentations

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Performance of Cache Memory

A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

RISC vs CISC Yuan Wei Bin Huang Amit K. Naidu. Introduction - RISC and CISC Boundaries have blurred. Modern CPUs Utilize features of both. The Manufacturing.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Presenter: Chien-Chih Chen. 2 An Assertion Library for On- Chip White-Box Verification at Run-Time On-Chip Verification of NoCs Using Assertion Processors.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

PPEP: online Performance, power, and energy prediction framework

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)

THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

Full and Para Virtualization

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.

PipeliningPipelining Computer Architecture (Fall 2006)

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

“Temperature-Aware Task Scheduling for Multicore Processors” Masters Thesis Proposal by Myname 1 This slides presents title of the proposed project State.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Overview Motivation (Kevin) Thermal issues (Kevin)

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

How do we evaluate computer architectures?

Visit for more Learning Resources

PPEP: online Performance, power, and energy prediction framework

Improving Memory Access 1/3 The Cache and Virtual Memory

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Intel’s Core i7 Processor

Levels of Parallelism within a Single Processor

Module 3: Branch Prediction

Alex Saify Chad Reynolds James Aldorisio Brian Bischoff

Levels of Parallelism within a Single Processor

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Procedure Return Predictors

Presentation transcript:

PPEP: online Performance, power, and energy prediction framework and DVFS Space Exploration Bo Su† Junli Gu‡ Li Shen† Wei Huang‡ Joseph L. Greathouse‡ Zhiying Wang† †National University of Defense Technology ‡AMD Research Dynamic Voltage & Frequency Scaling Challenge How to predict performance & power across VF states Difficulties on modern processor Multiple clock domains Multiple power planes Cores Cores Good morning, I’m Bo Su from the National University of Defense Technology. The title of our paper is “PPEP: online Performance, power, and energy prediction framework and DVFS Space Exploration”. Dynamic voltage and frequency scaling is now widely employed in modern processors to boost performance, lower power, and improve energy efficiency. Good DVFS decisions require accurate performance and power predictions across different VF states. However, these predictions are difficult because of the complexities of modern processors, such as multiple clock domains and multiple power planes. North Bridge Core Clock Domain & Power Plane North Bridge Clock Domain & Power Plane Cores Cores Scalable Not scalable AMD FX-8320

Q1: performance prediction Core VF State Performance Power Higher VF5 CPI(VF5) Power(VF5) Running at: Running at: VF4 CPI(VF4) Power(VF4) VF3 CPI(VF3) Power(VF3) VF2 CPI(VF2) Power(VF2) VF1 Performance prediction? CPI(VF1) Power(VF1) Lower LL-MAB: Miss Address Buffer (MAB) based Leading Loads (LL) CPI Predictor; In this work, we overcome these difficulties. Starting from measurements taking at one VF state, we can predict the performance and power of all other VF states on commercial AMD processors. For performance prediction, we implement a Cycles-Per-Instruction predictor, named LL-MAB. It takes advantage of the Miss Address Buffer (or known as Miss Status Handling Register) of the L2 cache, to count the memory stall cycles according to the Leading Loads theory. The CPI prediction error is 3.2 percent across a more than 2x change in frequency.. Core & L1$ L2$ MAB0 MAB0 L3$ & DRAM MAB1 MAB1 3.2% error >2X freq. diff. MAB2 MAB2 … MABn Core Clock Domain Other Clock Domains

Wednesday 11:05AM Q2: power prediction Core VF State Performance Power Higher VF5 CPI(VF5) Power(VF5) Running at: Running at: VF4 CPI(VF4) Power(VF4) VF3 CPI(VF3) Power(VF3) VF2 CPI(VF2) Power(VF2) VF1 CPI(VF1) CPU Power prediction? Power(VF1) Lower Power model: CPU Events + Temperature Power prediction: LL-MAB + 2 observations of CPU events. 4.2% error across 5 VF states. For power prediction: First, we use CPU events and temperature to build a power model. And then, based on 2 observations, we implement a predictor that allows us to take measurements at one VF state and predict the chip’s power usage at all other VF states. The power prediction error is 4.2 percent. Finally, we combine the CPI predictor and the power predictor to form the PPEP framework. PPEP runs as a software user level daemon without requiring hardware or operating system modifications. Our talk is at 5 past 11 on Wednesday morning before the “Test-of-Time Awards”. Glad to see you then, thank you! Wednesday 11:05AM