Software-Hardware Cooperative Power Management Technique for Main Memory So, today I’m going to be talking about a software-hardware cooperative power.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

User-Mode Linux Ken C.K. Lee

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.

Improving Energy Efficiency by Making DRAM Less Randomly Accessed Hai Huang, Kang G. Shin, Charles Lefurgy, Tom Keller University of Michigan IBM Austin.

1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.

Protocol Implementation An Engineering Approach to Computer Networking.

Operating System Support for Virtual Machines Samuel King, George Dunlap, Peter Chen Univ of Michigan Ashish Gupta.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.

CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.

Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

Virtualization Technology Prof D M Dhamdhere CSE Department IIT Bombay Moving towards Virtualization… Department of Computer Science and Engineering, IIT.

17 Sep 2002Embedded Seminar2 Outline The Big Picture Who’s got the Power? What’s in the bag of tricks?

HW/SW/FW Allocation – Page 1 of 14CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Allocation of Hardware, Software, and Firmware.

Low-Power Wireless Sensor Networks

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.

Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.

Energy Savings with DVFS Reduction in CPU power Extra system power.

Operating Systems. Definition An operating system is a collection of programs that manage the resources of the system, and provides a interface between.

A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Energy Management in Virtualized Environments Gaurav Dhiman, Giacomo Marchetti, Raid Ayoub, Tajana Simunic Rosing (CSE-UCSD) Inside Xen Hypervisor Online.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

ESSES 2003 © 2003, Carla Schlatter Ellis 1 Outline for Today Objective –Power-aware memory Announcements.

Copyright © 2011, Performance Evaluation of a Green Scheduling Algorithm for Energy Savings in Cloud Computing Truong Vinh Truong Duy; Sato,

Dana Butnariu Princeton University EDGE Lab June – September 2011 OPTIMAL SLEEPING IN DATACENTERS Joint work with Professor Mung Chiang, Ioannis Kamitsos,

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Computer Organization & Assembly Language © by DR. M. Amer.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Operating System Principles And Multitasking

Power and Control in Networked Sensors E. Jason Riedy and Robert Szewczyk Presenter: Fayun Luo.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Critical Power Slope: Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi †,Charles Lefurgy ‡, Eric Van Hensbergen ‡, Ram Rajamony ‡,

CS 546: Intelligent Embedded Systems Gaurav S. Sukhatme Robotic Embedded Systems Lab Center for Robotics and Embedded Systems Computer Science Department.

Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Adaptive Sleep Scheduling for Energy-efficient Movement-predicted Wireless Communication David K. Y. Yau Purdue University Department of Computer Science.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Profiling, Prediction, and Capping of Power in Consolidated Environments Bhuvan Urgaonkar Computer Systems Laboratory The Penn State University Talk at.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Lecture Topics: 11/15 CPU scheduling: –Scheduling goals and algorithms.

JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.

1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

An operating system (OS) is a collection of system programs that together control the operation of a computer system.

Soft Timers : Efficient Microsecond Software Timer Support for Network Processing - Mohit Aron & Peter Druschel CS533 Winter 2007.

Green Software Engineering Prof

EE 472 – Embedded Systems Dr. Shwetak Patel.

Outline - Energy Management

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Software-Hardware Cooperative Power Management Technique for Main Memory So, today I’m going to be talking about a software-hardware cooperative power management technique for main memory. This work was done at IBM Austin Research Lab during the summer Hai Huang, Kang G. Shin University of Michigan Charles Lefurgy, Karthick Rajamani, Tom Keller, Eric Van Hensbergen, Freeman Rawson IBM Austin Research Lab

Motivation High power dissipation causes a lot problems for many computing systems, especially for large servers High electric and cooling cost Unreliable electronic components Low rack-density Intelligent management of system power is important to ensure these systems can continue to function The motivation of this work is that high-power dissipation is causing a lot heat-related problems for many computing systems, especially for large servers, for example, high electric and cooling cost, unreliable electronic components, and lower rack density. To alleviate these problems, we need intelligent management for the system power

DRAM: A Power Hog Main memory (DRAM) consumes a significant portion of the total power – which makes it a good candidate to optimize power for E.g., in an IBM mid-range eServer system, around 40% of the total power is consumed by the main memory The main focus of this work is to reduce power for the main memory system because it can consume a very significant portion of the total system power. It has been reported that for an IBM mid-range eServer system, around 40% of the total power is being dissipated by the main memory. Therefore, it is definitely a good candidate for us to manage power.

Outline Motivation Background Previous Work A Cooperative Approach Results Conclusion The outline for the rest of the talk is as follows. Next, I’m going to give a little bit of background information on DRAM, and specifically, we are going to be focusing on its power management capabilities. Then I’m going to be talking a little about previous works. Then I am going to propose a new cooperative power management technique, followed by some experimental results and finally we conclude.

Outline Motivation Background Previous Work A Cooperative Approach Results Conclusion

Background DRAM dissipates power continuously Self-refresh, row/column decoders, amplifiers, data queue, etc. DRAM’s power management capabilities Multiple power states Memory controller is used to implement a simple interface to transition between these states Transitions have non-negligible delays Trade-offs between power and performance DRAMs are simple solid-state devices that consume power continuously. Some of the main energy consuming components include the self-refresh circuitry, row/column decoders, amplifiers, and data queues. In order to reduce power, we need to power down some of these components. To make things easier for the system programmer, memory controller is used to implement a simple interface such that we can use it to transition memory devices to various low-power states, and the memory controller takes care of the all the power-up and power-down operations and all the timing constraints. But because the transitional delays between various power states can be non-negligible, we still need to play the usual game of energy-delay tradeoff.

Example: DDR Example: Registered 512MB DDR module w/8 devices per rank Read/Write (779.1 mW) Power-down (150 mW) Standby (275.0 mW) Self-refresh (20.87 mW) 5ns 1000ns auto To make things more concrete, let’s look at a real example using DDR devices. DDR devices have four major power states defined. In this transition diagram we show the power dissipation of each of the power states and the transitional delays between them. As we can see, the lower the power Normally, it is in Standby mode, and it transitions to Read/Write state automatically when I/O commands are issued and it transitions back to Standby state right after I/O completes. From the Standby state, we can also manually transition to two low-power states, where self-refresh dissipates much less energy than power-down, but with a much higher resynchronization delay. These are the power and performance characteristics, and now let’s looks at some of the power management techniques leveraging these characteristics Example: Registered 512MB DDR module w/8 devices per rank

Outline Motivation Background Previous Work A Cooperative Approach Software Techniques Hardware Techniques A Cooperative Approach Results Conclusion We first look at software techniques, where power management decisions are made by the operating system software, and then we look at hardware techniques, where these decisions are done at a much lower level – usually at the memory controller level. We then analyze the advantages and disadvantages of the two approaches and propose a software-hardware cooperative technique and show why it is superior

Software Technique Process i: uses ranks 0 and 2 Process j: uses rank 3 OS can track each process’ virtual-to-physical memory mappings Self-refresh Standby Process i context-switched in Process j context-switched in Self-refresh Standby Rank 0 Rank 1 Rank 2 Rank 3 time In the software approach, operating system is in total control of the power. Because the operating system knows everything about a process, including each process’ virtual-to-physical memory mapping, the OS knows exactly which memory regions are used, and which are not by each of the processes. At each context switch, it turns off unused memory regions by the schedule process, which not only saves energy but also not affects performance. So let’s look at an example. In this example, process i has mapped pages in rank 0 and 2, therefore from the time process I contexted switched in til the time it contexted switched out, the OS can safely turn off ranks 1 and 3 to reduce power while not suffering from any performance penalties. Then, say the next process only uses rank 3, the OS can turn off rank 0, 1 and 2, and so on. The advantage of software techniques is that it doesn’t require complicated hardware modifications and has simple control. However, due to its coarse-grained control, many energy-saving opportunities are lost. For example, if process I uses pages mapped to rank 2 very rarely, it is not very energy efficient to keep it in Standby mode at all times while this process is executing. Now, let’s look at hardware techniques to see how they manage power for the memory.

Hardware Technique Allows for much finer-grained control of power Monitors each memory access Predicts when to transition to lower power modes Idle time > Threshold Standby Self-refresh Idle time < Threshold read/write time power Hardware techniques allow for much finer-grained control of power because they continuously monitor every memory access, and based on the past observations, they make predictions on when to transition to lower power states. Again, we use an example to illustrate its fine-grained control mechanism. Each of the blue arrows indicates a memory access, and after each memory access completes, the memory controller starts a timer to keep track of idle time, and if this idle time exceeds a dynamically determined threshold, this memory rank is transitioned to a lower power state. If another memory access starts before the idle time exceeds this threshold, we restart the timer. As we can see, such fine grained control mechanism can extract a lot of idle times for energy saving purpose. But it also has a major problem

Hardware Technique: Problems Hardware techniques can be easily confused by constant context-switching Different processes would have different memory access behavior, and it takes time for the memory controller to adapt, readapt, readapt… Process i Process j time The problem is that the hardware technique monitors memory accesses and controls power at a such a low hardware level that it doesn’t understand what’s going on at the software layer. But ironically, it is the OS and the user-level processes that are driving all the memory accesses. So, by not knowing this information, it will likely to make the wrong power management decisions that not only negatively affects power but also performance as well. The one thing that causes the most problem for the hardware is the constant context-switching in the software layer because different processes may have very different memory access behaviors, which means the hardware needs to adapt, readapt over and over again every time we have a context switch, which makes it very inefficient memory accesses - Imagine hundreds of parallel processes instead of 2! - context switching interval ~ 1 msec

Outline Motivation Background Previous Work A Cooperative Approach Results Conclusion Now, let’s now look at how we can improve upon these previous techniques by showing a software-hardware cooperative technique where the software is used to assist the hardware to better manage the power

Cooperative Approach Improve the hardware technique so we don’t have to readapt, readapt, readapt… Need system software cooperation Make the hardware understand the notion of processes At each context switch, OS sends a signal to the memory controller Upon receiving this signal, the memory controller saves and restores its internal registers, which are used for keeping past memory access patterns Essentially, we can now manage power for the current process solely depending on this and only this process’ past memory accesses So what we did was to improve upon the hardware technique so it doesn’t have to needlessly readapt over and over again at every time we context switch. To do this, we would need to make the hardware understand the notion of processes, which requires some collaboration from the system software. So, this is how it works: At each context switch, the operating system sends a context-switching signal to the memory controller. Upon receiving this signal, the memory controller saves its internal registers, which are used for keeping past memory access patterns, and associates this set of registers to the current process, and restores a set of register contents into the memory controller internal registers that were previously saved in the same manner for the now-scheduled process. This is just like the way we’re saving and restoring CPU registers at each context switch. Essentially, the memory controller can now manage power for the currently running process depending solely on this process’s past memory accesses

Context-Aware Memory Controller Registers Threshold predictor CPU Registers Saves current process’ CPU context MC context Restores scheduled process’ CPU context and MC context Signals context switch This is a graphical representation of what I just said

Cooperative Technique: Per-Process Process i memory accesses Process j Process i Process j time We have seen this example before, and also have seen why hardware technique is inefficient. By using the cooperative technique, the memory controller can quickly adapt its power management strategies depending on which is the current running process.

Outline Motivation Background Previous Work A Cooperative Approach Results Conclusion Now, let’s look at some experiments and compare the results

Experimental Setup Mambo: Memsim: Workloads: A full-machine simulator to run various workloads and collect memory traces Memsim: Trace-driven simulator that produces performance and power results for the main memory Workloads: SPECjbb + bzip2 + crafty (low memory-intensive) SPECjbb + art + mcf (high memory-intensive) As the hardware is not available in any of today’s systems, the next best thing we can do is to implement our system using a machine simulator. We chose Mambo, which we used to run various workloads and collect memory traces. The these traces then are fed into a main memory simulator called Memsim, which produces performance and power results. We used two different workloads for this work. We call them low memory-intensive and high memory-intensive workload. In the low memory intensive workload, we used SPECjbb with two of the low memory intensive workloads from the SPECcpu benchmarks. In the high memory intensive workload, we used SPECjbb with two of the high-memory intensive workloads from the SPECcpu benchmarks.

Results Low-memory intensive workload High-memory intensive workload Here we show the results for our workloads. High-memory intensive workload

Conclusion Cooperative technique Future Work Uses 72–75% less power than when no power management is applied, with 11–14% slow-down in average response time Uses 14–17% less power than the hardware technique Uses 16–26% less power than the software technique Has a comparable performance to HW and SW techniques Future Work Communicate hints directly from user processes to the hardware