Improving Energy Efficiency by Making DRAM Less Randomly Accessed Hai Huang, Kang G. Shin, Charles Lefurgy, Tom Keller University of Michigan IBM Austin.

Slides:



Advertisements
Similar presentations
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
High Performing Cache Hierarchies for Server Workloads
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Page 15/4/2015 CSE 30341: Operating Systems Principles Allocation of Frames  How should the OS distribute the frames among the various processes?  Each.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Impact of Data Locality on Garbage Collection in SSDs: A General Analytical Study Yongkun Li, Patrick P. C. Lee, John C. S. Lui, Yinlong Xu The Chinese.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
ClickCAM Using Click for Exploring Power Saving Schemes in Router Architectures Jonathan Ellithorpe, Laura Keys CS 252, Spring 2009.
Computer Organization and Architecture
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Software-Hardware Cooperative Power Management Technique for Main Memory So, today I’m going to be talking about a software-hardware cooperative power.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
PMIT-6102 Advanced Database Systems
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Virtualization Part 2 – VMware. Virtualization 2 CS5204 – Operating Systems VMware: binary translation Hypervisor VMM Base Functionality (e.g. scheduling)
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
CS533 Concepts of Operating Systems Jonathan Walpole.
ESSES 2003 © 2003, Carla Schlatter Ellis 1 Outline for Today Objective –Power-aware memory Announcements.
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
1 Amit Berman Reliable Architecture for Flash Memory Joint work with Uri C. Weiser, Acknowledgement: thanks to Idit Keidar Department of Electrical Engineering,
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Critical Power Slope: Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi †,Charles Lefurgy ‡, Eric Van Hensbergen ‡, Ram Rajamony ‡,
Real-Time Performance Analysis of Adaptive Link Rate Baoke Zhang, Karthikeyan Sabhanatarajan, Ann Gordon-Ross*, Alan D. George* This work was supported.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
SketchVisor: Robust Network Measurement for Software Packet Processing
Memory COMPUTER ARCHITECTURE
Selective Code Compression Scheme for Embedded System
Resource Aware Scheduler – Initial Results
Green Software Engineering Prof
(Find all PTEs that map to a given PPN)
Energy-Efficient Address Translation
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Address-Value Delta (AVD) Prediction
A workload-aware energy model for VM migration
Presentation transcript:

Improving Energy Efficiency by Making DRAM Less Randomly Accessed Hai Huang, Kang G. Shin, Charles Lefurgy, Tom Keller University of Michigan IBM Austin Research Lab

Overview Continual increase in the power budget allocated to main memory (i.e., DRAM)  E.g., in a mid-range IBM eServer system, 40% of the total system energy is consumed by its main memory subsystem By passively monitoring memory traffic and managing the power, existing power management techniques are not fully exploiting deeper power-saving states => Actively shape memory traffic to enable existing techniques to save more energy

Passive Monitoring Memory Traffic Why is passively monitoring memory traffic inefficient?  Memory accesses are random – good for performance, bad for energy consumption!  Idle time between consecutive memory accesses is often too short for use of the deeper power-saving state  Randomness is mostly due to OS’s arbitrary virtual-to-physical mapping

Example: Active vs. Passive Rank 0 Rank 1 Rank 0 Rank 1 time Active memory traffic management High-powerLow-powerUltra Low-power time Passive memory traffic management

How to Shape Memory Traffic Essentially, we need to artificially create disparity in access frequency among different memory ranks Hot Ranks and Cold Ranks Disparity in access frequency can be created by finding and migrating frequently-accessed pages to a subset of memory ranks  Hot ranks: contain frequently-accessed pages  Cold ranks: contain infrequently-accessed and unmapped pages Page migration can be done by system software

Implementation page counter MC Rank 0 Rank 1 Rank 2 Rank 3 Hot ranks Cold ranks Operating System Migration thread Time triggers Migrate (old_page, new_page) Second level page table Process First level page table Modify PT

Issues with Page Migration There is a cost associated with each page migration Memory access frequency Is often highly skewed!!! 6% pages causes 75% accesses 14% pages causes 90% accesses Not all pages need to be migrated

Evaluation Simulators  Mambo [IBM] – A full-machine simulator, cycle-accurate, supports PowerPC architecture  Memsim [IBM] – Detailed trace-driven main memory simulator, written in CSIM Workloads  Low memory-intensive workload: SPECjbb + bzip + crafty  High memory-intensive workload: SPECjbb + art + mcf SPECjbb: simulating 8 warehouses SPEC2000 benchmarks: using Reference input set

Low Memory-Intensive Workload

High Memory-Intensive Workload

Summary of Results Energy:  Actively shaping memory traffic saves 35% more energy than passively monitoring Performance :  Low memory-intensive workload: small impact on performance  High memory-intensive workload: significantly degrades performance due to more contention on hot ranks Cost :  Use hardware counters, or  Software page faults

Conclusion Actively shaping memory traffic allows existing power management techniques to more effectively save power Highly-skewed page accesses are observed Alternative main memory design:  Use high-performance/highly-parallel ranks as hot ranks  Use low-performance/low-power ranks as cold ranks Allows frequently-accessed pages to be accessed faster Allows memory ranks that hold infrequently-accessed and unmapped pages to consume less energy