Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Slides:

Advertisements

Similar presentations

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Fabián E. Bustamante, Spring 2007

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Systems I Locality and Caching

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Chapter 91 Logical Address in Paging  Page size always chosen as a power of 2.  Example: if 16 bit addresses are used and page size = 1K, we need 10.

Introduction to Virtual Memory and Memory Management

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Virtual Memory Chapter 7.4.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Reducing Memory Interference in Multicore Systems

ECE232: Hardware Organization and Design

Section 9: Virtual Memory (VM)

143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.

CS510 Operating System Foundations

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

CSCI206 - Computer Organization & Programming

Part V Memory System Design

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Page Replacement.

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Lecture 8: Efficient Address Translation

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis School of Computing, University of Utah ASPLOS-2010

DRAM Memory Constraints Modern machines spend nearly 25% - 40% of total system power for memory. Some commercial servers already have larger power budgets for memory than CPU. Main memory access is one of the largest performance bottlenecks. We address both performance and power concerns for DRAM memory accesses. 2

DRAM Access Mechanism … Memory Controller Memory bus or channel Rank DRAM chip or device Bank Array DIMM 3 CPU makes a memory request and the Memory Controller converts it to appropriate DRAM commands. Accesses within a device begin with selecting a bank, then a row. 1/8 th of the row buffer One word of data output Row A few column bits are then selected from the row- buffer. These bits are then the output from the device. Many bits read from the DRAM cells to service a single CPU request!

DRAM Access Inefficiencies - I Over fetch due to large row-buffers. 8 KB read into row buffer for a 64 byte cache line. Row-buffer utilization for a single request < 1%. Why are row buffers so large? Large arrays minimize cost-per-bit. Striping a cache line across multiple chips (arrays) improves data transfer bandwidth. 4

DRAM Access Inefficiencies - II Open page policy Row buffers kept open with the hope that subsequent requests will be row buffer hits. FR-FCFS request scheduling (First-Ready FCFS) Memory controller schedules requests to open row-buffers first. Diminishing locality in multi-cores. 5 Access LatencyAccess Energy Row-buffer Hit~ 75 cycles ~ 18 nJ Row-buffer Miss~ 225 cycles~ 38 nJ

DRAM Row-buffer Hit-rates 6 With increasing core counts, DRAM row-buffer hit-rates reduce.

Key Observation Cache Block Access Pattern Within OS Pages 7 For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks.

Outline DRAM Basics. Motivation. Basic Idea. Software Only Implementation (ROPS). Hardware Implementation (HAM). Results. 8

Basic Idea Gather all heavily accessed chunks of independent OS pages and map them to the same DRAM row. 9 Hottest micro-pages 1 KB micro-pages Coldest micro-pages 4 KB OS Pages DRAM Memory Reserved DRAM Region

Basic Idea Identifying “hot” micro-pages. Memory controller counters and OS daemon. Reserved rows in DRAM for hot micro-pages. Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system (< 0.1%). EPOCH based schemes. Expose EPOCH length to the OS for flexibility. 10

Software Only Implementation (ROPS) 11 Virtual Address X CPU Memory Request 4 GB Main Memory Baseline Translation Lookaside Buffer (TLB) Y Physical Address Y Reduced OS Page size (ROPS) 4 MB Reserved DRAM region Shrink the OS page size to 1KB Every Epoch: 1.Migrate hot micro-pages. TLB shoot-down and page table update. 2.Promote cold micro-pages to a superpage. Page table/TLB updated. Hot micro-pages Cold micro-pages Physical Address Z Translation Lookaside Buffer (TLB)

Software Only Implementation (ROPS) Reduced OS Page Size (ROPS). Throughout the system, reduce page size to 1KB size. Migrate hot micro-pages via DRAM-copy Hot micro-pages live in the same row-buffer in the reserved DRAM region. Mitigate reduction in TLB reach by promoting cold micro-pages to 4KB superpages. Superpage creation facilitated by “reservation-based” page allocation. Allocate four 1KB micro-pages to contiguous DRAM frames. Allows contiguous virtual addresses to be placed in contiguous physical addresses → makes superpage creation easy. 12

Hardware Implementation (HAM) 13 Physical Address X New addr. Y 4 GB Main Memory CPU Memory Request 4 MB Reserved DRAM region Y X Page A Mapping Table X Y Old AddressNew Address BaselineHardware Assisted Migration (HAM)

Hardware Implementation (HAM) Hardware Assisted Migration (HAM). New level of address indirection − Place data wherever you want in the DRAM. Maintain a Mapping Table (MT) − Preserve old physical addresses of migrated micro-pages. DRAM-copy of hot micro-pages to the reserved rows. Populate/update MT every EPOCH. 14

Results Schemes Evaluated Baseline Oracle/Profiled:  Best-effort estimate of expected benefit in the next epoch based on a prior profile run. Epoch Based ROPS and HAM  Evaluated 5M, 10M, 50M, and 100M.  Trends are similar, best perf. with 5M and 10M. Simics simulation platform. DRAMSim based DRAM timing. DRAM timing and energy figures from Micron datasheets. 15 Simulation Parameters

Results Accesses to Micro-Pages in Reserved Rows in an Epoch 16 % of Total accesses to micro- pages in reserved rows Total # 4KB pages touched in an Epoch. % Accesses to micro-pages4KB pages touched

Results 17 5M cycle EPOCH, ROPS, HAM and ORACLE Hardware assisted migration offers better returns due to lower TLB management overheads. Apart from 9% perf. gains, our schemes also save energy at the same time! Applications with room for improvement show average performance Improvement of 9% Percent change in performance

Results ROPS, HAM and ORACLE Energy consumption of the DRAM sub-system % Reduction in DRAM energy 18

Conclusions On average, for applications with room for improvement and with our best performing scheme Average performance ↑ 9% (max. 18%) Average memory energy consumption ↓ 18% (max. 62%). Average row-buffer utilization ↑ 38% Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses. Future work Can co-locate hot micro-pages that are accessed around the same time. 19

That's all for today … Questions? 20