1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Computer Architecture Lecture 23: Memory Management
Computer Architecture Lecture 24: Memory Scheduling
Lecture 15: DRAM Main Memory Systems
Application Slowdown Model
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Lecture 6: Reliability, PCM
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
15-740/ Computer Architecture Lecture 19: Main Memory
Presented by Florian Ettinger
Presentation transcript:

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling

2 Latency and Power Wall Power wall: 25-40% of datacenter power can be attributed to the DRAM system Latency and power can be both improved by employing smaller arrays; incurs a penalty in density and cost Latency and power can be both improved by increasing the row buffer hit rate; requires intelligent mapping of data to rows, clever scheduling of requests, etc. Power can be reduced by minimizing overfetch – either read fewer chips or read parts of a row; incur penalties in area or bandwidth

3 Overfetch Overfetch caused by multiple factors:  Each array is large (fewer peripherals  more density)  Involving more chips per access  more data transfer pin bandwidth  More overfetch  more prefetch; helps apps with locality  Involving more chips per access  less data loss when a chip fails  lower overhead for reliability

4 Re-Designing Arrays Udipi et al., ISCA’10

5 Selective Bitline Activation Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro Additional logic per array so that only relevant bitlines are read out Essentially results in finer-grain partitioning of the DRAM arrays

6 Rank Subsetting Instead of using all chips in a rank to read out 64-bit words every cycle, form smaller parallel ranks Increases data transfer time; reduces the size of the row buffer But, lower energy per row read and compatible with modern DRAM chips Increases the number of banks and hence promotes parallelism (reduces queuing delays) Mini-Rank, MICRO’08; MC-DIMM, SC’09

7 Row Buffer Management Open Page policy: maximizes row buffer hits, minimizes energy Close Page policy: helps performance when there is limited locality Hybrid policies: can close a row buffer after it has served its utility; lots of ways to predict utility: time, accesses, locality counters for a bank, etc.

8 Micro-Pages Sudan et al., ASPLOS’10 Organize data across banks to maximize locality in a row buffer Key observation: most locality is restricted to a small portion of an OS page Such hot micro-pages are identified with hardware counters and co-located on the same row Requires hardware indirection to a page’s new location Works well only if most activity is confined to a few micro-pages

9 Scheduling Policies The memory controller must manage several timing constraints and issue a command when all resources are available It must also maximize row buffer hit rates, fairness, and throughput Reads are typically given priority over writes; the write buffer must be drained when it is close to full; changing the direction of the bus requires 5-10 ns delay Basic policies: FCFS, First-Ready-FCFS (prioritize row buffer hits)

10 STFM Mutlu and Moscibroda, MICRO’07 When multiple threads run together, threads with row buffer hits are prioritized by FR-FCFS Each thread has a slowdown: S = T alone / T shared, where T is the number of cycles the ROB is stalled waiting for memory Unfairness is estimated as S max / S min If unfairness is higher than a threshold, thread priorities override other priorities (Stall Time Fair Memory scheduling) Estimation of T alone requires some book-keeping: does an access delay critical requests from other threads?

11 PAR-BS Mutlu and Moscibroda, ISCA’08 A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling

12 TCM Kim et al., MICRO 2010 Organize threads into latency-sensitive ad bw-sensitive clusters based on memory intensity; former gets higher priority Within bw-sensitive cluster, priority is based on rank Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness) Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others

13 Title Bullet