1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
A Case for Refresh Pausing in DRAM Memory Systems
The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Lecture 15: DRAM Main Memory Systems
Application Slowdown Model
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Lecture 20: OOO, Memory Hierarchy
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
18-447: Computer Architecture Lecture 19: Main Memory
Presentation transcript:

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies

2 LR-DIMM Figure from: Yoon et al., ISCA 2012

3 BOOM Yoon et al., ISCA 2012 Using LPDDR chips at low frequency Wide internal datapath Longer burst length on DBUS No cache line buffering Higher activate power Same bandwidth, but lower power and higher latency

4 BOOM with Sub-Ranking

5 Disaggregated Memory Lim et al., ISCA 2009 To support high capacity memory, build a separate memory blade that is shared by all compute blades in a rack For example, if average utilization is 2 GB, each compute blade is only provisioned with 2 GB of memory; but the compute blade can also access (say) 2 TB of data in the memory blade The hierarchy is exclusive and data is managed at page granularity Remote memory access is via PCIe (120 ns latency and 1 GB/s in each direction)

6 Micron HMC 3D-stacked device with memory+logic High capacity, low power, high bandwidth Can move functionalities to the memory package Figure from: T. Pawlowski, HotChips 2011

7 HMC Details 32 banks per die x 8 dies = 256 banks per package 2 banks x 8 dies form 1 vertical slice (shared data bus) High internal data bandwidth (TSVs)  entire cache line from a single array (2 banks) that is 256 bytes wide Future generations: eight links that can connect to the processor or other HMCs – each link (40 GBps) has 16 up and 16 down lanes (each lane has 2 differential wires) 1866 TSVs at 60 um pitch and 2 Gb/s (50 nm 1Gb DRAMs) 3.7 pJ/bit for DRAM layers and 6.78 pJ/bit for logic layer (existing DDR3 modules are 65 pJ/bit) Figure from: T. Pawlowski, HotChips 2011

8 Scheduling Policies Basics Must honor several timing constraints for each bank/rank Commands: PRE, ACT, COL-RD, COL-WR, REF, Power-Up/Dn Must handle reads and writes on the same DDR3 bus Must issue refreshes on time Must maximize row buffer hit rates and parallelism Must maximize throughput and fairness

9 Address Mapping Policies Consecutive cache lines can be placed in the same row to boost row buffer hit rates Consecutive cache lines can be placed in different ranks to boost parallelism Example address mapping policies: row:rank:bank:channel:column:blkoffset row:column:rank:bank:channel:blkoffset

10 Reads and Writes A single bus is used for reads and writes The bus direction must be reversed when switching between reads and writes; this takes time and leads to bus idling Hence, writes are performed in bursts; a write buffer stores pending writes until a high water mark is reached Writes are drained until a low water mark is reached

11 Refresh Example, tREFI (gap between refresh commands) = 7.8us, tRFC (time to complete refresh command) = 350 ns JEDEC: must issue 8 refresh commands in a 8*tREFI time window Allows for some flexibility in when each refresh command is issued Elastic refresh: issue a refresh command when there is lull in activity; any unissued refreshes are handled at the end of the 8*tREFI window

12 Maximizing Row Buffer Hit Rates FCFS: Issue the first read or write in the queue that is ready for issue First Ready - FCFS: First issue row buffer hits if you can

13 STFM Mutlu and Moscibroda, MICRO’07 When multiple threads run together, threads with row buffer hits are prioritized by FR-FCFS Each thread has a slowdown: S = T alone / T shared, where T is the number of cycles the ROB is stalled waiting for memory Unfairness is estimated as S max / S min If unfairness is higher than a threshold, thread priorities override other priorities (Stall Time Fair Memory scheduling) Estimation of T alone requires some book-keeping: does an access delay critical requests from other threads?

14 PAR-BS Mutlu and Moscibroda, ISCA’08 A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time and overall throughput By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling

15 TCM Kim et al., MICRO 2010 Organize threads into latency-sensitive and bw-sensitive clusters based on memory intensity; former gets higher priority Within bw-sensitive cluster, priority is based on rank Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness) Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others

16 Minimalist Open-Page Kaseridis et al., MICRO 2011 Place 4 consecutive cache lines in one bank, then the next 4 in a different bank and so on – provides the best balance between row buffer locality and bank-level parallelism Don’t have to worry as much about fairness Scheduling first takes priority into account, where priority is determined by wait-time, prefetch distance, and MLP in thread A row is precharged after 50 ns, or immediately following a prefetch-dictated large burst

17 Other Scheduling Ideas Using reinforcement learning Ipek et al., ISCA 2008 Co-ordinating across multiple MCs Kim et al., HPCA 2010 Co-ordinating requests from GPU and CPU Ausavarungnirun et al., ISCA 2012 Several schedulers in the Memory Scheduling Championship at ISCA 2012 Predicting the number of row buffer hits Awasthi et al., PACT 2011

18 Title Bullet