1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
A Case for Refresh Pausing in DRAM Memory Systems
The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Lecture 20: OOO, Memory Hierarchy
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
18-447: Computer Architecture Lecture 19: Main Memory
Presentation transcript:

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies

2 HMC Details 32 banks per die x 8 dies = 256 banks per package 2 banks x 8 dies form 1 vertical slice (shared data bus) High internal data bandwidth (TSVs)  entire cache line from a single array (2 banks) that is 256 bytes wide Future generations: eight links that can connect to the processor or other HMCs – each link (40 GBps) has 16 up and 16 down lanes (each lane has 2 differential wires) 1866 TSVs at 60 um pitch and 2 Gb/s (50 nm 1Gb DRAMs) 3.7 pJ/bit for DRAM layers and 6.78 pJ/bit for logic layer (existing DDR3 modules are 65 pJ/bit) Figure from: T. Pawlowski, HotChips 2011

3 Modern Memory System PROC.. 4 DDR3 channels 64-bit data channels 800 MHz channels 1-2 DIMMs/channel 1-4 ranks/channel

4 Cutting-Edge Systems PROC SMB.. The link into the processor is narrow and high frequency The Scalable Memory Buffer chip is a “router” that connects to multiple DDR3 channels (wide and slow) Boosts processor pin bandwidth and memory capacity More expensive, high power

5 Buffer-on-Board Examples From: Cooper-Balis et al., ISCA 2012

6 FB-DIMM 100 W for a fully-populated FB-DIMM channel under high load. Processor FB-DIMM AMB DRAM : … … FB-DIMM AMB DRAM : … … FB-DIMM AMB DRAM : … … … FB-DIMM architecture; up to 8 FB-DIMMs can be daisy-chained through AMBs. / 10 bit 14 bit /

7 LR-DIMM Figure from: Yoon et al., ISCA 2012

8 BOOM Yoon et al., ISCA 2012 Using LPDDR chips at low frequency Wide internal datapath Longer burst length on DBUS No cache line buffering Higher activate power Same bandwidth, but lower power and higher latency

9 BOOM with Sub-Ranking

10 Disaggregated Memory Lim et al., ISCA 2009 To support high capacity memory, build a separate memory blade that is shared by all compute blades in a rack For example, if average utilization is 2 GB, each compute blade is only provisioned with 2 GB of memory; but the compute blade can also access (say) 2 TB of data in the memory blade The hierarchy is exclusive and data is managed at page granularity Remote memory access is via PCIe (120 ns latency and 1 GB/s in each direction)

11 Scheduling Policies Basics Must honor several timing constraints for each bank/rank Commands: PRE, ACT, COL-RD, COL-WR, REF, Power-Up/Dn Must handle reads and writes on the same DDR3 bus Must issue refreshes on time Must maximize row buffer hit rates and parallelism Must maximize throughput and fairness

12 Address Mapping Policies Consecutive cache lines can be placed in the same row to boost row buffer hit rates Consecutive cache lines can be placed in different ranks to boost parallelism Example address mapping policies: row:rank:bank:channel:column:blkoffset row:column:rank:bank:channel:blkoffset

13 Reads and Writes A single bus is used for reads and writes The bus direction must be reversed when switching between reads and writes; this takes time and leads to bus idling Hence, writes are performed in bursts; a write buffer stores pending writes until a high water mark is reached Writes are drained until a low water mark is reached

14 Refresh Example, tREFI (gap between refresh commands) = 7.8us, tRFC (time to complete refresh command) = 350 ns JEDEC: must issue 8 refresh commands in a 8*tREFI time window Allows for some flexibility in when each refresh command is issued Elastic refresh: issue a refresh command when there is lull in activity; any unissued refreshes are handled at the end of the 8*tREFI window

15 Maximizing Row Buffer Hit Rates FCFS: Issue the first read or write in the queue that is ready for issue (not necessarily the oldest in program order) First Ready - FCFS: First issue row buffer hits if you can

16 STFM Mutlu and Moscibroda, MICRO’07 When multiple threads run together, threads with row buffer hits are prioritized by FR-FCFS Each thread has a slowdown: S = T alone / T shared, where T is the number of cycles the ROB is stalled waiting for memory Unfairness is estimated as S max / S min If unfairness is higher than a threshold, thread priorities override other priorities (Stall Time Fair Memory scheduling) Estimation of T alone requires some book-keeping: does an access delay critical requests from other threads?

17 PAR-BS Mutlu and Moscibroda, ISCA’08 A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time and overall throughput By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling

18 TCM Kim et al., MICRO 2010 Organize threads into latency-sensitive and bw-sensitive clusters based on memory intensity; former gets higher priority Within bw-sensitive cluster, priority is based on rank Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness) Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others

19 Minimalist Open-Page Kaseridis et al., MICRO 2011 Place 4 consecutive cache lines in one bank, then the next 4 in a different bank and so on – provides the best balance between row buffer locality and bank-level parallelism Don’t have to worry as much about fairness Scheduling first takes priority into account, where priority is determined by wait-time, prefetch distance, and MLP in thread A row is precharged after 50 ns, or immediately following a prefetch-dictated large burst

20 Other Scheduling Ideas Using reinforcement learning Ipek et al., ISCA 2008 Co-ordinating across multiple MCs Kim et al., HPCA 2010 Co-ordinating requests from GPU and CPU Ausavarungnirun et al., ISCA 2012 Several schedulers in the Memory Scheduling Championship at ISCA 2012 Predicting the number of row buffer hits Awasthi et al., PACT 2011

21 Title Bullet