The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

1 JILP Workshop on Architecture Competitions JWAC-3 Memory Scheduling Championship (MSC) 9 th June 2012 Organizers: Rajeev Balasubramonian, Utah Niladrish.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Understanding a Problem in Multicore and How to Solve It

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Shih-Fan, Peng 2013 IEE5008 –Autumn 2013 Memory Systems DRAM Controller for Video Application Shih-Fan, Peng Department of Electronics Engineering National.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Reducing Memory Interference in Multicore Systems

A Staged Memory Resource Management Method for CMP Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

18742 Parallel Computer Architecture Caching in Multi-core Systems

Lecture 15: DRAM Main Memory Systems

Lecture: Memory, Multiprocessors

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

A Talk on Adaptive History-Based Memory Scheduling

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

15-740/ Computer Architecture Lecture 19: Main Memory

Presented by Florian Ettinger

Presentation transcript:

The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo Park

Introduction The cost of read-to-write switching is high The timing overhead of the row conflict is much higher than that of the read-to-write switching Conventional draining policy does not utilize row locality Effective draining policy considering row locality is proposed

EX1 :: Conventional Drain Policy W8 B0 R5 W8 B0 R5 R11 B3 R9 R11 B3 R9 R10 B1 R6 R10 B1 R6 R9 B0 R1 R9 B0 R1 R8 B1 R3 R8 B1 R3 R7 B1 R4 R7 B1 R4 R6 B2 R8 R6 B2 R8 R5 B2 R4 R5 B2 R4 R4 B3 R6 R4 B3 R6 R3 B2 R7 R3 B2 R7 R2 B3 R4 R2 B3 R4 R1 B0 R5 R1 B0 R5 R0 B1 R5 R0 B1 R5 W7 B1 R2 W7 B1 R2 W6 B0 B7 W6 B0 B7 W5 B2 R8 W5 B2 R8 W4 B3 R8 W4 B3 R8 W3 B1 R7 W3 B1 R7 W2 B0 R1 W2 B0 R1 W1 B2 R4 W1 B2 R4 W0 B3 R6 W0 B3 R6 WQ RQ HI_WMLO_WM drain starts Activated Row B0 B1 B2 B3 R6 R4 R1 R7 R8 R7 switch to read drain row hit chance is wasted during consecutive write drain row hit request issued request

* Row locality is successfully utilized EX1 :: Proposed Drain Policy W8 B0 R5 W8 B0 R5 R11 B3 R9 R11 B3 R9 R10 B1 R6 R10 B1 R6 R9 B0 R1 R9 B0 R1 R8 B1 R3 R8 B1 R3 R7 B1 R4 R7 B1 R4 R6 B2 R8 R6 B2 R8 R5 B2 R4 R5 B2 R4 R4 B3 R6 R4 B3 R6 R3 B2 R7 R3 B2 R7 R2 B3 R4 R2 B3 R4 R1 B0 R5 R1 B0 R5 R0 B1 R5 R0 B1 R5 W7 B1 R2 W7 B1 R2 W6 B0 R7 W6 B0 R7 W5 B2 R8 W5 B2 R8 W4 B3 R8 W4 B3 R8 W3 B1 R7 W3 B1 R7 W2 B0 R1 W2 B0 R1 W1 B2 R4 W1 B2 R4 W0 B3 R6 W0 B3 R6 WQ RQ HI_WMLO_WM drain starts Activated Row B0 B1 B2 B3 R6 row hit request R4 R1 issued request Row Hit in RQ, not in WQ switch to read drain R5 Row Hit in WQ, not in RQ switch to write drain

EX2 :: Conventional Drain Policy W8 B0 R0 W8 B0 R0 R11 B3 R9 R11 B3 R9 R10 B1 R6 R10 B1 R6 R9 B0 R1 R9 B0 R1 R8 B1 R3 R8 B1 R3 R7 B1 R4 R7 B1 R4 R6 B2 R8 R6 B2 R8 R5 B2 R4 R5 B2 R4 R4 B3 R6 R4 B3 R6 R3 B2 R7 R3 B2 R7 R2 B3 R4 R2 B3 R4 R1 B0 R5 R1 B0 R5 R0 B1 R5 R0 B1 R5 W7 B3 R0 W7 B3 R0 W6 B2 R0 W6 B2 R0 W5 B1 R0 W5 B1 R0 W4 B0 R0 W4 B0 R0 W3 B3 R0 W3 B3 R0 W2 B2 R0 W2 B2 R0 W1 B1 R0 W1 B1 R0 W0 B0 R0 W0 B0 R0 WQ RQ HI_WMLO_WM drain starts Activated Row B0 B1 B2 B3 row hit request R0 issued request switch to read drain R0 R5 R4 row hit write requests was not issued successfully

EX2 :: Proposed Drain Policy W8 B0 R0 W8 B0 R0 R11 B3 R9 R11 B3 R9 R10 B1 R6 R10 B1 R6 R9 B0 R1 R9 B0 R1 R8 B1 R3 R8 B1 R3 R7 B1 R4 R7 B1 R4 R6 B2 R8 R6 B2 R8 R5 B2 R4 R5 B2 R4 R4 B3 R6 R4 B3 R6 R3 B2 R7 R3 B2 R7 R2 B3 R4 R2 B3 R4 R1 B0 R5 R1 B0 R5 R0 B1 R5 R0 B1 R5 W7 B3 R0 W7 B3 R0 W6 B2 R0 W6 B2 R0 W5 B1 R0 W5 B1 R0 W4 B0 R0 W4 B0 R0 W3 B3 R0 W3 B3 R0 W2 B2 R0 W2 B2 R0 W1 B1 R0 W1 B1 R0 W0 B0 R0 W0 B0 R0 WQ RQ HI_WMLO_WM drain starts Activated Row B0 B1 B2 B3 row hit request R0 issued request continue write drain R0 Row Hit in WQ, not in RQ switch to read drain * Row locality is successfully utilized

Key Idea By referencing row locality – Switch to read even if the # pending write requests in the write queue is bigger than “low watermark” – Drain write requests continuously even if the # pending write requests in the write queue reaches “low watermark” The row locality can be fully utilized with RLDP(Row Locality based Drain Policy)

Flow Chart

Conventional Scheduling Algorithms Delayed write drain[13] and delayed close policy[17] are combined to increase performance and utilize row buffer locality Delayed write drain is applied adaptively based on historical request density Per-bank delayed close policy is adaptively applied considering history counter – Read history counter is incremented when the read command is issued, and decremented when the active command is issued – Write history counter operates in the same manner

Total Execution Time Compare to the CLOSE+FR-FCFS, the total execution time is reduced by – 2.35% w/ DELAYED-CLOSE – 4.37% w/ RLDP – 5.64% w/ PROPOSED (9.99%, compare to FCFS) * DELAYED-CLOSE shows better result in 1- channel configuration than 4- channel configuration(3.74% in 1CH, 0.90% in 4CH) ** RLDP improves performance in both configuration (4.51% in 1CH, 3.86% in 4CH)

Row Hit Rate of Write Requests * Row Hit Rate = (# Write - #ActiveW)/ #Write ** RLDP shows improvement in terms of the row hit rate of write requests Compare to the CLOSE+FR-FCFS, the row hit rate of write requests is increased by – 10.64% w/ DELAYED-CLOSE – 30.05% w/ RLDP – 34.28% w/ PROPOSED

Row Hit Rate of Read Requests * DELAYED-CLOSE shows improvement in terms of the row hit rate of read requests, while RLDP shows slight improvement Compare to the CLOSE+FR-FCFS, the row hit rate of write requests is increased by – 2.20% w/ DELAYED-CLOSE – 0.35% w/ RLDP – 2.21% w/ PROPOSED

Row Hit Rate of Write Requests * 4 channel configuration, MT-canneal * The row hit rate of write requests is improved greatly in all simulation period

High Watermark Residence Time / Read-Write Switching Frequency “HIGH WATERMARK RESIDENCE TIME” of the proposed algorithm is reduced by 69% The “READ-WRITE SWITCHING FREQUENCY” of the proposed algorithm is increased by 33.78% * Read to write switching is occurred more frequently

Hardware Overhead The total register overhead is 0.4KB Because RLDP only checks the row hit of the requests, logic complexity is low

Comparison of Key Metrics Compare to the close scheduling policy, the proposed algorithm reduces the system execution time by 6.86%(9.99%, compare to FCFS) PFP is improved by 11.4%, and EDP is improved by 12%

Conclusion RLDP(Row Locality based Drain Policy) is proposed to utili ze the row buffer locality The proposed scheduling algorithm improves the row hit rate of both write request and read request The number of the active command is reduced so that the total execution time is improved by the amount of 6.86% compare to the CLOSE+FR-FCFS scheduling algorithm(9.99 %, compare to FCFS)

Q & A

References [1] B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems - Cache, DRAM, Disk. Elsevier, Chapter 7, [2] JEDEC. JEDEC standard: DDR/DDR2/DDR3 STANDARD (JESD 79-1,2,3) [3] Chang Joo Lee. DRAM-Aware Prefetching and Cache Management, HPS Technical Report, TR-HPS , University of Texas, Austin, December, [4] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proce edings of ISCA, [5] M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Op portunities Posed by Multiple On-Chip Memory Controllers. In Proceedings of PACT, [6] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Expl oiting Differences in Memory Access Behavior. In Proceedings of MICRO, [7] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. I n Proceedings of MICRO, [8] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling - Enhancing Both Performance an d Fairness of Shared DRAM Systems. In Proceedings of ISCA, [9] C. Lee, O. Mutlu, V. Nerasiman, and Y. N. Patt, ‘‘Prefetch-Aware DRAM Controllers,’’ In Proceedings of MICRO, 2008.

References [10] I. Hur and C. Lin, Adaptive History-Based Memory Schedulers. In Proceedings of MICRO, [11] D. Kaseridis, J. Stuecheli, and L. John. Minimalist Open-page: A DRAM Page-mode Scheduling Poli cy for the Many-core Era In Proceedings of MICRO, [12] J. Stuecheli, D. Kaseridis, D. Daly, H. Hunter, and L. John. The Virtual Write Queue: Coordinating D RAM and LastLevel Cache Policies. In Proceedings of ISCA, [13] C. Natarajan et al. A study of performance impact of memory controller features in multi-process or server environment. In Proceedings of WMPI, [14] N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. Jouppi. Staged Reads : Mi tigating the Impact of DRAM Writes on DRAM Reads. In Proceedings of HPCA, [15] B. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change Memory as a Scalable DRAM Alternative. In Proceedings of ISCA, [16] N. Chatterjee and R. Balasubramonian and M. Shevgoor and S. Pugsley and A. Udipi and A. Shafie e and K. Sudan and M. Awasthi and Z. Chishti. USIMM: the Utah SImulated Memory Module [17] memory-but-were-afraid-to-ask/6http:// memory-but-were-afraid-to-ask/6 [18] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Archi tectural Implications. In Proceedings of PACT, 2008

1-Rank Configuration Compare to the CLOSE+FR-FCFS, the total execution time is reduced by – 2.87 % w/ DELAYED-CLOSE – 4.26% w/ RLDP – 4.76% w/ PROPOSED (7.04%, compare to FCFS) * Read to write switching overhead is larger than 2-rank configuration