Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

Slides:

Advertisements

Similar presentations

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Reducing Memory Interference in Multicore Systems

A Staged Memory Resource Management Method for CMP Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Short Circuiting Memory Traffic in Handheld Platforms

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Increasing Effective Cache Capacity Through the Use of Critical Words

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and Lizy Kurian John + MICRO’11 Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) + The University of Texas at Austin * IBM Corp. [Paper Review]

( 2 ) Abstract [IEEE Spectrum(link)]link DRAM: balance between performance, power, and storage density To realize good performance, Must mange the structural and timing restrictions of the DRAM devices Use of “Page-mode” feature can mitigate many DRAM constraints Aggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAM In this paper, Minimalist approach “just enough” page-mode accesses to get benefits, avoiding unfairness  Proposed address hashing + data prefetch engine + per request priority

( 3 ) 1. Introduction Row buffer (or “page-mode”) Access This paper proposed combination of open/closed-page policy based on … 1)Page-mode gain with only a small number of page accesses  Propose a fair DRAM address mapping scheme: low RBL & high BLP 2)Page-mode hit with spatial locality which can be captured in prefetch engines  Propose an intuitive criticality-based memory request priority scheme Open-page policyClosed-page policy Page-mode gain Reducing row access latency None (single col. access per row activation) Multiple requests in many core system Introducing priority inversion and fairness/starvation problems Avoiding complexities of row buffer management RBL: Row-buffer Locality BLP: Bank-level Parallelism NOT temporal locality!

( 4 ) 2. Background DRAM timing constraint results in “dead time” before and after random access  MC(Memory Controller)’s job is to reduce performance-limiting gaps using parallelism 1) tRC (row cycle time; BK) : MC activates a page  wait for BK : multiple threads access diff. BK  latency overhead (tRC delay) 2) tRP (row precharge time; BK) : In open-page policy, MC activates other page  tRP BK (=close current page before new page is opened) ACTPREACT tRP (e.g. 12ns) tRC (e.g. 48ns) tRAS (e.g. bank

( 5 ) 3. Motivation Use of “page-mode” … 1)Latency Effects: Due to tRC & tRP, overall latency increase  small # of access? 2)Power Reduction: only Activate Power reduction  small # of access is enough 3)Bank Utilization: drop off quickly as access increase  small # of access is enough 4)Other DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%) Closed-page policy If B/U is high, the probability that new request will conflict w/ a busy bank is greater. 16% 62% Next page

( 6 ) 3. Motivation 3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs  large last-level cache (e.g. IBM PowerPC 7) RBL: Row-buffer Locality Temporal locality: hits to the large Last-level cache Row buffers exploit only Spatial locality Using prefetch engines, It can be predict spatial locality

( 7 ) 3. Motivation 3.2 Bank and Row Buffer Locality Interplay with Address Mapping -. DRAM device address: row, column, and bank Workload A: long sequential access seq. Workload B: single operation Workload A: higher priority  Slow B0 Workload B: higher priority  Slow A4 High BLP (Bank-level Parallelism)  B0 can be serviced w/o degrading traffic to the workload A e.g. FR-FCFS e.g. ATLAS, PAR-BS e.g. Minimalist (DRAM all col.  low order real addr.) (DRAM col. & bank  low order real addr.) (DRAM all col.  low order real addr.)

( 8 ) 4. Minimalist Open-page Mode 7-bit5-bit2-bit 4.1 DRAM Address Mapping Scheme For sequential access of 4 cache lines -. The basic difference that the Column access bits are split in two places LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits -. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits  reducing row buffer conflict [Zhang et al./MICRO’00]

( 9 ) 4. Minimalist Open-page Mode 4.2 Data Prefetch Engine [IBM PowerPC 6] : predictable “page-mode” opportunities  need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor 1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource

( 10 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme : In OOO execution, the importance of each request can vary both between and within applications  need for dynamic priority scheme 1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval  time-based -. 2 categories: read (normal) and prefetch  read request is higher priority -. MLP information from MSHR in each core: many misses  less important -. Distance information from Prefetch engine (4.2) MLP: Memory Level Parallelism MSHR: Miss Status Holding Register Read request

( 11 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme (cont.) 2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge  increasing command BW 3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC  No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request  very critical request can be serviced w/ the smallest latency 4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue)  causing minimal write instructions

( 12 ) 5. Evaluation -. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset -. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment -. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline

( 13 ) 5. Evaluation 5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best throughput improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.

( 14 ) 5. Evaluation 5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.

( 15 ) 5. Evaluation 5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short  The high page hit rate is simply not possible given the interleaving of requests between the eight executing programs. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.

( 16 ) 5. Evaluation 5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)

( 17 ) 5. Evaluation 5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calculator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-mode hits (resulting in high DRAM activation power) and the increase in system performance (decreasing runtime).

( 18 ) Conclusions Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page activation -. Assign per-request priority using request stream information in MLP and data prefetch engine Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS

( 19 ) Appendix. Detailed simulation information

( 20 ) Appendix. Detailed simulation information

( 21 ) Appendix. Detailed simulation information

( 22 ) Thanks,