Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and Lizy Kurian John + MICRO’11 Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) + The University of Texas at Austin * IBM Corp. [Paper Review]
( 2 ) Abstract [IEEE Spectrum(link)]link DRAM: balance between performance, power, and storage density To realize good performance, Must mange the structural and timing restrictions of the DRAM devices Use of “Page-mode” feature can mitigate many DRAM constraints Aggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAM In this paper, Minimalist approach “just enough” page-mode accesses to get benefits, avoiding unfairness Proposed address hashing + data prefetch engine + per request priority
( 3 ) 1. Introduction Row buffer (or “page-mode”) Access This paper proposed combination of open/closed-page policy based on … 1)Page-mode gain with only a small number of page accesses Propose a fair DRAM address mapping scheme: low RBL & high BLP 2)Page-mode hit with spatial locality which can be captured in prefetch engines Propose an intuitive criticality-based memory request priority scheme Open-page policyClosed-page policy Page-mode gain Reducing row access latency None (single col. access per row activation) Multiple requests in many core system Introducing priority inversion and fairness/starvation problems Avoiding complexities of row buffer management RBL: Row-buffer Locality BLP: Bank-level Parallelism NOT temporal locality!
( 4 ) 2. Background DRAM timing constraint results in “dead time” before and after random access MC(Memory Controller)’s job is to reduce performance-limiting gaps using parallelism 1) tRC (row cycle time; BK) : MC activates a page wait for BK : multiple threads access diff. BK latency overhead (tRC delay) 2) tRP (row precharge time; BK) : In open-page policy, MC activates other page tRP BK (=close current page before new page is opened) ACTPREACT tRP (e.g. 12ns) tRC (e.g. 48ns) tRAS (e.g. bank
( 5 ) 3. Motivation Use of “page-mode” … 1)Latency Effects: Due to tRC & tRP, overall latency increase small # of access? 2)Power Reduction: only Activate Power reduction small # of access is enough 3)Bank Utilization: drop off quickly as access increase small # of access is enough 4)Other DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%) Closed-page policy If B/U is high, the probability that new request will conflict w/ a busy bank is greater. 16% 62% Next page
( 6 ) 3. Motivation 3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs large last-level cache (e.g. IBM PowerPC 7) RBL: Row-buffer Locality Temporal locality: hits to the large Last-level cache Row buffers exploit only Spatial locality Using prefetch engines, It can be predict spatial locality
( 7 ) 3. Motivation 3.2 Bank and Row Buffer Locality Interplay with Address Mapping -. DRAM device address: row, column, and bank Workload A: long sequential access seq. Workload B: single operation Workload A: higher priority Slow B0 Workload B: higher priority Slow A4 High BLP (Bank-level Parallelism) B0 can be serviced w/o degrading traffic to the workload A e.g. FR-FCFS e.g. ATLAS, PAR-BS e.g. Minimalist (DRAM all col. low order real addr.) (DRAM col. & bank low order real addr.) (DRAM all col. low order real addr.)
( 8 ) 4. Minimalist Open-page Mode 7-bit5-bit2-bit 4.1 DRAM Address Mapping Scheme For sequential access of 4 cache lines -. The basic difference that the Column access bits are split in two places LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits -. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits reducing row buffer conflict [Zhang et al./MICRO’00]
( 9 ) 4. Minimalist Open-page Mode 4.2 Data Prefetch Engine [IBM PowerPC 6] : predictable “page-mode” opportunities need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor 1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource
( 10 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme : In OOO execution, the importance of each request can vary both between and within applications need for dynamic priority scheme 1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval time-based -. 2 categories: read (normal) and prefetch read request is higher priority -. MLP information from MSHR in each core: many misses less important -. Distance information from Prefetch engine (4.2) MLP: Memory Level Parallelism MSHR: Miss Status Holding Register Read request
( 11 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme (cont.) 2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW 3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request very critical request can be serviced w/ the smallest latency 4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue) causing minimal write instructions
( 12 ) 5. Evaluation -. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset -. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment -. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline
( 13 ) 5. Evaluation 5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best throughput improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.
( 14 ) 5. Evaluation 5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.
( 15 ) 5. Evaluation 5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short The high page hit rate is simply not possible given the interleaving of requests between the eight executing programs. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.
( 16 ) 5. Evaluation 5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)
( 17 ) 5. Evaluation 5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calculator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-mode hits (resulting in high DRAM activation power) and the increase in system performance (decreasing runtime).
( 18 ) Conclusions Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page activation -. Assign per-request priority using request stream information in MLP and data prefetch engine Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS
( 19 ) Appendix. Detailed simulation information
( 20 ) Appendix. Detailed simulation information
( 21 ) Appendix. Detailed simulation information
( 22 ) Thanks,