Presentation is loading. Please wait.

Presentation is loading. Please wait.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

Similar presentations


Presentation on theme: "Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and."— Presentation transcript:

1 jinil_chung@korea.ac.kr Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and Lizy Kurian John + MICRO’11 Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) (jinil_chung@korea.ac.kr) (jinil_chung@korea.ac.kr)jinil_chung@korea.ac.kr + The University of Texas at Austin * IBM Corp. [Paper Review]

2 jinil_chung@korea.ac.kr ( 2 ) Abstract [IEEE Spectrum(link)]link DRAM: balance between performance, power, and storage density To realize good performance, Must mange the structural and timing restrictions of the DRAM devices Use of “Page-mode” feature can mitigate many DRAM constraints Aggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAM In this paper, Minimalist approach “just enough” page-mode accesses to get benefits, avoiding unfairness  Proposed address hashing + data prefetch engine + per request priority

3 jinil_chung@korea.ac.kr ( 3 ) 1. Introduction Row buffer (or “page-mode”) Access This paper proposed combination of open/closed-page policy based on … 1)Page-mode gain with only a small number of page accesses  Propose a fair DRAM address mapping scheme: low RBL & high BLP 2)Page-mode hit with spatial locality which can be captured in prefetch engines  Propose an intuitive criticality-based memory request priority scheme Open-page policyClosed-page policy Page-mode gain Reducing row access latency None (single col. access per row activation) Multiple requests in many core system Introducing priority inversion and fairness/starvation problems Avoiding complexities of row buffer management RBL: Row-buffer Locality BLP: Bank-level Parallelism NOT temporal locality!

4 jinil_chung@korea.ac.kr ( 4 ) 2. Background DRAM timing constraint results in “dead time” before and after random access  MC(Memory Controller)’s job is to reduce performance-limiting gaps using parallelism 1) tRC (row cycle time; ACT-to-ACT @same BK) : MC activates a page  wait for tRC @same BK : multiple threads access diff. row @same BK  latency overhead (tRC delay) 2) tRP (row precharge time; PRE-to-ACT @same BK) : In open-page policy, MC activates other page  tRP penalty @same BK (=close current page before new page is opened) ACTPREACT tRP (e.g. 12ns) tRC (e.g. 48ns) tRAS (e.g. 36ns) @same bank

5 jinil_chung@korea.ac.kr ( 5 ) 3. Motivation Use of “page-mode” … 1)Latency Effects: Due to tRC & tRP, overall latency increase  small # of access? 2)Power Reduction: only Activate Power reduction  small # of access is enough 3)Bank Utilization: drop off quickly as access increase  small # of access is enough 4)Other DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%) Closed-page policy If B/U is high, the probability that new request will conflict w/ a busy bank is greater. 16% 62% Next page

6 jinil_chung@korea.ac.kr ( 6 ) 3. Motivation 3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs  large last-level cache (e.g. IBM PowerPC 7) RBL: Row-buffer Locality Temporal locality: hits to the large Last-level cache Row buffers exploit only Spatial locality Using prefetch engines, It can be predict spatial locality

7 jinil_chung@korea.ac.kr ( 7 ) 3. Motivation 3.2 Bank and Row Buffer Locality Interplay with Address Mapping -. DRAM device address: row, column, and bank Workload A: long sequential access seq. Workload B: single operation Workload A: higher priority  Slow B0 Workload B: higher priority  Slow A4 High BLP (Bank-level Parallelism)  B0 can be serviced w/o degrading traffic to the workload A e.g. FR-FCFS e.g. ATLAS, PAR-BS e.g. Minimalist (DRAM all col.  low order real addr.) (DRAM col. & bank  low order real addr.) (DRAM all col.  low order real addr.)

8 jinil_chung@korea.ac.kr ( 8 ) 4. Minimalist Open-page Mode 7-bit5-bit2-bit 4.1 DRAM Address Mapping Scheme For sequential access of 4 cache lines -. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits -. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits  reducing row buffer conflict [Zhang et al./MICRO’00]

9 jinil_chung@korea.ac.kr ( 9 ) 4. Minimalist Open-page Mode 4.2 Data Prefetch Engine [IBM PowerPC 6] : predictable “page-mode” opportunities  need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor 1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource

10 jinil_chung@korea.ac.kr ( 10 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme : In OOO execution, the importance of each request can vary both between and within applications  need for dynamic priority scheme 1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval  time-based -. 2 categories: read (normal) and prefetch  read request is higher priority -. MLP information from MSHR in each core: many misses  less important -. Distance information from Prefetch engine (4.2) MLP: Memory Level Parallelism MSHR: Miss Status Holding Register Read request

11 jinil_chung@korea.ac.kr ( 11 ) 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme (cont.) 2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge  increasing command BW 3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC  No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request  very critical request can be serviced w/ the smallest latency 4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue)  causing minimal write instructions

12 jinil_chung@korea.ac.kr ( 12 ) 5. Evaluation -. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset -. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment -. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline

13 jinil_chung@korea.ac.kr ( 13 ) 5. Evaluation 5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best throughput improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.

14 jinil_chung@korea.ac.kr ( 14 ) 5. Evaluation 5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.

15 jinil_chung@korea.ac.kr ( 15 ) 5. Evaluation 5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short  The high page hit rate is simply not possible given the interleaving of requests between the eight executing programs. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.

16 jinil_chung@korea.ac.kr ( 16 ) 5. Evaluation 5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)

17 jinil_chung@korea.ac.kr ( 17 ) 5. Evaluation 5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calculator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-mode hits (resulting in high DRAM activation power) and the increase in system performance (decreasing runtime).

18 jinil_chung@korea.ac.kr ( 18 ) Conclusions Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page activation -. Assign per-request priority using request stream information in MLP and data prefetch engine Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS

19 jinil_chung@korea.ac.kr ( 19 ) Appendix. Detailed simulation information

20 jinil_chung@korea.ac.kr ( 20 ) Appendix. Detailed simulation information

21 jinil_chung@korea.ac.kr ( 21 ) Appendix. Detailed simulation information

22 jinil_chung@korea.ac.kr ( 22 ) Thanks,


Download ppt "Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and."

Similar presentations


Ads by Google