HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Slides:

Advertisements

Similar presentations

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Kevin Ross, UCSC, September Service Network Engineering Resource Allocation and Optimization Kevin Ross Information Systems & Technology Management.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Lecture Topics: 11/15 CPU scheduling: –Scheduling goals and algorithms.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Optimizing Distributed Actor Systems for Dynamic Interactive Services

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Application Slowdown Model

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

CPU SCHEDULING.

Presented by Florian Ettinger

Presentation transcript:

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, Onur Mutlu Proceedings of the 39th annual international symposium on Computer architecture (ISCA ’12), June 2012

High Performance Computing & Systems Lab Outline  Motivation: CPU-GPU Systems  Goal  Observations  Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler  Results  Conclusion 2

High Performance Computing & Systems Lab Memory Scheduling for CPU-GPU Systems  Current and future systems integrate a GPU along with m ultiple cores  GPU shares the main memory with the CPU cores  GPU is much more (4x-20x) memory-intensive than CPU 3

High Performance Computing & Systems Lab Main Memory is a Bottleneck  All cores contend for limited off-chip bandwidth  Inter-application interference degrades system performance  The memory scheduler can help mitigate the problem 4 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM Memory Request Buffer Req Data Req

High Performance Computing & Systems Lab Introducing the GPU into the System  GPU occupies a significant portion of the request buffers  Limits the MC’s visibility of the CPU applications’ differing memory beh avior  can lead to a poor scheduling decision 5 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM Req GPU Req

High Performance Computing & Systems Lab Naïve Solution: Large Monolithic Buffer 6 Memory Scheduler To DRAM Core 1Core 2Core 3Core 4 Req GPU

High Performance Computing & Systems Lab Problems with Large Monolithic Buffer  A large buffer requires more complicated logic to:  Analyze memory requests (e.g., determine row buffer hits)  Analyze application characteristics  Assign and enforce priorities  This leads to high complexity, high power, large die area 7 Memory Scheduler Req More Complex Memory Scheduler

High Performance Computing & Systems Lab Goal  Design a new memory scheduler that is:  Scalable to accommodate a large number of requests  Easy to implement  Application-aware  Able to provide high performance and fairness, especially in heteroge neous CPU-GPU systems 8

High Performance Computing & Systems Lab Outline  Motivation: CPU-GPU Systems  Goal  Observations  Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler  Results  Conclusion 9

High Performance Computing & Systems Lab Key Functions of a Memory Controller  Memory controller must consider three different things c oncurrently when choosing the next request: 1) Maximize row buffer hits  Maximize memory bandwidth 2) Manage contention between applications  Maximize system throughput and fairness 3) Satisfy DRAM timing constraints  Current systems use a centralized memory controller d esign to accomplish these functions  Complex, especially with large request buffers 10

High Performance Computing & Systems Lab Key Idea: Decouple Tasks into Stages  Idea: Decouple the functional tasks of the memory contr oller  Partition tasks across several simpler HW structures (stages) 1) Maximize row buffer hits  Stage 1: Batch formation  Within each application, groups requests to the same row into batche s 2) Manage contention between applications  Stage 2: Batch scheduler  Schedules batches from different applications 3) Satisfy DRAM timing constraints  Stage 3: DRAM command scheduler  Issues requests from the already-scheduled order to each bank 11

High Performance Computing & Systems Lab Outline  Motivation: CPU-GPU Systems  Goal  Observations  Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler  Results  Conclusion 12

High Performance Computing & Systems Lab SMS: Staged Memory Scheduling 13 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Stage 1 Stage 2 Stage 3 Req Monolithic Scheduler Batch For mation DRAM Co mmand Sc heduler Bank 1 Bank 2 Bank 3Bank 4

High Performance Computing & Systems Lab SMS: Staged Memory Scheduling 14 Stage 1 Stage 2 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch For mation Stage 3 DRAM Co mmand Sc heduler Bank 1 Bank 2 Bank 3Bank 4

High Performance Computing & Systems Lab Stage 1: Batch Formation  Goal: Maximize row buffer hits  At each core, we want to batch requests that access the same row within a limited time window  A batch is ready to be scheduled under two conditions 1) When the next request accesses a different row 2) When the time window for batch formation expires  Keep this stage simple by using per-core FIFOs 15

High Performance Computing & Systems Lab Stage 1: Batch Formation Example 16 Core 1 Core 2Core 3Core 4 Row A Row B Row C Row D Row E Row F Row E Batch Boundary To Stage 2 (Batch Scheduling) Row A Time wind ow expires Next request goes to a different row Stage 1 Batch For mation

High Performance Computing & Systems Lab SMS: Staged Memory Scheduling 17 Stage 1 Stage 2 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch For mation Stage 3 DRAM Co mmand Sc heduler Bank 1 Bank 2 Bank 3Bank 4

High Performance Computing & Systems Lab Stage 2: Batch Scheduler  Goal: Minimize interference between applications  Stage 1 forms batches within each application  Stage 2 schedules batches from different applications  Schedules the oldest batch from each application  Question: Which application’s batch should be schedule d next?  Goal: Maximize system performance and fairness  To achieve this goal, the batch scheduler chooses between two differ ent policies 18

High Performance Computing & Systems Lab Stage 2: Two Batch Scheduling Algorithms  Shortest Job First (SJF)  Prioritize the applications with the fewest outstanding memory request s because they make fast forward progress  Pro: Good system performance and fairness  Con: GPU and memory-intensive applications get deprioritized  Round-Robin (RR)  Prioritize the applications in a round-robin manner to ensure that mem ory-intensive applications can make progress  Pro: GPU and memory-intensive applications are treated fairly  Con: GPU and memory-intensive applications significantly slow down others 19

High Performance Computing & Systems Lab Stage 2: Batch Scheduling Policy  The importance of the GPU varies between systems and over time  Scheduling policy needs to adapt to this  Solution: Hybrid Policy  At every cycle:  With probability p : Shortest Job First  Benefits the CPU  With probability 1-p : Round-Robin  Benefits the GPU  System software can configure p based on the importan ce/weight of the GPU  Higher GPU importance  Lower p value 20

High Performance Computing & Systems Lab SMS: Staged Memory Scheduling 21 Stage 1 Stage 2 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch For mation Stage 3 DRAM Command Scheduler Bank 1 Bank 2 Bank 3Bank 4

High Performance Computing & Systems Lab Stage 3: DRAM Command Scheduler  High level policy decisions have already been made by:  Stage 1: Maintains row buffer locality  Stage 2: Minimizes inter-application interference  Stage 3: No need for further scheduling  Only goal: service requests while satisfying DRAM timin g constraints  Implemented as simple per-bank FIFO queues 22

High Performance Computing & Systems Lab Putting Everything Together 23 Current Batch Scheduling Policy SJF Current Batch Scheduling Policy RR Batch Scheduler Bank 1 Bank 2 Bank 3 Bank 4 Core 1 Core 2 Core 3 Core 4 Stage 1: Batch Formation Stage 3: DRAM Command Scheduler GPU Stage 2:

High Performance Computing & Systems Lab Complexity  Compared to a row hit first scheduler, SMS consumes*  66% less area  46% less static power  Reduction comes from:  Monolithic scheduler  stages of simpler schedulers  Each stage has a simpler scheduler (considers fewer properties at a ti me to make the scheduling decision)  Each stage has simpler buffers (FIFO instead of out-of-order)  Each stage has a portion of the total buffer size (buffering is distribute d across stages)  * Based on a Verilog model using 180nm library 24

High Performance Computing & Systems Lab Outline  Motivation: CPU-GPU Systems  Goal  Observations  Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler  Results  Conclusion 25

High Performance Computing & Systems Lab Methodology  Simulation parameters  16 OoO CPU cores, 1 GPU modeling AMD Radeon™ 5870  DDR DRAM 4 channels, 1 rank/channel, 8 banks/channel  Workloads  CPU: SPEC CPU 2006  GPU: Recent games and GPU benchmarks  7 workload categories based on the memory-intensity of CPU applicat ions  Low memory-intensity (L)  Medium memory-intensity (M)  High memory-intensity (H) 26

High Performance Computing & Systems Lab Comparison to Previous Scheduling Algorithms  FR-FCFS  Prioritizes row buffer hits  Maximizes DRAM throughput  Low multi-core performance  Application unaware  ATLAS  Prioritizes latency-sensitive applications  Good multi-core performance  Low fairness  Deprioritizes memory-intensive applications  TCM  Clusters low and high-intensity applications and treats each separatel y  Good multi-core performance and fairness  Not robust  Misclassifies latency-sensitive applications 27

High Performance Computing & Systems Lab Evaluation Metrics  CPU performance metric: Weighted speedup  GPU performance metric: Frame rate speedup  CPU-GPU system performance: CPU-GPU weighted spee dup 28

High Performance Computing & Systems Lab Evaluated System Scenarios  CPU-focused system  GPU-focused system 29

High Performance Computing & Systems Lab Evaluated System Scenario: CPU Focused  GPU has low weight (weight = 1)  Configure SMS such that p, SJF probability, is set to 0.9  Mostly uses SJF batch scheduling  prioritizes latency-sensitive appl ications (mainly CPU) 30 1

High Performance Computing & Systems Lab Performance: CPU-Focused System  SJF batch scheduling policy allows latency-sensitive ap plications to get serviced as fast as possible 31 p=0.9 Workload Categories

High Performance Computing & Systems Lab Evaluated System Scenario: GPU Focused  GPU has high weight (weight = 1000)  Configure SMS such that p, SJF probability, is set to 0  Always uses round-robin batch scheduling  prioritizes memory-inten sive applications (GPU)

High Performance Computing & Systems Lab Performance: GPU-Focused System  Round-robin batch scheduling policy schedules GPU req uests more frequently 33 p=0 Workload Categories

High Performance Computing & Systems Lab Additional Results in the Paper  Fairness evaluation  47.6% improvement over the best previous algorithms  Individual CPU and GPU performance breakdowns  CPU-only scenarios  Competitive performance with previous algorithms  Scalability results  SMS’ performance and fairness scales better than previous algorithm s as the number of cores and memory channels increases  Analysis of SMS design parameters 34

High Performance Computing & Systems Lab Outline  Motivation: CPU-GPU Systems  Goal  Observations  Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler  Results  Conclusion 35

High Performance Computing & Systems Lab Conclusion  Observation: Heterogeneous CPU-GPU systems require me mory schedulers with large request buffers  Problem: Existing monolithic application-aware memory sch eduler designs are hard to scale to large request buffer size  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM  Compared to state-of-the-art memory schedulers:  SMS is significantly simpler and more scalable  SMS provides higher performance and fairness 36

High Performance Computing & Systems Lab Thank you