1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

End-to-End Analysis of Distributed Video-on-Demand Systems Padmavathi Mundur, Robert Simon, and Arun K. Sood IEEE Transactions on Multimedia, February.

CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Memory Technology “Non-so-random” Access Technology:

Spring 2007W. Rhett DavisNC State UniversityECE 747Slide 1 ECE 747 Digital Signal Processing Architecture SoC Lecture – Working with DRAM April 3, 2007.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Packet Scheduling From Ion Stoica. 2 Packet Scheduling  Decide when and what packet to send on output link -Usually implemented at output interface 1.

A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single-Node Case Abhay K. Parekh, Member, IEEE, and Robert.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Survey of Existing Memory Devices Renee Gayle M. Chua.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Providing QoS with Virtual Private Machines Kyle J. Nesbit, James Laudon, and James E. Smith.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Packet Scheduling: SCFQ, STFQ, WF2Q Yongho Seok Contents Review: GPS, PGPS SCFQ( Self-clocked fair queuing ) STFQ( Start time fair queuing ) WF2Q( Worst-case.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Fair Queuing Hamed Khanmirza Principles of Network University of Tehran.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Reducing Memory Interference in Multicore Systems

ECE Dept., Univ. Maryland, College Park

Zhichun Zhu Zhao Zhang ECE Department ECE Department

A Requests Bundling DRAM Controller for Mixed-Criticality System

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

Presentation transcript:

1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer Engineering Sun Microsystems*

2 Motivation: Multicore Systems Significant memory bandwidth limitations  Bandwidth constrained operating points will occur more often in the future Systems must perform well at bandwidth constrained operating points  Must respond in a predictable manner

3 Bandwidth Interference Desktops  Soft real-time constraints Servers  Fair sharing / billing Decreases overall throughput IPC vpr with crafty vpr with art

4 Solution A memory scheduler based on  First-Ready FCFS Memory Scheduling  Network Fair Queuing (FQ) System software allocates memory system bandwidth to individual threads The proposed FQ memory scheduler 1. Offers threads their allocated bandwidth 2. Distributes excess bandwidth fairly

5 Background Memory Basics Memory Controllers First-Ready FCFS Memory Scheduling Network Fair Queuing

6 Background Memory Basics

7 Micron DDR2-800 timing constraints (measured in DRAM address bus cycles) t RCD Activate to read5 cycles t CL Read to data bus valid5 cycles t WL Write to data bus valid4 cycles t CCD CAS to CAS (CAS is a read or a write)2 cycles t WTR Write to read3 cycles t WR Internal write to precharge6 cycles t RTP Internal read to precharge3 cycles t RP Precharge to activate5 cycles t RRD Activate to activate (different banks)3 cycles t RAS Activate to precharge18 cycles t RC Activate to activate (same bank)22 cycles BL/2Burst length (Cache Line Size / 64 bits) 4 cycles t RFC Refresth to activate51 cycles t RFC Max refresh to refresh28,000 cycles

8 Background: Memory Controller Memory Controller L2 Cache Processor L1 Caches L2 Cache Processor L1 Caches Chip Boundary SDRAM CMP

9 Background: Memory Controller Translates memory requests into SDRAM commands  Activate, Read, Write, and Precharge Tracks SDRAM timing constraints  E.g., activate latency t RCD and CAS latency t CL Buffers and reorders requests in order to improve memory system throughput

10 Background: Memory Scheduler

11 Background: FR-FCFS Memory Scheduler A First-Ready FCFS priority queues 1. Ready commands 2. CAS commands over RAS commands 3. earliest arrival time Ready with respect to the SDRAM timing constraints FR-FCFS is a good general-purpose scheduling policy [Rixner 2004]  Multithreaded issues

12 Example: Two Threads a1a1 a2a2 a3a3 a4a4 Thread 1 a1a1 Bursty MLP, bandwidth constrained Thread 2 Isolated misses, latency sensitive a5a5 a6a6 a7a7 a8a8 Computation Memory Latency a2a2 a3a3 a4a4 a5a5 Computation

13 First Come First Serve a1a1 a2a2 a3a3 a4a4 Thread 1 Shared Memory System Thread 2 a2a2 a1a1 a5a5 a6a6 a7a7 a8a8

14 Background: Network Fair Queuing Network Fair Queuing (FQ) provides QoS in communication networks  Network flows are allocated bandwidth on each network link along the flow’s path Routers use FQ algorithms to offer flows their allocated bandwidth  Minimum bandwidth bounds end-to-end communication delay through the network We leverage FQ theory to provide QoS in memory systems

15 Background: Virtual Finish-Time Algorithm The k th packet on flow i is denoted  p i k p i k virtual start-time  S i k = max { a i k, F i k-1 } p i k virtual finish-time  F i k = S i k + L i k /  i  i flow i’s share of network link A virtual clock determines arrival time a i k  VC algorithm determines the fairness policy

16 Quality of Service Each thread is allocated a fraction  i of the memory system bandwidth  Desktop – soft real time applications  Server – differentiated service – billing The proposed FQ memory scheduler 1. Offers threads their allocated bandwidth, regardless of the load on the memory system 2. Distributes excess bandwidth according to the FQ memory scheduler’s fairness policy

17 Quality of Service Minimum Bandwidth ⇒ QoS  A thread allocated a fraction  i of the memory system bandwidth will perform as well as the same thread on a private memory system operating at  i of the frequency

18 Fair Queuing Memory Scheduler VTMS is used to calculate memory request deadlines  Request deadlines are virtual finish-times FQ scheduler selects 1. the first-ready pending request 2. with the earliest deadline first (EDF) FQ Scheduler Transaction Buffer SDRAM Thread 1 VTMS Thread m VTMS … Deadline / Finish-Time Algorithm … Thread 1 Requests Thread m Requests

19 a5a5 a6a6 a7a7 a8a8 Fair Queuing Memory Scheduler a2a2 a3a3 a4a4 a1a1 Thread 1 Shared Memory System Thread 2 Dilated by the reciprocal  i Memory latency Virtual Time Deadlines a1a1 a2a2 a3a3 a4a4

20 Virtual Time Memory System Each thread has its own VTMS to model its private memory system VTMS consists of multiple resources  Banks and channels In hardware, a VTMS consists of one register for each memory bank and channel resource  A VTMS register holds the virtual time the virtual resource will be ready to start the next request

21 Virtual Time Memory System A request’s deadline is its virtual finish-time  The time the request would finish if the request’s thread were running on a private memory system operating at  i of the frequency A VTMS model captures fundamental SDRAM timing characteristics  Abstracts away some details in order to apply network FQ theory

22 Priority Inversion First-ready scheduling is required to improve bandwidth utilization Low priority ready commands can block higher priority (earlier virtual finish-time) commands Most priority inversion blocking occurs at active banks, e.g. a sequence of row hits

23 Bounding Priority Inversion Blocking Time 1. When a bank is inactive and t RAS cycles after a bank has been activated, prioritize request FR-VFTF 2. After a bank has been active for t RAS cycles, FQ scheduler select the command with the earliest virtual finish time and wait for it to become ready

24 Evaluation Simulator originally developed at IBM Research Structural model  Adopts the ASIM modeling methodology  Detailed model of finite memory system resources Simulate 20 statistically representative 100M instruction SPEC2000 traces

25 4GHz Processor – System Configuration Issue Buffer64 entries Issue Width8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU) Reorder Buffer128 entries Load / Store Queues32 entry load reorder queue, 32 entry store reorder queue I-Cache32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs D-Cache32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs L2 Cache 512KB private cache, 64 byte lines, 8-ways, 12 cycle latency, 16 store merge buffer entries, 32 transaction buffer entries Memory Controller16 transaction buffer entries per thread, 8 write buffer entries per thread, closed page policy SDRAM Channels 1 channel SDRAM Ranks 1 rank SDRAM Banks 8 banks

26 Evaluation We use data bus utilization to roughly approximate “aggressiveness” Single Thread Data Bus Utilization 0% 20% 40% 60% 80% 100% artequakemcffacereclucasgccswim mgridapsiwupwisetwolfgap ammpbzip2 gzipvprmesa sixtrackperlbmk crafty Utilization

27 Evaluation We present results for a two thread workload that stresses the memory system  Construct 19 workloads by combining each benchmark (subject thread) with art, the most aggressive benchmark (background thread)  Static partitioning of memory bandwidth  i =.5 IPC normalized to QoS IPC  Benchmark’s IPC on private memory system at  i =.5 the frequency (.5 the bandwidth) More results in the paper

28 Normalized IPC of Subject Thread equakemcffacereclucas gccswimmgridapsi wupwisetwolf gap ammp bzip2gzip vprmesa sixtrack perlbmkcraftyhmean Normalized IPC Normalized IPC of Background Thread (art) equakemcffacerec lucasgcc swim mgrid apsi wupwisetwolf gap ammp bzip2 gzip vpr mesasixtrack perlbmk crafty hmean Normalized IPC FR-FCFSFQ

29 Subject Thread of Two Thread Workload (Background Thread is art) Throughput – Harmonic Mean of Normalized IPCs equakemcffacerec lucasgcc swimmgrid apsi wupwise twolfgapammp bzip2 gzip vpr mesasixtrack perlbmk crafty hmean Harmonic Mean of Normalized IPCs FR-FCFSFQ

30

31 Summary and Conclusions Existing techniques can lead to unfair sharing of memory bandwidth resources ⇒ Destructive interference Fair queuing is a good technique to provide QoS in memory systems Providing threads QoS eliminates destructive interference which can significantly improve system throughput

32 Backup Slides

33 Generalized Processor Sharing Ideal generalized processor sharing (GPS)  Each flow i is allocated a share  i of the shared network link  GPS server services all backlogged flows simultaneously in proportion to their allocated shares Flow 1Flow 2Flow 3Flow 4 11 22 33 44

34 Background: Network Fair Queuing Network FQ algorithms model each flow as if it were on a private link  Flow i’s private link has  i the bandwidth of the real link Calculates packet deadlines  A packet’s deadline is the virtual time the packet finishes its transmission on its private link

35 Virtual Time Memory System Finish Time Algorithm Thread i’s kth memory request is denoted  m i k m i k bank j virtual start-time  B j.S i k = max { a i k, B j.F i (k-1)’ } m i k bank j virtual finish-time  B j.F i k = B j.S i k + B j.L i k /  i m i k channel virtual start-time  C.S i k = max { B j.F i k-1, C.F i k-1 } m i k channel virtual finish-time  C.F i k = C.S i k + C.L i k /  i

36 Fairness Policy FQMS Fairness policy: distribute excess bandwidth to the thread that has consumed the least excess bandwidth (relative to its service share) in the past  Different than the fairness policy commonly used in networks Differs from the fairness policy commonly used in networks because a memory system is an integral part of a closed system

37 Background: SDRAM Memory Systems SDRAM 3D Structure  Banks  Rows  Columns SDRAM Commands  Activate row  Read or write columns  Precharge bank

38 Virtual Time Memory System Service Requirements SDRAM CommandB cmd.LC cmd.L Activatet RCD n/a Readt CL BL/2 Writet WL BL/2 Precharget RP + (t RAS - t RCD - t CL )n/a The t RAS timing constraint overlaps read and write bank timing constraints Precharge bank service requirement accounts for the overlap