1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer Engineering Sun Microsystems*
2 Motivation: Multicore Systems Significant memory bandwidth limitations Bandwidth constrained operating points will occur more often in the future Systems must perform well at bandwidth constrained operating points Must respond in a predictable manner
3 Bandwidth Interference Desktops Soft real-time constraints Servers Fair sharing / billing Decreases overall throughput IPC vpr with crafty vpr with art
4 Solution A memory scheduler based on First-Ready FCFS Memory Scheduling Network Fair Queuing (FQ) System software allocates memory system bandwidth to individual threads The proposed FQ memory scheduler 1. Offers threads their allocated bandwidth 2. Distributes excess bandwidth fairly
5 Background Memory Basics Memory Controllers First-Ready FCFS Memory Scheduling Network Fair Queuing
6 Background Memory Basics
7 Micron DDR2-800 timing constraints (measured in DRAM address bus cycles) t RCD Activate to read5 cycles t CL Read to data bus valid5 cycles t WL Write to data bus valid4 cycles t CCD CAS to CAS (CAS is a read or a write)2 cycles t WTR Write to read3 cycles t WR Internal write to precharge6 cycles t RTP Internal read to precharge3 cycles t RP Precharge to activate5 cycles t RRD Activate to activate (different banks)3 cycles t RAS Activate to precharge18 cycles t RC Activate to activate (same bank)22 cycles BL/2Burst length (Cache Line Size / 64 bits) 4 cycles t RFC Refresth to activate51 cycles t RFC Max refresh to refresh28,000 cycles
8 Background: Memory Controller Memory Controller L2 Cache Processor L1 Caches L2 Cache Processor L1 Caches Chip Boundary SDRAM CMP
9 Background: Memory Controller Translates memory requests into SDRAM commands Activate, Read, Write, and Precharge Tracks SDRAM timing constraints E.g., activate latency t RCD and CAS latency t CL Buffers and reorders requests in order to improve memory system throughput
10 Background: Memory Scheduler
11 Background: FR-FCFS Memory Scheduler A First-Ready FCFS priority queues 1. Ready commands 2. CAS commands over RAS commands 3. earliest arrival time Ready with respect to the SDRAM timing constraints FR-FCFS is a good general-purpose scheduling policy [Rixner 2004] Multithreaded issues
12 Example: Two Threads a1a1 a2a2 a3a3 a4a4 Thread 1 a1a1 Bursty MLP, bandwidth constrained Thread 2 Isolated misses, latency sensitive a5a5 a6a6 a7a7 a8a8 Computation Memory Latency a2a2 a3a3 a4a4 a5a5 Computation
13 First Come First Serve a1a1 a2a2 a3a3 a4a4 Thread 1 Shared Memory System Thread 2 a2a2 a1a1 a5a5 a6a6 a7a7 a8a8
14 Background: Network Fair Queuing Network Fair Queuing (FQ) provides QoS in communication networks Network flows are allocated bandwidth on each network link along the flow’s path Routers use FQ algorithms to offer flows their allocated bandwidth Minimum bandwidth bounds end-to-end communication delay through the network We leverage FQ theory to provide QoS in memory systems
15 Background: Virtual Finish-Time Algorithm The k th packet on flow i is denoted p i k p i k virtual start-time S i k = max { a i k, F i k-1 } p i k virtual finish-time F i k = S i k + L i k / i i flow i’s share of network link A virtual clock determines arrival time a i k VC algorithm determines the fairness policy
16 Quality of Service Each thread is allocated a fraction i of the memory system bandwidth Desktop – soft real time applications Server – differentiated service – billing The proposed FQ memory scheduler 1. Offers threads their allocated bandwidth, regardless of the load on the memory system 2. Distributes excess bandwidth according to the FQ memory scheduler’s fairness policy
17 Quality of Service Minimum Bandwidth ⇒ QoS A thread allocated a fraction i of the memory system bandwidth will perform as well as the same thread on a private memory system operating at i of the frequency
18 Fair Queuing Memory Scheduler VTMS is used to calculate memory request deadlines Request deadlines are virtual finish-times FQ scheduler selects 1. the first-ready pending request 2. with the earliest deadline first (EDF) FQ Scheduler Transaction Buffer SDRAM Thread 1 VTMS Thread m VTMS … Deadline / Finish-Time Algorithm … Thread 1 Requests Thread m Requests
19 a5a5 a6a6 a7a7 a8a8 Fair Queuing Memory Scheduler a2a2 a3a3 a4a4 a1a1 Thread 1 Shared Memory System Thread 2 Dilated by the reciprocal i Memory latency Virtual Time Deadlines a1a1 a2a2 a3a3 a4a4
20 Virtual Time Memory System Each thread has its own VTMS to model its private memory system VTMS consists of multiple resources Banks and channels In hardware, a VTMS consists of one register for each memory bank and channel resource A VTMS register holds the virtual time the virtual resource will be ready to start the next request
21 Virtual Time Memory System A request’s deadline is its virtual finish-time The time the request would finish if the request’s thread were running on a private memory system operating at i of the frequency A VTMS model captures fundamental SDRAM timing characteristics Abstracts away some details in order to apply network FQ theory
22 Priority Inversion First-ready scheduling is required to improve bandwidth utilization Low priority ready commands can block higher priority (earlier virtual finish-time) commands Most priority inversion blocking occurs at active banks, e.g. a sequence of row hits
23 Bounding Priority Inversion Blocking Time 1. When a bank is inactive and t RAS cycles after a bank has been activated, prioritize request FR-VFTF 2. After a bank has been active for t RAS cycles, FQ scheduler select the command with the earliest virtual finish time and wait for it to become ready
24 Evaluation Simulator originally developed at IBM Research Structural model Adopts the ASIM modeling methodology Detailed model of finite memory system resources Simulate 20 statistically representative 100M instruction SPEC2000 traces
25 4GHz Processor – System Configuration Issue Buffer64 entries Issue Width8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU) Reorder Buffer128 entries Load / Store Queues32 entry load reorder queue, 32 entry store reorder queue I-Cache32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs D-Cache32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs L2 Cache 512KB private cache, 64 byte lines, 8-ways, 12 cycle latency, 16 store merge buffer entries, 32 transaction buffer entries Memory Controller16 transaction buffer entries per thread, 8 write buffer entries per thread, closed page policy SDRAM Channels 1 channel SDRAM Ranks 1 rank SDRAM Banks 8 banks
26 Evaluation We use data bus utilization to roughly approximate “aggressiveness” Single Thread Data Bus Utilization 0% 20% 40% 60% 80% 100% artequakemcffacereclucasgccswim mgridapsiwupwisetwolfgap ammpbzip2 gzipvprmesa sixtrackperlbmk crafty Utilization
27 Evaluation We present results for a two thread workload that stresses the memory system Construct 19 workloads by combining each benchmark (subject thread) with art, the most aggressive benchmark (background thread) Static partitioning of memory bandwidth i =.5 IPC normalized to QoS IPC Benchmark’s IPC on private memory system at i =.5 the frequency (.5 the bandwidth) More results in the paper
28 Normalized IPC of Subject Thread equakemcffacereclucas gccswimmgridapsi wupwisetwolf gap ammp bzip2gzip vprmesa sixtrack perlbmkcraftyhmean Normalized IPC Normalized IPC of Background Thread (art) equakemcffacerec lucasgcc swim mgrid apsi wupwisetwolf gap ammp bzip2 gzip vpr mesasixtrack perlbmk crafty hmean Normalized IPC FR-FCFSFQ
29 Subject Thread of Two Thread Workload (Background Thread is art) Throughput – Harmonic Mean of Normalized IPCs equakemcffacerec lucasgcc swimmgrid apsi wupwise twolfgapammp bzip2 gzip vpr mesasixtrack perlbmk crafty hmean Harmonic Mean of Normalized IPCs FR-FCFSFQ
30
31 Summary and Conclusions Existing techniques can lead to unfair sharing of memory bandwidth resources ⇒ Destructive interference Fair queuing is a good technique to provide QoS in memory systems Providing threads QoS eliminates destructive interference which can significantly improve system throughput
32 Backup Slides
33 Generalized Processor Sharing Ideal generalized processor sharing (GPS) Each flow i is allocated a share i of the shared network link GPS server services all backlogged flows simultaneously in proportion to their allocated shares Flow 1Flow 2Flow 3Flow 4 11 22 33 44
34 Background: Network Fair Queuing Network FQ algorithms model each flow as if it were on a private link Flow i’s private link has i the bandwidth of the real link Calculates packet deadlines A packet’s deadline is the virtual time the packet finishes its transmission on its private link
35 Virtual Time Memory System Finish Time Algorithm Thread i’s kth memory request is denoted m i k m i k bank j virtual start-time B j.S i k = max { a i k, B j.F i (k-1)’ } m i k bank j virtual finish-time B j.F i k = B j.S i k + B j.L i k / i m i k channel virtual start-time C.S i k = max { B j.F i k-1, C.F i k-1 } m i k channel virtual finish-time C.F i k = C.S i k + C.L i k / i
36 Fairness Policy FQMS Fairness policy: distribute excess bandwidth to the thread that has consumed the least excess bandwidth (relative to its service share) in the past Different than the fairness policy commonly used in networks Differs from the fairness policy commonly used in networks because a memory system is an integral part of a closed system
37 Background: SDRAM Memory Systems SDRAM 3D Structure Banks Rows Columns SDRAM Commands Activate row Read or write columns Precharge bank
38 Virtual Time Memory System Service Requirements SDRAM CommandB cmd.LC cmd.L Activatet RCD n/a Readt CL BL/2 Writet WL BL/2 Precharget RP + (t RAS - t RCD - t CL )n/a The t RAS timing constraint overlaps read and write bank timing constraints Precharge bank service requirement accounts for the overlap