1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling
2 Latency and Power Wall Power wall: 25-40% of datacenter power can be attributed to the DRAM system Latency and power can be both improved by employing smaller arrays; incurs a penalty in density and cost Latency and power can be both improved by increasing the row buffer hit rate; requires intelligent mapping of data to rows, clever scheduling of requests, etc. Power can be reduced by minimizing overfetch – either read fewer chips or read parts of a row; incur penalties in area or bandwidth
3 Overfetch Overfetch caused by multiple factors: Each array is large (fewer peripherals more density) Involving more chips per access more data transfer pin bandwidth More overfetch more prefetch; helps apps with locality Involving more chips per access less data loss when a chip fails lower overhead for reliability
4 Re-Designing Arrays Udipi et al., ISCA’10
5 Selective Bitline Activation Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro Additional logic per array so that only relevant bitlines are read out Essentially results in finer-grain partitioning of the DRAM arrays
6 Rank Subsetting Instead of using all chips in a rank to read out 64-bit words every cycle, form smaller parallel ranks Increases data transfer time; reduces the size of the row buffer But, lower energy per row read and compatible with modern DRAM chips Increases the number of banks and hence promotes parallelism (reduces queuing delays) Mini-Rank, MICRO’08; MC-DIMM, SC’09
7 Row Buffer Management Open Page policy: maximizes row buffer hits, minimizes energy Close Page policy: helps performance when there is limited locality Hybrid policies: can close a row buffer after it has served its utility; lots of ways to predict utility: time, accesses, locality counters for a bank, etc.
8 Micro-Pages Sudan et al., ASPLOS’10 Organize data across banks to maximize locality in a row buffer Key observation: most locality is restricted to a small portion of an OS page Such hot micro-pages are identified with hardware counters and co-located on the same row Requires hardware indirection to a page’s new location Works well only if most activity is confined to a few micro-pages
9 Scheduling Policies The memory controller must manage several timing constraints and issue a command when all resources are available It must also maximize row buffer hit rates, fairness, and throughput Reads are typically given priority over writes; the write buffer must be drained when it is close to full; changing the direction of the bus requires 5-10 ns delay Basic policies: FCFS, First-Ready-FCFS (prioritize row buffer hits)
10 STFM Mutlu and Moscibroda, MICRO’07 When multiple threads run together, threads with row buffer hits are prioritized by FR-FCFS Each thread has a slowdown: S = T alone / T shared, where T is the number of cycles the ROB is stalled waiting for memory Unfairness is estimated as S max / S min If unfairness is higher than a threshold, thread priorities override other priorities (Stall Time Fair Memory scheduling) Estimation of T alone requires some book-keeping: does an access delay critical requests from other threads?
11 PAR-BS Mutlu and Moscibroda, ISCA’08 A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling
12 TCM Kim et al., MICRO 2010 Organize threads into latency-sensitive ad bw-sensitive clusters based on memory intensity; former gets higher priority Within bw-sensitive cluster, priority is based on rank Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness) Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others
13 Title Bullet