Reducing Memory Interference in Multicore Systems Lavanya Subramanian Department of ECE 11/04/2011 Hello. My name is Lavanya Subramanian. Today, I am going to talk about Application-Aware Memory Channel Partitioning
Main Memory is a Bottleneck Core Memory Core Channel Core Main memory latency is long. Reduces performance stalling core In a multicore system, applications running on multiple cores share the main memory
Problem of Inter-Application Interference Core Memory Channel Req Req Req Req Core Applications’ requests interfere at the main memory. This inter-application interference degrades system performance. Problem further exacerbated due to Fast growing core counts Limited off-chip pin bandwidth
Talk Summary Goal Address the problem of inter-application interference at main memory with the goal of improving performance. Outline of this talk Background/Motivation Previous Approaches Our Approach Goal of this talk is to motivate, describe and address the problem of inter-application interference at main memory. I shall give some background on main memory organization/operation, describe previous approaches, their shortcomings and why we need our approach – memory channel partitioning
Background: Main Memory Organization
DRAM Main Memory Organization Core Channel The processor accesses the off-chip DRAM main memory through one or more channels. Here, I show a single channel. The smallest accessible unit within a channel is a bank. There are other levels in the hierarchy such as ranks and DIMMs, which I shall not go into in detail. Accesses to multiple banks can proceed in parallel. But, only one bank can send data on the channel at the same time. Bank
DRAM Organization: Bank Organization Column (8 bytes) A bank is a 2D array of DRAM cells Row (4 KB) Row Addr Row Decoder Each bank is a 2D array of DRAM cells. The x dimension is a row. A row is organized as several columns. Row Buffer Column Addr Column Mux
DRAM Organization: Accessing data Required Data Row A B C D E F Column Now, I want to access the highlighted piece of data. Row Buffer
DRAM Organization: The Row-Buffer Required Data Row A B C D E F Destructive Read into Row Buffer Column The entire row is read from the array into the “row-buffer” shown below. The read is destructive. Now, the required piece of data is read from the row-buffer and sent on the channel. The data of that row is now present in the row-buffer. Row Buffer Sent onto channel
DRAM Organization: Row Hit Required Data Row A B C D E F Column So, a subsequent access to another piece/column of data from the same row is serviced in the row buffer. An array access is NOT required. This is called a row hit Row Buffer Sent on channel
DRAM Organization: Row Miss Required Data Row A B C D E F 1. Write back data in row buffer to array 2. Destructive Read into Row Buffer Column On the other hand, if subsequent access is to data in another row, i) row buffer contents have to be written back into the array ii) the new row is read into the row buffer. This is called a row buffer miss Row Buffer Sent onto channel Row Miss latency = 2 x Row hit latency
The Memory Controller Medium between the core and the main memory Channel Bank 0 Bank 1 Memory Controller Core Request Buffer Medium between the core and the main memory Buffers memory requests from core in request buffer Re-orders and schedules requests to main memory banks
FR-FCFS (Rixner et al. ISCA’00) Exploits row hits to minimize overall DRAM access latency FCFS FR-FCFS Bank Bank Req Req Req Req Req Req Service Timeline time Service Timeline time
Memory Scheduling in Multicore Systems FR-FCFS Core 1 App1 Bank Req Req Req Req Req Core 2 App2 time Service Timeline Application 2’s single request starves behind three of Application 1’s requests Low memory-intensity application 2 starves behind application 1 Minimizing overall DRAM access latency != System Performance
Need for Application Awareness Memory Scheduler needs to be aware of application characteristics. Thread Cluster Memory (TCM) Scheduling is the current best application-aware memory scheduling policy. TCM (Kim et al. MICRO’10) always prioritizes low memory-intensity applications shuffles between high memory intensity applications Strength Provides good system performance Shortcoming High hardware complexity due to ranking and prioritization logic
Modern Systems have Multiple Channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Allocation of data to channels – a new degree of freedom
Interleaving rows across channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables parallelism in access to rows on different channels
Interleaving cache lines across channels Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables finer grained parallelism at the cache line granularity
Key Insight 1 High memory-intensity applications interfere with low memory-intensity applications in shared memory channels Time Units Channel Partitioning Core 0 App A Core 1 App B Channel 0 Bank 1 Bank 0 Time Units 1 2 3 4 5 Channel 1 Channel 0 5 4 3 2 1 Bank 0 Core 0 App A Bank 1 Bank 0 Core 1 App B Bank 1 Channel 1 Conventional Page Mapping Solution: Map data of low and high memory-intensity applications to different channels
Key Insight 2 Channel 0 Bank 1 Channel 1 Bank 0 Request Buffer State D E Request Buffer State Conventional Page Mapping Channel 0 Bank 1 Channel 1 Bank 0 B E C D A Request Buffer State Channel Partitioning Channel 0 Bank 1 Bank 0 B Service Order Channel 1 1 2 3 4 5 6 C D E A Channel 1 Channel 0 Bank 1 Bank 0 B Service Order 1 2 3 4 5 6 C D A E
Memory Channel Partitioning (MCP) Hardware Profile Applications Classify applications into groups Partition available channels between groups Assign a preferred channel to each application Allocate application pages to preferred channel System Software
Profile/Classify Applications Profiling Collect Last Level Misses per Kilo Instruction (MPKI) and Row-buffer hit rate (RBH) of applications online Classification MPKI > MPKIt Low Intensity High Intensity RBH > RBHt Low Row-Buffer Locality No Yes High Row-Buffer Locality
Partition Between Low and High Intensity Groups Channel 1 Low Intensity Channel 2 Assign channels proportional to number of applications in group Channel 3 High Intensity Channel 4
Partition b/w Low and High RBH groups High Intensity Low Row- Buffer Locality Channel 3 Assign channels proportional to bandwidth demand of group High Intensity High Row- Buffer Locality Channel 4
Preferred Channel Assignment/Allocation Load balance group’s bandwidth demand across group’s allocated channels Each application now has a preferred channel allocation Page allocation to preferred channel on first touch Operating system assigns a page to a preferred channel if free page available Else use modified replacement policy to preferentially choose a replacement candidate from preferred channel
Integrating Partitioning and Scheduling Inter-application Interference Mitigation Memory Scheduling Memory Partitioning Integrated Memory Partitioning and Scheduling
Integrated Memory Partitioning and Scheduling (IMPS) Applications with very low memory intensities (< 1 MPKI) do not need dedicated bandwidth In fact, dedicating bandwidth results in wastage These applications need short access latencies interfere minimally with other applications Solution: Always prioritize them in the scheduler. Handle other applications via memory channel partitioning
Methodology Core Model Memory Model 4 GHz out-of-order processor 128 entry instruction window 512 KB cache/core Memory Model DDR2 1 GB capacity 4 channels, 4 banks/channel Row interleaved Row hit: 200 cycles Row Miss: 400 cycles
Comparison to Previous Scheduling Policies MCP performs 1% better than TCM (best previous scheduler) at no extra hardware complexity IMPS performs 5% better than TCM (best previous scheduler) at minimal extra hardware complexity Perform consistently well across all intensity categories
Comparison to AFT/DPM (Awasthi et al. PACT’11) MCP/IMPS outperform AFT and DPM by 7/12.4% (across 40 workloads). Application-aware page allocation mitigates inter-application interference better.
Future Work Further exploration of integrated memory partitioning and scheduling for system performance Integrated partitioning and scheduling for fairness Workload aware memory scheduling