Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

The Multicore Era Core Main Memory Cache

The Multicore Era Multiple applications execute in parallel
Shared Cache Main Memory Core Core Interconnect Multiple applications execute in parallel High throughput and efficiency

Challenge: Interference at Shared Resources
Core Core Shared Cache Main Memory Core Core Interconnect

Impact of Shared Resource Interference
gcc (core 1) mcf (core 1) This graph shows two applications from the SPEC CPU 2006 suite, leslie3d and gcc. These applications are run on either core of a two core system and share main memory. The y-axis shows each application’s slowdown as compared to when the application is run standalone on the same system. Now, let’s look at leslie3d. It slows down by 2x when run with gcc. Let’s look at another case when leslie3d is run with another application, mcf. Leslie now slows down by 5.5x, while it slowed down by 2x when run with gcc. Leslie experiences different slowdowns when run with different applications. More generally, an application’s performance depends on which applications it is running with. Such performance unpredictability is undesirable…. 1. High application slowdowns 2. Unpredictable application slowdowns

Why Predictable Performance?
There is a need for predictable performance When multiple applications share resources Especially if some applications require performance guarantees Example 1: In server systems Different users’ jobs consolidated onto the same server Need to provide bounded slowdowns to critical jobs Example 2: In mobile systems Interactive applications run with non-interactive applications Need to guarantee performance for interactive applications …because there is a need for predictable performance When multiple applications share resources and Especially when some applications require guarantees on performance. For example, in mobile systems, interactive applications are run with non-interactive ones. And there is a need to provide guarantees on performance to the interactive applications. In server systems, different users’ jobs are heavily consolidated onto the same server. In such a scenario, there is a need to achieve bounded slowdowns for critical jobs of users. These are a couple of examples of systems that need predictable performance. There are many other systems (such as …) in which performance predictability is necessary. Therefore, our goal in this work is to provide predictable performance in the presence of memory interference.

Thesis Statement Goals
High and predictable performance can be achieved in multicore systems through simple/implementable mechanisms to mitigate and quantify shared resource interference Goals Approaches

2. Predictable Performance Mitigate Interference Quantify Interference
Goals and Approaches Goals: 1. High Performance 2. Predictable Performance Approaches: Mitigate Interference Quantify Interference

Focus Shared Resources in This Thesis
Core Core Shared Cache Capacity Main Memory Core Core Interconnect Main Memory Bandwidth

Related Prior Work Mitigate Interference Quantify Interference
Much explored Not our focus CQoS (ICS ‘04), UCP (MICRO ‘06), DIP (ISCA ‘07), DRRIP (ISCA ‘10), EAF (PACT ‘12) Not our focus PCASA (DATE ’12), Ubik (ASPLOS ’14) Goal: Meet resource allocation target Cache Capacity FST (ASPLOS ’10), PTCA (TACO ’13) Challenge: High inaccuracy STFM (MICRO ’07), PARBS (ISCA ’08), ATLAS (HPCA ’10), TCM (MICRO ’11), Criticality-aware (ISCA ‘’13) Challenge: High complexity STFM (MICRO ’07) Challenge: High inaccuracy Memory Bandwidth

Outline Mitigate Interference Quantify Interference
Application Slowdown Model and its uses Much explored Not our focus Not our focus Cache Capacity Memory Interference induced Slowdown Estimation Model and its uses Blacklisting Memory Scheduler Memory Bandwidth

Background: Main Memory
Rows Columns Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3 Row-buffer hit Row-buffer miss Row Buffer Row Buffer Row Buffer Row Buffer Row Buffer Memory Controller Channel FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00] Row-buffer hit first Older request first Unaware of inter-application interference

Tackling Inter-Application Interference: Application-aware Memory Scheduling
Enforce Ranks Monitor Rank Highest Ranked AID 2 Req 1 1 Req 2 4 Req 3 Req 4 Req 5 3 Req 7 Req 8 Request Buffer 2 Request App. ID (AID) 1 4 = 2 3 3 1 4 The key principles behind application-aware memory request scheduling are to first monitor applications’ memory access characteristics, rank applications based on these access characteristics. A memory access characteristic that several application aware schedulers use is memory intensity (the rate at which the application generates accesses to memory). Low memory intensity applications are ranked higher. Once a ranking is computed, requests are prioritized at the memory controller to obey this ranking. For instance, in this example, the blue application here is the highest ranked. So, its requests are given highest priority. In this work, we observe that this kind of a ranking based scheduler increases critical path latency and area significantly in order to improve performance and fairness among applications. Full ranking increases critical path latency and area significantly to improve performance and fairness

Performance vs. Fairness vs. Simplicity
App-unaware FRFCFS App-aware (Ranking) Ideal PARBS ATLAS Low performance and fairness TCM Our Solution Our Solution (No Ranking) Blacklisting Performance This chart shows performance, fairness and simplicity on three axes. The ideal scheduler should maximize all three of these quantities. Now, let’s look at where different schedulers lie in this space. First, let’s look at a state-of-the-art application-unaware scheduler. It is simple, but has low performance and fairness. Next, let’s look at some application-aware schedulers. This scheduler optimizes for fairness and performance, but gives up simplicity. This one optimizes for performance, but gives up simplicity and fairness. Finally, this one optimizes for both performance and fairness, but gives up simplicity. One common attribute of these application-aware schedulers is that they trade-off simplicity to achieve performance and/or fairness, since they all use a ranking-based scheme like the one I described in the previous slide. In this work, we ask the key question, is it essential to give up simplicity in order to optimize for performance and fairness? We propose a new memory scheduling algorithm that is simple, high performance and fair, by eliminating the idea of employing a full ordered ranking among all applications. Is it essential to give up simplicity to optimize for performance and/or fairness? Complex Our solution achieves all three goals Very Simple Simplicity

Problems with Previous Application-aware Memory Schedulers
1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns Our Goal: Design a memory scheduler with Low Complexity, High Performance, and Fairness Having described these two major problems with full ranking-based application-aware schedulers, our goal in this work is to overcome these problems and design a new memory scheduler that has low complexity, high performance and high fairness.

Key Observation 1: Group Rather Than Rank
Observation 1: Sufficient to separate applications into two groups, rather than do full ranking Monitor Rank Group Interference Causing Vulnerable 1 2 4 1 > 2 2 3 We first observe that it is sufficient to separate applications into two groups, rather than perform a full ordered ranking across all applications. Like I described earlier, application-aware memory schedulers typically monitor applications’ memory access characteristics and perform a full ordered ranking across all applications based on their memory access characteristics. We observe that such a full ordered ranking is not necessary. Instead, applications can be classified into only two groups, one containing vulnerable applications and the other containing interference-causing applications. Requests from the vulnerable application group are prioritized over requests from the interference-causing application group. Apart from this single level of prioritization, there is no ranking between applications’ requests within a group. One clear benefit of such a grouping-based scheduler is reduced complexity. There is no need to rank or sort all applications. Furthermore, the enforcement of a single level of priority incurs much lower complexity than the enforcement of a full ordered ranking. 3 4 3 1 4 Benefit 1: Low complexity compared to ranking Benefit 2: Lower slowdowns than ranking

Observation 1: Sufficient to separate applications into two groups, rather than do full ranking Monitor Rank Group Interference Causing Vulnerable 1 2 4 1 > 2 2 3 We first observe that it is sufficient to separate applications into two groups, rather than perform a full ordered ranking across all applications. Like I described earlier, application-aware memory schedulers typically monitor applications’ memory access characteristics and perform a full ordered ranking across all applications based on their memory access characteristics. We observe that such a full ordered ranking is not necessary. Instead, applications can be classified into only two groups, one containing vulnerable applications and the other containing interference-causing applications. Requests from the vulnerable application group are prioritized over requests from the interference-causing application group. Apart from this single level of prioritization, there is no ranking between applications’ requests within a group. One clear benefit of such a grouping-based scheduler is reduced complexity. There is no need to rank or sort all applications. Furthermore, the enforcement of a single level of priority incurs much lower complexity than the enforcement of a full ordered ranking. 3 4 3 1 4 How to classify applications into groups?

Key Observation 2 Observation 2: Serving a large number of consecutive requests from an application causes interference Basic Idea: Group applications with a large number of consecutive requests as interference-causing  Blacklisting Deprioritize blacklisted applications Clear blacklist periodically (1000s of cycles) Benefits: Lower complexity Finer grained grouping decisions  Lower unfairness We observe that when a large number of consecutive requests are served from an application, it interferes with other applications since their requests are delayed. Based on this observation, we propose to group applications by monitoring the number of consecutive requests served from an application. Applications with a number of consecutive requests served are classified as interference-causing. Essentially, these applications are blacklisted. Requests from these blacklisted applications are deprioritized. This blacklist status is cleared periodically (at fine time granularities) in order to ensure that no applications are deprioritized for too long. This kind of a grouping scheme has two main benefits. First, lower complexity compared to a scheme that monitors metrics like memory intensity on an interval basis. With this kind of a grouping scheme, only a single bit needs to be maintained for each application in order to denote its blacklist status, rather than having to count and maintain memory intensity for every application. Second, this scheme enables fine grained grouping decisions. Applications are classified into groups based on monitoring a few consecutive requests. Furthermore, this grouping is also cleared at fine time granularities. This prevents any application from being deprioritized for too long, unlike an interval based scheme that monitors memory access characteristics over a long time and then classifies applications.

The Blacklisting Memory Scheduler (ICCD ‘14)
1. Monitor 2. Blacklist 1. Monitor 1. Monitor 3. Prioritize 3. Prioritize 2. Blacklist 4. Clear Periodically 4. Clear Periodically Request Buffer Req Blacklist AID Blacklist Memory Controller Simple and scalable design Req 1 ? AID1 AID1 AID1 AID1 AID2 Req 2 1 Memory Controller ? 1 1 AID Blacklist Req 3 1 ? The Blacklisting memory scheduler or BLISS, for short, first monitors the number of consecutive requests from applications. At the memory controller, a bit vector is maintained to indicate the blacklist status of each application. Only two counters are required to track the number of consecutive requests. One that indicates the application ID (AID) of the last request and another that indicates the number of consecutive requests from the application that generated the last request. Here is an example request sequence. The first request is from application 2…. When the number of consecutive requests hits a certain threshold, the application is blacklisted and the bit vector is updated to indicate this. The monitoring and blacklisting of applications are ongoing operations that happen throughout execution. The blacklist status of applications is used to determine request scheduling order. The blacklist status of each request is checked and a request from a non-blacklisted application is preferentially scheduled. This is just a one-bit comparison as against an N-bit comparison when using a full ranking based application-aware scheduler. The blacklist information is cleared periodically in order to prevent any application from being blacklisted for a long time. Overall, the complexity of the different steps in the blacklisting memory scheduler is significantly lower than previous application-aware schedulers. Only the number of consecutive requests served needs to be monitored. Furthermore, the blacklist information is just a bit vector. Finally, the comparison logic to prioritize requests from non-blacklisted applications is also simpler. Hence, this design is simple and more easily scalable to larger cores. Last Req AID 2 3 1 2 Req 4 ? # Consecutive Requests 4 4 3 1 2 3 Last Req AID 3 Req 5 ? 1 1 Req 6 ? # Consecutive Requests 1 2 Req 7 1 ? 3 Req 8 ?

Methodology Configuration of our simulated baseline system Workloads
24 cores 4 channels, 8 banks/channel DDR DRAM 512 KB private cache/core Workloads SPEC CPU2006, TPC-C, Matlab , NAS 80 multiprogrammed workloads A brief overview of our methodology. Our baseline simulated system has 24 cores and 4 memory channels. Our workloads are built from ….

Performance and Fairness
5% (Lower is better) 21% Ideal First, let me present performance and fairness results. Here is a pareto plot showing performance on the x-axis and unfairness on the y-axis. The ideal memory scheduler would maximize performance while minimizing unfairness. Let’s look at where different schedulers lie in this space. The FRFCFS scheduler optimizes for neither performance nor fairness. FRFCFS-Cap and PARBS reduce unfairness significantly while improving performance. Schedulers like ATLAS and TCM optimize for performance further, but at the cost of increasing unfairness. The blacklisting scheduler moves much closer to the ideal achieving both high performance and fairness. We draw two major conclusions. First, the blacklisting memory scheduler achieves the highest performance among all previous schedulers we compare to. Second, the blacklisting memory scheduler achieves a good balance between performance and fairness, improving performance over the best performing previous schedulers, while approaching the fairness of the fairest previous schedulers. 1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness (Higher is better)

Blacklisting reduces complexity significantly
43% 70% Next, I will present our complexity results. This plot shows critical path latency on the x-axis and scheduler area on the y-axis. The ideal scheduler would minimize both critical path latency and area. Let’s look at where different schedulers lie in this space. The application-unaware FRFCFS scheduler has both low critical path latency and area, as we would expect. The FRFCFS-Cap scheduler which is very similar to the FRFCFS scheduler in terms of implementation, has similar area and critical path latency. The application-aware schedulers, PARBS, ATLAS and TCM increase both critical path latency and area significantly over the application-unaware schedulers. Our solution, the blacklisting scheduler has significantly critical path latency and area than the application-aware schedulers and is much closer in complexity to the application-unaware schedulers. In fact, the blacklisting scheduler has 43% lower area and 70% lower critical path latency than previous application-aware schedulers. Ideal Blacklisting reduces complexity significantly

Impact of Interference on Performance
Alone (No interference) time Execution time Shared (With interference) time Execution time Impact of Interference

Slowdown: Definition Slowdown of an application is its performance when it is run stand alone on a system to its performance when it is sharing the system with other applications. An application’s performance when it is sharing the system with other applications can be measured in a straightforward manner. However, measuring an application’s alone performance, without actually running it alone is harder. This is where are first observation comes in.

Alone (No interference) time Execution time Shared (With interference) time Execution time Impact of Interference Previous Approach: Estimate impact of interference at a per-request granularity Difficult to estimate due to request overlap

Observation: Request Service Rate is a Proxy for Performance
For a memory bound application, Performance  Memory request service rate Difficult Intel Core i7, 4 cores We observe that for a memory bound application, its performance is roughly proportional to the rate at which its memory requests are serviced. This plot shows the results of our experiment on an Intel, 4-core system. We run omnetpp, a SPEC CPU 2006 benchmark along with another memory hog, whose memory intensity can be varied. This, in turn, lets us vary the rate at which the memory requests of omnetpp are serviced. The x-axis shows the normalized request service rate of omnetpp, normalized to when it is run alone. The y-axis shows the performance of omnetpp, normalized to when it is run alone. As can be seen, there is a very strong, y=x like correlation between request service rate and performance. We observe the same trend for various memory bound applications. Here are a couple of more examples. Therefore, memory request service rate can be used as a proxy for performance. Now, going back to the slowdown equation. We can replace performance by request service rate, based on our first observation, enabling us to express slowdown as the ratio of alone to shared request service rates. Shared request service rate can be measured in a relatively straightforward manner when an application is run along with other applications. Measuring alone request service rate, without actually running an application alone is harder. This is where our second observation comes into play. Easy

Observation: Highest Priority Enables Request Service Rate Alone Estimation
Request Service Rate Alone (RSRAlone) of an application can be estimated by giving the application highest priority at the memory controller Highest priority  Little interference (almost as if the application were run alone) We observe that an application’s alone request service rate can be estimated by giving the application highest priority in accessing memory. Because, when an application has highest priority, it receives little interference from other applications, almost as though the application were run alone.

Observation: Highest Priority Enables Request Service Rate Alone Estimation
1. Run alone Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory 2. Run with another application Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory Here is an example. An application, let’s call it the red application, is run by itself on a system. Two of its requests are queued at the memory controller request buffers. Since these are the only two requests queued up for main memory at this point, the memory controller schedules them back to back and they are serviced in two time units. Now, the red application is run with another application, the blue application. The blue application’s request makes its way between two of the red application’s requests. A naïve memory scheduler could potentially schedule the requests in the order in which they arrived, delaying the red application’s second request by one time unit, as compared to when the application was run alone. Next, let’s consider running the red and blue applications together, but give the red application highest priority. In this case, the memory scheduler schedules the two requests of the red application back to back, servicing them in two time units. Therefore, when the red application is given highest priority, its requests are serviced as though the application were run alone. 3. Run with another application: highest priority Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications
Based on these two observations, our memory interference induced slowdown estimation (MISE) model for memory bound applications is as follows. Now, let’s analyze why this request service rate ratio based model holds for memory bound applications

Observation: Memory Bound vs. Non-Memory Bound
Memory-bound application Compute Phase Memory Phase No interference Req Req Req time With interference Req Req Req A memory-bound application spends significant fraction of its execution time waiting for memory. Let’s first look at the case when a memory-bound application is not receiving any interference. The application does a small amount of compute at the core and then sends out a bunch of memory requests. The application cannot do any more compute until these requests are serviced. Now, when this application experiences interference, the length of its compute phase does not change. However, the time taken to service the memory requests increases. In essence, since the application spends most of its time in the memory phase, the memory induced slowdown of the application dominates the overall slowdown of the application. time Memory phase slowdown dominates overall slowdown

Observation: Memory Bound vs. Non-Memory Bound
Compute Phase Non-memory-bound application Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications Memory Phase No interference time With interference However, this is not the case for a non-memory-bound application. A non-memory-bound application spends a sizeable fraction of its time doing compute at the core and a smaller fraction of time waiting for memory. When the application experiences interference, the length of the compute phase does not change, whereas the memory phase slows down with interference. Specifically, if alpha is the fraction of time the application spends in the memory phase, then 1 – alpha is the fraction of time it spends in the compute phase. The 1 – alpha fraction does not change with memory interference, however the alpha fraction slows down as the ratio of request service rate alone to request service rate shared. We augment the MISE model to take this into account. Slowdown for non-memory-bound applications can be expressed as the sum of two quantities - 1 – alpha, the compute fraction that does not change with interference and the memory induced slowdown. time Only memory fraction ( ) slows down with interference

Interval Based Operation
time Measure RSRShared, Estimate RSRAlone Measure RSRShared, Estimate RSRAlone MISE operates on an interval basis. Execution time is divided into intervals. During an interval, each application’s shared request service rate and alpha are measured. And alone request service rate is estimated. At the end of the interval, slowdown is estimated as a function of these three quantities. This repeats during every interval. Now, let us look at how each of these quantities is measured/estimated. Estimate slowdown Estimate slowdown

Previous Work on Slowdown Estimation
STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07] FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10] Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13] Basic Idea: The most relevant previous works on slowdown estimation are STFM, stall time fair memory scheduling and FST, fairness via source throttling. FST employs similar mechanisms as STFM for memory-induced slowdown estimation, therefore, I will focus on STFM for the rest of the evaluation. STFM estimates slowdown as the ratio of alone to shared stall times. Stall time shared is easy to measure, analogous to shared request service rate Stall time alone is harder and STFM estimates it by counting the number of cycles an application receives interference from other applications. Count number of cycles application receives interference

Methodology Configuration of our simulated system Workloads 4 cores
1 channel, 8 banks/channel DDR DRAM 512 KB private cache/core Workloads SPEC CPU2006 300 multi programmed workloads Next, I will quantitatively compare MISE and STFM. Before that, let me present our methodology. We carry out our evaluations on a 4-core, 1-channel DDR DRAM system. Our workloads are composed of SPEC CPU 2006 applications. We build 300 multiprogrammed workloads from these applications.

Quantitative Comparison
SPEC CPU 2006 application leslie3d This graph shows the actual slowdown of leslie3d, a memory-bound application, with time (in million cycles). Here are slowdown estimates from STFM and here are slowdown estimates from MISE. As can be seen, MISE’s slowdown estimates are much closer to the actual slowdowns. This is because MISE estimates leslie’s alone performance while giving it highest priority, whereas STFM tries to estimate it while leslie receives interference from other applications.

Comparison to STFM Average error of MISE: 8.2%
Average error of STFM: 29.4% (across 300 workloads) cactusADM GemsFDTD soplex Several more applications. For most applications, MISE’s slowdown estimates are more accurate. On average, MISE’s error is 8.2%, while STFM’s error is 29.4%, across our 300 workloads. wrf calculix povray

Possible Use Cases of the MISE Model
Bounding application slowdowns [HPCA ’13] Achieving high system fairness and performance [HPCA ’13] VM migration and admission control schemes [VEE ’15] Fair billing schemes in a commodity cloud

MISE-QoS: Providing “Soft” Slowdown Guarantees
Goal 1. Ensure QoS-critical applications meet a prescribed slowdown bound 2. Maximize system performance for other applications Basic Idea Allocate just enough bandwidth to QoS-critical application Assign remaining bandwidth to other applications The goal of this mechanism is to .. <read off slide>. The basic idea is to allocate just enough bandwidth the QoS-critical application, while assigning the remaining bandwidth among the other applications.

Methodology Baseline memory scheduling mechanism
Each application (25 applications in total) considered the QoS-critical application Run with 12 sets of co-runners of different memory intensities Total of 300 multi programmed workloads Each workload run with 10 slowdown bound values Baseline memory scheduling mechanism Always prioritize QoS-critical application [Iyer et al., SIGMETRICS 2007] Other applications’ requests scheduled in FR-FCFS order [Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000] Read off slide

A Look at One Workload MISE is effective in
Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2 MISE is effective in meeting the slowdown bound for the QoS-critical application improving performance of non-QoS-critical applications QoS-critical non-QoS-critical

Effectiveness of MISE in Enforcing QoS
Across 3000 data points Predicted Met Not Met QoS Bound 78.8% 2.1% 2.2% 16.9% MISE-QoS meets the bound for 80.9% of workloads MISE-QoS correctly predicts whether or not the bound is met for 95.7% of workloads AlwaysPrioritize meets the bound for 83% of workloads

Performance of Non-QoS-Critical Applications
When slowdown bound is 10/3 MISE-QoS improves system performance by 10% Higher performance when bound is loose

Shared Cache Capacity Contention
Core Core Shared Cache Capacity Main Memory Core Core

Cache Capacity Contention
Access Rate Core Shared Cache Main Memory Core Priority Applications evict each other’s blocks from the shared cache

Estimating Cache and Memory Slowdowns
Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Cache Service Rate Memory Service Rate Core Core Core Core

Service Rates vs. Access Rates
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Cache Service Rate Core Core Core Core Request service and access rates are tightly coupled

The Application Slowdown Model
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core

Real System Studies: Cache Access Rate vs. Slowdown

Challenge How to estimate alone cache access rate? Shared Main Memory
Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Priority Core Core Core Core Auxiliary Tag Store

Auxiliary tag store tracks such contention misses
Cache Access Rate Core Shared Cache Main Memory Core Priority Auxiliary Tag Store Still in auxiliary tag store Auxiliary Tag Store Auxiliary tag store tracks such contention misses

Accounting for Contention Misses
Revisiting alone memory request service rate Cycles serving contention misses should not count as high priority cycles

Alone Cache Access Rate Estimation
Cache Contention Cycles: Cycles spent serving contention misses From auxiliary tag store when given high priority Measured when given high priority

Application Slowdown Model (ASM)
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core

Previous Work on Slowdown Estimation
STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07] FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10] Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13] Basic Idea: The most relevant previous works on slowdown estimation are STFM, stall time fair memory scheduling and FST, fairness via source throttling. FST employs similar mechanisms as STFM for memory-induced slowdown estimation, therefore, I will focus on STFM for the rest of the evaluation. STFM estimates slowdown as the ratio of alone to shared stall times. Stall time shared is easy to measure, analogous to shared request service rate Stall time alone is harder and STFM estimates it by counting the number of cycles an application receives interference from other applications. Count interference experienced by each request

Model Accuracy Results
Select applications Average error of ASM’s slowdown estimates: 10%

Leveraging ASM’s Slowdown Estimates
Slowdown-aware resource allocation for high performance and fairness Slowdown-aware resource allocation to bound application slowdowns VM migration and admission control schemes [VEE ’15] Fair billing schemes in a commodity cloud

Cache Capacity Partitioning
Access Rate Core Shared Cache Main Memory Core Goal: Partition the shared cache among applications to mitigate contention

Cache Capacity Partitioning
Way 2 Set 0 Set 1 Set 2 Set 3 .. Set N-1 1 3 Core Main Memory Core Previous partitioning schemes optimize for miss count Problem: Not aware of performance and slowdowns

ASM-Cache: Slowdown-aware Cache Way Partitioning
Key Requirement: Slowdown estimates for all possible way partitions Extend ASM to estimate slowdown for all possible cache way allocations Key Idea: Allocate each way to the application whose slowdown reduces the most

Memory Bandwidth Partitioning
Cache Access Rate Core Shared Cache Main Memory Core Goal: Partition the main memory bandwidth among applications to mitigate contention

ASM-Mem: Slowdown-aware Memory Bandwidth Partitioning
Key Idea: Allocate high priority proportional to an application’s slowdown Application i’s requests given highest priority at the memory controller for its fraction

Coordinated Resource Allocation Schemes
Cache capacity-aware bandwidth allocation Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core 1. Employ ASM-Cache to partition cache capacity 2. Drive ASM-Mem with slowdowns from ASM-Cache

Fairness and Performance Results
16-core system 100 workloads Significant fairness benefits across different channel counts

Outline Mitigate Interference Quantify Interference Much explored
Not our focus Not our focus Cache Capacity Application Slowdown Model and its uses Memory Interference induced Slowdown Estimation Model and its uses Blacklisting Memory Scheduler Memory Bandwidth

Thesis Contributions Principles behind our scheduler and models
Simple two-level prioritization sufficient to mitigate interference Request service rate a proxy for performance Simple and high-performance memory scheduler design Accurate slowdown estimation models Mechanisms that leverage our slowdown estimates

Summary Problem: Shared resource interference causes high and unpredictable application slowdowns Goals: High and predictable performance Approaches: Mitigate and quantify interference Thesis Contributions: Principles behind our scheduler and models Simple and high-performance memory scheduler Accurate slowdown estimation models Mechanisms that leverage our slowdown estimates

Future Work Leveraging slowdown estimates at the system and cluster level Interference estimation and performance predictability for multithreaded applications Performance predictability in heterogeneous systems Coordinating the management of main memory and storage

Research Summary Predictable performance in multicore systems
[HPCA ’13, SuperFri ’14, KIISE ’15] High and predictable performance in heterogeneous systems [ISCA ’12, SAFARI Tech Report ’15] Low-complexity memory scheduling [ICCD ’14] Memory channel partitioning [MICRO ’11] Architecture-aware cluster management [VEE ’15] Low-latency DRAM architectures [HPCA ’13]

Backup Slides

Blacklisting

1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns First, the high hardware complexity of full ranking.

Ranking Increases Hardware Complexity
Enforce Ranks Next Highest Ranked AID Highest Ranked AID Monitor Rank Req 1 1 Req 2 4 Req 3 Req 4 Req 5 3 Req 7 Req 8 Request Buffer Request App. ID (AID) 1 2 = 4 2 3 3 As we saw in the beginning of the talk, application-aware schedulers first monitor applications’ memory access characteristics, rank applications based on these characteristics and prioritize requests at the memory controller to obey this ranking. For instance, the highest ranked application is the blue application with ID 2. So, the application id of requests in the request buffer is checked against 2 to see if there is a request from the highest ranked application with id 2. Since there isn’t, the requests’ application IDs are compared against the next highest ranked application ID and so on. The monitoring, ranking and enforcement of a full ordered ranking across all applications incurs high hardware complexity. Furthermore, this hardware complexity increases with application and core count. 1 4 Hardware complexity increases with application/core count

Ranking Increases Hardware Complexity
From synthesis of RTL implementations using a 32nm library 1.8x 8x These two plots show the critical path latency and area of an application-unaware scheduler (from synthesis of RTL implementations) for a 24 core system. The critical path latency of a state-of-the-art full-ranking based application-aware scheduler is 8x that of the application-unaware scheduler, while the area of the application-aware scheduler is 1.8x that of the application-unaware scheduler. Clearly, ranking-based application-aware schedulers incur high hardware cost. Ranking-based application-aware schedulers incur high hardware cost

1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns First, the high hardware complexity of full ranking.

Ranking Causes Unfair Slowdowns
GemsFDTD (high memory intensity) GemsFDTD sjeng (low memory intensity) sjeng In order to understand this problem, we look at two applications, GemsFDTD and sjeng from the SPEC CPU 2006 suite. These applications are run as part of a 24-application workload. These plots show execution time on the x-axis and number of requests served during different execution time intervals on the y-axis for a state-of-the-art application-unaware scheduler. Next, with a state-of-the-art application-aware scheduler that employs ranking. There are long periods of time when no requests from GemsFDTD are served. This is because GemsFDTD has high memory intensity while sjeng has more memory intensity. The ranking-based application-aware scheduler prioritizes requests of low memory-intensity applications over requests of high memory-intensity applications. In fact, it employs a full ordered ranking of all applications, thereby prioritizing requests of some high memory intensity applications over requests of other high memory intensity applications. As a result, service to high memory-intensity applications like GemsFDTD is denied. Full ordered ranking of applications GemsFDTD denied request service

Ranking Causes Unfair Slowdowns
GemsFDTD (high memory intensity) sjeng (low memory intensity) The effect of this denial of service is high application slowdowns. GemsFDTD’s slowdown increases to close to 8x from 5x when a ranking-based application-aware scheduler is used, while sjeng’s slowdown reduces from 2x to ~1x. We see a lot of instances where high memory intensity applications are slowed down significantly and unfairly when a ranking-based application-aware scheduler is employed. Ranking-based application-aware schedulers cause unfair slowdowns

GemsFDTD (high memory intensity) No unfairness due to denial of request service sjeng (low memory intensity) The other benefit of such a scheme is lower unfairness. If we go back to our earlier example with GemsFDTD and sjeng, GemsFDTD was denied request service for long periods of time when a ranking-based scheduler that ranked based on memory intensity was employed. However, when a scheduler that groups applications based on memory intensity is employed, such denial of service does not exist.

GemsFDTD (high memory intensity) sjeng (low memory intensity) In fact, if we look at at the slowdowns experienced by the two applications with a grouping-based scheduler, GemsFDTD’s slowdown is significantly lower than a ranking-based scheduler and actually slightly lower than even an application-unware scheduler. Furthermore, sjeng’s slowdown with a grouping-based scheduler is on par with that of a ranking-based scheduler. In general, across a variety of applications and workloads, we see that a ranking-based scheduler has much lower slowdowns. Benefit 2: Lower slowdowns than ranking

Previous Memory Schedulers
Application-unaware + Low complexity - Low performance and fairness FRFCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000] Prioritizes row-buffer hits and older requests FRFCFS-Cap [Mutlu and Moscibroda, MICRO 2007] Caps number of consecutive row-buffer hits PARBS [Mutlu and Moscibroda, ISCA 2008] Batches oldest requests from each application; prioritizes batch Employs ranking within a batch ATLAS [Kim et al., HPCA 2010] Prioritizes applications with low memory-intensity TCM [Kim et al., MICRO 2010] Always prioritizes low memory-intensity applications Shuffles thread ranks of high memory-intensity applications Application-aware + High performance and fairness - High complexity We compare to a range of previous schedulers. FRFCFS… FRFCFS and FRFCFS-Cap are application-unaware. Hence, they are simple, but have low performance and fairness. PARBS, ATLAS and TCM are application-aware and improve performance and fairness. But, they have high complexity.

5% 21% Ideal First, let me present performance and fairness results. Here is a pareto plot showing performance on the x-axis and unfairness on the y-axis. The ideal memory scheduler would maximize performance while minimizing unfairness. Let’s look at where different schedulers lie in this space. The FRFCFS scheduler optimizes for neither performance nor fairness. FRFCFS-Cap and PARBS reduce unfairness significantly while improving performance. Schedulers like ATLAS and TCM optimize for performance further, but at the cost of increasing unfairness. The blacklisting scheduler moves much closer to the ideal achieving both high performance and fairness. We draw two major conclusions. First, the blacklisting memory scheduler achieves the highest performance among all previous schedulers we compare to. Second, the blacklisting memory scheduler achieves a good balance between performance and fairness, improving performance over the best performing previous schedulers, while approaching the fairness of the fairest previous schedulers. 1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness

Performance vs. Fairness vs. Simplicity
Close to fairest Fairness FRFCFS FRFCFS - Cap Ideal PARBS ATLAS Highest performance TCM Blacklisting Performance Finally, let’s go back and look at the combined performance, fairness and simplicity space that I showed you at the beginning of the talk. The ideal scheduler has the highest performance and fairness, while being the simplest. The Blacklisting memory scheduler achieves the highest performance and close to the highest fairness, while approaching the simplicity of application-unaware schedulers. Hence, we conclude that the blacklisting scheduler is the closest to the ideal among all schedulers. Blacklisting is the closest scheduler to ideal Close to simplest Simplicity

Summary Applications’ requests interfere at main memory
Prevalent solution approach Application-aware memory request scheduling Key shortcoming of previous schedulers: Full ranking High hardware complexity Unfair application slowdowns Our Solution: Blacklisting memory scheduler Sufficient to group applications rather than rank Group by tracking number of consecutive requests Much simpler than application-aware schedulers at higher performance and fairness

5% higher system performance and 21% lower maximum slowdown than TCM

Complexity Results Blacklisting achieves Blacklisting achieves
43% lower area than TCM Blacklisting achieves 70% lower latency than TCM

Understanding Why Blacklisting Works
libquantum (High memory-intensity application) calculix (Low memory-intensity application) Blacklisting shifts the request distribution towards the right Blacklisting shifts the request distribution towards the left

Harmonic Speedup

Effect of Workload Memory Intensity

Combining FRFCFS-Cap and Blacklisting

Sensitivity to Blacklisting Threshold

Sensitivity to Clearing Interval

Sensitivity to Core Count

Sensitivity to Channel Count

Sensitivity to Cache Size

Performance and Fairness with Shared Cache

Breakdown of Benefits

BLISS vs. Criticality-aware Scheduling

Sub-row Interleaving

Measuring RSRShared and α
Request Service Rate Shared (RSRShared) Per-core counter to track number of requests serviced At the end of each interval, measure Memory Phase Fraction ( ) Count number of stall cycles at the core Compute fraction of cycles stalled for memory Shared request service rate can be measured using per-core counters to track the number of requests serviced during each interval. At the end of the interval, SRSR can be computed as the number of requests serviced divided by the length of the interval. Memory phase fraction of each application can be computed by counting the number of cycles the core stalls waiting for memory and then computing what fraction are these cycles of the interval length. Measuring SRSR and alpha is the straightforward part. Now, for the more interesting part, estimating alone request service rate.

Estimating Request Service Rate Alone (RSRAlone)
Goal: Estimate RSRAlone How: Periodically give each application highest priority in accessing memory Divide each interval into shorter epochs At the beginning of each epoch Randomly pick an application as the highest priority application At the end of an interval, for each application, estimate Our goal is to estimate alone request service rate. To do this, we periodically give each application the highest priority in accessing memory, based on observation 2. How exactly do we do this. First, each interval is divided into shorter epochs. At the beginning of each epoch, the memory controller picks one application at random and designates it the highest priority application. This way each application is given highest priority at different epochs during the interval. At the end of the interval, ARSR is estimated for each application as the ratio of the number of requests of the application serviced during the high priority epochs to the number of cycles the application was given highest priority. Now, is this an accurate way of estimating ARSR. It turns out not. We have been ignoring an inaccuracy all along.

Inaccuracy in Estimating RSRAlone
When an application has highest priority Still experiences some interference High Priority Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory That is, even when an application has highest priority, it still experiences some interference from other applications. Here is an example showing why. The red application has highest priority and is run along with another application, the blue application. First, there is just one request in the request buffer and it is from the red application. Since this is the only request, it is scheduled by the memory controller and completes in one time unit. While this request is being serviced, there comes another request from the blue application. Our memory controller is work conserving and schedules another application’s request, if there is no request from the red application. So, it schedules the blue application’s request next. However, just as the blue application’s request is being scheduled comes another request from the red application. Ideally, we would like to pre-empt the blue application’s request. However, because of the way DRAM operates, preempting the blue application’s request incurs overhead. Therefore, the red application’s request would have to wait until the blue application’s request completes and only then is it serviced. These extra cycles that the red application spent waiting for the blue application’s request to complete are interference cycles that it would not have experienced had it been running alone. Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory Interference Cycles

Accounting for Interference in RSRAlone Estimation
Solution: Determine and remove interference cycles from RSRAlone calculation A cycle is an interference cycle if a request from the highest priority application is waiting in the request buffer and another application’s request was issued previously The solution to account for this interference is to determine and remove such interference cycles from the ARSR computation. How we do this is to subtract these cycles from the number of high priority epoch cycles. Now, the question is how do we determine which cycles are interference cycles. We deem a cycle an interference cycle if …. (read from slide)

MISE Operation: Putting it All Together
Interval Interval time Measure RSRShared, Estimate RSRAlone Measure RSRShared, Estimate RSRAlone MISE operates on an interval basis. Execution time is divided into intervals. During an interval, each application’s shared request service rate and alpha are measured. And alone request service rate is estimated. At the end of the interval, slowdown is estimated as a function of these three quantities. This repeats during every interval. Now, let us look at how each of these quantities is measured/estimated. Estimate slowdown Estimate slowdown

MISE-QoS: Mechanism to Provide Soft QoS
Assign an initial bandwidth allocation to QoS- critical application Estimate slowdown of QoS-critical application using the MISE model After every N intervals If slowdown > bound B +/- ε, increase bandwidth allocation If slowdown < bound B +/- ε, decrease bandwidth allocation When slowdown bound not met for N intervals Notify the OS so it can migrate/de-schedule jobs SLIDE NEEDS WORK

Performance of Non-QoS-Critical Applications
When slowdown bound is 10/3 MISE-QoS improves system performance by 10% Higher performance when bound is loose

Case Study with Two QoS-Critical Applications
Two comparison points Always prioritize both applications Prioritize each application 50% of time MISE-QoS provides much lower slowdowns for non-QoS-critical applications MISE-QoS can achieve a lower slowdown bound for both applications

Minimizing Maximum Slowdown
Goal Minimize the maximum slowdown experienced by any application Basic Idea Assign more memory bandwidth to the more slowed down application

Mechanism Memory controller tracks Different components of mechanism
Slowdown bound B Bandwidth allocation of all applications Different components of mechanism Bandwidth redistribution policy Modifying target bound Communicating target bound to OS periodically

Bandwidth Redistribution
At the end of each interval, Group applications into two clusters Cluster 1: applications that meet bound Cluster 2: applications that don’t meet bound Steal small amount of bandwidth from each application in cluster 1 and allocate to applications in cluster 2

Modifying Target Bound
If bound B is met for past N intervals Bound can be made more aggressive Set bound higher than the slowdown of most slowed down application If bound B not met for past N intervals by more than half the applications Bound should be more relaxed Set bound to slowdown of most slowed down application

Results: Harmonic Speedup

Results: Maximum Slowdown

Sensitivity to Memory Intensity (16 cores)

MISE: Per-Application Error
Benchmark STFM MISE 453.povray 56.3 0.1 473.astar 12.3 8.1 454.calculix 43.5 1.3 456.hmmer 17.9 400.perlbench 26.8 1.6 464.h264ref 13.7 8.3 447.dealII 37.5 2.4 401.bzip2 28.3 8.5 436.cactusADM 18.4 2.6 458.sjeng 21.3 8.8 450.soplex 29.8 3.5 433.milc 26.4 9.5 444.namd 43.6 3.7 481.wrf 33.6 11.1 437.leslie3d 4.3 429.mcf 83.74 11.5 403.gcc 25.4 4.5 445.gobmk 23.1 12.5 462.libquantum 48.9 5.3 483.xalancbmk 18 13.6 459.GemsFDTD 21.6 5.5 435.gromacs 31.4 15.6 470.lbm 6.9 6.3 482.sphinx3 21 16.8 471.omnetpp 26.2 17.5 465.tonto 32.7 19.5

Sensitivity to Epoch and Interval Lengths
1 mil. 5 mil. 10 mil. 25 mil. 50 mil. 1000 65.1% 9.1% 11.5% 10.7% 8.2% 10000 64.1% 8.1% 9.6% 8.6% 8.5% 100000 64.3% 11.2% 8.9% 9% 64.5% 31.3% 14.8% 14.9% 11.7% Epoch Length

Workload Mixes Mix No. Benchmark 1 Benchmark 2 Benchmark 3 1 sphinx3
leslie3d milc 2 sjeng gcc perlbench 3 tonto povray wrf 4 5 6 namd lbm 7 h264ref bzip2 libquantum 8 hmmer omnetpp 9 cactusADM 10 mcf 11 xalancbmk astar 12

STFM’s Effectiveness in Enforcing QoS
Across 3000 data points Predicted Met Not Met QoS Bound 63.7% 16% 2.4% 17.9%

STFM vs. MISE’s System Performance

MISE’s Implementation Cost
1. Per-core counters worth 20 bytes Request Service Rate Shared Request Service Rate Alone 1 counter for number of high priority epoch requests 1 counter for number of high priority epoch cycles 1 counter for interference cycles Memory phase fraction ( ) 2. Register for current bandwidth allocation – 4 bytes 3. Logic for prioritizing an application in each epoch

MISE Accuracy w/o Interference Cycles
Average error – 23%

MISE Average Error by Workload Category
Workload Category (Number of memory intensive applications) Average Error 4.3% 1 8.9% 2 21.2% 3 18.4%

Impact of Cache Capacity Contention
Shared Main Memory Shared Main Memory and Caches Cache capacity interference causes high application slowdowns

Error with Sampling

Error Distribution

Impact of Prefetching

Sensitivity to Epoch and Quantum Lengths

Sensitivity to Core Count

Sensitivity to Cache Capacity

Sensitivity to Auxiliary Tag Store Sampling

ASM-Cache: Fairness and Performance Results
Significant fairness benefits across different systems

ASM-Mem: Fairness and Performance Results
Significant fairness benefits across different systems

ASM-QoS: Meeting Slowdown Bounds

Previous Approach: Estimate Interference Experienced Per-Request
Shared (With interference) time Execution time Req A Req B Req C Request Overlap Makes Interference Estimation Per-Request Difficult

Estimating PerformanceAlone
Shared (With interference) Execution time Req A Req B Request Queued Req C Request Served Difficult to estimate impact of interference per-request due to request overlap

Alone (No interference) time Execution time Shared (With interference) time Execution time Impact of Interference Previous Approach: Estimate impact of interference at a per-request granularity Difficult to estimate due to request overlap

Application-aware Memory Channel Partitioning
Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling

Observation: Modern Systems Have Multiple Channels
Core Memory Red App Memory Controller Channel 0 Core Memory Blue App Memory Controller Channel 1 A new degree of freedom Mapping data across multiple channels

Data Mapping in Current Systems
Core Memory Red App Memory Controller Page Channel 0 Core Memory Blue App Memory Controller Channel 1 Causes interference between applications’ requests

Partitioning Channels Between Applications
Core Memory Red App Memory Controller Page Channel 0 Core Memory Blue App Memory Controller Channel 1 Eliminates interference between applications’ requests

Integrated Memory Partitioning and Scheduling
Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling

Slowdown/Interference Estimation in Existing Systems
Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core How do we detect/mitigate the impact of interference on a real system using existing performance counters?

Our Approach: Mitigating Interference in a Cluster
Detect memory bandwidth contention at each host Estimate impact of moving each VM to a non-contended host (cost-benefit analysis) Execute the migrations that provide the most benefit

Architecture-aware DRM – ADRM (VEE 2015)

ADRM: Key Ideas and Results
Memory bandwidth captures impact of shared cache and memory bandwidth interference Model degradation in performance as linearly proportional to bandwidth increase/decrease Key Results: Average performance improvement of 9.67% on a 4-node cluster

QoS in Heterogeneous Systems
Staged memory scheduling In collaboration with Rachata Ausavarungnirun, Kevin Chang and Gabriel Loh Goal: High performance in CPU-GPU systems Memory scheduling in heterogeneous systems In collaboration with Hiroukui Usui Goal: Meet deadlines for accelerators while improving performance

Performance Predictability in Heterogeneous Systems
Core Core Core Core Shared Cache Core Core Core Core Main Memory Accelerator Accelerator

Goal of our Scheduler (SQUASH)
Goal: Design a memory scheduler that Meets accelerators’ deadlines and Achieves high CPU performance Basic Idea: Different CPU applications and hardware accelerators have different memory requirements Track progress of different agents and prioritize accordingly

Key Observation: Distribute Priority for Accelerators
Accelerators need priority to meet deadlines Worst case prioritization not always the best Prioritize accelerators when they are not on track to meet a deadline Distributing priority mitigates impact of accelerators on CPU cores’ requests

Key Observation: Not All Accelerators are Equal
Long-deadline accelerators are more likely to meet their deadlines Short-deadline accelerators are more likely to miss their deadlines Schedule short-deadline accelerators based on worst-case memory access time

Key Observation: Not All CPU cores are Equal
Memory-intensive cores are much less vulnerable to interference Memory non-intensive cores are much more vulnerable to interference Prioritize accelerators over memory-intensive cores to ensure accelerators do not become urgent

SQUASH: Key Ideas and Results
Distribute priority for HWAs Prioritize HWAs over memory-intensive CPU cores even when not urgent Prioritize short-deadline-period HWAs based on worst case estimates Improves CPU performance by 7-21% Meets 99.9% of deadlines for HWAs

Thesis Defense Lavanya Subramanian

Similar presentations

Presentation on theme: "Thesis Defense Lavanya Subramanian"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thesis Defense Lavanya Subramanian

Similar presentations

Presentation on theme: "Thesis Defense Lavanya Subramanian"— Presentation transcript:

Similar presentations

About project

Feedback