QoS-aware Flash Memory Controller Thank you for the introduction. Hello, I’m Bryan Kim from Seoul National University, and it is my pleasure to present our work on QoS-aware flash memory controller. In this work, we address the performance variation and unpredictability in flash memory-based storage systems. We address these issues with a proposed design of the QoS-aware flash controller, and the design we present reduces not only the average response time, but also the 99.9 percentile response time by a large margin. Bryan S. Kim and Sang Lyul Min Seoul National University
Devices and applications Flash memory ubiquity NAND flash memory Flash-based storage Devices and applications Before we get into the details of the design, I’ll go through a short background on flash memory. Flash memory has become widely popular in recent years. It is used in a variety of storage, from USB sticks to enterprise-class SSDs, and these in turn are used in a wide range of devices and applications. They are used in mobile devices like our smart phones and laptops, but also in large-scale systems for applications such as database, high-performance computing, and video streaming. * Images from various sources via google search Background Design Evaluation Conclusion
Flash memory eccentricities Hello World RTAS 2016 “Hello” “World” “RTAS” However, flash memory has eccentricities that need to be addressed in order for it to be used as storage. To list a few, it does not allow in-place updates. “2016” Background Design Evaluation Conclusion
Flash memory eccentricities Hello World RTAS 2017 *Update map* “Hello” “2017” “World” “RTAS” So if I wanted to update some data, it needs to be written to a new location and a mapping between the logical address and the physical address must be maintained. The old address with the outdated data is later cleaned, or erased for new writes. However, erasing data isn’t that straightforward because the granularities for writing data and erasing data are different in size. Program operation writes data to pages, while erase operation wipes the entire block clean. “2016” *Invalidate* Background Design Evaluation Conclusion
Flash memory eccentricities Hello World RTAS 2017 “Hello” “2017” “World” “Hello” “RTAS” “World” This means that all the valid pages within the block must be copied out to another location before erasing it. “2016” “RTAS” *Copy valid data* Background Design Evaluation Conclusion
Flash memory eccentricities Hello World RTAS 2017 *Erase a block* “Hello” “2017” “World” “Hello” “RTAS” “World” This is known as garbage collection and it is responsible for reclaiming space for new writes. “2016” “RTAS” Background Design Evaluation Conclusion
Flash translation layer Garbage collection Host request handling Read scrubbing Mapping table management FTL Error handling Sudden power off recovery These internal management tasks, along with other techniques such as fault handling, collectively make up the flash translation layer, often abbreviated as the FTL. These tasks all run together inside the flash storage in order to hide the eccentricities of flash memory. Bad block management Wear-leveling Background Design Evaluation Conclusion
Performance of flash storage So then, how does flash storage perform? This graph shows the throughput of 8 flash storages, documented in the SNIA’s performance test specification. First thing to notice here is that the performance drops significantly over time, to as low as 5% of the initial. Second, there is substantial jitter in throughput even in the steady state. So then what causes these performance drops and variations in these flash storages? * Graph from SNIA solid state storage performance test specification Background Design Evaluation Conclusion
Challenge 1: scheduling Host req #0: Prog(chip 0, block 4, page 18) Mapping table management Host req #1: Prog(chip 1, block 2, page 17) Host req #2: Read(chip 0, block 0, page 46) Host request handling Read scrubbing Flash memory subsystem Wear-leveling Error handling Garbage collection Bad block management First issue. Multiple tasks of the FTL contend for a shared resource, the flash memory subsystem. Tasks such as host request handling and garbage collection need to access flash memory, for reading and programming pages and erasing blocks. So there needs to be a scheduler that arbitrates the stream of requests from each FTL task. It needs to schedule in such a way that not only the bandwidth of the flash memory subsystem is utilized efficiently, but also response time perceived by the host is small. GC req #0: Read(chip 0, block 1, page 55) GC req #1: Read(chip 0, block 1, page 78) Sudden power off recovery Background Design Evaluation Conclusion
Challenge 2: changes in importance Host req #0: Prog(chip 0, block 4, page 18) Mapping table management Host req #1: Prog(chip 1, block 2, page 17) Host req #2: Read(chip 0, block 0, page 46) Host request handling Host req #3: Read(chip 1, block 6, page 03) Read scrubbing Flash memory subsystem Wear-leveling Error handling Garbage collection Bad block management So then, if our goal is to reduce the response time of host requests, shouldn’t we simply prioritize host handling over any other tasks? Well, not quite because the importance of tasks changes dynamically depending on the state of the system. For example, when there are plenty of reclaimed and unused free blocks, sure, there is no reason not to prioritize host handling over garbage collection. GC req #0: Read(chip 0, block 1, page 55) Sudden power off recovery Background Design Evaluation Conclusion
Challenge 2: changes in importance Host req #0: Prog(chip 0, block 4, page 18) Mapping table management Host req #1: Prog(chip 1, block 2, page 17) Host request handling Read scrubbing Flash memory subsystem Wear-leveling Error handling Garbage collection Bad block management But what if there are no more free blocks left? In that state, host data cannot be written until a free unused block is ready, so then reclaiming a block through garbage collection becomes very important. This means that the flash storage needs to make sure that free blocks do not run out by internally managing which task is more important at any given point in time. It may seem counter-intuitive, but giving priority to garbage collection under certain conditions actually reduces the response time for host requests. GC req #0: Read(chip 0, block 1, page 55) GC req #1: Read(chip 0, block 1, page 78) GC req #2: Read(chip 0, block 1, page 99) Sudden power off recovery GC req #3: Erase(chip 0, block 1) Background Design Evaluation Conclusion
Challenge 3: load balancing Flash memory subsystem Mapping table management Host request handling Prog Flash chip Read scrubbing Prog Read Read Wear-leveling Flash chip Error handling Garbage collection Read Read Read Read Read Read Read Flash chip Prog Bad block management Lastly, there is an element of load balancing. Even if the host access pattern is uniform across all chips, which is not, the load to each chip is skewed mainly because of garbage collection. A garbage collection process will select a victim block, copy all the valid pages out, and then erase that block. This will cause the chip under garbage collection to be heavily loaded, and any host requests that needs to be serviced from this chip will be penalized, adding to the variation. Flash chip Prog Read Sudden power off recovery Background Design Evaluation Conclusion
QoS-aware flash controller Fair share scheduler Dynamic share allocator Non-binding request handler In order to address these issues, we propose a QoS-aware flash controller. The fair share scheduler, dynamic share allocator, and the non-binding request handler are the three main components for reducing response time variation. Background Design Evaluation Conclusion
Fair share scheduler Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller We schedule each request at the flash controller-level where it knows best of the availability of the flash memory chips and schedulability of flash memory operations. The fair share scheduler keeps track of the progress for each FTL task, and schedules requests fairly based on weighted fair queueing. Background Design Evaluation Conclusion
Fair share scheduler Read Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller 40% share Host progress @ 0us 100us Flash chip Read For example, lets say both host handling and garbage collection share a single flash memory chip. 40% is allocated for the host, and 60% for the garbage collection, and a host request arrives. The scheduler knows that the resource is available, and schedules the host request. GC progress 60% share Background Design Evaluation Conclusion
Fair share scheduler Read Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller 0us+100us/40% = 250us 40% share Host progress @ 0us 100us Flash chip Read The progress for the host is updated to 250 based on the start time, duration of the request, and the share of 40%. GC progress 60% share Background Design Evaluation Conclusion
Fair share scheduler Read Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller 0us+100us/40% = 250us 40% share Host progress @ 0us @ 150us 100us Flash chip Read Now at time 150, requests from both host and garbage collection arrive. Two requests having exact same start time is highly unlikely, but lets just say that it happened to be for the purpose of this example. @ 150us GC progress 60% share Background Design Evaluation Conclusion
Fair share scheduler Read GC program Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller 0us+100us/40% = 250us 40% share Host progress @ 0us @ 150us 100us 240us Flash chip Read GC program In this case, the request from garbage collection is scheduled first because the garbage collection has made the less progress. @ 150us GC progress 60% share 150us+240us/60% = 550us Background Design Evaluation Conclusion
Fair share scheduler Read GC program Read Fair share scheduler Keep track of the state of resources Select request to service based on share Interface with the low-level controller 390us+100us/40% = 640us 40% share Host progress @ 0us @ 150us 100us 240us 100us Flash chip Read GC program Read The host requests would be scheduled once the on-going program for garbage collection finishes. @ 150us GC progress 60% share 150us+240us/60% = 550us Background Design Evaluation Conclusion
Dynamic share allocator # of free blocks as representation of state Adjust share to control # of free blocks In the previous example, the share weights for the two streams were static. In our design, the share weights dynamically change to make sure that the system will not run out of free blocks. Background Design Evaluation Conclusion
Dynamic share allocator # of free blocks as representation of state Adjust share to control # of free blocks Target # of free blocks # of free blocks Dynamic Share Allocator Host share GC share Host requests Scheduler So, while the scheduler’s job is to order requests from multiple FTL tasks based on the share weights, the dynamic share allocator’s job is to decide those share weights. We do this using a feedback on the current number of free blocks, and use it to control the shares. If the current number of free blocks is far less from the desired number, the share weight for garbage collection is bumped up to increase the number of free blocks produced. Scheduled requests GC requests Background Design Evaluation Conclusion
Non-binding request handler Estimate queue delay for each chip Re-assign non-binding requests to a new chip Notify FTL task of the selection Lastly, we exploit the fact that writes can be serviced by any available chips because the mapping from logical to physical address happens anyway. The non-binding request handler takes into consideration of all pending requests for each chip and scales it with the share weight to estimate the expected delay. The one with the minimum expected delay is chosen as the target for the program, and its selection is notified back to the FTL through a response. Background Design Evaluation Conclusion
Non-binding request handler Estimate queue delay for each chip Re-assign non-binding requests to a new chip Notify FTL task of the selection 100us Host reqs Chip 0 Host: 80% 200us GC: 20% GC reqs Host program In this example, let’s say there are two chips, both queued with the same amount of total work left to do. Here, host requests are expected to be scheduled more and completed faster because of the bigger share weight. So, if a host program request arrives, it should be assigned to chip 0 as there is less host work there and the GC work there isn’t expected to make much progress anyway. More details on the internal operations can be found in the paper. 200us Host reqs Chip 1 100us GC reqs Background Design Evaluation Conclusion
Evaluation methodology Storage system configuration FTL tasks: host request handling & garbage collection Generate a stream of asynchronous flash memory requests Inter-arrival time for requests to model processing overhead for tasks FTL with 4KB mapping granularity Garbage collection with greed policy (select block with minimum # of valid data) ~14% over-provisioning factor To verify the effectiveness of our techniques, we implemented the design on the SSD extension of the DiskSim simulator. In the simulator, both host request handling and garbage collection generate a stream of flash memory requests as asynchronous events that are queued at the controller. These requests are generated with some inter-arrival time to model processing overheads for these tasks. We use a 4KB mapping scheme and implemented a greedy policy for the garbage collector’s victim selection. 1/8th of the storage capacity is reserved, giving an overprovisioning factor of about 14%. Background Design Evaluation Conclusion
Evaluation methodology Storage system configuration FTL tasks: host request handling & garbage collection Generate a stream of asynchronous flash memory requests Inter-arrival time for requests to model processing overhead for tasks FTL with 4KB mapping granularity Garbage collection with greed policy (select block with minimum # of valid data) ~14% over-provisioning factor Workload configuration Issue rate: 5K IOPS Such that both host request handling & garbage collection run concurrently While not causing requests to queue up unboundedly Duration: 1 hour simulation time (up to 18M IOs) As for the workload, we use a synthetic IO generator to stress-test the storage system. We set the request issue throughput high enough such that host request handling and garbage collection run concurrently, but not too high such that it would result in unbounded response times. The simulation ran for one-hour of simulation time, resulting up to 18M IO requests. With that said, lets jump into the experimental results. Background Design Evaluation Conclusion
Experiment 1: Establishing baseline QoS-unaware : schedule in order of arrival Throttling : limit bandwidth use QoS-aware (FSS) : 50:50 share First, let’s see how bad the performance variation can be without any provisions for QoS. Here we compare three schemes. First, a QoS-unaware scheme where requests from both host and garbage collection are processed at the flash controller in the order of arrival. Second is a throttling scheme where the host and GC have a limited amount of bandwidth that they can use given a period of time. Last is a simple QoS-aware scheme with just the fair share scheduler enabled at 50:50 share between the host and garbage collector. Background Design Evaluation Conclusion
Experiment 1: Establishing baseline QoS-unaware : schedule in order of arrival Throttling : limit bandwidth use QoS-aware (FSS) : 50:50 share 99.9%: <47ms @10ms: 56% Graph on your left shows the throughput over time for the QoS-unaware scheme under a random write workload. The average throughput for the entire duration is at 5K IOPS, but as you can tell, there is a lot of variation at each individual measurement of one second intervals. As the response time CDF on your right indicates, only about 56% of requests respond before 10 milliseconds. Background Design Evaluation Conclusion
Experiment 1: Establishing baseline QoS-unaware : schedule in order of arrival Throttling : limit bandwidth use QoS-aware (FSS) : 50:50 share 99.9%: <28ms @10ms: 74% @10ms: 56% The throttling scheme in black shows a tighter envelope for throughput and improves the latency curve, but only about 74% of requests respond before 10 milliseconds. Background Design Evaluation Conclusion
Experiment 1: Establishing baseline QoS-unaware : schedule in order of arrival Throttling : limit bandwidth use QoS-aware (FSS) : 50:50 share 99.9%: <3.2ms @10ms: 100% @10ms: 74% @10ms: 56% Lastly for the fair share scheduling scheme in blue, the CDF is at 100% for 10 milliseconds, and in fact, all of the requests complete before 4.3 milliseconds. 99.9% of the requests respond before 3.2 milliseconds. Even with a static share, scheduling at the controller in a more fine-grained manner reduces not only the response time, but also the tail latency by a large margin. Background Design Evaluation Conclusion
Experiment 2: Effects of share weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC Next, let’s investigate the effects of different share weight on the performance. In this scenario, we compare a fair share scheduling with static weight of 80% for the host and 20% for the garbage collection against a static share of 20% for the host and 80% for the garbage collection. Background Design Evaluation Conclusion
Experiment 2: Effects of share weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC Random read/write workload 99.9%: <1.3ms 99.9%: <15ms First graph on left shows the read response time under a random read and write workload. From this graph, it may seem obvious that the share weight of 80% for the host in light blue is better than the other one in dark dotted blue. Background Design Evaluation Conclusion
Experiment 2: Effects of share weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC Random read/write workload Composite workload 99.99%: <19ms 99.9%: <1.3ms 99.99%: <32ms 99.9%: <15ms However, if you look at the graph on the right, it tells a different story. This graph shows the response time CDF under a composite workload where there are phases to the workload: it goes through sequential accesses and random accesses. Under this workload, the average response time for 80% for the host in light blue is better than the other in dark dotted blue. However, it quickly fizzles out and there is a crossing point at around 17 milliseconds indicating that the static share of 80% for host has a huge long tail latency. To see why this happens, let’s take a look at the number of free blocks over time. Background Design Evaluation Conclusion
Experiment 2: Effects of share weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC Sequential write Random read/write This graph shows the number of free blocks where the response time spike occurred. As you can see, the number of free blocks becomes critically low for the static share of 80% for the host in light blue towards the end of the write intensive phase. On the other hand, static share of 20% for host in dark dotted blue has enough free blocks to transition smoothly into the next phase, and thus is better at bounding the response time for high QoS requirements. This goes to show that there is no magic number and that the share must be dynamically adjusted depending on the system state. Background Design Evaluation Conclusion
Experiment 3: Dynamically adjusting weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC FSS+DSA : Share dynamically allocated 99.9%: <4.2ms 99.9%: <15ms So then what if the share is dynamically adjusted? The graph on the left shows the number of free blocks over time at the previously problematic transition. Background Design Evaluation Conclusion
Experiment 3: Dynamically adjusting weight H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC FSS+DSA : Share dynamically allocated 99.9%: <3.8ms 99.9%: <4.2ms 99.9%: <15ms With shares dynamically allocated in orange, there is no shortage of free blocks, and as you can guess it has better performance. Dynamic allocation achieves about 3.8ms of 99.9 percentile response time, better than both static share schemes. It’s interesting to note, static share of 80% for the host in light blue achieves better 99.8 percentile performance than the dynamic allocation. However, like before, it quickly stalls and loses the race thereafter. Background Design Evaluation Conclusion
Experiment 4: Putting it all together H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC FSS+DSA : Share dynamically allocated FSS+DSA+NRH : All techniques applied Lastly, let’s see what happens if all three techniques are put together. With non-binding request handling, write requests are assigned to a chip with the least amount of expected delay, improving response time. Background Design Evaluation Conclusion
Experiment 4: Putting it all together H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC FSS+DSA : Share dynamically allocated FSS+DSA+NRH : All techniques applied 99.9%: <4.2ms 99.9%: <15ms 99.9%: <3.8ms This graph shows the read response time for all the schemes so far… except that the curve for the QoS-unaware and throttling scheme doesn’t even fall into the range of the graph. Background Design Evaluation Conclusion
Experiment 4: Putting it all together H80G20 : Static share of 80% for host, 20% for GC H20G80 : Static share of 20% for host, 80% for GC FSS+DSA : Share dynamically allocated FSS+DSA+NRH : All techniques applied Throttle 99.9%: <26ms 99.9%: <1.5ms 99.9%: <4.2ms 99.9%: <15ms 99.9%: <3.8ms QoS-unaware 99.9%: <46ms The green line represents the latency curve when all three techniques are put together. The 99.9% percentile response time for this is at 1.5ms. To put this into perspective, the 99.9% percentile response time for the QoS-unaware scheme is at 46ms, about 30 times slower. If we reduce the workload intensity to 1K IOPS, this gap further increases to about 56. We also covered how our techniques would scale by performing sensitivity analysis on some key system parameters such as over-provisioning factor, page size, and number of chips. These results are detailed in the paper. Background Design Evaluation Conclusion
QoS-aware Flash Memory Controller Conclusion Design for reducing response time variation Fairly schedule requests at the controller-level according to share Dynamically adjust the share weight depending on the state of the system Balance the load across multiple chips QoS-aware Flash Memory Controller Improves average response time by 12 ~ 38 for reads by 1.4 ~ 6.9 for writes Improves 99.9% response time by 29 ~ 56 for reads by 2.0 ~ 8.5 for writes In this work, we looked into the issue of unpredictable performance and high response time variation in flash storage. To combat this challenge, we designed a QoS-aware flash memory controller that schedules requests from multiple FTL tasks according to the given share weight, dynamically adjust the importance of tasks depending on the state of the storage, and balances the load across multiple chips in order to further reduce response time. Our results show a great improvement in not only the average response time, but also the 99.9 percentile response time. Background Design Evaluation Conclusion
Future directions Generalizing the scheduler Hardware prototyping Implement other FTL tasks Support any combination of tasks Hardware prototyping Implement on an FPGA development board Run real workloads in real-time And much more! Going forward, there are a number of directions that we can expand into. First direction is generalizing the scheduler and implementing other FTL tasks, extending beyond host request handling and garbage collection. Second is hardware prototyping on an FPGA development board and running it in real-time. I think there are a lot of possibilities in this area, and I hope this talk piqued your interest. With that said, I conclude my presentation and I would be happy to take questions. Thank you. Background Design Evaluation Conclusion