Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.

Similar presentations


Presentation on theme: "Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech."— Presentation transcript:

1 Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech University 2. Oak Ridge National Laboratory Presenter: Wei Xie 11th IEEE International Conference on Networking, Architecture and Storage (IEEE NAS 2016)

2 Outline Background: flash-based SSD and DFTL
Motivation: interference of address translation and data access Design: separate queues for address translation and data access Implementation and evaluation Summary and future work 1

3 Flash-based Solid State Drives
Flash-based SSDs offer 5 -50x (BW) or 1000x (random IOPS) more performance than traditional hard disk drives Also 3-30x more power efficient Price/GB falling down quickly Use cases: fast checkpoint in HPC, accelerate enterprise applications Storage type 7200RPM SATA HDD SATA SSD NVMe SSD Seq read bandwidth (MB/s) 76 475 2232 Rad 4KB mixed (IOPS) 62 50646 238,143 Credit: SNIA Credit: Kitguru.net 2

4 Background: Flash-based SSDs
read write block Read & write at page level (4KB) Erase at block level (64 or 256 pages) Multiple SSD channels allows concurrent read and write access Flash translation layer (FTL) to manage the mapping from logical to physical address page erase 3

5 Background: Suboptimal SSD Performance
Under-utilized parallelism Concurrent access across channels, dies and planes High parallelism level but utilization is low because scheduling is not parallelism-aware Inefficiency of the flash translation layer (FTL) Granularity can be page-level, block-level or hybrid Choosing the granularity is a tradeoff between performance (page-level the best) and mapping table size (page-level the worst) DFTL is one of the most efficient FTL algorithms 4

6 Background: DFTL, A Cache-based FTL
Main idea Page-level for performance Cache the mapping table in the embedded SRAM Exploit locality Issue Cache miss could be expensive: introduce extra flash read (map-loading) and flash write (write-back) We call the extra access the translation access to differentiate from the user issued data access Cache miss Cache hit Load Write-back Data access 5

7 Outline Background: flash-based SSD and DFTL
Motivation: interference of address translation and data access Design: separate queues for address translation and data access Implementation and evaluation Summary and future work 6

8 Motivation: Interference of Translation Access and Data Access
Translation access not only incur extra read and write Also affects the utilization of internal parallelism Write1 means write logical page 1 Write-back5 means the write-back operation caused by accessing logical page 5 Write1 – write4 are interleaved by write-back operations Not able to write 1-4 concurrently 7

9 A Solution Translation access and data access are separated
In the example, the write1-4 can be accessed concurrently Write-back5-8 could also be merged into one write-back, and we will talk about why 8

10 Related Work Hu et al. [1] identified the address translation overhead problem in DFTL and proposed to use PCM to store mapping-table entries to avoid the overhead Several studies explored to increase the utilization of the internal parallelism by making the IO scheduling parallelism-aware (Sprinkler by Jung et al. [2], paper by Park et al. [3] and paper by Jung et al [4]) Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang, “Performance Impact and Interplay of SSD Parallelism through Advanced Commands, Allocation Strategy and Data Granularity,” in Proceedings of the international conference on Supercomputing, 2011. M. Jung and M. Kandemir, “Sprinkler: Maximizing Resource Utilization in Many- chip Solid State Disks,” in High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. C. Park, E. Seo, J.-Y. Shin, S. Maeng, and J. Lee, “Exploiting Internal Parallelism of Flash-based SSDs,” Computer Architecture Letters, vol. 9, 2010. M. Jung, E. H. Wilson, III, and M. Kandemir, “Physically Addressed Queueing (PAQ): Improving Parallelism in Solid State Disks,” in Proceedings of the 39th Annual International Symposium on Computer Architecture, ser. ISCA ’12. 9

11 Outline Background: flash-based SSD and DFTL
Motivation: interference of address translation and data access Design: separate queues for address translation and data access Implementation and evaluation Summary and future work 10

12 Design: Separated I/O Queues
Separate queue for write-back, map-loading and data access (read and write also separated) Write-back -> map loading -> data access Two folds of benefits Merge multiple translation operations into one page read or write Avoid interference to data access 11

13 Parallel-DFTL Data: data access operation ML: map-loading operation
WB: write-back operation REQ2 WB2 ML2 Data2 REQ1 WB1 ML1 Data1 Wait for map-loadings complete and issue Wait for write-backs complete and try to merge and issue Once full, try to merge and issue Data queue Map-loading queue Write-back queue 12

14 Translation Operations
Map tables in flash 5 6 7 8 201 202 203 204 Cached map tables merge 1 2 3 4 100 101 102 103 Access page 5, 6, 7, 8 incurs 4 cache misses 1 2 3 4 Replacement policy 5 6 7 8 201 202 203 204 merge 5 6 7 8 201 202 203 204 1 2 3 4 100 101 102 103 13

15 Parallelize Translation Operations
Map tables in flash Cached map tables 1 2 3 4 merge 1 2 3 4 Channel-level parallelism merge 5 6 7 8 5 6 7 8 Translation of page 1 – 8 can be handled by one single concurrent flash write 14

16 A Different Write-back Process
In DFTL, write-back is one by one Occurs a cache miss Check if a free cache entry available If no, write-back one entry In Parallel-DFTL, it finds how many write-backs are needed for the requests in the queue Check how many cache misses would occur M See how many free cache entries available F Write back (M-F) entries if M>F 15

17 Cache Replacement Policy
To get n free cache entries LRU to select an entry e Merge any other entries m1 to mc that belong to the same translation page as e If c+1 < n Consider an entry based on LRU If the entry locates on a different channel from all the entries selected Select it Repeat 3 until n free entries are retrieved 16

18 Outline Background: flash-based SSD and DFTL
Motivation: interference of address translation and data access Design: separate queues for address translation and data access Implementation and evaluation Summary and future work 17

19 Implementation Flashsim: flash-based SSD simulator based on Disksim
Trace-driven, block-IO traces available DFTL is fully implemented by the original authors Internal parallelism not considered Microsoft research’s SSD extension to Disksim Implements the internal parallelism of SSD Our simulator: Based on Flashsim + parallelism + separate IO queue for translation I/O and data access I/O 18

20 Real-world Test Workload
Financial1: online transaction processing (OLTP) Websearch1: search engine server Exchange1: Microsoft Exchange mail server Workload trace Average request size Write ratio Cache hit ratio with 512KB cache Financial1 4.5KB 91 % 78.1 % Websearch1 15.14KB 1% 64.6% Exchange1 14.7KB 69.2% 89.3% 19

21 SSD Configuration 2 channels, 4 dies, each die is 2GB
SSD size Parallelism level Page size Block size Config 16GB 8 4KB 256KB 2 channels, 4 dies, each die is 2GB SRAM cache size is varied from 32KB to 2MB and is fully used for the cached mapping table entries 20

22 Evaluation: Financial1
Small request size, write dominant Moderate improvement with small cache Very small improvement with >512KB cache The average request size is small (4KB = 1 page size) that makes it difficult to parallelize nonetheless 21

23 Evaluation: Websearch1
Moderate request size (15KB), read dominate Low cache hit ratio generates more map-loadings Relative large request size allows merging/parallelizing translation requests 22

24 Evaluation: Exchange1 Moderate request size (14KB), read/write mixed
High temporal locality Request size is regarded a key factor that whether Parallel-DFTL can be effective 23

25 Translation Overhead Breakdown
The reduction of map-loading is generally more significant comparing to the write-back It indicates that lots of the write-back operations can not be merged/parallelized A better cache replacement algorithm could improve it 24

26 Micro-benchmark: Request Size
Write 1GB region sequentially and then read this region randomly Bandwidth scales up with the request size due to parallelism Parallel-DFTL is very close to the ideal-page map (almost no translation overhead) 25

27 Summary and Future Work
Address translation operations in DFTL found to interfere data access and reduce the utilization of internal parallelism The design of separate queues could reduce the address translation overhead and fully leverage the internal parallelism Both real and synthetic benchmark tests suggest that workloads with a relative large request size can benefit more from Parallel-DFTL Investigate the optimal cache replacement algorithm Consider temporal locality and internal parallelism at the same time 26

28 Acknowledgement Thanks! Questions? 27
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research. This research is also supported by the National Science Foundation under grant CNS and CNS Thanks! Questions? Wei Xie: Data Intensive Scalable Computing Laboratory at Texas Tech University: Future Tech Group at Oak Ridge National Laboratory: 27

29 Back-up Slide

30 Data-Intensive Computing Storage Demand
I/O performance and power-consumption are critical problems of data-intensive applications HPC checkpoint requires high I/O throughput Data analytics application Power is a big constraint in data centers and HPC system  the 10MW consumption of present day HPC systems is certainly becoming a bottleneck Exascale computing put very high IO demand on storage system. Need new architecture and technology to help. Power comsumption is also a critical issue to achieve exa-scale. People trying to limit exa-scale system in 10MW. 1

31 Micro-benchmark: Ill-mapped Flash Pages
Similar to the previous test, but write the first 1GB region randomly with 4KB request size, then read sequentially The random write make the subsequent read much slower (as expected) Small benefit from Parallel-DFTL because subsequent pages may locate on


Download ppt "Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech."

Similar presentations


Ads by Google