Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas.

Similar presentations


Presentation on theme: "Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas."— Presentation transcript:

1 Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab

2 Takeaway Observations: – Employing more and more flash chips is not a promising solution – Unbalanced flash chip utilization and low parallelism Challenges: – The degree of parallelism and utilization depends highly on incoming I/O request patterns Our approach: – Sprinkles I/O request based on internal resource layout rather than the order imposed by a storage queue – Commits more memory requests to a specific internal flash resource

3 Revisiting NAND Flash Performance Memory Cell Performance (excluding data movement) – READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms ONFI 4.0  800 MB/sec WRITE  1.6 ~ 20 MB/sec READ  70 ~ 200 MB/sec Flash Interface (ONFI 3.0) – SDR : 50 MB/sec – NV-DDR: 200 MB/sec – NV-DDR2: 533 MB/sec

4 Revisiting NAND Flash Performance ONFI 4.0  800 MB/sec PCI Express (single lane) – 2.x: 500 MB/sec – 3.0: 985 MB/sec – 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)  31.51 GB/sec

5 Revisiting NAND Flash Performance 200 MB/s 800 MB/s 31 GB/s Performance Disparity (even under an ideal situation)

6 How can we reduce the performance disparity?

7 Internal Parallelism A Single Host-level I/O Request

8 Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

9 Many-chip SSD Performance Performance stagnates

10 Utilization and Idleness Utilization sharply goes down Idleness keeps growing

11 I/O Service Routine in a Many-chip SSD Memory Requests: data size is the same as atomic flash I/O unit size Out-of-order SchedulingSystem- and Flash-level Parallelism A flash transaction should be decided before entering the execution stage Challenge: I/O access patterns and sizes are all determined by host-side kernel modules

12 Challenge Examples Virtual Address Scheduler Physical Address Scheduler

13 Virtual Address Scheduler(VAS) 1 2 3 4 5 Physical Offset IdleTail Collision Physical Offset C0C3 C6 C1C4 C7 C2C5 C8 CHIP 3 (C3)

14 Physical Address Scheduler (PAS) 1 2 3 4 5 C0C3 C6 C1C4 C7 C2C5 C8 Physical Offset Pipelining Tail Collision CollisionTail Collision CHIP 3 (C3)

15 Observations # of chips < # of memory requests – The total number of chips is relatively fewer than the total number of memory request coming from different I/O requests There exist many requests heading to the same chip, but to different internal resources – Multiple memory requests can be built into a high FLP transaction if we could change commit order

16 Insights Stalled memory requests can be immediately served – If the scheduler could compose the requests beyond the boundary of I/O requests and commit them regardless of the order of them It can have more flexibility in building a flash transaction with high FLP – If the scheduler can commit them targeting different flash internal resources

17 Sprinkler Relaxing the parallelism dependency – Schedule and build memory requests based on the internal resource layout Improving transactional-locality – Supply many memory requests to underlying flash controllers

18 RIOS: Resource-driven I/O Scheduling C0C3 C6 C1C4 C7 C2C5 C8 1 2 3 4 5 Relaxing the parallelism dependency – Schedule and build memory requests based on the internal resource layout 6 7 8 9 10 11

19 RIOS: Resource-driven I/O Scheduling C0C3 C6 C1C4 C7 C2C5 C8 1 2 3 4 5 RIOS – Out-of-Order Scheduling – Fine Granule Out-of-Order Execution – Maximizing Utilization 6 7 8 9 10 11

20 FARO: FLP-Aware Request Over-commitment High Flash-Level Parallelism (FLP) – Bring as many requests as possible to flash controllers, allowing them to coalesce many memory requests into a single flash transaction Consideration – A careless memory requests over-commitment can introduce more resource contention

21 Overlap Depth – The number of memory requests heading to different planes and dies, but the same chip Connectivity – Maximum number of memory requests that belong to the same I/O request C3 FARO: FLP-Aware Request Overcommitment RIOS FARO Overlap depth: 4 Connectivity: 2 Overlap depth: 4 Connectivity: 1

22 Sprinkler 1 2 3 4 5 C0C3 C6 C1C4 C7 C2C5 C8 Pipelining

23 Evaluations Simulation – NFS (NANDFlashSim) http://nfs.camelab.orghttp://nfs.camelab.org – 64 ~ 1024 flash chips -- dual die, four plane (our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms, read: 20 us) Workloads – Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj) – High transactional locality workloads: cfs2, msnfs2~3 Schedulers – VAS : Virtual Address Scheduler, using FIFO – PAS: Physical Address Scheduler, using extra queues – SPK1: Sprinkler, using only FARO – SPK2: Sprinkler, using only RIOS – SPK3: Sprinkler, using both FARO and RIOS

24 Throughput 300 MB/s improvement Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance 4x improvement [Bandwidth] [IOPS]

25 I/O and Queuing Latency SPK1 is worse than PAS SPK2 is worse than SPK1 SPK1 itself cannot secure enough memory requests and still have parallelism dependency Large req.size [Avg. Latency] [Queue Stall Time] SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.

26 Idleness Evaluation SPK1 shows worse inter-idleness reduction than PAS SPK1 shows better intra-idleness reduction than PAS [Inter-chip Idleness] [Intra-chip Idleness] When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

27 Conclusion and Related Work Conclusion: – Sprinkler relaxes the parallelism dependency by sprinkling memory requests based on the underlying internal resources – Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller Related work: – Balancing timing constraints, fairness, and different dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07] – Physical Address Scheduling [ISCA’12 TC’11]

28

29 Parallelism Breakdown [VAS] [SPK1 FARO-only] [SPK2 RIOS-only] [SPK3 Sprinkler]

30 # of Transactions [64-chips][1024-chips]

31 Time Series Analysis

32 GC

33 Sensitivity Test [64-chips] [1024-chips] [256-chips]


Download ppt "Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas."

Similar presentations


Ads by Google