An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung.

An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories
Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung , Mahmut Kandemir Yonsei University Hello, I am Jie Zhang from Yonsei University, Korea. I am so honored to present our works to you guys. In this work, we study “NVMe” and observe its real characteristics. Let’s get started!

Executive Summary Motivation Challenge Modeling Exploration
To fully utilize the potential performance of Non-Volatile Memories (NVM), Non-Volatile Memory Express (NVMe) was recently proposed. Challenge Its wide range of design parameters have not been fully explored in the literature. Hence, what kind of system consideration and limitation need to be taken into account are still veiled. Modeling Due to the absence of publicly-available tools, we developed an analytical simulation model based on NVMe specifications. Our model can characterize NVMe in a wide variety storage settings including physical bus performance, NVM type, queue count/depth, etc. Exploration Using our model, we explored various NVMe design parameters which can affect I/O response time and system throughput. We also presents key observations regrading communication overheads and queue configurations. Before starting this talk, let me summarize our works in a slide. (click) To fully utilize the potential performance of non-volatile memories such as NAND flash and phase-change ram, a brand-new memory interface was proposed. It is non-volatile memory express, also referred to as “NVMe”. (click) Unfortunately, its wide range of design parameters have not been fully explored in the literature. Hence, what kind of system consideration and limitation need to be taken into account are still veiled. (click) Due to ~ (click) Using our model ~

Background and Motivation: Advent of NVMe and its Well-known Characteristics
From now on, let’s go over important backgrounds related to NVMe.

Need for High-Performance Interfaces
Host System Storage System (NVMs) Interface Performance Bottleneck! More Resources More Parallelisms  Higher Bandwidth SATA/SAS PCIe Storage interface as a bridge between host and storage Traditional SATA and SAS have been widely employed Storage-internal bandwidths keep increasing Thanks to increased resources and parallelisms Traditional interfaces failed to deliver the very-high bandwidths From upgrading traditional interfaces to devising new high-performance interfaces (click) Storage interface plays an important role to connect host system and storage system. (click) To deliver storage bandwidth to the host system, traditional SATA and SAS have been widely employed. (click) However, storage bandwidth quickly increases, thanks to more resources and more parallelism in the storage. (click) As a result, traditional interfaces started to fail to deliver the significantly-increased storage bandwidths. Therefore, storage interfaces became a performance bottleneck. (click) To address this, there have been various efforts from upgrading traditional SATA/SAS interfaces to devising new high-performance interfaces such as PCIe.

PCI Express & NVM Express
A high-speed physical interconnect (proposed by PCI-SIG) Widely adopted in computer system extension GPU connection & SSD connection A brand-new logical device interface (proposed by NVMHCI) Designed for exploiting the potential of high-performance NVMs and standardizing the PCIe-based memory interfaces Samsung NVMe 96X series, Intel SSD DC series, HGST Ultrastar SN series, Micron 9100 series, Hwuawei ES series, etc Our focus in this work is to explore NVMe on top of PCIe (click) As a promising high-performance interface, “PCI Express”, also referred to as “PCIe” is quite popular. PCIe is a physical interconnect, which is proposed by PCI-SIG. (click) PCIe is widely adopted in computer system extension such as GPU connection and SSD connection. (click) On the other hand, “NVM Express”, also referred to as “NVMe” is recently getting attention. NVMe is a logical interface, which is proposed by NVMHCI. (click) NVMe is designed for exploiting the potential of high performance ~ (click) Many vendors released SSDs based on PCIe and NVMe. Here are some examples. Samsung NVMe 96 series ~. (click) We want to emphasize that PCIe is physical connection whereas NVMe is communication protocol. (click) Our focus in this work is to explore NVMe on top of PCIe.

NVMe’s Streamlined Communication
“The Linux I/O Stack Diagram” Traditional Interface NVMe Interface Then, let’s look at the two main advantages of NVMe, which are widely advertised. The first strength of NVMe is its streamlined communication protocol. This is a part of the linux I/O stack diagram. (click) between block I/O layer and physical devices, traditional I/O stack includes multiple intermediate layers. (click) In contrast, NVMe strives to reduce I/O latencies by minimizing such layers.

NVMe’s Rich Queuing Mechanism
Traditional interface provides single I/O queue with 10s entries Native Command Queuing (NCQ) with 32 entries NVMe strives to increase throughput by providing scalable number of queues with scalable entries Up to 64K queues with up to 64K entries NVMe queue configurations in the host-side memory Pairs of Submission Queue (SQ) and Completion Queue (CQ) Per-core, Per-process, or Per-thread The second strength of NVMe is its rich queuing mechanism. (click) Traditional interface provides a single I/O queue with 10s entries. The representative native command queuing (NCQ) has only 32 entries. (click) In contrast, NVMe strives to increase throughput by providing scalable number of queues with scalable entries, depending on system design requirements. It can provide up to 64K queues and each queue can have up to 64K entries. (click) Here is an example of NVMe queue configrations. The queues are located in the host main memory. (click) A queue pair consists of submission queue and completion queue, referred to as SQ and CQ, respectively. (click) Each core can have this queue pair. (click) Each process can have the queue pair. (click) And each thread can have the queue pair.

We Want to Know Real Characteristics
Question (1) regarding to streamlined communication: How much overhead is brought up by the NVMe Communication for different types of NVMs in processing an I/O request? Question (2) regarding to rich queuing mechanism: Is it possible to scalably extract performance improvements, as the number of queues and queue entries increase? No publicly-available tool to characterize NVMe Design parameters have not been studied in the literature NVMe design consideration and limitation are still veiled Even though NVMe promotes their characteristics, we want to know the real characteristics. Specifically, we have following questions. (click) First, regarding to the streamlined communication, how much overhead is brought ~ ? (click) Second, regarding to the rich queuing mechanism, is it possible to scalably ~ ? (click) Unfortunately, to the best of our knowledge, there is no available tool to characterize NVMe in the public domain. As a result, a lot of NVMe design parameters have not been studied in the literature. And, NVMe design consideration and limitation are still veiled. (click) Therefore, in this work, we propose an analytical model to uncover its real characteristics. We propose an analytical NVMe model to uncover its real characteristics

Preliminaries: PCIe/NVMe Operations
Before introducing our NVMe model, let’s take a look at detailed PCIe and NVMe operations. Following details would be a basis for building our analytical model.

Memory Stack Architecture
I/O Threads (on cores) NVMe SUB/CPL Queues (in memory) NVMe Drivers Host-Side Storage-Side PCIe NVMe Info RD/WR Data (click) I/O stack of NVMe lays over both host and storage. (click) Threads running on host-cores generates I/O requests. (click) Storage can be based on block-addressable NVM such as NAND flash or byte-addressable NVM such as PCM and STT-MRAM. (click) To control the underlying NVM, storage also includes NVM controller. (click) PCIe as a physical connection integrates the host and the storage. (click) As a main component of NVMe, NVMe driver is placed in the host-side. (click) The NVMe drivier implements queues in the host-side memory. (click) To communicate with host-side NVMe driver, another critical component of NVMe, NVMe controller, is located in the storage-side. Particularly, the NVMe controller implements door bell registers which are used for NVMe drivers to notify necessary info. to the NVMe controller. (click) NVMe driver and NVMe controller communicates “NVMe information” and “RD/WR data” over PCIe bus. (click) Please keep in mind that NVMe lays over both sides, that is, NVMe driver in the host and NVMe controller in the storage. NVMe Controller (Door Bell Reg.) NVM Controller NVM (Block / Byte-Addressable)

Communication Protocol
[I/O Write] [I/O Read] DB-write Host-Side SSD-Side DB-write Host-Side SSD-Side Time flow IO-Req IO-Req IO-Fetch IO-Fetch WR-DMA SSD RD SSD-PROC RD-DMA Then, how NVMe driver of host and NVMe controller of storage communicates? (click) Let’s look at the communication process when processing an I/O write request. (click) Left-side is host while right-side is SSD. (click) And time flows from top to bottom. (click) When a new I/O is inserted in the submission queue, the NVMe driver rings a doorbell register of NVMe controller. We call it “DoorBell-Write”. (click) When SSD is ready to process the I/O requests, the NVMe controller requests it to the NVMe driver. We call this “I/O request”. (click) the NVMe driver sends the actual I/O request to the SSD, which is called “I/O fetch”. (click) Data to be written is also transferred to the SSD. We call this “Write DMA”. (click) Based on the I/O request and data to be written, SSD process the I/O requests. (click) Once SSD processing is done, the NVMe controller inserts an I/O completion message into the completion queue, which is referred to as “Completion submit” (click) Finally, the NVMe controller send a Message Signaled Interrupt (MSI) to the NVMe driver, to notify the completion of I/O requests to the host. (click) Processing an I/O read request has the same series of steps. (click) “DoorBell-write” to notify there is an I/O request submission. (click) SSD requests the I/O to the host. (click) “I/O fetch” from the host. (click) Upon the read request, SSD processes the read. (click) The read data is transferred to the host memory, which is called “Read DMA”. (click) “completion submit” and (click) “MSI” are followed to finalize the I/O read request. These are the handshaking process of the NVMe protocol. SSD WR SSD-PROC CPL-Submit CPL-Submit MSI MSI

PCIe Bus & Packet Transfer
PCIe Lane (x1 ~ x32) … PCIe Bus(v1 ~ v4) Host System SSD v1 x1 v2 x1 v3 x1 v4 x1 250GB/s 500MB/s 1GB/s 2GB/s Transaction Layer Packet (TLP) Data Link Layer Packet (DLLP) The NVMe information and DMA involved in the NVMe communication are moved over PCIe bus. (click) Note that the host system and the SSD is connected by a PCIe bus. The PCIe bus has been upgraded and the latest specification release the version 4. (click) As you can see in the table, whenever the version is upgraded, its bandwidth doubles. (click) Furthermore, a PCIe bus consists of single lane or multiple lanes up to 32, which can increase the PCIe bandwidth in a scalable fashion. For example, V2 PCIe based on 16 lanes have about 8GB/s bandwidth. (click) Based on this high-performance bus, all information is transferred in the form of packet. There are two types of packets: Transaction Layer packet (TLP) and Data Link Layer Packet (DLLP). TLP is used for delivering user information, whereas DLLP is system-level packet with no user intervention. The representative use case for DLLP is an acknowledgment (ACK) which is sent back to any packet reception. (click) All NVMe information and DMA exhibited in the handshaking process uses TLP. Their packet sizes are listed up in this table. The size of TLPs for NVMe information is about 20-24B. The size of TLP for DMA is 4KB, which is maximum packet size. If DMA size is bigger than 4KB, additional TLPs are employed. DB-Write IO-Req IO-Fetch CPL-Submit MSI DMA ACK TLP DLLP 24B 20B 4KB 8B

Our NVMe Model: Based on Four Components
Now, based on the knowledge of PCIe and NVMe, let us see how we construct our NVMe model. Our model consists of four different components.

Our Analytical Model (1) I/O Request (3) Host System (2) PCIe Bus
(4) NVM (1) I/O Request Model Based on the handshaking process of NVMe TReadI/O = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + TStall + TRD-NVM TRD-DMA TWriteI/O = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + TStall + TWR-NVM + TWR-DMA TNVMeInfo = TDB-Write + TIO-Req + TIO-Fetch + TCPL-Submit + TMSI + Tstall TNVMeInfo vs TNVM + TDMA ? Our analytical model considers all related-components in processing an I/O request. That is, A lifespan of an I/O requests, PCIe bus, host system, and NVM-based storage. (click) First of all, let’s take a look at the I/O lifetime model. (click) Our I/O lifetime model is based on the handshaking process of the NVMe specification. (click) Therefore, the latency of an I/O read request is determined by the time taken to transfer “DoorBell-Write”, “I/O Request”, “I/O Fetch”, “Completion-Submit”, and “Message Signaled Interrupt”, the time taken to read data from the underlying NVM, and the time taken to transfer read data. Since PCIe bus allows only one packet to be transferred at a time, we add an additional time to wait for the bus service due to the stall. (click) In the same manner, the latency of an I/O write requests is determined by the time taken to transfer “DoorBell-Write”, “I/O Request”, “I/O Fetch”, “Completion-Submit”, and “Message Signaled Interrupt”, the time taken to write data to the underlying NVM, and the time taken to transfer write data. And finally, the stall time. (click) Among these contributors to the I/O latency, our interest is the time taken for the NVMe handshaking process. (click) Hence, we call the time involved in NVMe handshaking only Time NVMeInfo. (click) The time for NVMeInfo relative to the time for device processing and the time for DMA determine the NVMe communication overhead.

Our Analytical Model (1) I/O Request (3) Host System (4) NVM
(2) PCIe Bus (2) PCIe Bus Model Scalable performance (versions, lanes) One packet only on the bus  queues for packets (3) Host System Model Queue count, queue depth I/O submission rates (4) NVM Model Block-addressable NVM (flash), byte-addressable NVM (PCM, STT-MRAM) Read/write latencies, processing unit sizes (click) The second part of our model is PCIe bus model. (click) The bus performance is highly scalable based on its version and the number of lanes in it. (click) Since only one packet can use the PCIe bus at a time, both sides of it have queues for packets to acquire the bus, which can calculate the I/O stall time. (click) The next module of our model is host system model. (click) Recall that the host system is where NVMe I/O queues are implemented. (click) In our model, queue count and queue depth can be configurable, depending on system requirements. (click) Furthermore, since host system is where applications run, the I/O generation rates can vary. (click) The last part of our model is NVM model. (click) The target storage can have any type of NVM. Hence, we categorize various NVMs into two groups. One is block-addressable NVM such as NAND flash memories, and the other is byte-addressable NVM such as phase-change ram and stt-mram. (click) Depending on the NVM category, the processing unit size and read/write latencies can vary. To sum up, our NVMe model based on these four detailed modules allows to explore a wide variety of design parameters.

Experiments & Observations
Now, let me describe our experience with the proposed NVMe model. Especially, we focus on (1) the overhead of NVMe handshaking process and (2) the benefits of NVMe queuing mechanism.

Experimental Setup Our model features a broad range of design parameters But, our goal is to uncover true NVMe characteristics (1) We configure two representative NVM SSDs (2) We configure two representative micro-benchmarks NVM Type Block-based NVM (Block-address) DRAM-like NVM (Byte-address) Base Unit Size 4KB 64B R/W Latencies (Read) 30us / (Write) 200us (Read) 50ns / (Write) 1us Access Pattern Block Access Byte Access Size Range 4KB ~ 2MB 8B ~ 1024B I/O Interval 10us 100ns (click) our model features a broad range of design parameters. (click) However, our goal is to uncover real NVMe characteristics, related to the NVMe communication overhead and the performance benefits of NVMe queuing system. (click) Accordingly, we configured two representative NVM models. One is block NVM whose processing unit size is 4KB. The read and write latencies are 30us and 200us, respectively. The other is DRAM-like NVM whose processing unit size is 64B. Its read and write latencies are 50ns and 1us, respectively. (click) We also configured two representative application models. To mimic block I/O applications, one micro-benchmark has varying request sizes from 4KB to 2MB. And It assumes to generate I/O requests every 10 us. Furthermore, we employ another micro-benchmark to reflect applications working on the main memory. Its request sizes vary from 8B to 1KB and its I/O generation rate is 100ns. (click) We want to emphasize that these values can significantly vary depending on application/system designs. However, since our focus is to evaluate NVMe, we fix the representative values as you can see here. Above parameter values can significantly vary, but, we fix the representative values to evaluate NVMe!

Communication Overhead Analysis: Block NVM + NVMe
First, let’s investigate NVMe communication overhead in the block NVM configuration. (click) our interest is the time contribution for NVMeInfo over the time for SSD processing and DMA in an I/O latency. (click) Y-axis represents the latency breakdown of single 4KB Read and single 4KB Write requests. (click) X-axis indicates the employed PCIe bus performance, which are specified by version number and lane counts. (click) You can see negligible proportion of NVMeInfo in the total I/O latency. (click) In conclusion, NVMe communication is not a burden in block NVMs. As advertised, it is quite streamlined! Our interest: TNVMeInfo VS TSSD+TDMA in block NVMs ? Y-axis: latency breakdown of a single 4KB-I/O R/W request X-axis: varying PCIe bus performance TNVMeInfo: 0.15% (read) & 0.03% (write) of total latency In block NVMs, NVMe communication is NOT a burden

Communication Overhead Analysis: DRAM-like NVM + NVMe
(click) Next, let’s do the same experiments in DRAM-like NVM configuration. (click) In this case, we break down the latencies of a 64B Read and Write requests. (click) Unlike the case of the block NVM, the time for NVMeInfo is a bit part of the total I/O latency. Specifically, for version \-2 PCIe buses, the NVMeInfo takes on average 44% and 4% of entire latency for read and write, respectively. (click) In DRAM-like NVM, the time fraction of NVMe Information is remarkable, which is contrary to the common expectation on NVMe. How about in DRAM-like NVMs? Y-axis: latency breakdown of a single 64B-I/O R/W request TNVMeInfo: 44% (read) & 4% (write) of total latency (V2x1~V2x16) In DRAM-like NVMs, fraction of NVMe information is REMARKABLE

Communication Overhead Analysis: DRAM-like NVM + NVMe
[64B Read on V2x1 PCIe Bus] [64B Write on V2x1 PCIe Bus] Latency breakdown of a 64B RD/WR request on a V2x1 PCIe bus Read: TNVMeInfo (58%) VS TDMA + TSSD (42%) Write: TNVMeInfo (20%) VS TDMA + TSSD (80%) As PCIe performance increases, this high overhead can be hidden The number of I/O requests also increases in DRAM-like NVMs Let’s take close look at the NVMe communication overhead in DRAM-like NVMs. (click) These two graphs present the latency breakdown of a 64B Read and Write request when employing version-2 PCIe with a single lane. (click) In case of read operation, 58% of total I/O execution time is for NVMe handshaking process. (click) In case of write operation, 20% of total I/O latency is for transferring NVMe information. This implies that NVMe is not that streamlined in DRAM-like NVM configurations. (click) One can note that as PCIe performance increases, this high-overhead can be hidden. (click) However, also note that, unlike block NVMs, the number of I/O requests also significantly increases in DRAM-like NVMs, which can remain the portion of NVMe information very high.

Too frequent ISR invocations in DRAM-like NVMs
NVMe ISR Overhead Host-Side SSD-Side MSI overhead We want investigate another possible NVMe overhead. (click) Recall that storage send Message-Signaled Interrupt (MSI) as the last step of NVMe handshaking. Upon the MSI, the host triggers an interrupt service routine (ISR) to finalize the I/O request. (click) Unfortunately, an ISR execution imposes a long CPU-intervention, which can be a big overhead at the host. Since this ISR is invoked whenever serving an I/O request is over, frequent ISR invocations can be a threat to the system performance. (click) To monitor the frequency of ISR invocations, we varied I/O generation rates from 20 to 100ns for DRAM-like NVM and from 2 to 10us for Block NVM. (click) We also increased the number of I/O threads. As we assume that each I/O thread has its own submission queue, we can say that the number of submission queues increases. (click) As you can see here, NVMe brings too frequent ISR invocations in DRAM-like NVM configurations. We want to mention that a few prior works also tried to resolve the frequent ISR overheads. In conclusion, system designers using NVMe carefully consider this additional host-side burden. At the host, Interrupt Service Routine (ISR) triggered by MSI An ISR execution imposes a long CPU-intervention (overhead) I/O generation rates: DRAM-like (20~100ns) Block (2~10us) I/O thread (or SQ): 1 ~ 32 Too frequent ISR invocations in DRAM-like NVMs

Queue Depth Analysis: Throughput
[Block NVM with V2x16 PCIe Bus] [DRAM-like NVM with V2x16] From now on, we’ll example NVMe queuing mechanism, which allows to increase queue depth and count in a scalable fashion. (click) Specifically, if one continues to increase the queue depth, is it possible to extract additional throughput benefits? (click) To evaluate this, we increased the queue depth from 1 to 65536, which is the maximum value of the NVMe specifications. (click) We also varied I/O request sizes, that is 8KB to 1MB for Block NVMs, and 8B to 1KB for DRAM-like NVMs. As you can see, in any case, the I/O throughput are saturated at a specific queue depth. For small-sized requests, thanks to the high-performance PCIe bus, I/O requests are quickly completed, thus, I/O requests are not accumulated in the queue. So, in this case, we do not need a very-deep queue. For requests with large sizes, the maximum throughput is bounded by the PCIe bus bandwidth. Note that maximum storage throughput is determined by its interface bandwidth. (click) Therefore, simply working with a very large queue is not necessarily improving the storage throughput in a scalable fashion. Our interest: scalable throughput as Q depth increases? Queue depth (entries): 1 ~ (Max value in NVMe specifications) Request sizes: Block NVM(8KB~1MB) & DRAM-like NVM (8B~1024B) Simply working with a very large queue is not necessarily improving the storage throughput in a scalable fashion

Queue Depth Analysis: Saturation Point
Our interest: Q depth limit? PCIe bus performance: V2x1 ~ V4x32 Saturation point: Block NVM (700) & DRAM-like NVM (10) In DRAM-like NVM, high-bandwidth PCIe bus and streamlined NVMe protocols make queue levels quite low We examine the saturation point in terms of queue depth. (click) We varied PCIe bus performance. (click) In case of Block NVMs, the possible maximum queue depth is about 700, whereas about 10 entries are enough for DRAM-like NVMs. (click) The reason for the low queue depth in DRAM-like NVMs is that the high-bandwidth PCIe bus and the streamlined NVMe protocols make queue levels quite low. Therefore, there is not need to increase the queue depth for DRAM-like NVMs.

Queue Depth Analysis: Latency
Then, how about the latency under varying queue depth? (click) User can fill queue with more and more I/O requests, regardless of the saturation point. (click) Hence, for both Block-NVM and DRAM-like NVM based on a version-2 PCIe bus with 16 lanes, (click) we monitored I/O latencies by increasing the queue depth from 1 to (click) Increasing queue depth severely hurts I/O latencies in both block and DRAM-like NVMs. (click) This is because after PCIe bus bandwidth runs out, all NVMeInfo and DMA packets are stalled to get the bus service. Therefore, when designing NVMe-based storage systems, the queue depth should be carefully decided by considering both throughput and latency. [Block NVM with V2x16 PCIe Bus] [DRAM-like NVM with V2x16] User can fill queue with I/O requests regardless of the saturation This severely hurts I/O latencies in both Block & DRAM-like NVMs After PCIe bus bandwidth runs out, all NVMeInfo and DMA packets are stalled to get the bus service

Queue Count Analysis: Throughput
[Block NVM with V2x16 PCIe Bus] In addition to queue depth, we are also interested in the number of queues. (click) Specifically, if one continues to increase the queue count, is it possible to extract additional throughput benefits? (click) To evaluate this, for both block NVM and DRAM-like NVM with a version-2 16-lane PCIe bus, (click) we increased the queue count from 1 to 65536, which is the maximum value of the NVMe specifications. Since each thread has its own queue in our experiment, we interchangeably use “thread” and “queues”. (click) We also varied I/O request sizes, that is 4KB to 32KB for Block NVMs, and 8B to 1KB for DRAM-like NVMs. (click) In case of block NVM, to achieve the maximum throughput, the queue count can increase until PCIe bandwidth runs out. Therefore, compared to the traditional single queue, NVMe’s multiple queues are a good way to improve the throughput. (click) In case of DRAM-like NVM, the throughput saturates at a small queue count. It is because the high proportion of NVMeInfo and DMA in DRAM-like NVM wastes PCIe bus bandwidth. Please note that the smaller a request size, the higher proportion of NVMe info and DMA. [DRAM-like NVM with V2x16] Our interest: scalable throughput as Q count increases? Q (thread) count : 1~ (Max value in NVMe specifications) Request sizes: Block NVM (4~32KB) & DRAM-like NVM (8~1024B) Block NVM: Q count can increase until PCIe bandwidth runs out DRAM-like NVM: Throughput saturates at a small queue count

Conclusions To better utilize NVMs, PCIe and NVMe are getting attention as its physical interface and logical interface, respectively Due to lack of studies in the literature, NVMe characteristics and design considerations are still veiled To uncover true NVMe characteristics, we proposed an analytical model based on PCIe/NVMe specifications Key observations (1) NVMe communication overhead in DRAM-like NVMs is not light-weight (2) Frequent ISR invocations in DRAM-like NVMs generates a big burden (3) Performance does not scale to the increasing Q depth/counts (4) Latencies get hurt if Q depth/count goes beyond a saturation point Let me summarize this talk. To better utilize emerging NVMs, PCIe and NVMe are getting attention as its physical interface and logical interface, respectively. However, real NVMe characteristics and a lot of design considerations are still veiled. Hence, we proposed an NVMe model based on PCIe and NVMe specifications. Using our analytical model, we explore the design space and obtain following observations. ~

Thanks a lot for your interest
Any question or comment is very welcome.

Backup Slides

Queue Count Analysis: Saturation Point
Our interest: Q count limit? PCIe bus performance: V2x1 ~ V4x32 Saturation point: Block NVM (256) & DRAM-like NVM (16) Unlike the expectation that more threads would be allowed and beneficial for the DRAM-like NVM, at most 16 threads perform the best in the majority of the cases

An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung.

Similar presentations

Presentation on theme: "An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung.

Similar presentations

Presentation on theme: "An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories Wonil Choi, Jie Zhang, Shuwen Gao, Jaesoo Lee, Myoungsoo Jung."— Presentation transcript:

Similar presentations

About project

Feedback