Boost Write Performance for DBMS on Solid State Drive Yu LI
Backgrounds (1) SSD is a complex storage device flash chips (i.e., NAND) controller hardware proprietary software (i.e., firmware) block device interface via a standard interconnect (e.g., USB, IDE, SATA). In general: Sequential read/write, random read is fast. Random write is slow.
Backgrounds (2) Some DBMS applications trend to generate random write stream Online Transaction Processing (OLTP) Small and frequent insert/delete/update Concurrence
In-Page Logging Approach In-Page Logging Approach [Lee, Sigmod 07] Idea: turn random write to log appending However In-page logging area needs hardware support. For SSD, not practical.
Backgrounds (3) Question: is there any solution to improve write performance without modifying the firmware of SSD ? Systemetic performance studies show that not all kinds of “random write” on SSD are slow. Write performance depends more on write pattern on SSD. [uFlip CIDR2009]
uFlip results Focused write e.g., write inside a <8MB file Partitioned Sequential Write write e.g., 1,50,2,51,3,52,… Ordered Sequential Write write e.g., 1,3,5,7,9,…
Our Idea (1) Write Stream Decomposition If we can collect enough write requests: Isolate the write request of good write patterns Cluster write requests to form instance of focused write SSD
Our Idea (2) StableBuffer 1 3 SSD Decomposition 2 Through StableBuffer: Two writes (1,3) in good write pattern (1x~4x) One random read (2) (at most 1x) => Total 9x Directly: => 17x~30x
StableBuffer DBMS Buffer Manager DBMS Transactions StableBuffer Translation Table Write Write Stream Decompositors Main Memory SSD Write Read System Overview
Components of StableBuffer Manager StableBuffer: pre-allocated focused are on SSD. E.g., pre-allocated file < 8MB. StableBuffer Translation Table: A table for entries like “ ” Fast lookup, insert and delete Write Stream Decompositors: A group programs running in concurrent threads Decomposite instance of good write pattern
More on StableBuffer Translation Table Reverse index embedded in pages for StableBuffer Translation Table Destinations and timestamp For recovery in case of system crush When recovery, page at offset O whose destination is D, compare its timestamp T to the latest update time T 0 of page at destination D If T> T 0, insert into table. Otherwise, the slot O is free.
Query on StableBuffer When get a request of retrieving some page at D we need to check whether there is an entry “ ” in StableBuffer Translation Table. If there is, return page at Oth slot in StableBuffer. Otherwise issue a read request to SSD for the page at D. So it is better to implement StableBuffer Translation Table as a hash table on D.
index Sequential Write Stream Partitioned Sequential Write Stream Focused Write Stream StableBuffer Translation Table Decomposite Sequential Write Decompositor Petitioned Sequential Write Decompositor Focused Write Decompositor Decompositors Share Ordered Sequential Write Stream Ordered Sequential Write Decompositor Share index Decompositors (1)
Decompositors (2) Decompositors run in concurrent threads. The results could share same entries of StableBuffer Translation Table. Select the results of decompositors select the instance of write pattern which performs better on SSD. select bigger instance. E.g., 1,2,56,57,6,7,42,43,3,4,... We select the results according to
Decompositors (3) Sequential Write Decompositor Maintain a search tree index on the destination addresses of mapping entries Partitioned Write Decompositor share the search tree index of Sequential Write Decompositor Ordered Write Decompositor share the search tree index of Sequential Write Decompositor Focused Write Decompositor maintains a hash index of entries of StableBuffer TranslationTable. entry “ ” will be hashed into bucket
Preliminary Result of Evaluation Prototype of StableBuffer manager Accept write trace file On Windows desktop pc, 16GB MTron MSD-SATA-3525 SSD page size 4KB StableBuffer is 8MB = 2048 pages Trace Oracle 11g running TPC-C benchmark simulates an enterprise OLTP retailing system, which keeping insert/delete/update records from a 8GB database write requests
Preliminary Result of Evaluation 1.5x
Q & A Thanks