Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won.

Slides:

Advertisements

Similar presentations

Improving Transaction-Time DBMS Performance and Functionality David Lomet Microsoft Research Feifei Li Florida State University.

Advertisements

Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 

Journaling of Journal Is (Almost) Free Kai Shen Stan Park* Meng Zhu University of Rochester * Currently affiliated with HP Labs FAST

External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.

Log Tuning. AOBD 2007/08 H. Galhardas Atomicity and Durability Every transaction either commits or aborts. It cannot change its mind Even in the face.

CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.

Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.

University of Minnesota FAST: A Framework for Flash-Aware Spatial Trees Mohamed Sarwat, Mohamed Mokbel, Xun Zhou Department of Computer Science and Engineering.

Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.

B+-tree and Hashing.

Boost Write Performance for DBMS on Solid State Drive Yu LI.

Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.

Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.

Primary Indexes Dense Indexes

Memory Management Last Update: July 31, 2014 Memory Management1.

I/O Stack Optimization for Smartphones

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.

Origianal Work Of Hyojun Kim and Seongjun Ahn

Index tuning-- B+tree. overview © Dennis Shasha, Philippe Bonnet 2001 B+-Tree Locking Tree Traversal –Update, Read –Insert, Delete phantom problem: need.

Logging in Flash-based Database Systems Lu Zeping

Architecture Rajesh. Components of Database Engine.

Speaker: 吳晋賢 (Chin-Hsien Wu) Embedded Computing and Applications Lab Department of Electronic Engineering National Taiwan University of Science and Technology,

Author : Chin-Hsien Wu Presenter : kilroy. Outline Introduction Related work Motivation Main idea Evaluation Conclusion Q & A.

Optimized Transaction Time Versioning Inside a Database Engine Intern: Feifei Li, Boston University Mentor: David Lomet, MSR.

Switch off your Mobiles Phones or Change Profile to Silent Mode.

By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon

Database Management Systems,Shri Prasad Sawant. 1 Storing Data: Disks and Files Unit 1 Mr.Prasad Sawant.

© Dennis Shasha, Philippe Bonnet 2001 Log Tuning.

Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.

IN-MEMORY OLTP By Manohar Punna SQL Server Geeks – Regional Mentor, Hyderabad Blogger, Speaker.

University of Minnesota FAST: A Framework for Flash-Aware Spatial Trees Mohamed Sarwat, Mohamed Mokbel, Xun Zhou Department of Computer Science and Engineering.

CSC 556 – DBMS II, Spring 2013 May 1, 2013 DBMS ACID Properties & Concurrency Control.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Lecture 21 LFS. VSFS FFS fsck journaling SBDISBDISBDI Group 1Group 2Group N…Journal.

연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 2) Academic Year 2014 Spring.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.

11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]

Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.

Backup and Recovery - II - Checkpoint - Transaction log – active portion - Database Recovery.

CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.

CS 540 Database Management Systems

Transactional Recovery and Checkpoints Chap

Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.

Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.

Chapter 5 Record Storage and Primary File Organizations

Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

1 Transactional Flash Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou Microsoft Research, Silicon Valley Presented by Sudharsan.

WSRR 111 Coerced Cache Eviction and Discreet Mode Journaling: Dealing with Misbehaving Disks Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi Andrea.

NVWAL: Exploiting NVRAM in Write-Ahead Logging

Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #

File System Consistency

Persistent Memory (PM)

Hathi: Durable Transactions for Memory using Flash

Failure-Atomic Slotted Paging for Persistent Memory

Transactional Recovery and Checkpoints

Efficient data maintenance in GlusterFS using databases

Enforcing the Atomic and Durable Properties

Operating Systems ECE344 Lecture 11: SSD Ding Yuan

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Lecture 20 LFS.

Turbo-Charged Transaction Logs

RUM Conjecture of Database Access Method

Transaction Log Internals and Performance David M Maxwell

SQL Statement Logging for Making SQLite Truly Lite

The Design and Implementation of a Log-Structured File System

Presentation transcript:

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won 2 1 Ulsan National Institute of Science and Technology, Korea 2 Hanyang University, Korea LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA

USENIX FAST ‘14, Santa Clara, CA Outline  Motivation Journaling of Journal Anomaly  How to resolve Journaling of Journal anomaly Multi-Version B-Tree (MVBT)  Optimizations of Multi-Version B-Tree Lazy Split Reserved Buffer Space Lazy Garbage Collection Metadata Embedding Disabling Sibling Redistribution  Evaluation  Demo 2

USENIX FAST ‘14, Santa Clara, CA Android is the most popular mobile platform 3

USENIX FAST ‘14, Santa Clara, CA Storage in Android platform Performance Bottleneck Major cause for device break down. Cause = Excessive IO

USENIX FAST ‘14, Santa Clara, CA Android I/O Stack EXT4 Insert/Update/ Delete/Select Read/Write 5 SQLite Misaligned Interaction Block Device Driver Read/Write

USENIX FAST ‘14, Santa Clara, CA SQLite insert (Truncate mode) SQLite insert Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal. 6

USENIX FAST ‘14, Santa Clara, CA EXT4 Journaling (ordered mode) EXT4 SQLite Write data block Write EXT4 journal Write journal metadata Write journal commit 7 write()

USENIX FAST ‘14, Santa Clara, CA Journaling of Journal Anomaly EXT4 SQLite insert 8 One insert of 100 Byte  9 Random Writes of 4KByte Write data block Write EXT4 journal Write journal metadata Write journal commit

USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? EXT4 SQLite insert Database Journaling Database Insertion 9

USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? fsync() Database Journaling Database Insertion Journaling + Database = Version-based B-Tree (MVBT) 10 Reducing the number of fsync() calls

USENIX FAST ‘14, Santa Clara, CA Version-based B-Tree (MVBT) 10 [T1, ∞) T1 Insert 10Update 10 with 20 T2 10 [T1, T2) 20 [T2, ∞) Time Do not overwrite old version data Do not need a rollback journal [T1, T2)

USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 12 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~∞ ) Page 1 Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 13 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Dead Node Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 14 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 15 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 16 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 25 [5~ ∞ ) 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 ∞ [0~5) : Page 1

USENIX FAST ‘14, Santa Clara, CA 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 Node Split in Multi-Version B-Tree 17 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 25 [5~ ∞ ) dirty One more dirty page than original B-Tree ∞ [0~5) : Page 1

USENIX FAST ‘14, Santa Clara, CA I/O Traffic in MVBT DB buffer cache fsync() 18 Reduce the number of dirty pages! MVBT

USENIX FAST ‘14, Santa Clara, CA I/O Traffic in MVBT MVBT LS -MVBT DB buffer cache fsync() DB buffer cache fsync() 19

USENIX FAST ‘14, Santa Clara, CA Optimizations in Android I/O Write Read t write t read >> Write Transaction 1 Write Transaction 1 Write Transaction 2 Write Transaction 2 Write Transaction 3 Write Transaction 3 20 Single Insert Transaction Multiple Insert

USENIX FAST ‘14, Santa Clara, CA Optimizations LS-MVBT Lazy Split Multi-Version B-Tree (LS-MVBT) Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space 21

USENIX FAST ‘14, Santa Clara, CA  Legacy Split in MVBT Optimization1: Lazy Split 22 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 25 [5~ ∞ ) dirty ∞ [0~5) : Page 1 4 dirty pages

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 23 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~∞ ) Page 1 Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 24 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 Lazy Node: Half-dead, Half-live Lazy Node: Half-dead, Half-live Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 25 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) Page 1 12 [4~5 ) 40 [1~5 ) Key = 25 = 5

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ )

USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split  Lazy Node Overflow 28 Dead entries 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

USENIX FAST ‘14, Santa Clara, CA  Lazy Node Overflow → Garbage collect dead entries Optimization1: Lazy Split [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

USENIX FAST ‘14, Santa Clara, CA  Lazy Node Overflow → Garbage collect dead entries Optimization1: Lazy Split 30 5 [3~∞ ) 8 [6~∞ ) 10 [2~∞ ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) But, what if the dead entries are being accessed by other transactions? Key = 8 = 6

USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space  Option 1. Wait for read transactions to finish [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space  Option 2. Split as in legacy MVBT split [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) Key = 8 = 6 5 [6~∞ ) 10 [6~∞ ) Page 4 5 [3~6) 1 0 [2~6) 12 [4~5 ) 40 [1~5 ) Page 1

USENIX FAST ‘14, Santa Clara, CA 5 [6~∞ ) 8 [6~∞ ) Page 4 Optimization2: Reserved Buffer Space  Option 2. Split as in legacy MVBT split 33 5 [3 ~6) 10 [2 ~6) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~6 ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) 10 [6 ~ ∞ ) : Page 3 10 [6~∞ )

USENIX FAST ‘14, Santa Clara, CA  Option 3. Pad some buffer space in tree nodes Optimization2: Reserved Buffer Space Buffer space is used when lazy node is full [5~∞ ) Page 2 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) 12 [5~∞ ) 8 [6~ ∞ ) If buffer space is also full, split as in legacy MVBT Key = 8 = 6

USENIX FAST ‘14, Santa Clara, CA Similar to rollback in MVBT Rollback in LS-MVBT 35 Txn 5 crashes 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) Rollback Number of dirty nodes touched by rollback of LS- MVBT is also smaller than that of MVBT. 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 ∞ [0~5 ) : Page 1 Page 3 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~ ∞ ) Page 1 ∞ [0~∞ ) : Page 1 Page 3

USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Periodic Garbage Collection 36 P1 P2 P3 P1 Transaction buffer cache Transaction 1

USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Periodic Garbage Collection 37 Garbage Collection P1 GC buffer cache Transaction 1 P2 P3 Stopped P1 P2 P3 fsync() Extra Dirty Pages Transaction buffer cache

USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Lazy Garbage Collection 38 P2 P3 Transaction buffer cache Transaction 1 fsync() P1 Do not garabge collect if no space is needed Insert No Extra Dirty Page by GC

USENIX FAST ‘14, Santa Clara, CA Version = “File Change Counter” in DB header page 25 [5~∞) 40 [1~∞) 15 [6~∞) Optimization4: Metadata embedding 5 Header Page 1 Page 2... Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 6 Commit dirty 39 Page N ++ 2 dirty pages

USENIX FAST ‘14, Santa Clara, CA RAMDISK Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 Commit [5~∞) 40 [1~∞) 15 [6~∞) Header Page 1 Page 2... Page N

USENIX FAST ‘14, Santa Clara, CA Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK RAMDISK Volatile Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 Commit 25 [5~∞) 40 [1~∞) 15 [6~∞) Header Page 1 Page 2... Page N 6 dirty 5 1 dirty page CRASH

USENIX FAST ‘14, Santa Clara, CA Optimization5: Disable Sibling Redistribution  Sibling redistribution hurts insertion performance Disable sibling redistribution → avoid dirtying 4 pages Page 4: 5 10 Page 3: Page 2: 55 ∞ Insertion of key dirty 4 dirty pages 42 dirt y Page 5: 10 : Page 4 40: Page 3 ∞ : Page Page 5: 10 : Page 4 40: Page 3 ∞ : Page dirty

USENIX FAST ‘14, Santa Clara, CA Page Optimization5: Disable Sibling Redistribution  Disabled sibling redistribution : Page Insertion of key : Page 3 90 : Page 2 Page 5: : Page 5 dirty 3 dirty pages Page 5 : dirty Page 2 Page 3: dirty

USENIX FAST ‘14, Santa Clara, CA Summary Optimizations 44 Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space LS-MVBT Avoid dirtying an extra page when split occurs Reduce the probability of node split Delete dead entries on the next mutation Avoid dirtying header page Do not touch siblings to make search faster

USENIX FAST ‘14, Santa Clara, CA Performance Evaluation  Implemented LS-MVBT in SQLite  Testbed: Samsung Galaxy-S4 (JellyBean 4.3) Exynos 5 Octa 5410 Quad Core 1.6GHz 2GB Memory 16GB eMMC  Performance Analysis Tools Mobibench MOST 45

USENIX FAST ‘14, Santa Clara, CA LS-MVBT fsync WAL fsync() MVBT fsync() B-tree computation Analysis of insert Performance WAL Checkpointing Interval SQLite Insert (Response Time) LS-MVBT: 30~78% faster than WAL mode 46

USENIX FAST ‘14, Santa Clara, CA Analysis of insert Performance LS-MVBT flushes only one dirty page 47 SQLite Insert (# of Dirty Pages per Insert)

USENIX FAST ‘14, Santa Clara, CA Analysis of Block I/O Behavior 10 insertions (LS-MVBT)10 insertions (WAL ) 100 insertions (WAL)100 insertions (LS-MVBT) 20 block accesses 34 block accesses 148 block accesses 218 block accesses 48 SQLite Insert (# of Accesses to Block Device)

USENIX FAST ‘14, Santa Clara, CA I/O Traffic at Block Device Level 49 LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash SQLite Insert (I/O Traffic per 10 msec) 1000 insertions (LS-MVBT)1000 insertions (WAL )

USENIX FAST ‘14, Santa Clara, CA  How about Search performance? LS-MVBT is optimized for write performance. 50

USENIX FAST ‘14, Santa Clara, CA Search Performance LS-MVBT wins unless more than 93% of transactions are search 51 Transaction Throughput with varying Search/Insert ratio

USENIX FAST ‘14, Santa Clara, CA How about recovery latency? WAL mode exhibits longer recovery latency due to log replay. How about LS-MVBT? 52

USENIX FAST ‘14, Santa Clara, CA Recovery LS-MVBT is x5~x6 faster than WAL. 53 Recovery Time with varying # of insert operations to rollback

USENIX FAST ‘14, Santa Clara, CA How much performance improvement for each optimization? 54

USENIX FAST ‘14, Santa Clara, CA Effect of Disabling Sibling Redistribution Disabling sibling redistribution ↓ 20% faster insertion 55 SQLite Insert: (Response Time and # of Dirty Pages)

USENIX FAST ‘14, Santa Clara, CA Performance Quantification 56 Throughput of SQLite: Combining All Optimizations 40% improvement 51% improvement 70% improvemnet X1.7 performance improvement against WAL mode LS-MVBT: 704 insertions/sec WAL : 417 insertions/sec 1220% improvemnet X13.2 performance improvement against Truncate mode LS-MVBT: 704 insertions/sec Truncate : 53 insertions/sec

USENIX FAST ‘14, Santa Clara, CA Query response time 57 Response time of MVBT is 80% of WAL Response time of LS-MVBT is 58% of WAL

USENIX FAST ‘14, Santa Clara, CA Conclusion  LS-MVBT resolves the Journaling of Journal anomaly via Reducing the number of fsync()’s with Version based B-tree. Reducing the overhead of single fsync() via lazy split, disabling sibling redistribution, metadata embedding, reserved buffer space, lazy garbage collection.  LS-MVBT achieves 70% performance gain against WAL mode (417 ins/sec  704 ins/sec) solely via software optimization. 58

USENIX FAST ‘14, Santa Clara, CA Thank you…