LevelDB Introduction Fenggang Wu Oct. 17th, 2016.

Slides:



Advertisements
Similar presentations
The Architecture of Oracle
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
The Linux Kernel: Memory Management
1 - Oracle Server Architecture Overview
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Distributed storage for structured data
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 8: Main Memory.
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Jeff's Filesystem Papers Review Part I. Review of "Design and Implementation of The Second Extended Filesystem"
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
It consists of two parts: collection of files – stores related data directory structure – organizes & provides information Some file systems may have.
Bigtable: A Distributed Storage System for Structured Data
CHAPTER 9 File Storage Shared Preferences SQLite.
TerarkDB Introduction Peng Lei Terark Inc ( ). All rights reserved 1.
Bigtable A Distributed Storage System for Structured Data.
The very Essentials of Disk and Buffer Management.
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Bigtable A Distributed Storage System for Structured Data
COMP261 Lecture 23 B Trees.
Module 11: File Structure
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
Lecture 16: Data Storage Wednesday, November 6, 2006.
Tree-Structured Indexes
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
CSE-291 Cloud Computing, Fall 2016 Kesden
CSE-291 (Cloud Computing) Fall 2016
CS 3305 System Calls Lecture 7.
Hashing Exercises.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Chapter 15 QUERY EXECUTION.
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Physical Database Design
Module 11: Data Storage Structure
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
CS222/CS122C: Principles of Data Management Notes #07 B+ Trees
Tree-Structured Indexes
Introduction to Database Systems
CS179G, Project In Computer Science
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
Tree-Structured Indexes
Chapter 13: Data Storage Structures
2018, Spring Pusan National University Ki-Joune Li
A Small and Fast IP Forwarding Table Using Hashing
Copyright © Curt Hill Page Management In memory and on disk Copyright © Curt Hill.
File Organization.
Outline Introduction LSM-tree and LevelDB Architecture WiscKey.
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
Tree-Structured Indexes
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #06 B+ trees Instructor: Chen Li.
Lecture 20: Representing Data Elements
CSE 542: Operating Systems
CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

LevelDB Introduction Fenggang Wu Oct. 17th, 2016

Projects Using LevelDB

LevelDB “LevelDB is an open source on-disk key-value store written by Google fellows Jeffrey Dean and Sanjay Ghemawat.” – Wikipedia “LevelDB is a light-weight, single-purpose library for persistence with bindings to many platforms.” – leveldb.org

API Get, Put, Delete, Iterator (Range Query).

Outline Design From LSM Tree to LevelDB Implementation

Key-Value Data Structures Hash table, Binary Tree, B+-Tree "when writes are slow, defer them and do them in batches” * *Dennis G. Severance and Guy M. Lohman. 1976.

Log-structured Merge (LSM) Tree O’Neil, P., Cheng, E., Gawlick, D., & O’Neil, E. (1996).

Two Component LSM-Tree

K+1 Components LSM-Tree

Rolling Merge

From LSM-Tree to LevelDB Lu, L., Pillai, T. S., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2016).

LevelDB Data Structures Log file Memtable Immutable Memtable SSTable (file) Manifest file

Discussion Why using files instead of tree structure on disk?

Operations Put/delete Get Compaction

Put/Delete Insert into memtable Delete: Insert a delete marker to the memtable; annihilates with the value during compaction.

Get Sequence: memtable => immutable memtable => SSTs(level# low to high) Using manifest file (which records smallest and largest key), to locate all the overlapping SSTs.

LevelDB Structural Requiement L0 table size: ~4MB. Number: ideally 4; try to be less than 8; no greater than 12 L1 (and above) table size: <2MB. Total size < 10^i MB (i=level).

Compaction Maintain levelDB structure “in shape” Minor Compaction Major Compaction

Minor Compaction Dump immutable into SST file Push as further down as possible if No overlap with current level Overlapping with no more than 10 SSTs in the next level

Trigger Minor Compaction when all of the below satisfy: memtable exceeds 4MB when put/delete immutable = NULL Block write thread, move memtable to immutable, create new memtable. Background flushing for immutable

Major Compaction Pick one SST (level i) using an PickCompaction function. Merge it with all the overlapping SSTs (level i or i+1) and write them into level i+1 File boundary: 1) size <=2MB and 2) overlapping with no more than 10 SST in the next level

Evaluating Function size_compaction seek_compaction Find more imbalanced level Find the next SST to be compacted seek_compaction

Trigger Major Compaction when any of below satisfies Invoke CompactRange() manually allowed_seek used up when calling Get() L0 SST exceeds 8 Li (i>0) SST space exceeds 10^i MB

More Implementation Details

Source code structure

Important Directories doc/ documentation. Define log and sstable format include/leveldb/ header file for users, including basic interfaces, etc. db/ main key value store logic implementation. including interface implementation (db_impl/db_iter), internal struct definition (dbformat/memtable/skiplist/write_batch), db status and operations (version_set/version_edit), log formatting (log/log_reader/log_writer), file operations (filename), sstable related files (builder/table_cache). table/ sstable data structure and operations, including format, block operations (block/block_builder), iterator(two_level_iterator/ merger), optimizing wrapper (iterator_wrapper) port/ portable implantation for lock, signal, atomic operations, compression. (posix and android) util/ utility. memory management for memtable; LRU cache; comparator, general functions (coding/crc32c/hash/random/MutexLock/logging), etc.

Basic Concepts

Slice (include/leveldb/slice.h) An wrapper of data and the length for the ease of managing data.

Option (include/leveldb/option.h) struct options: Configuration when db is started. write buffer size (memtable size) log size max open file number Also there are ReadOption/WriteOption for put/get/delete. sync, etc.

Some other “knobs” max level number kL0_CompactionTrigger: L0 table compaction trigger kL0_SlowdownWritesTrigger kL0_StopWritesTrigger kTargetFileSize

ValueType (db/dbformat.h)

SequnceNumber (db/dbformat.h) every put/delete operation has a sequence number (64bits global variable) compaction, snapshot depends on the sequence number.

Keys userkey: passed from user, in Slice format InternalParsedKey: userkey + seqNum + valuetype InternalKey: string representation of InternalParsedKey

WriteBatch (db/write_batch.cc)

Memtable(db/memtable.ccdb/skiplist.h) write buffer of the keys (size: write_buffer_size) format: skip_list Once size read threshold, it turns into imutable memtable, and a new buffer is create for write. Background compaction will dump immutable table to sstable.

Skip list

Sstable(table/table.cc) Persistent data store (Key and value). index in the end of the file max file size: kTargetFileSize

FileMetaData (db/version_edit.h)

Sstable format index index

block block(table/block.cc) BlockHandle(table/format.h) compressed?

Compaction one compaction background process is in charge of dumping memtable into sstable, and balance sstable in all levels. memtable dumping has priority. L0 table has the size of the write buffer. L0 table does not merge keys According to design policy, selecting one table from level n and n+1 to merge (merge the same key, deletion), result in many sstables in level n+1. L1 and up, sstable is non-overlapped, better for read operation

Compaction in Detail leveldb has only one compaction background task. After main thread trigger compaction: 1)  if compaction is running or db is exiting, return 2)  examine current db status to determine the need of compaction. If so trigger background compaction task 3)  carry out compact logic (DBImpl::BackgroundCompaction()), trigger compaction again on completion.

When main thread trigger compaction? db starts and after recovery is done call compaction function directly memtable reach to size limit, turn into immutable memtable and trigger allowed_seeks is used up for a file.

Compaction procedure If immutable memtable exists, dump it to disk, and return after done. VersionSet::PickCompaction() DBImpl::DoCompactionWork() DBImpl::CleanupCompaction()

Minor Compaction Procedure 1. sstable = MemtableDumpToSSTable(immtable); 2. i = PickLevelForMemTableOutput(sstable); 3. place SSTable to level i; 4. edit= getVersionEdit(); 5. updateVersionSet();

VersionSet::PickCompaction() choose level (the one exceeds the file count limit most), choose the file with the start_key larger than the compaction pointer if all level balanced, choose the file that exceed the seek time limit.

Major Compaction Procedure c, i := PickCompaction(); // i is level if( up(i).size() == 1 AND down(i).size() == 0) { // down(i) is empty set. overlapBytes := calculateOverlapBytes(up(i), Level[i+2]); if( overlapBytes <= 20M ){ Just place up(i) to (i+1) level. return; } } DoCompactionWork; edit = updateEdit(); updateVersionSet(edit);

References http://leveldb.org https://en.wikipedia.org/wiki/LevelDB https://opensource.googleblog.com/2011/07/leveldb-fast-persistent-key-value-store.html http://highscalability.com/blog/2011/8/10/leveldb-fast-and-lightweight-keyvalue-database-from-the-auth.html http://openinx.github.io/ppt/leveldb-lsm-tree.pdf https://github.com/google/leveldb http://openinx.github.io/2014/08/17/leveldb-compaction/ http://openinx.github.io/2014/08/11/leveldb-notes/ https://www.quora.com/How-does-the-Log-Structured-Merge-Tree-work http://blog.csdn.net/sparkliang/article/details/8567602 http://www.cnblogs.com/siegfang/archive/2013/01/12/lsm-tree.html http://teampal.ustcsz.edu.cn/svn/nosql/test/tair/%E6%B7%98%E5%AE%9Dleveldb%E7%9A%84%E5%AE%9E%E7%8E%B0.PDF https://www.quora.com/How-does-the-Log-Structured-Merge-Tree-work