CSCI 104 Log Structured Merge Trees

Slides:



Advertisements
Similar presentations
Planar point location -- example
Advertisements

Skip List & Hashing CSE, POSTECH.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
ANALYSIS OF SOFT HEAP Varun Mishra April 16,2009.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
CSC 211 Data Structures Lecture 13
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
Amortized Analysis and Heaps Intro David Kauchak cs302 Spring 2013.
Priority Queues and Heaps Tom Przybylinski. Maps ● We have (key,value) pairs, called entries ● We want to store and find/remove arbitrary entries (random.
Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.
Merge Sort.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Chapter 16: Searching, Sorting, and the vector Type
Sorting With Priority Queue In-place Extra O(N) space
COMP261 Lecture 23 B Trees.
Course Developer/Writer: A. J. Ikuomola
Hashing & HashMaps CS-2851 Dr. Mark L. Hornick.
Top 50 Data Structures Interview Questions
Data Structures I (CPCS-204)
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Analysis of Algorithms
Searching – Linear and Binary Searches
CSCI 104 Searching and Sorted Lists
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Efficiency of in Binary Trees
Hashing CSE 2011 Winter July 2018.
UNIT III TREES.
CSC 421: Algorithm Design & Analysis
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+-Trees.
CSCI Trees and Red/Black Trees
October 30th – Priority QUeues
Lecture 22 Binary Search Trees Chapter 10 of textbook
Searching & Sorting "There's nothing hidden in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting.
Hashing Exercises.
Search by Hashing.
Data Structures and Algorithms
Hash Functions/Review
Tree data structure.
Algorithm design and Analysis
Tree data structure.
Data Structures and Algorithms
תרגול 9 Heaps Lempel-Ziv.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Heap Sort The idea: build a heap containing the elements to be sorted, then remove them in order. Let n be the size of the heap, and m be the number of.
Priority Queue and Binary Heap Neil Tang 02/12/2008
A Kind of Binary Tree Usually Stored in an Array
Data Structures and Algorithms
B-Tree Insertions, Intro to Heaps
B- Trees D. Frey with apologies to Tom Anastasio
B+-Trees (Part 1).
Presentation By: Justin Corpron
Searching CLRS, Sections 9.1 – 9.3.
CS202 - Fundamental Structures of Computer Science II
(2,4) Trees 2/15/2019 (2,4) Trees (2,4) Trees.
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CSC 421: Algorithm Design & Analysis
CSE 373: Data Structures and Algorithms
Amortized Analysis and Heaps Intro
CSCI 104 Skip Lists Mark Redekopp.
CSE 373: Data Structures and Algorithms
Mark Redekopp David Kempe
Hash Maps Introduction
Algorithms CSCI 235, Spring 2019 Lecture 19 Order Statistics
CSC 421: Algorithm Design & Analysis
Heaps.
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

CSCI 104 Log Structured Merge Trees Mark Redekopp

Series Summation Review Let n = 1 + 2 + 4 + … + 2k = 𝑖=0 𝑘 2 𝑖 . What is n? n = 2k+1-1 What is log2(1) + log2(2) + log2(4) + log2(8)+…+ log2(2k) = 0 + 1 + 2 + 3+… + k = 𝑖=0 𝑘 𝑖 O(k2) So then what if k = log(n) as in: log2(1) + log2(2) + log2(4) + log2(8)+…+ log2(2log(n)) O(log2n) Arithmetic series: 𝑖=1 𝑛 𝑖 = 𝑛(𝑛+1) 2 =𝜃 𝑛 2 Geometric series 𝑖=1 𝑛 𝑐 𝑖 = 𝑐 𝑛+1 −1 𝑐−1 =𝜃 𝑐 𝑛

Merge Trees Overview Consider a list of (pointers to) arrays with the following constraints Each array is sorted though no ordering constraints exist between arrays The array at list index k is of exactly size 2k or empty An array at list location k can be of size 2k or empty 1 2 3 4 … NULL … 5 2 3 4 1 … 6 9 Size = 8 12 Size = 16 if non-empty 14 18 20 51 Note: These are the keys for a set (or key,value pairs for a map)

Merge Trees Size Define… Given k, what is n? n=2k-1 n as the # of keys in the entire structure k as the size of the list (i.e. positions in the list) Given k, what is n? Let n = 1 + 2 + 4 + … + 2k = 𝑖=0 𝑘 2 𝑖 . What is n? n=2k-1 An array at list location k can be of size 2k or empty 1 2 3 4 … NULL … 5 2 3 4 1 … 6 9 Size = 8 12 Size = 16 if non-empty 14 18 20 51 Note: These are the keys for a set (or key,value pairs for a map)

Merge Trees Find Operation To find an element (or check if it exists) Iterate through the arrays in order (i.e. start with array at list position 0, then the array at list position 1, etc.) In each array perform a binary search If you reach the end of the list of arrays without finding the value it does not exist in the set/map An array at list location k can be of size 2k or empty 1 2 3 4 … NULL … 5 2 3 4 1 … 6 9 Size = 8 12 Size = 16 if non-empty 14 18 20 51 Note: These are the keys for a set (or key,value pairs for a map)

Find Runtime What is the worst case runtime of find? When the item is not present which requires, a binary search is performed on each list T(n) = log2(1) + log2(2) + … log2(2k) = 0 + 1 + 2 + … + k = 𝑖=0 𝑘 𝑖 = O(k2) But let's put that in terms of the number of elements in the structure (i.e. n) Recall k = log2(n)+1 So find is O(log2(n)2) An array at list location k can be of size 2k or empty 1 2 3 4 … NULL … 5 2 3 4 1 … 6 9 Size = 8 12 Size = 16 if non-empty 14 18 20 51 Note: These are the keys for a set (or key,value pairs for a map)

Improving Find's Runtime While we might be okay with [log(n)]2, how might we improve the find runtime in the general case? Hint: I would be willing to pay O(1) to know if a key is not in a particular array without having to perform find A Bloom filter could be maintained alongside each array and allow us to skip performing a binary search in an array

An array at list location k can be of size 2k or empty Insertion Algorithm An array at list location k can be of size 2k or empty Let j be the smallest integer such that array j is empty (first empty slot in the list of arrays) An insertion will cause Location j's array to become filled Locations 0 through j-1 to become empty insert(19) j=2 1 2 3 … NULL … 5 2 4 1 6 9 Size = 8 12 Before insertion 14 18 20 1 2 3 … … … … 2 4 1 5 6 19 9 After insertion Size = 8 12 14 18 20

Insertion Algorithm Starting at array 0, iteratively merge the previously merged array with the next, stopping when an empty location is encountered insert(19) 1 2 1 2 1 2 3 … NULL NULL … … NULL … 19 5 2 2 2 4 5 4 4 1 Merge 19 5 6 19 9 Merge Size = 8 12 14 List 0 is full so merge two arrays of size 1 List 1 is full so merge two arrays of size 2 18 20

Insert Examples insert(8) insert(4) insert(2) insert(7) insert(5) 1 2 1 2 insert(4) … … NULL insert(8) … … NULL Cost = 1 / Stop @ 0 4 Cost = 1 / Stop @ 0 8 2 4 1 2 5 insert(2) … … NULL 19 Cost = 2 / Stop @ 1 2 1 2 insert(7) 4 … … NULL 1 2 Cost = 2 / Stop @ 1 7 2 insert(5) … … NULL 8 4 5 Cost = 1 / Stop @ 0 5 2 19 4 1 2 1 2 insert(12) insert(19) … … NULL … … NULL Cost = 1 / Stop @ 0 Cost = 4 / Stop @ 2 12 7 2 2 8 4 4 5 5 19 19

Insertion Runtime: First Look 1 2 Best case? First list is empty and allows direct insertion in O(1) Worst case? All list entries (arrays) are full so we have to merge at each location In this case we will end with an array of size n=2k in position k Also recall merging two arrays of size m is Θ(m) So the total cost of all the merges is 1 + 2 + 4 + 8 + … + n = 2*n-1 = Θ(n) = Θ(2k) But if the worst case occurs how soon can it occur again? It seems the costs vary from one insert to the next This is a good place to use amortized analysis insert(4) … … NULL 4 1 2 insert(2) … … NULL 2 4 1 2 insert(5) … … NULL 5 2 4 1 2 insert(19) … … NULL 2 4 5 19

Total Cost for N insertions Total cost of n=16 insertions: 1+2+1+4+1+2+1+8+1+2+1+4+1+2+1+16 =1*n/2 + 2*n/4 + 4*n/8 + 8*n/16 + n =n/2 + n/2 + n/2 + n/2 + n =n/2*log2(n) + n Amortized cost = Total cost / n operations log2(n)/2 + 1 = O(log2(n))

Amortized Analysis of Insert 1 2 We have said when you end (place an array) in position k you have to do O(2k+1) work for all the merges How often do we end in position k The 0th position will be free with probability ½ (p=0.5) We will stop at the 1st position with probability ¼ (p=0.25) We will stop at the 2nd position with probability 1/8 (p=0.125) We will stop at the kth position with probability 1/2k = 2-k So we pay 2k+1 with probability 2-(k+1) Suppose we have n items in the structure (i.e. max k is log2n) what is the expected cost of inserting a new element 𝑘=0 log⁡(𝑛) 2 𝑘+1 2 −(𝑘+1) = 𝑘=0 log⁡(𝑛) 1 =log⁡(𝑛) insert(4) … … NULL Cost = 1 / Stop @ 0 4 1 2 insert(2) … … NULL Cost = 2 / Stop @ 1 2 4 1 2 insert(5) … … NULL Cost = 1 / Stop @ 0 5 2 4 1 2 insert(19) … … NULL Cost = 4 / Stop @ 2 2 4 5 19

Summary Variants of log structured merge trees have found popular usage in industry Starting array size might be fairly large (size of memory of a single server) Large arrays (from merging) are stored on disk Pros: Ease of implementation Sequential access of arrays helps lower its constant factors Operations: Find = log2(n) Insert = Amortized log(n) Remove = often not considered/supported