1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 19: B-Trees: Data Structures for Disk.

Slides:



Advertisements
Similar presentations
Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
Advertisements

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
They’re not just binary anymore!
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 9: BTrees Tyler Robison Summer
CS4432: Database Systems II
CS CS4432: Database Systems II Basic indexing.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #7.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Tirgul 6 B-Trees – Another kind of balanced trees.
CS 255: Database System Principles slides: B-trees
CS4432: Database Systems II
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
IntroductionIntroduction  Definition of B-trees  Properties  Specialization  Examples  2-3 trees  Insertion of B-tree  Remove items from B-tree.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Storage CMSC 461 Michael Wilson. Database storage  At some point, database information must be stored in some format  It’d be impossible to store hundreds.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CSE AU B-Trees1 B-Trees CSE 373 Data Structures.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
Indexing.
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
CompSci 100E 39.1 Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
CompSci Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Internal and External Sorting External Searching
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
1 i206: Lecture 16: Data Structures for Disk; Advanced Trees Marti Hearst Spring 2012.
@ Zhigang Zhu, CSC212 Data Structure - Section FG Lecture 17 B-Trees and the Set Class Instructor: Zhigang Zhu Department of Computer Science.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
COMP261 Lecture 23 B Trees.
Data Structures and Algorithms for Information Processing
Multiway Search Trees Data may not fit into main memory
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
CPSC-310 Database Systems
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
B- Trees D. Frey with apologies to Tom Anastasio
B-Trees This presentation shows you the potential problem of unbalanced tree and show one way to fix it This lecture introduces heaps, which are used.
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
B-Trees This presentation shows you the potential problem of unbalanced tree and show one way to fix it This lecture introduces heaps, which are used.
Presentation transcript:

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 19: B-Trees: Data Structures for Disk

2 Data Structures & Memory So far we’ve seen data structures stored in main memory What happens when you have a very large set of data? –Too slow to load it into memory –Might not fit into memory If you use virtual memory, the paging behavior changes the running time expectations A very large array of length N that is stored in virtual memory will take much longer to access if most of data is on pages that are paged out.

3 Data Structures on Disk For very large sets of information, we often need to keep most of it on disk Examples: –Information retrieval systems –Database systems To handle this efficiently: –Keep an index in memory –Keep the data on disk –The index contains pointers to the data on the disk Two most common techniques: –Hash tables and B-trees

4 Disk vs. RAM Disk has much larger capacity than RAM Disk is much slower to access than RAM

5 Images copyright 2003 Pearson Education RAM: Memory cells arranged by address

6 Images copyright 2003 Pearson Education Disk: Memory cells arranged by address

7 Disk Structure and Operation Made up of platters –Like a phonograph record Divided into –tracks (rings) and –sectors (wedges) Sectors divided into fixed-sized blocks As the disk spins beneath it, the read/write arm reads the data from the block(s) of interest –Data is read into a memory buffer in the OS –This gets eventually transferred to RAM –The process has to wait to get this resource

8 Disk Access Time Seek Time –The time required to move the read/write heads over the disk surface to the required track. –Roughly proportional to the distance the heads must move. Rotational Latency –The time taken, after the completion of the seek, for the disk platter to spin until the first sector addressed passes under the read/write heads. –On average, this is half of a full rotation. Transfer Time –The time taken for the disk platter to spin until all the addressed sectors have passed under the heads. –Directly proportional to the number of sectors addressed. Image and text from

9 Hash Tables vs. B-Trees Hash tables great for selecting individual items –Fast search and insert –O(1) if the table size and hash function are well-chosen BUT –Hash tables inefficient for finding sets of information with similar keys or for doing range searches (e.g., All documents published in a date range) –We often need this in text and DBMS applications Search trees are better for this –Can place subtrees near each other on disk B-Trees are a popular kind of disk-based search tree

10 B-Trees Goal: efficient access to sorted information Balanced Structure Sorted Keys Each node has many children Each node contains many data items –These are stored in an array in sorted order B-Tree is defined in terms of rules: –Makes use of a notion of a constant MINIMUM –These rules can vary We’ll use the ones in the Main book

11 B-Tree Rules (from Main) Rule 1: –The root may have as few as 1 data item –Every other node has at least MINIMUM items Rule 2: –The maximum number of elements in a node is twice the value of MINIMUM Rule 3: –The elements of each B-tree node are stored in a partially filled array –Sorted from smallest (item 0) to largest

12 B-Tree Rules (from Main) Rule 4: –The number of subtrees (children) of a non-leaf node is always one more than the number of items stored in the node. Rule 5: –For any non-leaf node: (a) A key at index i is greater than all the keys in subtree number i for a given node. (b) A key at index i is less than all the keys in subtree number i+1 for a given node. Rule 6: –Every leaf in a B-tree has the same depth

13 Illustration of Rules 4 and 5 93 and 107 Subtree 0 Subtree 1 Subtree 2 all keys < <= keys <= < all keys Note: we could use some other ordering here besides integers

14 Example B-tree 6 6 MINIMUM = 1 Does this meet all the rule conditions? 2 and and 8

15 Implementing B-Trees Requires recursive thinking –Every child of the root node is also the root of a smaller B-tree

16 Example B-tree 6 6 Each subtree recursively acts like a B-tree. 2 and and 22 6 and

17 Implementing B-Trees Requires recursive thinking –Every child of the root node is also the root of a smaller b-tree Defining the class –In this definition, data is also used as the keys –Static Variables: private static final int MINIMUM = 200; private static final int MAXIMUM + MINIMUM*2; –Instance Variables: int dataCount; int [] data = new int[MAXIMUM + 1]; int childCount; IntBTree[] node = new IntBTree[MAXIMUM + 2]; (extra room here to help with the implementation of add node)

18 Searching for an Item boolean contains (int target, IntBTree node) set i equal to the first index in node such that data[i] >= target if (target found at data[i]) return true else if (node has no children) return false else return node[i].contains(target)

19 Find 18? and and 22 6 and boolean contains (int target, IntBTree node) set i equal to the first index in node such that data[i] >= target if (target found at data[i]) return true else if (node has no children) return false else return node[i].contains(target)

20 Adding a Node Tricky because of the need to maintain the B-Tree rules The strategy: –First place the new item wherever it belongs, according to the value of the key –Then if a node has too many items, recursively split the too-large node until the B-Tree condition is recovered.

21 Add 19. First, place the item where it belongs numerically. (MAXIMUM=2) , 22 6, , 17 18, 19, 22

22 Now Propagate extra item up a level to restore the B-Tree condition , 19, 22 6, , 17, Requires a node split

23 Propagate again, recursively , 17,

24 In the middle of add node, need to split a too- large node, passing the extra up to the parent. (MINIMUM=2, MAXIMUM=4) 6 6 1, 2 3, 6 7, 8 13, 16, 19, 22, 25 34, 35 50, 51 33, 40 9, 28 14, 15 31, 32 4, 5 17, 18 20, 21 23, 24 26, 27 11, 12

25 In the middle of add node, need to split a too- large node, passing the extra up to the parent. (MINIMUM=2, MAXIMUM=4) 6 6 1, 2 3, 6 7, 8 34, 35 50, 51 33, 40 9, 28 14, 15 31, 32 4, 5 17, 18 20, 21 23, 24 26, 27 11, 12 13, 16 22, 25 19

26 In the middle of add node, need to split a too- large node, passing the extra up to the parent. (MINIMUM=2, MAXIMUM=4) 6 6 1, 2 3, 6 7, 8 34, 35 50, 51 33, 40 9, 19, 28 14, 15 31, 32 4, 5 17, 18 20, 21 23, 24 26, 27 11, 12 13, 16 22, 25

27 B-Tree Running Time Analysis Worst case time for: –Searching for an item? O(d) –Adding an item? O(d) But this is in terms of d, not n (number of nodes) What about n? –Depth of the B-tree is never more than O(log n) But what if the B-tree has very wide nodes? –There is a tradeoff; we’ll see this soon for B+trees

28 Slide adapted from cis.stvincent.edu B+Trees Differences from B-Tree –Assume the actual data is in a separate file on disk –Internal nodes store keys only Each node may contain many keys Designed to be “branchy” or “bushy” Designed to have shallow height Has a limit on the number of keys per node –This way only a small number of disk blocks need to be read to find the data of interest –Only leaves store data records The leaf nodes refer to memory locations on disk Each leaf is linked to an adjacent leaf

29 B+Tree and Disk Reads Goal: –Optimize the B+tree structure so that a minimum number of disk blocks need to be read If the number of keys is not too large, keep all of the B+tree in memory Otherwise, –Keep the root and first levels of nodes in memory –Organize the tree so that each node fits within a disk block in order to reduce the number of disk reads

30 Slide adapted from lecture by Hector Garcia-Molina B+tree rulestree of order s (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

31 Slide adapted from lecture by Hector Garcia-Molina B+Tree Sizes Size of nodes: s keys s+1 pointers Don’t want nodes to be too empty Use at least: Non-leaf:(s+1)/2 pointers Leaf:(s+1)/2 pointers to data

32 Slide adapted from lecture by Hector Garcia-Molina Root B+Tree Examples=

33 Slide adapted from lecture by Hector Garcia-Molina Sample non-leaf to keys to keysto keys to keys <  k<8181  k<95 

34 Slide adapted from lecture by Hector Garcia-Molina Sample leaf node: From non-leaf node to next leaf in sequence To record with key 47 To record with key 50 To record with key 51

35 Slide adapted from lecture by Hector Garcia-Molina Full nodemin. node Non-leaf Leaf s=

36 Example Use of B+Trees Recall that an inverted index is composed of –A Dictionary file and –A Postings file Use a B+Tree for the dictionary –The keys are the words –The values stored on disk are the postings

37 Using B+Trees for Inverted Index Use it to store the dictionary More efficient for searches on words with the same prefix –count* matches count, counter, counts, countess –Can store the postings for these terms near one another –Then only one disk seek is needed to get to these

38 Inverted Index Dictionary Postings

39 Slide adapted from lecture by Hector Garcia-Molina Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

40 Slide adapted from lecture by Hector Garcia-Molina (a) Insert key = 32 s=

41 Slide adapted from lecture by Hector Garcia-Molina (a) Insert key = 7 s=

42 Slide adapted from lecture by Hector Garcia-Molina (c) Insert key = 160 s=

43 Slide adapted from lecture by Hector Garcia-Molina (d) New root, insert 45 s= new root

44 Slide adapted from lecture by Hector Garcia-Molina Interesting problem: For B+tree, how large should s be? … n is number of keys / node

45 What is the expected running time for finding an item in the B+tree? Assume B+tree with nodes of size s Assume all of the index is in memory –Use binary search to locate the appropriate key within a node This takes a + b log 2 s for constants a and b Remember that s is a constant Assume the B+tree is full – # nodes to examine is log s n where n = # records If n dominates –(Meaning that the tree is deeper than the nodes are wide) –O(log s n) If s dominates –(Meaning the nodes are wider than the tree is deep) –O(log 2 s)

46 Slide adapted from lecture by Hector Garcia-Molina Sample assumptions: (order s B+tree) (1) Time to read node from disk is ( s) msec. (2) Once block in memory, use binary search to locate key: (a + b log 2 s) msec. For some constants a,b; Assume a << 70 (3) Assume B+tree is full, i.e., # nodes to examine is log s n where n = # records

47 Slide adapted from lecture by Hector Garcia-Molina Can get: f(s) = time to find a record f(s) s opt s Thus if s is too big or too small, problems result

48 Slide adapted from lecture by Hector Garcia-Molina FIND s opt by f’(s) = 0 Answer is s opt = “few hundred” What happens to s opt as Disk gets faster? CPU gets faster?

49 Slide adapted from lecture by Hector Garcia-Molina Tradeoffs: B-trees have faster lookup than B+trees  But B+Tree is faster lookup if using fixed-sized blocks  In B-tree, deletion more complicated  B+trees often preferred

50 Choosing Data Structures Name example applications best suited for –Hash Tables –B+Trees

51 Next Time Sorting Algorithms