B+ Trees: An IO-Aware Index Structure Lecture 13.

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
External Sorting R & G Chapter 13 One of the advantages of being
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Lecture 11: DMBS Internals
BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
COMP261 Lecture 23 B Trees.
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Chapter 13
Evaluation of Relational Operations
Database Management Systems (CS 564)
Lecture 12 Lecture 12: Indexing.
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 30: The IO Model 1 External Sorting
Lecture 2- Query Processing (continued)
Lecture 26: Index Professor Xiannong Meng Spring 2018
Lecture 31: The IO Model 2 Repacking
CS222: Principles of Data Management Lecture #10 External Sorting
External Sorting.
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CS222P: Principles of Data Management Lecture #10 External Sorting
Database Systems (資料庫系統)
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 20: Representing Data Elements
Presentation transcript:

B+ Trees: An IO-Aware Index Structure Lecture 13

“If you don’t find it in the index, look very carefully through the entire catalog” - Sears, Roebuck and Co., Consumers Guide, Lecture 13

Today’s Lecture 1.[Moved from 12-3]: External Merge Sort & Sorting Optimizations 2.Indexes: Motivations & Basics 3.B+ Trees 3 Lecture 13

1. External Merge Sort 4 Lecture 13 > Section 1

What you will learn about in this section 1.External merge sort 2.External merge sort on larger files 3.Optimizations for sorting 5 Lecture 13 > Section 1

Recap: External Merge Algorithm Suppose we want to merge two sorted files both much larger than main memory (i.e. the buffer) We can use the external merge algorithm to merge files of arbitrary length in 2*(N+M) IO operations with only 3 buffer pages! Our first example of an “IO aware” algorithm / cost model Lecture 13 > Section 1 > External Merge Sort

External Merge Sort Lecture 13 > Section 1 > External Merge Sort

Why are Sort Algorithms Important? Data requested from DB in sorted order is extremely common e.g., find students in increasing GPA order Why not just use quicksort in main memory?? What about if we need to sort 1TB of data with 1GB of RAM… A classic problem in computer science! Lecture 13 > Section 1 > External Merge Sort

More reasons to sort… Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sorting is first step in bulk loading B+ tree index. Sort-merge join algorithm involves sorting Coming up… Next lecture Lecture 13 > Section 1 > External Merge Sort

Do people care? Sort benchmark bears his name Lecture 13 > Section 1 > External Merge Sort

So how do we sort big files? 1.Split into chunks small enough to sort in memory (“runs”) 2.Merge pairs (or groups) of runs using the external merge algorithm 3.Keep merging the resulting runs (each time = a “pass”) until left with one sorted file! Lecture 13 > Section 1 > External Merge Sort

External Merge Sort Algorithm 27,243,1 Example: 3 Buffer pages 6-page file Dis k Main Memory Buffer 18,22 F1F1 F2F2 33,1255,3144,10 1.Split into chunks small enough to sort in memory Lecture 13 > Section 1 > External Merge Sort Orange file = unsorted

External Merge Sort Algorithm 27,243,1 Dis k Main Memory Buffer 18,22 F1F1 F2F2 33,1255,3144,10 1.Split into chunks small enough to sort in memory Example: 3 Buffer pages 6-page file Lecture 13 > Section 1 > External Merge Sort Orange file = unsorted

External Merge Sort Algorithm 27,243,1 Dis k Main Memory Buffer 18,22 F1F1 F2F2 33,12 55,31 44,10 1.Split into chunks small enough to sort in memory Example: 3 Buffer pages 6-page file Lecture 13 > Section 1 > External Merge Sort Orange file = unsorted

External Merge Sort Algorithm 27,243,1 Dis k Main Memory Buffer 18,22 F1F1 F2F2 31,33 44,55 10,12 Example: 3 Buffer pages 6-page file 1.Split into chunks small enough to sort in memory Lecture 13 > Section 1 > External Merge Sort Orange file = unsorted

External Merge Sort Algorithm Dis k Main Memory Buffer F1F1 F2F2 31,3344,5510,12 And similarly for F 2 27,243,118,22 24,27 1,3 1.Split into chunks small enough to sort in memory Example: 3 Buffer pages 6-page file Lecture 13 > Section 1 > External Merge Sort Each sorted file is a called a run

External Merge Sort Algorithm Dis k Main Memory Buffer F1F1 F2F2 2. Now just run the external merge algorithm & we’re done! 31,3344,5510,12 18,2224,271,3 Example: 3 Buffer pages 6-page file Lecture 13 > Section 1 > External Merge Sort

Calculating IO Cost For 3 buffer pages, 6 page file: 1.Split into two 3-page files and sort in memory 1.= 1 R + 1 W for each file = 2*(3 + 3) = 12 IO operations 2.Merge each pair of sorted chunks using the external merge algorithm 1.= 2*(3 + 3) = 12 IO operations 3.Total cost = 24 IO Lecture 13 > Section 1 > External Merge Sort

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 18,4324,2745,38 Assume we still only have 3 buffer pages (Buffer not pictured) 31,3347,5510,12 18,2223,2041,3 31,3339,5542,46 18,2324,271,3 48,3344,4010,12 18,2224,2716,31 Lecture 13 > Section 1 > External Merge Sort: Larger files

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 18,4324,2745,38 31,3347,5510,12 18,2223,2041,3 31,3339,5542,46 18,2324,271,3 48,3344,4010,12 18,2224,2716,31 1. Split into files small enough to sort in buffer… Assume we still only have 3 buffer pages (Buffer not pictured) Lecture 13 > Section 1 > External Merge Sort: Larger files

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 27,3843,4518,24 31,3347,5510,12 20,2223,413,18 39,4246,5531,33 18,2324,271,3 33,4044,4810,12 22,2427,3116,18 1. Split into files small enough to sort in buffer… and sort Assume we still only have 3 buffer pages (Buffer not pictured) Lecture 13 > Section 1 > External Merge Sort: Larger files Call each of these sorted files a run

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 27,3843,4518,24 31,3347,5510,12 20,2223,413,18 39,4246,5531,33 18,2324,271,3 33,4044,4810,12 22,2427,3116,18 2. Now merge pairs of (sorted) files… the resulting files will be sorted! Disk 18,2427,3110,12 43,4445,5533,38 12,1820,223,10 33,4147,5523,31 18,2324,271,3 39,4246,5531,33 16,1822,2410,12 33,4044,4827,31 Assume we still only have 3 buffer pages (Buffer not pictured) Lecture 13 > Section 1 > External Merge Sort: Larger files

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 27,3843,4518,24 31,3347,5510,12 20,2223,413,18 39,4246,5531,33 18,2324,271,3 33,4044,4810,12 22,2427,3116,18 3. And repeat… Disk 18,2427,3110,12 43,4445,5533,38 12,1820,223,10 33,4147,5523,31 18,2324,271,3 39,4246,5531,33 16,1822,2410,12 33,4044,4827,31 Disk 10,1212,183,10 22,2324,2718,20 33,3338,4131,31 45,4755,5543,44 10,1216,181,3 23,2424,2718,22 31,3333,3927,31 44,4648,5540,42 Assume we still only have 3 buffer pages (Buffer not pictured) Lecture 13 > Section 1 > External Merge Sort: Larger files Call each of these steps a pass

Running External Merge Sort on Larger Files Disk 31,3344,5510,12 27,3843,4518,24 31,3347,5510,12 20,2223,413,18 39,4246,5531,33 18,2324,271,3 33,4044,4810,12 22,2427,3116,18 4. And repeat! Disk 18,2427,3110,12 43,4445,5533,38 12,1820,223,10 33,4147,5523,31 18,2324,271,3 39,4246,5531,33 16,1822,2410,12 33,4044,4827,31 Disk 10,1212,183,10 22,2324,2718,20 33,3338,4131,31 45,4755,5543,44 10,1216,181,3 23,2424,2718,22 31,3333,3927,31 44,4648,5540,42 Disk 3,1010,101,3 12,1618,1812,12 20,2222,2318,18 24,2427,2723,24 31,3131,3327,31 33,3839,4033,33 43,4444,4541,42 48,5555,5546,47 Lecture 13 > Section 1 > External Merge Sort: Larger files

Simplified 3-page Buffer Version Unsorted input file Split & sort Merge Sorted! Lecture 13 > Section 1 > External Merge Sort: Larger files

Using B+1 buffer pages to reduce # of passes Suppose we have B+1 buffer pages now; we can: 1.Increase length of initial runs. Sort B+1 at a time! At the beginning, we can split the N pages into runs of length B+1 and sort these in memory Lecture 13 > Section 1 > Optimizations for sorting IO Cost: Starting with runs of length 1 Starting with runs of length B+1

Using B+1 buffer pages to reduce # of passes Suppose we have B+1 buffer pages now; we can: 2. Perform a B-way merge. On each pass, we can merge groups of B runs at a time (vs. merging pairs of runs)! Lecture 13 > Section 1 > Optimizations for sorting IO Cost: Starting with runs of length 1 Starting with runs of length B+1 Performing B-way merges

Repacking Lecture 13 > Section 1 > Optimizations for sorting

Repacking for even longer initial runs Lecture 13 > Section 1 > Optimizations for sorting

Repacking Example: 3 page buffer Start with unsorted single input file, and load 2 pages 57,243,98 Dis k Main Memory Buffer 18,22 F1F1 10,33 44,55 31,12 Lecture 13 > Section 1 > Optimizations for sorting F2F2

Repacking Example: 3 page buffer Take the minimum two values, and put in output page 57,243,98 Dis k Main Memory Buffer 18,22 F1F1 10,33 44,55 31,12 Lecture 13 > Section 1 > Optimizations for sorting F2F ,12 m=12 Also keep track of max (last) value in current run…

Repacking Example: 3 page buffer Next, repack 57,243,98 Dis k Main Memory Buffer F1F1 33 Lecture 13 > Section 1 > Optimizations for sorting F2F2 3131,33 10,12 m=12 44,55 18,22

Repacking Example: 3 page buffer Next, repack, then load another page and continue! 57,243,98 Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=12 44,55 m=33 18,22

Repacking Example: 3 page buffer Now, however, the smallest values are less than the largest (last) in the sorted run… 3,98 Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=33 18,22 We call these values frozen because we can’t add them to this run… 44,55 57,24

Repacking Example: 3 page buffer Now, however, the smallest values are less than the largest (last) in the sorted run… Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=55 44,55 57,24 18,22 3,98

Repacking Example: 3 page buffer Now, however, the smallest values are less than the largest (last) in the sorted run… Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=55 44,55 57,24 18,22 3,98

Repacking Example: 3 page buffer Now, however, the smallest values are less than the largest (last) in the sorted run… Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=55 44,55 3,24 18,22 57,98

Repacking Example: 3 page buffer Once all buffer pages have a frozen value, or input file is empty, start new run with the frozen values Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=0 44,55 3,24 18,22 57,98 F3F3

Repacking Example: 3 page buffer Once all buffer pages have a frozen value, or input file is empty, start new run with the frozen values Dis k Main Memory Buffer F1F1 Lecture 13 > Section 1 > Optimizations for sorting F2F2 31,33 10,12 m=0 44,55 57,98 F3F3 3,18 22,24

Repacking Note that, for buffer with B+1 pages: If input file is sorted  nothing is frozen  we get a single run! If input file is reverse sorted (worst case)  everything is frozen  we get runs of length B+1 In general, with repacking we do no worse than without it! What if the file is already sorted? Engineer’s approximation: runs will have ~2(B+1) length Lecture 13 > Section 1 > Optimizations for sorting

Summary Basics of IO and buffer management. See notebook for more fun! (Learn about sequential flooding) We introduced the IO cost model using sorting. Saw how to do merges with few IOs, Works better than main-memory sort algorithms. Described a few optimizations for sorting Lecture 13 > Section 1 > SUMMARY

2. Indexes 42 Lecture 13 > Section 2

What you will learn about in this section 1.Indexes: Motivation 2.Indexes: Basics 3.ACTIVITY: Creating indexes 43 Lecture 13 > Section 2

Index Motivation Lecture 13 > Section 2 > Indexes: Motivation Person(name, age)

Index Motivation What about if we want to insert a new person, but keep the list sorted? We would have to potentially shift N records, requiring up to ~ 2*N/P IO operations (where P = # of records per page)! We could leave some “slack” in the pages… Lecture 13 > Section 2 > Indexes: Motivation 4,56,71,33,45,61,2 2 7, Could we get faster insertions?

Index Motivation What about if we want to be able to search quickly along multiple attributes (e.g. not just age)? We could keep multiple copies of the records, each sorted by one attribute set… this would take a lot of space Lecture 13 > Section 2 > Indexes: Motivation Can we get fast search over multiple attribute (sets) without taking too much space? We’ll create separate data structures called indexes to address all these points

Further Motivation for Indexes: NoSQL! NoSQL engines are (basically) just indexes! A lot more is left to the user in NoSQL… one of the primary remaining functions of the DBMS is still to provide index over the data records, for the reasons we just saw! Sometimes use B+ Trees (covered next), sometimes hash indexes (not covered here) Lecture 13 > Section 2 > Indexes: Motivation Indexes are critical across all DBMS types

Indexes: High-level An index on a file speeds up selections on the search key fields for the index. Search key properties Any subset of fields is not the same as key of a relation Example: On which attributes would you build indexes? Product(name, maker, price) Lecture 13 > Section 2 > Indexes: Basics

More precisely An index is a data structure mapping search keys to sets of rows in a database table Provides efficient lookup & retrieval by search key value- usually much faster than searching through all the rows of the database table An index can store the full rows it points to (primary index) or pointers to those rows (secondary index) We’ll mainly consider secondary indexes Lecture 13 > Section 2 > Indexes: Basics

Operations on an Index Search: Quickly find all records which meet some condition on the search key attributes More sophisticated variants as well. Why? Insert / Remove entries Bulk Load / Delete. Why? Indexing is one the most important features provided by a database for performance Lecture 13 > Section 2 > Indexes: Basics

Conceptual Example What if we want to return all books published after 1867? The above table might be very expensive to search over row-by- row… SELECT * FROM Russian_Novels WHERE Published > 1867 SELECT * FROM Russian_Novels WHERE Published > 1867 BIDTitleAuthorPublishedFull_text 001War and PeaceTolstoy1869… 002Crime and Punishment Dostoyevsky1866… 003Anna KareninaTolstoy1877… Russian_Novels Lecture 13 > Section 2 > Indexes: Basics

Conceptual Example BIDTitleAuthorPublishedFull_text 001War and PeaceTolstoy1869… 002Crime and Punishment Dostoyevsky1866… 003Anna KareninaTolstoy1877… PublishedBID Maintain an index for this, and search over that! Russian_Novels By_Yr_Index Why might just keeping the table sorted by year not be good enough? Lecture 13 > Section 2 > Indexes: Basics

Conceptual Example BIDTitleAuthorPublishedFull_text 001War and PeaceTolstoy1869… 002Crime and Punishment Dostoyevsky1866… 003Anna KareninaTolstoy1877… PublishedBID Indexes shown here as tables, but in reality we will use more efficient data structures… Russian_Novels By_Yr_Index AuthorTitleBID DostoyevskyCrime and Punishment 002 TolstoyAnna Karenina003 TolstoyWar and Peace 001 By_Author_Title_Index Can have multiple indexes to support multiple search keys Lecture 13 > Section 2 > Indexes: Basics

Covering Indexes PublishedBID By_Yr_Index Lecture 13 > Section 2 > Indexes: Basics We say that an index is covering for a specific query if the index contains all the needed attributes- meaning the query can be answered using the index alone! The “needed” attributes are the union of those in the SELECT and WHERE clauses… SELECT Published, BID FROM Russian_Novels WHERE Published > 1867 SELECT Published, BID FROM Russian_Novels WHERE Published > 1867 Example:

High-level Categories of Index Types B-Trees (covered next) Very good for range queries, sorted data Some old databases only implemented B-Trees We will look at a variant called B+ Trees Hash Tables (not covered) There are variants of this basic structure to deal with IO Called linear or extendible hashing- IO aware! The data structures we present here are “IO aware” Real difference between structures: costs of ops determines which index you pick and why Lecture 13 > Section 2 > Indexes: Basics

Activity-13.ipynb 56 Lecture 13 > Section 2 > ACTIVITY