B-trees - Hashing. 11.2Database System Concepts Review: B-trees and B+-trees Multilevel, disk-aware, balanced index methods primary or secondary dense.

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Indexing (Cont.) These slides are a modified version of the slides of the book “Database System Concepts” (Chapter 12), 5th Ed., McGraw-Hill,McGraw-Hill.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Advanced Data Structures NTUA Spring 2007 B+-trees and External memory Hashing.
Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 11: Indexing.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index Basic Concepts Indexing mechanisms used to speed up access to desired data. E.g., author catalog in library Search Key - attribute to set of attributes.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hash Indexes: Chap. 11 CS634 Lecture 6, Feb
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B+-tree and Hashing.
B+-tree and Hash Indexes
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Indexing and Hashing.
Computing & Information Sciences Kansas State University Friday, 24 Oct 2008CIS 560: Database System Concepts Lecture 23 of 42 Friday, 24 October 2008.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Database Systems ( 資料庫系統 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 )
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 12: Indexing and Hashing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
B-Trees, Part 2 Hash-Based Indexes R&G Chapter 10 Lecture 10.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 111 Database Systems II Index Structures.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and Hashing.
Hash-Based Indexes Chapter 11
Chapter 11: Indexing and Hashing
Database Management Systems (CS 564)
Hashing Chapter 11.
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

B-trees - Hashing

11.2Database System Concepts Review: B-trees and B+-trees Multilevel, disk-aware, balanced index methods primary or secondary dense or sparse supports selection and range queries B+-trees: most common indexing structure in databases all actual values stored on leaf-nodes. Optimality: space O(N/B), updates O(log B (N/B)), queries O(log B (N/B)+K/B) (B is the fan out of a node)

11.3Database System Concepts Root B+Tree ExampleOrder=

11.4Database System Concepts Full nodemin. node Non-leaf Leaf n=

11.5Database System Concepts B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer” (3) Number of pointers/keys for B+tree Non-leaf (non-root) nn-1  n/ 2   n/ 2  - 1 Leaf (non-root) nn-1 Rootnn-121 Max Max Min Min ptrs keys  (n-1)/ 2 

11.6Database System Concepts Insert into B+tree (a) simple case  space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

11.7Database System Concepts (a) Insert key = 32 n=

11.8Database System Concepts (a) Insert key = 7 n=

11.9Database System Concepts (c) Insert key = 160 n=

11.10Database System Concepts (d) New root, insert 45 n= new root

11.11Database System Concepts (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

11.12Database System Concepts (b) Coalesce with sibling  Delete n=5 40

11.13Database System Concepts (c) Redistribute keys  Delete n=5 35

11.14Database System Concepts (d) Non-leaf coalesce  Delete 37 n= new root

11.15Database System Concepts Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query (ssn=10) needs to traverse a full path. In practice, 3-4 block accesses (depending on the height of the tree, buffering) Any better approach?

11.16Database System Concepts Hashing  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees.

11.17Database System Concepts Static Hashing  # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.  h(k) MOD N= bucket to which data entry with key k belongs. (N = # of buckets) h(key) mod N h key Primary bucket pages Overflow pages 1 0 N-1

11.18Database System Concepts Static Hashing (Contd.)  Buckets contain data entries.  Hash fn works on search key field of record r. Use its value MOD N to distribute values over range 0... N-1.  h(key) = (a * key + b) usually works well.  a and b are constants; lots known about how to tune h.  Long overflow chains can develop and degrade performance.  extendable and Linear Hashing: Dynamic techniques to fix this problem.

11.19Database System Concepts extendable Hashing  Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets?  Reading and writing all pages is expensive!  Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed!  Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page!  Trick lies in how hash function is adjusted!

11.20Database System Concepts Example 13* LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C 10* 1*7* 4*12*32* 16* 5* we denote r by h(r). Directory is array of size 4. Bucket for record r has entry with index = `global depth’ least significant bits of h(r); –If h(r) = 5 = binary 101, it is in bucket pointed to by 01. –If h(r) = 7 = binary 111, it is in bucket pointed to by 11.

11.21Database System Concepts Handling Inserts  Find bucket where record belongs.  If there’s room, put it there.  Else, if bucket is full, split it:  increment local depth of original page  allocate new page with new local depth  re-distribute records from original page.  add entry for the new page to the directory

11.22Database System Concepts Example: Insert 21, then 19, 15 13* LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C 2 Bucket D DATA PAGES 10* 1*7* 2 4*12*32* 16* 15* 7* 19* 5*  21 =  19 =  15 = *

11.23Database System Concepts 2 4*12*32* 16* Insert h(r)=20 (Causes Doubling) LOCAL DEPTH GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D 1* 5*21*13* 10* 15*7*19* (`split image' of Bucket A) 20* 3 Bucket A2 4*12* of Bucket A) 3 Bucket A2 (`split image' 4* 20* 12* 2 Bucket B 1*5*21*13* 10* 2 19* 2 Bucket D 15* 7* 3 32* 16* LOCAL DEPTH GLOBAL DEPTH 3 32* 16*

11.24Database System Concepts Points to Note  20 = binary Last 2 bits (00) tell us r belongs in either A or A2. Last 3 bits needed to tell which.  Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to.  Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket.  When does bucket split cause directory doubling?  Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page.

11.25Database System Concepts Directory Doubling Why use least significant bits in directory? ó Allows for doubling via copying! vs * 6 = * 6 = Least Significant Most Significant

11.26Database System Concepts Comments on extendable Hashing  If directory fits in memory, equality search answered with one disk access; else two.  100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory.  Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large.  Multiple entries with same hash value cause problems!  Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.

11.27Database System Concepts Extendable Hashing vs. Other Schemes  Benefits of extendable hashing:  Hash performance does not degrade with growth of file  Minimal space overhead  Disadvantages of extendable hashing  Extra level of indirection to find desired record  Bucket address table may itself become very big (larger than memory)  Cannot allocate very large contiguous areas on disk either  Solution: B + -tree structure to locate desired record in bucket address table  Changing size of bucket address table is an expensive operation  Linear hashing is an alternative mechanism  Allows incremental growth of its directory (equivalent to bucket address table)  At the cost of more bucket overflows

11.28Database System Concepts Comparison of Ordered Indexing and Hashing  Cost of periodic re-organization  Relative frequency of insertions and deletions  Is it desirable to optimize average access time at the expense of worst-case access time?  Expected type of queries:  Hashing is generally better at retrieving records having a specified value of the key.  If range queries are common, ordered indices are to be preferred  In practice:  PostgreSQL supports hash indices, but discourages use due to poor performance  Oracle supports B+trees, static hash organization, but not hash indices  SQLServer supports only B + -trees

11.29Database System Concepts Bitmap Indices  Bitmap indices are a special type of index designed for efficient querying on multiple keys  Very effective on attributes that take on a relatively small number of distinct values  E.g. gender, country, state, …  E.g. income-level (income broken up into a small number of levels such as , , , infinity)  A bitmap is simply an array of bits  For each gender, we associate a bitmap, where each bit represents whether or not the corresponding record has that gender.

11.30Database System Concepts Bitmap Indices (Cont.)  In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute  Bitmap has as many bits as records  In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise

11.31Database System Concepts Bitmap Indices (Cont.)  Bitmap indices are useful for queries on multiple attributes  not particularly useful for single attribute queries  Queries are answered using bitmap operations  Intersection (and)  Union (or)  Complementation (not)  Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap  E.g AND = OR = NOT =  Males with income level L1:  And’ing of Males bitmap with Income Level L1 bitmap  AND =  Can then retrieve required tuples.  Counting number of matching tuples is even faster

11.32Database System Concepts Bitmap Indices (Cont.)  Bitmap indices generally very small compared with relation size  E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space used by relation.  If number of distinct attribute values is 8, bitmap is only 1% of relation size  Deletion needs to be handled properly  Existence bitmap to note if there is a valid record at a record location  Needed for complementation  not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap  Should keep bitmaps for all values, even null value  To correctly handle SQL null semantics for NOT(A=v):  intersect above result with (NOT bitmap-A-Null)

11.33Database System Concepts Efficient Implementation of Bitmap Operations  Bitmaps are packed into words; a single word and (a basic CPU instruction) computes and of 32 or 64 bits at once  E.g. 1-million-bit maps can be and-ed with just 31,250 instruction  Counting number of 1s can be done fast by a trick:  Use each byte to index into a precomputed array of 256 elements each storing the count of 1s in the binary representation  Can use pairs of bytes to speed up further at a higher memory cost  Add up the retrieved counts  Bitmaps can be used instead of Tuple-ID lists at leaf levels of B + -trees, for values that have a large number of matching records  Worthwhile if > 1/64 of the records have that value, assuming a tuple-id is 64 bits  Above technique merges benefits of bitmap and B + -tree indices

11.34Database System Concepts Index Definition in SQL  Create a B-tree index (default in most databases) create index on ( ) -- create index b-index on branch(branch_name) -- create index ba-index on branch(branch_name, account) -- concatenated index -- create index fa-index on branch(func(balance, amount)) – function index  Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key.  Hash indexes: not supported by every database (but implicitly in joins,…)  PostgresSQL has it but discourages due to performance  Create a bitmap index create bitmap index on ( ) -For attributes with few distinct values -Mainly for decision-support(query) and not OLTP (do not support updates efficiently)  To drop any index drop index

End of Chapter

11.36Database System Concepts Partitioned Hashing  Hash values are split into segments that depend on each attribute of the search-key. (A 1, A 2,..., A n ) for n attribute search-key  Example: n = 2, for customer, search-key being (customer-street, customer-city) search-key valuehash value (Main, Harrison) (Main, Brooklyn) (Park, Palo Alto) (Spring, Brooklyn) (Alma, Palo Alto)  To answer equality query on single attribute, need to look up multiple buckets. Similar in effect to grid files.

11.37Database System Concepts Grid Files  Structure used to speed the processing of general multiple search-key queries involving one or more comparison operators.  The grid file has a single grid array and one linear scale for each search-key attribute. The grid array has number of dimensions equal to number of search-key attributes.  Multiple cells of grid array can point to same bucket  To find the bucket for a search-key value, locate the row and column of its cell using the linear scales and follow pointer

11.38Database System Concepts Example Grid File for account

11.39Database System Concepts Queries on a Grid File  A grid file on two attributes A and B can handle queries of all following forms with reasonable efficiency  (a 1  A  a 2 )  (b 1  B  b 2 )  (a 1  A  a 2  b 1  B  b 2 ),.  E.g., to answer (a 1  A  a 2  b 1  B  b 2 ), use linear scales to find corresponding candidate grid array cells, and look up all the buckets pointed to from those cells.

11.40Database System Concepts Grid Files (Cont.)  During insertion, if a bucket becomes full, new bucket can be created if more than one cell points to it.  Idea similar to extendable hashing, but on multiple dimensions  If only one cell points to it, either an overflow bucket must be created or the grid size must be increased  Linear scales must be chosen to uniformly distribute records across cells.  Otherwise there will be too many overflow buckets.  Periodic re-organization to increase grid size will help.  But reorganization can be very expensive.  Space overhead of grid array can be high.  R-trees (Chapter 23) are an alternative

11.41Database System Concepts Linear Hashing  A dynamic hashing scheme that handles the problem of long overflow chains without using a directory.  Directory avoided in LH by using temporary overflow pages, and choosing the bucket to split in a round-robin fashion.  When any bucket overflows split the bucket that is currently pointed to by the “Next” pointer and then increment that pointer to the next bucket.

11.42Database System Concepts Linear Hashing – The Main Idea  Use a family of hash functions h 0, h 1, h 2,...  h i (key) = h(key) mod(2 i N)  N = initial # buckets  h is some hash function  h i+1 doubles the range of h i (similar to directory doubling)

11.43Database System Concepts Linear Hashing (Contd.)  Algorithm proceeds in `rounds’. Current round number is “Level”.  There are N Level (= N * 2 Level ) buckets at the beginning of a round  Buckets 0 to Next-1 have been split; Next to N Level have not been split yet this round.  Round ends when all initial buckets have been split (i.e. Next = N Level ).  To start next round: Level++; Next = 0;

11.44Database System Concepts LH Search Algorithm  To find bucket for data entry r, find h Level (r):  If h Level (r) >= Next (i.e., h Level (r) is a bucket that hasn’t been involved in a split this round) then r belongs in that bucket for sure.  Else, r could belong to bucket h Level (r) or bucket h Level (r) + N Level must apply h Level+1 (r) to find out.

11.45Database System Concepts Example: Search 44 (11100), 9 (01001) 0 h h 1 Level=0, Next=0, N= PRIMARY PAGES 44* 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* ( This info is for illustration only!)

11.46Database System Concepts Level=0, Next = 1, N=4 ( This info is for illustration only!) 0 h h PRIMARY PAGES OVERFLOW PAGES * 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* 43* Example: Search 44 (11100), 9 (01001)

11.47Database System Concepts Linear Hashing - Insert  Find appropriate bucket  If bucket to insert into is full:  Add overflow page and insert data entry.  Split Next bucket and increment Next.  Note: This is likely NOT the bucket being inserted to!!!  to split a bucket, create a new bucket and use h Level+1 to re- distribute entries.  Since buckets are split round-robin, long overflow chains don’t develop!

11.48Database System Concepts Example: Insert 43 (101011) 0 h h 1 ( This info is for illustration only!) Level=0, N= Next=0 PRIMARY PAGES 0 h h 1 Level= Next=1 PRIMARY PAGES OVERFLOW PAGES * 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* 44* 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* 43* ( This info is for illustration only!)

11.49Database System Concepts Example: End of a Round 0 h h 1 22* Next= Level=0, Next = 3 PRIMARY PAGES OVERFLOW PAGES 32* 9* 5* 14* 25* 66* 10* 18* 34* 35*31* 7* 11* 43* 44*36* 37*29* 30* 0 h h 1 37* Next= PRIMARY PAGES OVERFLOW PAGES 11 32* 9*25* 66* 18* 10* 34* 35* 11* 44* 36* 5* 29* 43* 14* 30* 22* 31*7* 50* Insert 50 (110010) Level=1, Next = 0