1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
Advanced Data Structures NTUA Spring 2007 B+-trees and External memory Hashing.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Data Organization - B-trees. 11.2Database System Concepts A simple index Brighton A Downtown A Downtown A Mianus A Perry.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #7.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
CS4432: Database Systems II
Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?
CS CS4432: Database Systems II Basic indexing.
B+-tree and Hashing.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #7.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
Primary Indexes Dense Indexes
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
CS 255: Database System Principles slides: B-trees
CS4432: Database Systems II
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
CS 405G: Introduction to Database Systems 22 Index Chen Qian University of Kentucky.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
B+ tree & B tree Extracted from Garcia Molina
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 111 Database Systems II Index Structures.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
CS4432: Database Systems II
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
CS 405G: Introduction to Database Systems 12. Index.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Access Structures COMP3211 Advanced Databases Dr Nicholas Gibbins
CS422 Principles of Database Systems Indexes Chengyu Sun California State University, Los Angeles.
CS422 Principles of Database Systems Indexes
COMP3017 Advanced Databases
CS 540 Database Management Systems
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Yan Huang - CSCI5330 Database Implementation – Access Methods
(Slides by Hector Garcia-Molina,
B+-Trees and Static Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Database Design and Programming
Indexing 4/11/2019.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 CS143: Index

2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering index) –Tree-based vs. hash-based index Tree-based index –Indexed sequential file –B+-tree Hash-based index –Static hashing –Extendible hashing

3 Basic Problem SELECT * FROM Student WHERE sid = 40 How can we answer the query? sidnameGPA 20Elaine3.2 70Peter2.6 40Susan3.7

4 Random-Order File How do we find sid=40? sidnameGPA 20Susan3.5 60James1.7 70Peter2.6 40Elaine3.9 30Christy2.9

5 Sequential File Table sequenced by sid. Find sid=40? sidnameGPA 20Susan3.5 30James1.7 40Peter2.6 50Elaine3.9 60Christy2.9

6 Binary Search 100,000 records Q: How many blocks to read? Any better way? –In a library, how do we find a book?

7 Basic Idea Build an “index” on the table –An auxiliary structure to help us locate a record given a “key”

8 Dense, Primary Index Primary index (clustering index) –Index on the search key Dense index –(key, pointer) pair for every record Find the key from index and follow pointer –Maybe through binary search Q: Why dense index? –Isn’t binary search on the file the same? Dense Index Sequential File

9 Why Dense Index? Example –10,000,000 records (900-bytes/rec) –4-byte search key, 4-byte pointer –4096-byte block. Unspanned tuples Q: How many blocks for table (how big)? Q: How many blocks for index (how big)?

10 Sparse, Primary Index Sparse index –(key, pointer) pair per every “block” –(key, pointer) pair points to the first record in the block Q: How can we find 60? Sequential File Sparse Index

11 Multi-level index Sequential File Sparse 2nd level st level Q: Why multi-level index? Q: Does dense, 2nd level index make sense?

12 Secondary (non-clustering) Index Secondary (non-clustering) index –When tuples in the table are not ordered by the index search key Index on a non-search-key for sequential file Unordered file Q: What index? –Does sparse index make sense? Sequence field

13 Sparse and secondary index?

14 Secondary index sparse High level First level is always dense Sparse from the second level

15 Important terms Dense index vs. sparse index Primary index vs. secondary index –Clustering index vs. non-clustering index Multi-level index Indexed sequential file –Sometimes called ISAM (indexed sequential access method) Search key (  primary key)

16 Insertion Insert 35 Q: Do we need to update higher-level index? 35

17 Insertion Insert 15 Q: Do we need to update higher-level index? 15 20

18 Insertion Q: Do we need to update higher-level index? Insert 15

19 Potential performance problem After many insertions… overflow pages (not sequential) Main index

20 Traditional Index (ISAM) Advantage –Simple –Sequential blocks Disadvantage –Not suitable for updates –Becomes ugly (loses sequentiality and balance) over time

21 B+Tree Most popular index structure in RDBMS Advantage –Suitable for dynamic updates –Balanced –Minimum space usage guarantee Disadvantage –Non-sequential index blocks

22 B+Tree Example (n=3) Leaf Non leaf root 20Susan2.7 30James3.6 50Peter1.8 ………... Balanced: All leaf nodes are at the same level

23 n: max # of pointers in a node All pointers (except the last one) point to tuples At least half of the pointers are used. (more precisely,  (n+1)/2  pointers) Sample Leaf Node (n=3) From a non-leaf node Last pointer: to the next leaf node 20Susan2.7 30James3.6 50Peter1.8 ……… points to tuple

24 Points to the nodes one-level below - No direct pointers to tuples At least half of the ptrs used (precisely,  n/2  ) - except root, where at least 2 ptrs used Sample Non-leaf Node (n=3) To keys 23  k<56 To keys 56  k To keys k<23

25 Find a greater key and follow the link on the left (Algorithm: Figure on textbook) Find 30, 60, 70? Search on B+tree

26 Nodes are never too empty Use at least Non-leaf:  n/2  pointers Leaf:  (n+1)/2  pointers full node min. node Non-leaf Leaf n=

27 Non-leaf (non-root) nn-1  n/ 2   n/ 2  -1 Leaf (non-root) nn-1 Rootnn-121 Max Max Min Min Ptrs keys ptrs keys  (n+1)/2   (n-1)/2  Number of Ptrs/Keys for B+tree

28 (a) simple case (no overflow) (b) leaf overflow (c) non-leaf overflow (d) new root B+Tree Insertion

29 (a) Simple case (no overflow)

30 Insert 60 Insertion (Simple Case)

31 Insert 60 Insertion (Simple Case)

32 (b) Leaf overflow

33 Insert 55 No space to store 55 Insertion (Leaf Overflow)

Insert 55 Split the leaf into two. Put the keys half and half Insertion (Leaf Overflow) Overflow!

35 Insert 55 Insertion (Leaf Overflow)

36 Insert 55 Copy the first key of the new node to parent Insertion (Leaf Overflow)

37 Insert 55 Insertion (Leaf Overflow) Q: After split, leaf nodes always half full? No overflow. Stop

38 (c) Non-leaf overflow

39 Insertion (Non-leaf Overflow) Insert Leaf overflow. Split and copy the first key of the new node 60 70

40 Insertion (Non-leaf Overflow) Insert

41 Insertion (Non-leaf Overflow) Insert

42 Insertion (Non-leaf Overflow) Insert Overflow!

43 Insertion (Non-leaf Overflow) Insert Split the node into two. Move up the key in the middle

44 Insertion (Non-leaf Overflow) Insert Middle key

45 Insertion (Non-leaf Overflow) Insert No overflow. Stop Q: After split, non-leaf at least half full?

46 (d) New root

47 Insertion (New Root Node) Insert

48 Insertion (New Root Node) Insert Overflow! 60 30

49 Insertion (New Root Node) Insert Split and move up the mid-key. Create new root 60 30

50 Insertion (New Root Node) Insert Q: At least 2 ptrs at root?

51 B+Tree Insertion Leaf node overflow –The first key of the new node is copied to the parent Non-leaf node overflow –The middle key is moved to the parent Detailed algorithm: Figure 12.13

52 B+Tree Deletion (a) Simple case (no underflow) (b) Leaf node, coalesce with neighbor (c) Leaf node, redistribute with neighbor (d) Non-leaf node, coalesce with neighbor (e) Non-leaf node, redistribute with neighbor In the examples, n = 4 –Underflow for non-leaf when fewer than  n/2  = 2 ptrs –Underflow for leaf when fewer than  (n+1)/2  = 3 ptrs –Nodes are labeled as a, b, c, d, …

53 (a) Simple case (no underflow)

54 (a) Simple case Delete a bcde

55 (a) Simple case Delete 25 –Underflow? Min 3 ptrs. Currently 3 ptrs a bcde Underflow? 40 50

56 (b) Leaf node, coalesce with neighbor

57 (b) Coalesce with sibling (leaf) Delete bcd a e

58 (b) Coalesce with sibling (leaf) Delete 50 –Underflow? Min 3 ptrs, currently bcd a Underflow? e

59 (b) Coalesce with sibling (leaf) Delete 50 –Try to merge with a sibling bcd a underflow! Can be merged? e

60 (b) Coalesce with sibling (leaf) Delete 50 –Merge c and d. Move everything on the right to the left bcd a Merge e

61 (b) Coalesce with sibling (leaf) Delete 50 –Once everything is moved, delete d bcd a e

62 (b) Coalesce with sibling (leaf) Delete 50 –After leaf node merge, From its parent, delete the pointer and key to the deleted node bc d e a

63 (b) Coalesce with sibling (leaf) Delete 50 –Check underflow at a. Min 2 ptrs, currently bc a Underflow? e

64 (c) Leaf node, redistribute with neighbor

65 (c) Redistribute (leaf) Delete bcde a

66 (c) Redistribute (leaf) Delete 50 –Underflow? Min 3 ptrs, currently 2 –Check if d can be merged with its sibling c or e –If not, redistribute the keys in d with a sibling Say, with c bcde a Underflow? Can be merged?

67 (c) Redistribute (leaf) Delete 50 –Redistribute c and d, so that nodes c and d are roughly “half full” Move the key 30 and its tuple pointer to the d bcde a Redistribute

68 (c) Redistribute (leaf) Delete 50 –Update the key in the parent bcde a 30 40

69 (c) Redistribute (leaf) Delete 50 –No underflow at a. Done bcde a 30 Underflow?

70 (d) Non-leaf node, coalesce with neighbor

71 (d) Coalesce (non-leaf) Delete 20 –Underflow! Merge d with e. Move everything in the right to the left 70 a bc defg

72 (d) Coalesce (non-leaf) Delete 20 –From the parent node, delete pointer and key to the deleted node 70 a bc defg

73 (d) Coalesce (non-leaf) Delete 20 –Underflow at b ? Min 2 ptrs, currently 1. –Try to merge with its sibling. Nodes b and c : 3 ptrs in total. Max 4 ptrs. Merge b and c. 70 a bc dfg underflow! Can be merged?

74 (d) Coalesce (non-leaf) Delete 20 –Merge b and c Pull down the mid-key 50 in the parent node Move everything in the right node to the left. Very important: when we merge non-leaf nodes, we always pull down the mid-key in the parent and place it in the merged node. 70 a bc dfg merge

75 (d) Coalesce (non-leaf) Delete 20 –Merge b and c Pull down the mid-key 50 in the parent node Move everything in the right node to the left. Very important: when we merge non-leaf nodes, we always pull down the mid-key in the parent and place it in the merged node. 70 bc dfg a

76 (d) Coalesce (non-leaf) 70 a bc dfg Delete 20 –Delete pointer to the merged node

77 (d) Coalesce (non-leaf) 70 a b dfg Delete 20 –Underflow at a ? Min 2 ptrs. Currently 2. Done

78 (e) Non-leaf node, redistribute with neighbor

79 (e) Redistribute (non-leaf) Delete 20 –Underflow! Merge d with e a bc defg

80 (e) Redistribute (non-leaf) Delete 20 –After merge, remove the key and ptr to the deleted node from the parent a bc defg

81 (e) Redistribute (non-leaf) Delete 20 –Underflow at b ? Min 2 ptrs, currently 1. –Merge b with c ? Max 4 ptrs, 5 ptrs in total. –If cannot be merged, redistribute the keys with a sibling. Redistribute b and c a bc dfg underflow! Can be merged?

82 (e) Redistribute (non-leaf) Delete 20 Redistribution at a non-leaf node is done in two steps. Step 1: Temporarily, make the left node b “overflow” by pulling down the mid-key and moving everything to the left a bc dfg redistribute

83 (e) Redistribute (non-leaf) Delete 20 Step 2: Apply the “overflow handling algorithm” (the same algorithm used for B+tree insertion) to the overflowed node –Detailed algorithm in the next slide a bc dfg redistribute 97 temporary overflow

84 (e) Redistribute (non-leaf) Delete 20 Step 2: “overflow handling algorithm” –Pick the mid-key (say 90) in the node and move it to parent. –Move everything to the right of 90 to the empty node c a bc dfg redistribute

85 (e) Redistribute (non-leaf) Delete 20 –Underflow at a ? Min 2 ptrs, currently 3. Done 70 a bc dfg

86 Important Points Remember: –For leaf node merging, we delete the mid-key from the parent –For non-leaf node merging/redistribution, we pull down the mid-key from their parent. Exact algorithm: Figure In practice –Coalescing is often not implemented Too hard and not worth it

87 Where does n come from? n determined by –Size of a node –Size of search key –Size of an index pointer Q: 1024B node, 10B key, 8B ptr  n?

88 Question on B+tree SELECT * FROM Student WHERE sid > 60?

89 Summary on tree index Issues to consider –Sparse vs. dense –Primary (clustering) vs. secondary (non-clustering) Indexed sequential file (ISAM) –Simple algorithm. Sequential blocks –Not suitable for dynamic environment B+trees –Balanced, minimum space guarantee –Insertion, deletion algorithms

90 Index Creation in SQL CREATE INDEX ON (,,…) Example –CREATE INDEX stidx ON Student(sid) Creates a B+tree on the attributes Speeds up lookup on sid

91 Primary (Clustering) Index MySQL: –Primary key becomes the clustering index DB2: –CREATE INDEX idx ON Student(sid) CLUSTER –Tuples in the table are sequenced by sid Oracle: Index-Organized Table (IOT) –CREATE TABLE T (... ) ORGANIZATION INDEX –B+tree on primary key –Tuples are stored at the leaf nodes of B+tree Periodic reorganization may still be necessary to improve range scan performance

92 Next topic Hash index –Static hashing –Extendible hashing

93 What is a Hash Table? Hash Table –Hash function h (k): key  integer [0…n] e.g., h (‘Susan’) = 7 –Array for keys: T[0…n] –Given a key k, store it in T[ h ( k )] 0 1Neil 2 3James 4Susan 5 h(Susan) = 4 h(James) = 3 h(Neil) = 1

94 Hashing for DBMS (Static Hashing) (key, record) Disk blocks (buckets) search key  h(key)

95 Overflow and Chaining Insert h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 h(e) = 1 Delete h(b) = 2 h(c) = a b c d e

96 Major Problem of Static Hashing How to cope with growth? –Data tends to grow in size –Overflow blocks unavoidable overflow blocks hash buckets

97 (a) Use i of b bits output by hash function Extendible Hashing (two ideas) h(K)  b use i  grows over time

98 (b) Use directory that maintains pointers to hash buckets (indirection) Extendible Hashing (two ideas) c e hash bucket directory h(c)

99 Example h(k) is 4 bits; 2 keys/bucket Insert i = 0111

100 Example Insert i = overflow! Increase i of the bucket. Split it.

101 Example Insert overflow! 2 Redistribute keys based on first i bits i =

102 Example Insert Update ptr in dir to new bkt i = ? If no space, double directory size (increase i)

103 Example Insert i = i = Copy pointers

i = Example Insert i =

105 Example Insert i = 0000 Overflow! Split bucket and increase i

106 Example Insert i = Overflow! Redistribute keys

107 Example Insert i = Update ptr in directory

108 Example Insert i =

109 Insert i = Overflow! Split bucket, increase i, redistribute keys

i = Update ptr in dir If no space, double directory Insert 0011

i = Insert i =

i = Insert i =

113 Extendible Hashing: Deletion Two options a)No merging of buckets b)Merge buckets and shrink directory if possible

114 Delete i = a b c 1010

115 Delete 1010 Can we merge a and b? b and c? i = a b c

116 Delete 1010 Decrease i and merge buckets i = a b c 1 Update ptr in directory Q: Can we shrink directory?

117 Delete i = a b i =

118 Bucket Merge Condition Bucket merge condition –Bucket i’s are the same –First (i-1) bits of the hash key are the same Directory shrink condition –All bucket i’s are smaller than the directory i

119 Questions on Extendible Hashing Can we provide minimum space guarantee?

120 Space Waste i = 3

121 Hash index summary Static hashing –Overflow and chaining Extendible hashing –Can handle growing files No periodic reorganizations –Indirection Up to 2 disk accesses to access a key –Directory doubles in size Not too bad if the data is not too large

122 Hashing vs. Tree Can an extendible-hash index support? SELECT * FROM R WHERE R.A > 5 Which one is better, B+tree or Extendible hashing? SELECT * FROM R WHERE R.A = 5