Download presentation
Presentation is loading. Please wait.
1
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More
2
CS 245Notes 52 key h(key) Hashing...... Buckets (typically 1 disk block)
3
CS 245Notes 53...... Two alternatives records...... (1) key h(key)
4
CS 245Notes 54 (2) key h(key) Index record key 1 Two alternatives
5
CS 245Notes 55 (2) key h(key) Index record key 1 Two alternatives Alt (2) for “secondary” search key
6
CS 245Notes 56 Example hash function Key = ‘x 1 x 2 … x n ’ n byte character string Have b buckets h: add x 1 + x 2 + ….. x n – compute sum modulo b
7
CS 245Notes 57 This may not be best function … Read Knuth Vol. 3 if you really need to select a good function.
8
CS 245Notes 58 This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Good hash Expected number of function:keys/bucket is the same for all buckets
9
CS 245Notes 59 Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent
10
CS 245Notes 510 Next: example to illustrate inserts, overflows, deletes h(K)
11
CS 245Notes 511 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123
12
CS 245Notes 512 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123 d a c b h(e) = 1
13
CS 245Notes 513 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123 d a c b h(e) = 1 e
14
CS 245Notes 514 01230123 a b c e d EXAMPLE: deletion Delete: e f f g
15
CS 245Notes 515 01230123 a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c
16
CS 245Notes 516 01230123 a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d
17
CS 245Notes 517 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit
18
CS 245Notes 518 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket
19
CS 245Notes 519 How do we cope with growth? Overflows and reorganizations Dynamic hashing
20
CS 245Notes 520 How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear
21
CS 245Notes 521 Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K) use i grows over time…. 00110101
22
CS 245Notes 522 (b) Use directory h(K)[i ] to bucket............
23
CS 245Notes 523 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010
24
CS 245Notes 524 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010 1 1100 1010
25
CS 245Notes 525 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010 1 1100 1010 New directory 2 00 01 10 11 i = 2 2
26
CS 245Notes 526 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued
27
CS 245Notes 527 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued 0111 0000 0111 0001
28
CS 245Notes 528 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued 0111 0000 0111 0001 2 2
29
CS 245Notes 529 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued
30
CS 245Notes 530 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued 1001 1010
31
CS 245Notes 531 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued 1001 1010 000 001 010 011 100 101 110 111 3 i = 3 3
32
CS 245Notes 532 Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure)
33
CS 245Notes 533 Deletion example: Run thru insert example in reverse!
34
CS 245Notes 534 Note: Still need overflow chains Example: many records with duplicate keys 1 1101 1100 22 insert 1100 1100 if we split:
35
CS 245Notes 535 Solution: overflow chains 1 1101 1100 1 insert 1100 add overflow block: 1101
36
CS 245Notes 536 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary +
37
CS 245Notes 537 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - -
38
CS 245Notes 538 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i
39
CS 245Notes 539 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i (b) File grows linearly
40
CS 245Notes 540 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets
41
CS 245Notes 541 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule
42
CS 245Notes 542 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule insert 0101
43
CS 245Notes 543 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule 0101 can have overflow chains! insert 0101
44
CS 245Notes 544 Note In textbook, n is used instead of m n=m+1 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets n=10
45
CS 245Notes 545 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets
46
CS 245Notes 546 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010
47
CS 245Notes 547 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101
48
CS 245Notes 548 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101 11
49
CS 245Notes 549 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101 11 1111 0101
50
CS 245Notes 550 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2...
51
CS 245Notes 551 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3...
52
CS 245Notes 552 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3... 100
53
CS 245Notes 553 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3... 100 101 0101
54
CS 245Notes 554 When do we expand file? Keep track of: # used slots total # of slots = U
55
CS 245Notes 555 If U > threshold then increase m (and maybe i ) When do we expand file? Keep track of: # used slots total # of slots = U
56
CS 245Notes 556 Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing Summary + + Can still have overflow chains -
57
CS 245Notes 557 Example: BAD CASE Very full Very emptyNeed to move m here… Would waste space...
58
CS 245Notes 558 Hashing - How it works - Dynamic hashing - Extensible - Linear Summary
59
CS 245Notes 559 Next: Indexing vs Hashing Index definition in SQL Multiple key access
60
CS 245Notes 560 Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Indexing vs Hashing
61
CS 245Notes 561 INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Indexing vs Hashing
62
CS 245Notes 562 Index definition in SQL Create index name on rel (attr) Create unique index name on rel (attr) defines candidate key Drop INDEX name
63
CS 245Notes 563 CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...)... at least in SQL... Note
64
CS 245Notes 564 ATTRIBUTE LIST MULTIKEY INDEX (next) e.g., CREATE INDEX foo ON R(A,B,C) Note
65
CS 245Notes 565 Motivation: Find records where DEPT = “Toy” AND SAL > 50k Multi-key Index
66
CS 245Notes 566 Strategy I: Use one index, say Dept. Get all Dept = “Toy” records and check their salary I1I1
67
CS 245Notes 567 Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:
68
CS 245Notes 568 Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3
69
CS 245Notes 569 Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k
70
CS 245Notes 570 For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k
71
CS 245Notes 571 Interesting application: Geographic Data DATA: x y...
72
CS 245Notes 572 Queries: What city is at ? What is within 5 miles from ? Which is closest point to ?
73
CS 245Notes 573 h n b i a c o d Example e g f m l k j
74
CS 245Notes 574 h n b i a c o d 10 20 Example e g f m l k j
75
CS 245Notes 575 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10
76
CS 245Notes 576 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 5 15
77
CS 245Notes 577 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 h i a b c d e f g n o m l j k 5 15
78
CS 245Notes 578 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 h i a b c d e f g n o m l j k Search points near f Search points near b 5 15
79
CS 245Notes 579 Queries Find points with Yi > 20 Find points with Xi < 5 Find points “close” to i = Find points “close” to b =
80
CS 245Notes 580 Many types of geographic index structures have been suggested kd-Trees (very similar to what we described here) Quad Trees R Trees...
81
CS 245Notes 581 Two more types of multi key indexes Grid Partitioned hash
82
CS 245Notes 582 Grid Index Key 2 X 1 X 2 …… X n V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2
83
CS 245Notes 583 CLAIM Can quickly find records with –key 1 = V i Key 2 = X j –key 1 = V i –key 2 = X j
84
CS 245Notes 584 CLAIM Can quickly find records with –key 1 = V i Key 2 = X j –key 1 = V i –key 2 = X j And also ranges…. –E.g., key 1 V i key 2 < X j
85
How do we find entry i,j in linear structure? CS 245Notes 585 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = max number of i values N=4
86
How do we find entry i,j in linear structure? CS 245Notes 586 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = S + iN + j max number of i values N=4 Issue: Cells must be same size, and N must be constant! Issue: Some cells may overflow, some may be sparse...
87
CS 245Notes 587 Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3
88
CS 245Notes 588 With indirection: Grid can be regular without wasting space We do have price of indirection
89
CS 245Notes 589 Can also index grid on value ranges SalaryGrid Linear Scale 123 ToySalesPersonnel 0-20K1 20K-50K2 50K-3 8
90
CS 245Notes 590 Grid files Good for multiple-key search Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - -
91
CS 245Notes 591 Idea: Key1 Key2 Partitioned hash function h1h2 010110 1110010
92
CS 245Notes 592 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert
93
CS 245Notes 593 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert
94
CS 245Notes 594 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales Sal=40k
95
CS 245Notes 595 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales Sal=40k
96
CS 245Notes 596 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k
97
CS 245Notes 597 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k look here
98
CS 245Notes 598 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales
99
CS 245Notes 599 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales look here
100
CS 245Notes 5100 Post hashing discussion: - Indexing vs. Hashing - SQL Index Definition - Multiple Key Access - Multi Key Index Variations: Grid, Geo Data - Partitioned Hash Summary
101
CS 245Notes 5101 Reading Chapter 5 Skim the following sections: –Sections 14.3.6, 14.3.7, 14.3.8 [Second Ed: 14.6.6, 14.6.7, 14.6.8] –Sections 14.4.2, 14.4.3, 14.4.4 [Second Ed: 14.7.2, 14.7.3, 14.7.4] Read the rest
102
CS 245Notes 5102 The BIG picture…. Chapters 11 & 12 [13]: Storage, records, blocks... Chapters 13 & 14 [14]: Access Mechanisms - Indexes - B trees - Hashing - Multi key Chapters 15 & 16 [15, 16]: Query Processing NEXT
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.