Download presentation
Presentation is loading. Please wait.
1
E.G.M. PetrakisHashing1 Hashing on the Disk Keys are stored in “disk pages” (“buckets”) several records fit within one page Retrieval: find address of page bring page into main memory searching within the page comes for free
2
E.G.M. PetrakisHashing2 Σ key space hash function........ data pages 0 1 2 m-1 b page size b: maximum number of records in page space utilization u: measure of the use of space
3
E.G.M. PetrakisHashing3 Collisions Keys that hash to the same address are stored within the same page If the page is full: i. page splits: allocate a new page and split page content between the old and the new page or ii. overflows: list of overflow pages x x x xx overflow
4
E.G.M. PetrakisHashing4 Access Time Goal: find key in one disk access Access time ~ number of accesses Large u: good space utilization but many overflows or splits => more disk accesses Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses
5
E.G.M. PetrakisHashing5 Categories of Methods Static: require file reorganization open addressing, separate chaining Dynamic: dynamic file growth, adapt to file size dynamic hashing, extendible hashing, linear hashing, spiral storage…
6
E.G.M. PetrakisHashing6 Dynamic Hashing Schemes File size adapts to data size without total reorganization Typically 1-3 disk accesses to access a key Access time and u are a typical trade- off u between 50-100% (typically 69%) Complicated implementation
7
E.G.M. PetrakisHashing7 Two disk accesses: one to access the index, one to access the data with index in main memory => one disk access Problem: the index may become too large Dynamic hashing (Larson 1978) Extendible hashing (Fagin et.al. 1979) indexdata pages Schemes With Index
8
E.G.M. PetrakisHashing8 Ideally, less space and less disk accesses (at least one) Linear Hashing (Litwin 1980) Linear Hashing with Partial Expansions (Larson 1980) Spiral Storage (Martin 1979) address space data space Schemes Without Index
9
E.G.M. PetrakisHashing9 Support for shrinking or growing file shrinking or growing address space, the hash function adapts to these changes hash functions using first (last) bits of key = b n-1 b n-2 ….b i b i-1 …b 2 b 1 b 0 h i (key) = b i-1 … b 2 b 1 b 0 supports 2 i addresses h i : one more bit than h i-1 to address larger files Hash Functions
10
E.G.M. PetrakisHashing10 Dynamic Hashing ( Larson 1978 ) Two level index primary h 1 (key): accesses a hash table secondary h 2 (key): accesses a binary tree Index: binary tree h 1 (k) 1 st level h 2 (k) 2 nd level data pages b 2 3 4 1
11
E.G.M. PetrakisHashing11 Index Fixed (static): h 1 (key) = key mod m Dynamic behavior on secondary index h 2 (key) uses i bits of key the bit sequence of h 2 =b i-1 …b 2 b 1 b 0 denotes which path on the binary tree index to follow in order to access the data page scan h 2 from right to left (bit 1: follow right path, bit 0: follow left path)
12
E.G.M. PetrakisHashing12 index h 1 (k) 1 st level h 2 (k) 2 nd level data pages b 2 3 4 1 012345012345 0 1 1 0 h 1 =1, h 2 =“0” h 1 =1, h 2 =“01” h 1 =1, h 2 =“11” h 1 =5, h 2 = any h 1 (key) = key mod 6 h 2 (key) = “01”<= depth of binary tree = 2 0
13
E.G.M. PetrakisHashing13 Initially fixed size primary index and no data insert record in new page under h 1 address if page is full, allocate one extra page split keys between old and new page use one extra bit in h 2 for addressing h 1 =1, h 2 =0 h 1 =1, h 2 =1 01230123 0 1 Insertions 01230123 01230123 b h 1 =1,h 2 =any
14
E.G.M. PetrakisHashing14
15
E.G.M. PetrakisHashing15 Deletions Find record to be deleted using h 1, h 2 Delete record Check “sibling” page: less than b records in both pages ? if yes merge the two pages delete one empty page shrink binary tree index by one level and reduce h 2 by one bit
16
E.G.M. PetrakisHashing16 merging 0 3 2 1 1 3 2 4 0 3 2 1 1 3 2 4 delete
17
E.G.M. PetrakisHashing17 Extendible Hashing ( Fagin et.al. 1979 ) Dynamic hashing without index Primary hashing is omitted Only secondary hashing with all binary trees at the same level The index shrinks and grows according to file size Data pages attached to the index
18
E.G.M. PetrakisHashing18 dynamic hashing with all binary trees at same level 1 0 0123401234 1 0 0 1 00 01 10 11 0123401234 2 2 1 dynamic hashing number of address bits
19
E.G.M. PetrakisHashing19 Initially 1 index and 1 data page 0 address bits insert records in data page index storage 0 b 0 Insertions global depth d: size of index 2 d local depth l : Number of address bits
20
E.G.M. PetrakisHashing20 1 1 0101 d: global depth = 1 l : local depth = 1 d 1 l index storage 0 b 0 dl Page “0” Overflows
21
E.G.M. PetrakisHashing21 Page “0” Overflows (cont.) 1 more key bit for addressing and 1 extra page => index doubles !! Split contents of previous page between 2 pages according to next bit of key Global depth d: number of index bits => 2 d index size Local depth l : number of bits for record addressing
22
E.G.M. PetrakisHashing22 Page “0” Overflows (again) 00 01 10 11 22 2 1 contains records with same 1 st bit of key contain records with same 2 bits of key d
23
E.G.M. PetrakisHashing23 Page “01” Overflows 3 000 001 010 011 100 101 110 111 1 2 3 3 d 1 more key bit for addressing 2 d-l : number of pointers to page
24
E.G.M. PetrakisHashing24 Page “100” Overflows no need to double index page 100 splits into two (1 new page) local depth l is increased by 1 000 001 010 011 100 101 110 111 2 3 2 3 3 2 +1
25
E.G.M. PetrakisHashing25 If l < d, split overflowed page (1 extra page) If l = d => index is doubled, page is split d is increased by 1=>1 more bit for addressing update pointers (either way): a) if d prefix bits are used for addressing d=d+1; for (i=2 d -1, i>=0,i--) index[i]=index[i/2]; b) if d suffix bits are used for (i=0; i <= 2 d -1; i++) index[i]=index[i]+2 d -1; d=d+1 Insertion Algorithm
26
E.G.M. PetrakisHashing26 Deletion Algorithm Find and delete record Check sibling page If less than b records in both pages merge pages and free empty page decrease local depth l by 1 (records in merged page have 1 less common bit) if l reduce index (half size) update pointers
27
E.G.M. PetrakisHashing27 000 001 010 011 100 101 110 111 2 3 2 3 3 2 delete with merging 000 001 010 011 100 101 110 111 2 3 2 2 2 l < d 00 01 10 11 2 2 2 2 2
28
E.G.M. PetrakisHashing28 A page splits and there are more than b keys with same next bit take one more bit for addressing (increase l) if d=l the index doubles again !! Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) if distribution is known, transform it to uniform Dynamic hashing performs better for non- uniform distributions (affected locally) Observations
29
E.G.M. PetrakisHashing29 For n: records and page size b expected size of index (Flajolet) 1 disk access/retrieval when index in main memory 2 disk accesses when index is on disk overflows increase number of disk accesses Performance
30
E.G.M. PetrakisHashing30 Storage Utilization with Page Splitting In general 50% < u < 100% On the average u ~ ln2 ~ 69% ( no overflows ) b b before splitting after splitting After splitting
31
E.G.M. PetrakisHashing31 Storage Utilization with Overflows Achieves higher u and avoids page doubling (d=l) higher u is achieved for small overflow pages u=2b/3b~66% after splitting small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75% double index only if the overflow overflows!! bb
32
E.G.M. PetrakisHashing32 Linear Hashing ( Litwin 1980) Dynamic scheme without index Indices refer to page addresses Overflows are allowed The file grows one page at a time The page which splits is not always the one which overflowed The pages split in a predetermined order
33
E.G.M. PetrakisHashing33 Linear Hashing (cont.) Initially n empty pages p points to the page that splits Overflows are allowed b p b p
34
E.G.M. PetrakisHashing34 File Growing A page splits whenever the “splitting criterion” is satisfied a page is added at the end of the file pointer p points to the next page split contents of old page between old and new page based on key values p
35
E.G.M. PetrakisHashing35 b=b page =4, b overflow =1 initially n=5 pages hash function h 0 =k mod 5 splitting criterion u > A% alternatively split when overflow overflows, etc. 4 319 613 303 402 27 737 712 16 711 125 320 90 435 p 215522 438 new element 0 1 2 3 4
36
E.G.M. PetrakisHashing36 Page 5 is added at end of file The contents of page 0 are split between pages 0 and 5 based on hash function h 1 = key mod 10 p points to the next page p 4 319 613 303 438 402 27 737 712 16 711 320 90 522 125 435 215 0 1 2 3 4 5
37
E.G.M. PetrakisHashing37 Initially h 0 =key mod n As new pages are added at end of file, h 0 alone becomes insufficient The file will eventually double its size In that case use h 1 =key mod 2n In the meantime use h 0 for pages not yet split use h 1 for pages that have already split Split contents of page pointed to by p based on h 1 Hash Functions
38
E.G.M. PetrakisHashing38 When the file has doubled its size, h 0 is no longer needed set h 0 =h 1 and continue (e.g., h 0 =k mod 10) The file will eventually double its size again Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) Hash Functions (cont.)
39
E.G.M. PetrakisHashing39 Initially n pages and 0 <= h 0 (k) <= n Series of hash functions Selection of hash function: if h i (k) >= p then use h i (k) else use h i+1 (k) Hash Functions
40
E.G.M. PetrakisHashing40 Linear Hashing with Partial Expansions (Larson 1980) Problem with Linear Hashing: pages to the right of p delay to split large chains of overflows on rightmost pages Solution: do not wait that much to split a page k partial expansions: take pages in groups of k all k pages of a group split together the file grows at lower rates
41
E.G.M. PetrakisHashing41 Two Partial Expansions Initially 2n pages, n groups, 2 pages/group groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1) Pages in same group spit together => some records go to a new page at end of file (position: 2n) 2 pointers to pages of the same group 0 1 n 2n
42
E.G.M. PetrakisHashing42 1 st Expansion After n splits, all pages are split the file has 3n pages (1.5 time larger) the file grows at lower rate after 1 st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n 0 n 2n 3n
43
E.G.M. PetrakisHashing43 2 nd Expansion After n splits the file has size 4n repeat the same process having initially 4n pages in 2n groups 2 pointers to pages of the same group 0 1 2n 4n
44
E.G.M. PetrakisHashing44 1 1,1 1,2 1,3 1,4 1,5 1,6 11,21,41,61,82 relative file size disk access/retrieval Linear Hashing Linear Hashing 2 partial expansions 3.563.534.04 deletion 3.313.213.57 insertio n 1.091.121.17 retrieval Linear Hashing 3 part. Exp. Linear Hashing 2 part. Exp. Linear Hashing b = 5 b’ = 5 u = 0.85
45
E.G.M. PetrakisHashing45 Dynamic Hashing Schemes Very good performance on membership, insert, delete operations Suitable for both main memory and disk b=1-3 records for main memory b=1-4 Kbytes for disk Critical parameter: space utilization u large u => more overflows, bad performance small u => less overflows, better performance Suitable for direct access queries (random accesses) but not for range queries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.