Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Indexing and Hashing – part II
Carnegie Mellon C. Faloutsos2 General Overview - rel. model Relational model - SQL –Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing
Carnegie Mellon C. Faloutsos3 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing
Carnegie Mellon C. Faloutsos4 (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium?
Carnegie Mellon C. Faloutsos5 Hashing A: Brilliant idea: key-to-address transformation: #0 page #123 page #999,999, ; Smith; Main str
Carnegie Mellon C. Faloutsos6 Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page #123 page #999,999, ; Smith; Main str
Carnegie Mellon C. Faloutsos7 Hashing Typically: each hash bucket is a page, holding many records: #0 page #h(123) M 123; Smith; Main str
Carnegie Mellon C. Faloutsos8 Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M 123; Smith; Main str.
Carnegie Mellon C. Faloutsos Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M... EMP file 123; Smith; Main str ; Johnson; Forbes ave 345; Tompson; Fifth ave...
Carnegie Mellon C. Faloutsos10 Indexing- overview ISAM and B-trees hashing –hashing functions –size of hash table –collision resolution Hashing vs B-trees Indices in SQL Advanced topics:
Carnegie Mellon C. Faloutsos11 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method
Carnegie Mellon C. Faloutsos12 Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: –Division hashing –Multiplication hashing
Carnegie Mellon C. Faloutsos13 Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 –gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?)
Carnegie Mellon C. Faloutsos14 eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing
Carnegie Mellon C. Faloutsos15 Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational
Carnegie Mellon C. Faloutsos16 Other hashing functions quadratic hashing (bad)... conclusion: use division hashing
Carnegie Mellon C. Faloutsos17 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method
Carnegie Mellon C. Faloutsos18 Size of hash table eg., 50,000 employees, 10 employee- records / page Q: M=?? pages/buckets/slots
Carnegie Mellon C. Faloutsos19 Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and –M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555
Carnegie Mellon C. Faloutsos20 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method
Carnegie Mellon C. Faloutsos21 Collision resolution Q: what is a ‘collision’? A: ??
Carnegie Mellon C. Faloutsos22 Collision resolution #0 page #h(123) M 123; Smith; Main str.
Carnegie Mellon C. Faloutsos23 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: ‘birthday paradox’
Carnegie Mellon C. Faloutsos24 Collision resolution open addressing –linear probing (ie., put to next slot/bucket) –re-hashing separate chaining (ie., put links to overflow pages)
Carnegie Mellon C. Faloutsos25 Collision resolution #0 page #h(123) M 123; Smith; Main str. linear probing:
Carnegie Mellon C. Faloutsos26 Collision resolution #0 page #h(123) M 123; Smith; Main str. re-hashing h1() h2()
Carnegie Mellon C. Faloutsos27 Collision resolution 123; Smith; Main str. separate chaining
Carnegie Mellon C. Faloutsos28 Design decisions - conclusions function: division hashing –h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining –easier to implement (deletions!); – no danger of becoming full
Carnegie Mellon C. Faloutsos29 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing
Carnegie Mellon C. Faloutsos30 Hashing vs B-trees: Hashing offers speed ! ( O(1) avg. search time)..but:
Carnegie Mellon C. Faloutsos31 Hashing vs B-trees:..but B-trees give: key ordering: –range queries –proximity queries –sequential scan O(log(N)) guarantees for search, ins./del. graceful growing/shrinking
Carnegie Mellon C. Faloutsos32 Hashing vs B-trees: thus: B-trees are implemented in most systems footnotes : hashing is not (why not?) ‘dbm’ and ‘ndbm’ of UNIX: offer one or both
Carnegie Mellon C. Faloutsos33 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing
Carnegie Mellon C. Faloutsos34 Indexing in SQL create index on ( ) create unique index on ( ) drop index
Carnegie Mellon C. Faloutsos35 Indexing in SQL eg., create index ssn-index on STUDENT (ssn) or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id)
Carnegie Mellon C. Faloutsos36 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest) –dynamic hashing –multi-attribute indexing
Carnegie Mellon C. Faloutsos37 Problem with static hashing problem: overflow? problem: underflow? (underutilization)
Carnegie Mellon C. Faloutsos38 Solution: Dynamic/extendible hashing idea: shrink / expand hash table on demand....dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’
Carnegie Mellon C. Faloutsos39 Extendible hashing #0 page #h(123) M 123; Smith; Main str.
Carnegie Mellon C. Faloutsos40 Extendible hashing #0 page #h(123) M 123; Smith; Main str. solution: split the bucket in two
Carnegie Mellon C. Faloutsos41 Extendible hashing in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually:
Carnegie Mellon C. Faloutsos42 Extendible hashing directory
Carnegie Mellon C. Faloutsos43 Extendible hashing directory
Carnegie Mellon C. Faloutsos44 Extendible hashing directory split on 3-rd bit
Carnegie Mellon C. Faloutsos45 Extendible hashing directory new page / bucket
Carnegie Mellon C. Faloutsos46 Extendible hashing directory (doubled) new page / bucket
Carnegie Mellon C. Faloutsos47 Extendible hashing BEFOREAFTER
Carnegie Mellon C. Faloutsos48 Extendible hashing Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth (see book) Mainly, of theoretical interest - same for –‘linear hashing’ of Litwin –‘order preserving’ –‘perfect hashing’ (no collisions!)
Carnegie Mellon C. Faloutsos49 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing
Carnegie Mellon C. Faloutsos50 multiple-key access how to support queries on multiple attributes, like –grade>=3 and course=‘415’ major motivation: Geographic Information systems (GIS)
Carnegie Mellon C. Faloutsos51 multiple-key access x y
Carnegie Mellon C. Faloutsos52 multiple-key access Typical query: Find cities within x miles from Pittsburgh thus, we want to store nearby cities on the same disk page:
Carnegie Mellon C. Faloutsos53 multiple-key access x y
Carnegie Mellon C. Faloutsos54 multiple-key access x y
Carnegie Mellon C. Faloutsos55 multiple-key access - R-trees x y
Carnegie Mellon C. Faloutsos56 multiple-key access - R-trees R-trees: very successful for GIS (along with ‘z-ordering’) more details: at ‘advanced topics’, later even more details: in
Carnegie Mellon C. Faloutsos57 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing industry workhorse