Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II.

Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II

Carnegie Mellon 15-415 - C. Faloutsos2 General Overview - rel. model Relational model - SQL –Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing

Carnegie Mellon 15-415 - C. Faloutsos3 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing

Carnegie Mellon 15-415 - C. Faloutsos4 (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium?

Carnegie Mellon 15-415 - C. Faloutsos5 Hashing A: Brilliant idea: key-to-address transformation: #0 page #123 page #999,999,999 123; Smith; Main str

Carnegie Mellon 15-415 - C. Faloutsos6 Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page #123 page #999,999,999 123; Smith; Main str

Carnegie Mellon 15-415 - C. Faloutsos7 Hashing Typically: each hash bucket is a page, holding many records: #0 page #h(123) M 123; Smith; Main str

Carnegie Mellon 15-415 - C. Faloutsos8 Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon 15-415 - C. Faloutsos9 123... Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M... EMP file 123; Smith; Main str.... 234; Johnson; Forbes ave 345; Tompson; Fifth ave...

Carnegie Mellon 15-415 - C. Faloutsos10 Indexing- overview ISAM and B-trees hashing –hashing functions –size of hash table –collision resolution Hashing vs B-trees Indices in SQL Advanced topics:

Carnegie Mellon 15-415 - C. Faloutsos11 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

Carnegie Mellon 15-415 - C. Faloutsos12 Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: –Division hashing –Multiplication hashing

Carnegie Mellon 15-415 - C. Faloutsos13 Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 –gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?)

Carnegie Mellon 15-415 - C. Faloutsos14 eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing

Carnegie Mellon 15-415 - C. Faloutsos15 Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational

Carnegie Mellon 15-415 - C. Faloutsos16 Other hashing functions quadratic hashing (bad)... conclusion: use division hashing

Carnegie Mellon 15-415 - C. Faloutsos18 Size of hash table eg., 50,000 employees, 10 employee- records / page Q: M=?? pages/buckets/slots

Carnegie Mellon 15-415 - C. Faloutsos19 Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and –M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555

Carnegie Mellon 15-415 - C. Faloutsos21 Collision resolution Q: what is a ‘collision’? A: ??

Carnegie Mellon 15-415 - C. Faloutsos22 Collision resolution #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon 15-415 - C. Faloutsos23 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: ‘birthday paradox’

Carnegie Mellon 15-415 - C. Faloutsos24 Collision resolution open addressing –linear probing (ie., put to next slot/bucket) –re-hashing separate chaining (ie., put links to overflow pages)

Carnegie Mellon 15-415 - C. Faloutsos25 Collision resolution #0 page #h(123) M 123; Smith; Main str. linear probing:

Carnegie Mellon 15-415 - C. Faloutsos26 Collision resolution #0 page #h(123) M 123; Smith; Main str. re-hashing h1() h2()

Carnegie Mellon 15-415 - C. Faloutsos27 Collision resolution 123; Smith; Main str. separate chaining

Carnegie Mellon 15-415 - C. Faloutsos28 Design decisions - conclusions function: division hashing –h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining –easier to implement (deletions!); – no danger of becoming full

Carnegie Mellon 15-415 - C. Faloutsos30 Hashing vs B-trees: Hashing offers speed ! ( O(1) avg. search time)..but:

Carnegie Mellon 15-415 - C. Faloutsos31 Hashing vs B-trees:..but B-trees give: key ordering: –range queries –proximity queries –sequential scan O(log(N)) guarantees for search, ins./del. graceful growing/shrinking

Carnegie Mellon 15-415 - C. Faloutsos32 Hashing vs B-trees: thus: B-trees are implemented in most systems footnotes : hashing is not (why not?) ‘dbm’ and ‘ndbm’ of UNIX: offer one or both

Carnegie Mellon 15-415 - C. Faloutsos34 Indexing in SQL create index on ( ) create unique index on ( ) drop index

Carnegie Mellon 15-415 - C. Faloutsos35 Indexing in SQL eg., create index ssn-index on STUDENT (ssn) or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id)

Carnegie Mellon 15-415 - C. Faloutsos36 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest) –dynamic hashing –multi-attribute indexing

Carnegie Mellon 15-415 - C. Faloutsos37 Problem with static hashing problem: overflow? problem: underflow? (underutilization)

Carnegie Mellon 15-415 - C. Faloutsos38 Solution: Dynamic/extendible hashing idea: shrink / expand hash table on demand....dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’

Carnegie Mellon 15-415 - C. Faloutsos39 Extendible hashing #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon 15-415 - C. Faloutsos40 Extendible hashing #0 page #h(123) M 123; Smith; Main str. solution: split the bucket in two

Carnegie Mellon 15-415 - C. Faloutsos41 Extendible hashing in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually:

Carnegie Mellon 15-415 - C. Faloutsos42 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001...

Carnegie Mellon 15-415 - C. Faloutsos43 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001...

Carnegie Mellon 15-415 - C. Faloutsos44 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... split on 3-rd bit

Carnegie Mellon 15-415 - C. Faloutsos45 Extendible hashing directory 00... 01... 10... 11... 1101... 10011... 0111... 0001... 101001... 10101... 10110... new page / bucket

Carnegie Mellon 15-415 - C. Faloutsos46 Extendible hashing directory (doubled) 1101... 10011... 0111... 0001... 101001... 10101... 10110... new page / bucket 000... 001... 010... 011... 100... 101... 110... 111...

Carnegie Mellon 15-415 - C. Faloutsos47 Extendible hashing 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 000... 001... 010... 011... 100... 101... 110... 111... 1101... 10011... 0111... 0001... 101001... 10101... 10110... BEFOREAFTER

Carnegie Mellon 15-415 - C. Faloutsos48 Extendible hashing Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth (see book) Mainly, of theoretical interest - same for –‘linear hashing’ of Litwin –‘order preserving’ –‘perfect hashing’ (no collisions!)

Carnegie Mellon 15-415 - C. Faloutsos50 multiple-key access how to support queries on multiple attributes, like –grade>=3 and course=‘415’ major motivation: Geographic Information systems (GIS)

Carnegie Mellon 15-415 - C. Faloutsos51 multiple-key access x y

Carnegie Mellon 15-415 - C. Faloutsos52 multiple-key access Typical query: Find cities within x miles from Pittsburgh thus, we want to store nearby cities on the same disk page:

Carnegie Mellon 15-415 - C. Faloutsos55 multiple-key access - R-trees x y

Carnegie Mellon 15-415 - C. Faloutsos56 multiple-key access - R-trees R-trees: very successful for GIS (along with ‘z-ordering’) more details: at ‘advanced topics’, later even more details: in 15-826

Carnegie Mellon 15-415 - C. Faloutsos57 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing industry workhorse

Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II.

Similar presentations

Presentation on theme: "Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II.

Similar presentations

Presentation on theme: "Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II."— Presentation transcript:

Similar presentations

About project

Feedback