Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Similar presentations


Presentation on theme: "1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)"— Presentation transcript:

1 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

2 General Overview - rel. model Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing

3 Indexing- overview ISAM and B-trees Hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing

4 (Static) Hashing Problem: “find EMP record with ssn=123” Q: What if disk space was free, and time was at premium?

5 Hashing A: Brilliant idea: key-to-address transformation: #0 page #123 page #999,999,999 123; Smith; Main str

6 Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page #123 page #999,999,999 123; Smith; Main str

7 Hashing Typically: each hash bucket is a page, holding many records: #0 page #h(123) M 123; Smith; Main str

8 Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M 123; Smith; Main str.

9 123... Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M... EMP file 123; Smith; Main str.... 234; Johnson; Forbes ave 345; Tompson; Fifth ave...

10 Indexing- overview ISAM and B-trees hashing hashing functions size of hash table collision resolution Hashing vs B-trees Indices in SQL Advanced topics:

11 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

12 Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: Division hashing Multiplication hashing

13 Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?)

14 eg., M=2; hash on driver-license number (dln), where the last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors -- prime M guards against that! Division hashing

15 Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) In general, we need an irrational number Advantage: M need not be a prime number But φ must be irrational

16 Other hashing functions quadratic hashing (bad)... conclusion: use division hashing

17 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

18 Size of hash table eg., 50,000 employees, 10 employee- records / page Q: M=?? pages/buckets/slots

19 Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555

20 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

21 Collision resolution Q: what is a ‘collision’? A: ??

22 Collision resolution #0 page #h(123) M 123; Smith; Main str.

23 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: e.g. bank account balances between $0 and $10,000 and between $90,000 and 100,000

24 Collision resolution open addressing linear probing (ie., put to next slot/bucket) re-hashing separate chaining (ie., put links to overflow pages)

25 Collision resolution #0 page #h(123) M 123; Smith; Main str. linear probing:

26 Collision resolution #0 page #h(123) M 123; Smith; Main str. re-hashing h1() h2()

27 Collision resolution 123; Smith; Main str. separate chaining

28 Design decisions - conclusions function: division hashing h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining easier to implement (deletions!); no danger of becoming full

29 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing

30 Hashing vs B-trees: Hashing offers speed ! ( O(1) avg. search time)..but B-trees offer:

31 Hashing vs B-trees:..but B-trees offer: key ordering: range queries proximity queries sequential scan O(log(N)) guarantees for search, ins./del. graceful growing/shrinking

32 Hashing vs B-trees: thus: B-trees are implemented in most systems footnotes : hashing is not (why not?)

33 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing

34 Indexing in SQL create index on ( ) create unique index on ( ) (in the case that the search key is a candidate key) drop index

35 Indexing in SQL eg., create index ssn-index on STUDENT (ssn) or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id)

36 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest) dynamic hashing multi-attribute indexing

37 Problem with static hashing problem: overflow? problem: underflow? (under-utilization)

38 Solution: Dynamic/extendible hashing Idea: shrink / expand hash table on demand.....dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’

39 Extendible hashing #0 page #h(123) M 123; Smith; Main str.

40 Extendible hashing #0 page #h(123) M 123; Smith; Main str. solution: split the bucket in two

41 Extendible hashing in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually:

42 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001...

43 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001...

44 Extendible hashing directory 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... split on 3-rd bit

45 Extendible hashing directory 00... 01... 10... 11... 1101... 10011... 0111... 0001... 101001... 10101... 10110... new page / bucket

46 Extendible hashing directory (doubled) 1101... 10011... 0111... 0001... 101001... 10101... 10110... new page / bucket 000... 001... 010... 011... 100... 101... 110... 111...

47 Extendible hashing 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 000... 001... 010... 011... 100... 101... 110... 111... 1101... 10011... 0111... 0001... 101001... 10101... 10110... BEFOREAFTER

48 Extendible hashing Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth (see book) Mainly, of theoretical interest - same for ‘linear hashing’ of Litwin ‘order preserving’ ‘perfect hashing’ (no collisions!)

49 Indexing- overview ISAM and B-trees Hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing

50 multiple-key access How to support queries on multiple attributes, like grade>=3 and course=‘415’ Major motivation: Geographic Information systems (GIS)

51 multiple-key access x y

52 Typical query: Find cities within x miles from Philadelphia thus, we want to store nearby cities on the same disk page:

53 multiple-key access x y

54 x y

55 multiple-key access - R-trees x y

56 R-trees: very successful for GIS (along with ‘z-ordering’) more details: at ‘advanced topics’, later

57 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing industry workhorse


Download ppt "1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)"

Similar presentations


Ads by Google