CMU SCS 15-826: Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing.
Review of Chapter 8 張啟中.
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
File Organizations Sept. 2012Yangjun Chen ACS Outline: Hashing (5.9, 5.10, 3 rd. ed.; 13.8, 4 th, 5 th ed.; 17.8, 6 th ed.) external hashing static.
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
B+-tree and Hashing.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
HASH TABLES Malathi Mansanpally CS_257 ID-220. Agenda: Extensible Hash Tables Insertion Into Extensible Hash Tables Linear Hash Tables Insertion Into.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
Hashing General idea: Get a large array
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
File Organizations Jan. 2008Yangjun Chen ACS Outline: Hashing (5.9, 5.10, 3 rd. ed.; 13.8, 4 th ed.) external hashing static hashing & dynamic hashing.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)
Chapter 5 Record Storage and Primary File Organizations
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Indexing and Hashing – part II.
CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
Hashing Alexandra Stefan.
Lecture 21: Hash Tables Monday, February 28, 2005.
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
File Organizations Chapter 8 “How index-learning turns no student pale
External Memory Hashing
Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing
External Memory Hashing
Hash-Based Indexes Chapter 10
15-826: Multimedia Databases and Data Mining
CPSC-608 Database Systems
Temple University – CIS Dept. CIS616– Principles of Data Management
15-826: Multimedia Databases and Data Mining
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2014)2 Reading Material [Litwin] Litwin, W., (1980), Linear Hashing: A New Tool for File and Table Addressing, VLDB, Montreal, Canada, 1980 textbook, Chapter 3 Ramakrinshan+Gehrke, Chapter 11

CMU SCS Copyright: C. Faloutsos (2014)3 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2014)4 Indexing - Detailed outline primary key indexing –B-trees and variants –(static) hashing –extendible hashing secondary key indexing spatial access methods text...

CMU SCS Copyright: C. Faloutsos (2014)5 (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium?

CMU SCS Copyright: C. Faloutsos (2014)6 Hashing A: Brilliant idea: key-to-address transformation: #0 page #123 page #999,999, ; Smith; Main str

CMU SCS Copyright: C. Faloutsos (2014)7 Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page #123 page #999,999, ; Smith; Main str

CMU SCS Copyright: C. Faloutsos (2014)8 Hashing Typically: each hash bucket is a page, holding many records: #0 page #h(123) M 123; Smith; Main str

CMU SCS Copyright: C. Faloutsos (2014)9 Hashing - design decisions? eg., IRS, 200M tax returns, by SSN

CMU SCS Copyright: C. Faloutsos (2014)10 Indexing- overview B-trees (static) hashing –hashing functions –size of hash table –collision resolution –Hashing vs B-trees –Indices in SQL Extendible hashing

CMU SCS Copyright: C. Faloutsos (2014)11 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

CMU SCS Copyright: C. Faloutsos (2014)12 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Division hashing 90% utilization Separate chaining

CMU SCS Copyright: C. Faloutsos (2014)13 Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: –Division hashing –Multiplication hashing SKIP

CMU SCS Copyright: C. Faloutsos (2014)14 Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 –gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?) SKIP

CMU SCS Copyright: C. Faloutsos (2014)15 eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing SKIP

CMU SCS Copyright: C. Faloutsos (2014)16 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method SKIP

CMU SCS Copyright: C. Faloutsos (2014)17 Size of hash table eg., 50,000 employees, 10 employee- records / page Q: M=?? pages/buckets/slots SKIP

CMU SCS Copyright: C. Faloutsos (2014)18 Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and –M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555 SKIP

CMU SCS Copyright: C. Faloutsos (2014)19 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method SKIP

CMU SCS Copyright: C. Faloutsos (2014)20 Collision resolution Q: what is a ‘collision’? A: ?? SKIP

CMU SCS Copyright: C. Faloutsos (2014)21 Collision resolution #0 page #h(123) M 123; Smith; Main str. SKIP

CMU SCS Copyright: C. Faloutsos (2014)22 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) SKIP

CMU SCS Copyright: C. Faloutsos (2014)23 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: ‘birthday paradox’ SKIP

CMU SCS Copyright: C. Faloutsos (2014)24 Collision resolution open addressing –linear probing (ie., put to next slot/bucket) –re-hashing separate chaining (ie., put links to overflow pages) SKIP

CMU SCS Copyright: C. Faloutsos (2014)25 Collision resolution #0 page #h(123) M 123; Smith; Main str. linear probing: SKIP

CMU SCS Copyright: C. Faloutsos (2014)26 Collision resolution #0 page #h(123) M 123; Smith; Main str. re-hashing h1() h2() SKIP

CMU SCS Copyright: C. Faloutsos (2014)27 Collision resolution 123; Smith; Main str. separate chaining SKIP

CMU SCS Copyright: C. Faloutsos (2014)28 Design decisions - conclusions function: division hashing –h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining –easier to implement (deletions!); – no danger of becoming full

CMU SCS Copyright: C. Faloutsos (2014)29 Indexing- overview B-trees (static) hashing –hashing functions –size of hash table –collision resolution –Hashing vs B-trees –Indices in SQL Extendible hashing

CMU SCS Copyright: C. Faloutsos (2014)30 Hashing vs B-trees: Hashing offers speed ! ( O(1) avg. search time)..but:

CMU SCS Copyright: C. Faloutsos (2014)31 Hashing vs B-trees:..but B-trees give: key ordering: –range queries –proximity queries –sequential scan O(log(N)) guarantees for search, ins./del. graceful growing/shrinking

CMU SCS Copyright: C. Faloutsos (2014)32 Hashing vs B-trees: thus: B-trees are implemented in most systems footnotes : ‘dbm’ and ‘ndbm’ of UNIX: offer one or both

CMU SCS Copyright: C. Faloutsos (2014)33 Indexing- overview B-trees (static) hashing –hashing functions –size of hash table –collision resolution –Hashing vs B-trees –Indices in SQL Extendible hashing

CMU SCS Copyright: C. Faloutsos (2014)34 Indexing in SQL create index on ( ) create unique index on ( ) drop index

CMU SCS Copyright: C. Faloutsos (2014)35 Indexing in SQL eg., create index ssn-index on STUDENT (ssn) or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id)

CMU SCS Copyright: C. Faloutsos (2014)36 Indexing- overview B-trees (static) Hashing extensible hashing –‘linear’ hashing [Litwin]

CMU SCS Copyright: C. Faloutsos (2014)37 Problem with static hashing problem: overflow? problem: underflow? (underutilization)

CMU SCS Copyright: C. Faloutsos (2014)38 Solution: Dynamic/extendible hashing idea: shrink / expand hash table on demand....dynamic hashing Details: how to grow gracefully, on overflow? Many solutions – simplest: Linear hashing [Litwin]

CMU SCS Copyright: C. Faloutsos (2014)39 Indexing- overview B-trees Static hashing extendible hashing –‘extensible’ hashing [Fagin, Pipenger +] –‘linear’ hashing [Litwin]

CMU SCS Copyright: C. Faloutsos (2014)40 Linear hashing - Detailed overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)41 Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth?

CMU SCS Copyright: C. Faloutsos (2014)42 Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth? A: split buckets from left to right, regardless of which one overflowed (‘crazy’, but it works well!) - Eg.:

CMU SCS Copyright: C. Faloutsos (2014)43 Linear hashing Initially: h(x) = x mod N (N=4 here) Assume capacity: 3 records / bucket Insert key ‘17’ bucket- id

CMU SCS Copyright: C. Faloutsos (2014)44 Linear hashing Initially: h(x) = x mod N (N=4 here) bucket- id overflow of bucket#1

CMU SCS Copyright: C. Faloutsos (2014)45 Linear hashing Initially: h(x) = x mod N (N=4 here) bucket- id overflow of bucket#1 Split #0, anyway!!!

CMU SCS Copyright: C. Faloutsos (2014)46 Linear hashing Initially: h(x) = x mod N (N=4 here) bucket- id Split #0, anyway!!! Q: But, how?

CMU SCS Copyright: C. Faloutsos (2014)47 Linear hashing A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id

CMU SCS Copyright: C. Faloutsos (2014)48 Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id

CMU SCS Copyright: C. Faloutsos (2014)49 Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id overflow 4

CMU SCS Copyright: C. Faloutsos (2014)50 Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id overflow 4 split ptr

CMU SCS Copyright: C. Faloutsos (2014)51 Linear hashing - overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)52 Linear hashing - searching? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) bucket- id overflow 4 split ptr

CMU SCS Copyright: C. Faloutsos (2014)53 Linear hashing - searching? Q1: find key ‘6’? Q2: find key ‘4’? Q3: key ‘8’? bucket- id overflow 4 split ptr

CMU SCS Copyright: C. Faloutsos (2014)54 Linear hashing - searching? Algo to find key ‘k’: compute b= h0(k); if b<split-ptr, compute b=h1(k) search bucket b

CMU SCS Copyright: C. Faloutsos (2014)55 Linear hashing - overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)56 Linear hashing - insertion? Algo: insert key ‘k’ compute appropriate bucket ‘b’ if the overflow criterion is true split the bucket of ‘split-ptr’ split-ptr ++ (*)

CMU SCS Copyright: C. Faloutsos (2014)57 Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions?

CMU SCS Copyright: C. Faloutsos (2014)58 Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization >= u-max

CMU SCS Copyright: C. Faloutsos (2014)59 Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization > u-max A2: avg length of ovf chains > max-len A3:....

CMU SCS Copyright: C. Faloutsos (2014)60 Linear hashing - insertion? Algo: insert key ‘k’ compute appropriate bucket ‘b’ if the overflow criterion is true split the bucket of ‘split-ptr’ split-ptr ++ (*) what if we reach the right edge??

CMU SCS Copyright: C. Faloutsos (2014)61 Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)62 Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)63 Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)64 Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)65 Linear hashing - split now? split ptr this state is called ‘full expansion’

CMU SCS Copyright: C. Faloutsos (2014)66 Linear hashing - observations In general, at any point of time, we have at most two h.f. active, of the form: h n (x) = x mod (N * 2 n ) h n+1 (x) = x mod (N * 2 n+1 ) (after a full expansion, we have only one h.f.)

CMU SCS Copyright: C. Faloutsos (2014)67 Linear hashing - overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)68 Linear hashing - deletion? reverse of insertion:

CMU SCS Copyright: C. Faloutsos (2014)69 Linear hashing - deletion? reverse of insertion: if the underflow criterion is met –contract!

CMU SCS Copyright: C. Faloutsos (2014)70 Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)71 Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr

CMU SCS Copyright: C. Faloutsos (2014)72 Linear hashing - overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)73 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R 1.01 d.a.

CMU SCS Copyright: C. Faloutsos (2014)74 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R 1.01 d.a. ??

CMU SCS Copyright: C. Faloutsos (2014)75 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R 1.01 d.a. ??

CMU SCS Copyright: C. Faloutsos (2014)76 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R 1.01 d.a. ??

CMU SCS Copyright: C. Faloutsos (2014)77 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R 1.01 d.a.

CMU SCS Copyright: C. Faloutsos (2014)78 Linear hashing - performance [Larson, TODS 1982] search-time (avg # of d.a.) split: if u>u 0 (say u 0 =.85) # records R 2R eg., 1.01 d.a. eg., 1.3 d.a.

CMU SCS Copyright: C. Faloutsos (2014)79 Linear hashing - overview Motivation main idea search algo insertion/split algo deletion performance analysis variations

CMU SCS Copyright: C. Faloutsos (2014)80 Other hashing variations ‘order preserving’ ‘perfect hashing’ (no collisions!) [Ed. Fox, et al]

CMU SCS Copyright: C. Faloutsos (2014)81 Primary key indexing - conclusions hashing is O(1) on the average for search linear hashing: elegant way to grow a hash table B-trees: industry work-horse for primary- key indexing (O(log(N) w.c.!)

CMU SCS Copyright: C. Faloutsos (2014)82 References for primary key indexing [Fagin+] Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, H. Raymond Strong: Extendible Hashing - A Fast Access Method for Dynamic Files. TODS 4(3): (1979) [Fox] Fox, E. A., L. S. Heath, Q.-F. Chen, and A. M. Daoud. "Practical Minimal Perfect Hash Functions for Large Databases." Communications of the ACM 35.1 (1992):

CMU SCS Copyright: C. Faloutsos (2014)83 References, cont’d [Knuth] D.E. Knuth. The Art Of Computer Programming, Vol. 3, Sorting and Searching, Addison Wesley [Larson] Per-Ake Larson Performance Analysis of Linear Hashing with Partial Expansions ACM TODS, 7,4, Dec. 1982, pp [Litwin] Litwin, W., (1980), Linear Hashing: A New Tool for File and Table Addressing, VLDB, Montreal, Canada, 1980