Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Indexing and Hashing – part II.

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing.
Review of Chapter 8 張啟中.
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
B+-tree and Hashing.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
Hashing General idea: Get a large array
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Copyright © Curt Hill Hashing A quick lookup strategy.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)
File Organizations and Indexing
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
COMP 430 Intro. to Database Systems
Subject Name: File Structures
Database Management Systems (CS 564)
C. Faloutsos Indexing and Hashing – part I
Advanced Associative Structures
External Memory Hashing
Introduction to Database Systems
Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing
External Memory Hashing
Hash-Based Indexes Chapter 10
Hashing Alexandra Stefan.
15-826: Multimedia Databases and Data Mining
Temple University – CIS Dept. CIS616– Principles of Data Management
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
What we learn with pleasure we never forget. Alfred Mercier
Presentation transcript:

Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Indexing and Hashing – part II

Carnegie Mellon C. Faloutsos2 General Overview - rel. model Relational model - SQL –Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing

Carnegie Mellon C. Faloutsos3 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing

Carnegie Mellon C. Faloutsos4 (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium?

Carnegie Mellon C. Faloutsos5 Hashing A: Brilliant idea: key-to-address transformation: #0 page #123 page #999,999, ; Smith; Main str

Carnegie Mellon C. Faloutsos6 Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page #123 page #999,999, ; Smith; Main str

Carnegie Mellon C. Faloutsos7 Hashing Typically: each hash bucket is a page, holding many records: #0 page #h(123) M 123; Smith; Main str

Carnegie Mellon C. Faloutsos8 Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon C. Faloutsos Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M... EMP file 123; Smith; Main str ; Johnson; Forbes ave 345; Tompson; Fifth ave...

Carnegie Mellon C. Faloutsos10 Indexing- overview ISAM and B-trees hashing –hashing functions –size of hash table –collision resolution Hashing vs B-trees Indices in SQL Advanced topics:

Carnegie Mellon C. Faloutsos11 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

Carnegie Mellon C. Faloutsos12 Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: –Division hashing –Multiplication hashing

Carnegie Mellon C. Faloutsos13 Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 –gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?)

Carnegie Mellon C. Faloutsos14 eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Division hashing

Carnegie Mellon C. Faloutsos15 Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational

Carnegie Mellon C. Faloutsos16 Other hashing functions quadratic hashing (bad)... conclusion: use division hashing

Carnegie Mellon C. Faloutsos17 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

Carnegie Mellon C. Faloutsos18 Size of hash table eg., 50,000 employees, 10 employee- records / page Q: M=?? pages/buckets/slots

Carnegie Mellon C. Faloutsos19 Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and –M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555

Carnegie Mellon C. Faloutsos20 Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method

Carnegie Mellon C. Faloutsos21 Collision resolution Q: what is a ‘collision’? A: ??

Carnegie Mellon C. Faloutsos22 Collision resolution #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon C. Faloutsos23 Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: ‘birthday paradox’

Carnegie Mellon C. Faloutsos24 Collision resolution open addressing –linear probing (ie., put to next slot/bucket) –re-hashing separate chaining (ie., put links to overflow pages)

Carnegie Mellon C. Faloutsos25 Collision resolution #0 page #h(123) M 123; Smith; Main str. linear probing:

Carnegie Mellon C. Faloutsos26 Collision resolution #0 page #h(123) M 123; Smith; Main str. re-hashing h1() h2()

Carnegie Mellon C. Faloutsos27 Collision resolution 123; Smith; Main str. separate chaining

Carnegie Mellon C. Faloutsos28 Design decisions - conclusions function: division hashing –h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining –easier to implement (deletions!); – no danger of becoming full

Carnegie Mellon C. Faloutsos29 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing

Carnegie Mellon C. Faloutsos30 Hashing vs B-trees: Hashing offers speed ! ( O(1) avg. search time)..but:

Carnegie Mellon C. Faloutsos31 Hashing vs B-trees:..but B-trees give: key ordering: –range queries –proximity queries –sequential scan O(log(N)) guarantees for search, ins./del. graceful growing/shrinking

Carnegie Mellon C. Faloutsos32 Hashing vs B-trees: thus: B-trees are implemented in most systems footnotes : hashing is not (why not?) ‘dbm’ and ‘ndbm’ of UNIX: offer one or both

Carnegie Mellon C. Faloutsos33 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing

Carnegie Mellon C. Faloutsos34 Indexing in SQL create index on ( ) create unique index on ( ) drop index

Carnegie Mellon C. Faloutsos35 Indexing in SQL eg., create index ssn-index on STUDENT (ssn) or (eg., on TAKES(ssn,cid, grade) ): create index sc-index on TAKES (ssn, c-id)

Carnegie Mellon C. Faloutsos36 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest) –dynamic hashing –multi-attribute indexing

Carnegie Mellon C. Faloutsos37 Problem with static hashing problem: overflow? problem: underflow? (underutilization)

Carnegie Mellon C. Faloutsos38 Solution: Dynamic/extendible hashing idea: shrink / expand hash table on demand....dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’

Carnegie Mellon C. Faloutsos39 Extendible hashing #0 page #h(123) M 123; Smith; Main str.

Carnegie Mellon C. Faloutsos40 Extendible hashing #0 page #h(123) M 123; Smith; Main str. solution: split the bucket in two

Carnegie Mellon C. Faloutsos41 Extendible hashing in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually:

Carnegie Mellon C. Faloutsos42 Extendible hashing directory

Carnegie Mellon C. Faloutsos43 Extendible hashing directory

Carnegie Mellon C. Faloutsos44 Extendible hashing directory split on 3-rd bit

Carnegie Mellon C. Faloutsos45 Extendible hashing directory new page / bucket

Carnegie Mellon C. Faloutsos46 Extendible hashing directory (doubled) new page / bucket

Carnegie Mellon C. Faloutsos47 Extendible hashing BEFOREAFTER

Carnegie Mellon C. Faloutsos48 Extendible hashing Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth (see book) Mainly, of theoretical interest - same for –‘linear hashing’ of Litwin –‘order preserving’ –‘perfect hashing’ (no collisions!)

Carnegie Mellon C. Faloutsos49 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing

Carnegie Mellon C. Faloutsos50 multiple-key access how to support queries on multiple attributes, like –grade>=3 and course=‘415’ major motivation: Geographic Information systems (GIS)

Carnegie Mellon C. Faloutsos51 multiple-key access x y

Carnegie Mellon C. Faloutsos52 multiple-key access Typical query: Find cities within x miles from Pittsburgh thus, we want to store nearby cities on the same disk page:

Carnegie Mellon C. Faloutsos53 multiple-key access x y

Carnegie Mellon C. Faloutsos54 multiple-key access x y

Carnegie Mellon C. Faloutsos55 multiple-key access - R-trees x y

Carnegie Mellon C. Faloutsos56 multiple-key access - R-trees R-trees: very successful for GIS (along with ‘z-ordering’) more details: at ‘advanced topics’, later even more details: in

Carnegie Mellon C. Faloutsos57 Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: –dynamic hashing –multi-attribute indexing industry workhorse