Wednesday, 5/8/2002 Hash table indexes, physical operators

Slides:



Advertisements
Similar presentations
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Advertisements

Quiz 2 Review. For which of the following attributes would a hash- index most likely be a better fit than a B+-tree index? A. Social Security Number B.
Hash Table indexing and Secondary Storage Hashing.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
CPS216: Data-intensive Computing Systems
CS 540 Database Management Systems
Relational Database Systems 2
Indexing Goals: Store large files Support multiple search keys
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Lecture 16: Data Storage Wednesday, November 6, 2006.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Lecture 21: Hash Tables Monday, February 28, 2005.
Hash-Based Indexes Chapter 11
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
Evaluation of Relational Operations
Chapter Trees and B-Trees
Chapter Trees and B-Trees
Dynamic Hashing.
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
Extendible Indexing Dina Said
Introduction to Database Systems
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
External Memory Hashing
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
CSE 544: Lectures 13 and 14 Storing Data, Indexes
Chapters 15 and 16b: Query Optimization
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Lecture 6: Data Storage and Indexes
Lecture 2- Query Processing (continued)
Database Systems (資料庫系統)
Database Design and Programming
DATABASE IMPLEMENTATION ISSUES
2018, Spring Pusan National University Ki-Joune Li
CPS216: Advanced Database Systems
CSE 326: Data Structures: Sorting
CPSC-608 Database Systems
CPSC-608 Database Systems
Monday, 5/13/2002 Hash table indexes, query optimization
Hash-Based Indexes Chapter 11
Database Systems (資料庫系統)
Lecture 11: B+ Trees and Query Execution
Database Implementation Issues
Chapter 11 Instructor: Xin Zhang
Lecture 20: Indexes Monday, February 27, 2006.
CS4433 Database Systems Indexing.
CSE 326: Data Structures Lecture #14
Database Implementation Issues
Lecture 20: Query Execution
CS4432: Database Systems II
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Wednesday, 5/8/2002 Hash table indexes, physical operators CSE 544: Lecture 12 Wednesday, 5/8/2002 Hash table indexes, physical operators

Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: There are n buckets A hash function f(k) maps a key k to {0, 1, …, n-1} Store in bucket f(k) a pointer to record with key k Secondary storage: bucket = block, use overflow blocks when needed

Hash Table Example Assume 1 bucket (block) stores 2 keys + pointers h(e)=0 h(b)=h(f)=1 h(g)=2 h(a)=h(c)=3 e b f g a c 1 2 3

Searching in a Hash Table Search for a: Compute h(a)=3 Read bucket 3 1 disk access e b f g a c 1 2 3

Insertion in Hash Table Place in right bucket, if space E.g. h(d)=2 e b f g d a c 1 2 3

Insertion in Hash Table Create overflow block, if no space E.g. h(k)=1 More over- flow blocks may be needed e b f g d a c k 1 2 3

Hash Table Performance Excellent, if no overflow blocks Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).

Extensible Hash Table Allows has table to grow, to avoid performance degradation Assume a hash function h that returns numbers in {0, …, 2k – 1} Start with n = 2i << 2k , only look at first i most significant bits

Extensible Hash Table E.g. i=1, n=2, k=4 Note: we only look at the first bit (0 or 1) i=1 0(010) 1 1 1(011) 1

Insertion in Extensible Hash Table 0(010) 1 1 1(011) 1(110) 1

Insertion in Extensible Hash Table Now insert 1010 Need to extend table, split blocks i becomes 2 i=1 0(010) 1 1 1(011) 1(110), 1(010) 1

Insertion in Extensible Hash Table Now insert 1110 i=2 0(010) 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

Insertion in Extensible Hash Table Now insert 0000, then 0101 Need to split block i=2 0(010) 0(000), 0(101) 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

Insertion in Extensible Hash Table After splitting the block 00(10) 00(00) 2 i=2 01(01) 2 00 01 10(11) 10(10) 2 10 11 11(10) 2

Performance Extensible Hash Table No overflow blocks: access always one read BUT: Extensions can be costly and disruptive After an extension table may no longer fit in memory

Linear Hash Table Idea: extend only one entry at a time Problem: n= no longer a power of 2 Let i be such that 2i <= n < 2i+1 After computing h(k), use last i bits: If last i bits represent a number > n, change msb from 1 to 0 (get a number <= n)

Linear Hash Table Example (01)00 (11)00 i=2 (01)11 BIT FLIP 00 01 (10)10 10

Linear Hash Table Example Insert 1000: overflow blocks… (01)00 (11)00 (10)00 i=2 (01)11 00 01 (10)10 10

Linear Hash Tables Extension: independent on overflow blocks Extend n:=n+1 when average number of records per block exceeds (say) 80%

Linear Hash Table Extension From n=3 to n=4 Only need to touch one block (which one ?) (01)00 (11)00 (01)00 (11)00 i=2 (01)11 00 (01)11 i=2 01 (10)10 10 (10)10 00 01 (01)11 10 11

Linear Hash Table Extension From n=3 to n=4 finished Extension from n=4 to n=5 (new bit) Need to touch every single block (why ?) (01)00 (11)00 i=2 (10)10 00 01 (01)11 10 11

Discussion of Physical Operators The following discussion is based mostly on: Goetz Graefe Query evaluation techniques for large databases

Discussion of Physical Operators General questions: What is the difference between physical algebra and logical algebra ? What is the iterators model ?

Discussion of Physical Operators Mergesort questions: What are the two methods for creating the level-0 runs ? Describe their pros and cons.

Discussion of Physical Operators Suppose we only allow two passes in merge-sort: (1) build level 0 runs, (2) merge them (XMLTK sorts this way) Assume M = 128MB, page size = 4kB What is the largest relation size we can sort ? How does this change if we double M (to 256MB) ? How does this change if double the page size (to 8kB) ? What conclusions do you draw from this slide ?

Discussion of Physical Operators Suppose we only allow two passes in merge-join: (1) build level 0 runs for R and S, (2) merge and join them Assume M = 128MB, page size = 4kB Write down the condition on the size of R and/or S that allows us to do merge-join in two passes

Discussion of Physical Operators Consider partitioned hash-join (described in the book, not the paper). Assume M = 128MB, page size = 4kB Write down the condition on the size of R and/or S that allows us to do partitioned hash-join

Discussion of Physical Operators Block nested loop join v.s. partitioned hash-join. Assuming R=S. When is one better than the other ? S R . M

Discussion of Physical Operators Hybrid hash join This is difficult. What does it really buy us ?

Discussion of Physical Operators More questions on joins What is an antisemijoin ?

Discussion of Physical Operators Object oriented databases and pointers Comment on the following statement: Object-oriented databases have little need for joins. Foreign keys are usually replaced with pointers, like in: Person(ssn, name, deptid), Department(name) deptid is a pointer to a department. A join in the relational model is now replaced with a traversal of a physical pointer, which is far more efficient.

Discussion of Physical Operators Parallel databases Describe briefly the following three forms of parallelism: Interquery parallelism Interoperator parallelism Intraoperator parallelism Assuming you implement a huge, single-user database and buy a parallel machine with 128 nodes. Which form of parallelism has best potential for speedup ? Describe speedup and scaleup

Discussion of Physical Operators More questions: What are NF2 relations ? Suppose R is on node 1, and S is on node 2. Describe the following distributed join computation methods: Semijoins Bloom filters