Dana Shapira Hash Tables

Slides:



Advertisements
Similar presentations
1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.
Advertisements

Name: ___________________________________ I filled _______________________’s bucket today. I filled a bucket by _________________________ _________________________________________.
Some Graph Algorithms.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Decision Maths 1 Sorting Algorithms Bubble Sort A V Ali : 1.Start at the beginning of the data set. 2.Compare the first two elements,
Extendible Hashing - Class Example
1 40T1 60T2 30T3 10T4 20T5 10T6 60T7 40T8 20T9 R S C C R JOIN S?
Computer Science 112 Fundamentals of Programming II Bucket Sort: An O(N) Sort Algorithm.
Additional notes on Hashing And notes on HW4. Selected Answers to the Last Assignment The records will hash to the following buckets: K h(K) (bucket number)
Transparent objects allow you to see clearly through them
Recurrences. What is a Recurrence Relation? A system of equations giving the value of a function from numbers to numbers in terms of the value of the.
Quick Sort Elements pivot Data Movement Sorted.
授課教授:李錫智 Data Structures -3 rd exam-. 1.[5] Refer to the two trees in Fig. 1. Let’s define the balance factor BF of a node to be the absolute value of.
Data Structures (3rd Exam). 1. [5] Determine whether or not the arrays A = [62, 40, 58, 26, 30, 57, 50, 16, 15], B = [56, 38, 55, 46, 16, 53, 48, 39,
Overflow Handling An overflow occurs when the home bucket for a new pair (key, element) is full. We may handle overflows by:  Search the hash table in.
CS 261 – Data Structures Hash Tables Part III: Hash like sorting algorithms.
General Computer Science for Engineers CISC 106 James Atlas Computer and Information Sciences 10/23/2009.
Objectives Learn how to implement the sequential search algorithm Explore how to sort an array using the selection sort algorithm Learn how to implement.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Selection Sort
1 times table 2 times table 3 times table 4 times table 5 times table
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Ariel Rosenfeld.  Input: a stream of m integers i1, i2,..., im. (over 1,…,n)  Output: the number of distinct elements in the stream.  Example – count.
CS Winter 2010 Tree Sort. Useful Properties of Sorted Data Structures Skip Lists (as well as AVL Trees, and various other data structures we will.
IT 152 Data Structures and Algorithms Tonga Institute of Higher Education.
1.2 Represent Functions as Rules and Tables EQ: How do I represent functions as rules and tables??
MSU/CSE 260 Fall Functions Read Section 1.8.
Hashing is a method to store data in an array so that sorting, searching, inserting and deleting data is fast. For this every record needs unique key.
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.
SEDCL, June 7-8, Table Enumeration in RAMCloud Elliott Slaughter.
Selection Sort
October 5, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL7.1 Prof. Charles E. Leiserson L ECTURE 8 Hashing II Universal hashing Universality.
Algorithms.
$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300.
Course Introductions.  Introduction to java  Basics of Java  Classes & Objects  Java Collections and APIs  Algorithms and their analysis  Recursion.
Tables Learning Support
An algorithm of Lock-free extensible hash table Yi Feng.
TCSS 342 Autumn 2004 Version TCSS 342 Data Structures & Algorithms Autumn 2004 Ed Hong.
Bin Packing First fit decreasing algorithm
Multiplication table. x
Function Objects and Comparators
Times Tables.
Sorting Data are arranged according to their values.
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
Topological Ordering Algorithm: Example
Sorting Data are arranged according to their values.
Bin Packing First fit decreasing algorithm
10:00.
Chapter 5: Probabilistic Analysis and Randomized Algorithms
Bin Packing First fit decreasing algorithm
Element, Compound or Mixture?
Advanced Implementation of Tables
Chapter 2: Getting Started
Feedback from Assignment 2
Bin Packing First fit decreasing algorithm
Topological Ordering Algorithm: Example
Hash Tables By JJ Shepherd.
Topological Ordering Algorithm: Example
Section 3.1 Functions.
CSE 321 Discrete Structures
Chapter 5: Probabilistic Analysis and Randomized Algorithms
3 times tables.
6 times tables.
Extendable hashing M.B.Chandak.
Data Structures Using C++ 2E
Data Structures and Algorithms
Linear Hashing Example
Topological Ordering Algorithm: Example
Transparent objects allow you to see clearly through them
RANDOM NUMBERS SET # 1:
Presentation transcript:

Dana Shapira Hash Tables Data Structures Dana Shapira Hash Tables

Element Uniqueness Problem Let Determine whether there exist ij such that xi=xj Sort Algorithm Bucket Sort for (i=0;i<m;i++) T[i]=NULL; for (i=0;i<n;i++){ if (T[xi]= = NULL) T[xi]= i else{ output (i, T[xi]) return; } What happens when m is large or when we are dealing with real numbers??

Hash Tables U x1 x2 x3 x4 h(x1) h(x2) h(x4) h(x3) h Notations: Set of array indices Hash Tables Notations: U universe of keys of size |U|, K an actual set of keys of size n, T a hash table of size m Use a hash-function h:U{0,…,m-1}, h(x)=i that computes the slot i in array T where element x is to be stored, for all x in U. h(k) is computed in O(|k|) = O(1). x1 x2 x3 x4 h(x1) h(x2) h(x4) h(x3)

Example 81 62 53 17 19 1 2 3 7 9 4 5 6 8 h:U{0,…, m-1} h(x)=x mod 10 (what is m?) input: 17,62,19,81,53 Collision: x ≠ y but h(x) = h(y). m « |U|. Solutions: Chaining Open addressing

Collision-Resolution by Chaining 1 81 62 12 53 17 37 57 19 Insert(T,x): Insert new element x at the head of list T[h(x.key)]. Delete(T,x): Delete element x from list T[h(x.key)]. Search(T,x): Search list T[h(x.key)].

Analysis of Chaining Simple Uniform Hashing Any given element is equally likely to hash to any slot in the hash table. The slot an element hashes to is independent of where other elements hash. Load factor: α = n/m (elements stored in the hash table / number of slots in the hash table)

Analysis of Chaining Theorem: In a hash table with chaining, under the assumption of simple uniform hashing, both successful and unsuccessful searches take expected time Θ(1+α) on the average, where α is the hash table load factor. Proof: Unsuccessful Search: Under the assumption of simple uniform hashing, any key k is equally likely to hash to any slot in the hash table. The expected time to search unsuccessfully for a key k is the expected time to search to the end of list T[h(k)] which has expected length α. expected time - (1 + α) including time for computing h(k). Successful search: The number of elements examined during a successful search is 1 more than the number of elements that appear before k in T[h(k)]. expected time - (1 + α) Corollary: If m = (n), then Insert, Delete, and Search take expected constant time.

Designing Good Hash Functions Example: Input = reals drawn uniformly at random from [0,1) Hash function: h(x) = mx Often, the input distribution is unknown. Then we can use heuristics or universal hashing.

The Division Method Hash function: h(x) = x mod m m = 2k  h(x) = the lowest k bits of x Heuristic: m = prime number not too close to a power of 2

The Multiplication Method Hash function: h(x) = m (cx mod 1), for some 0 < c < 1 Optimal choice of c depends on input distribution. Heuristic: Knuth suggests the inverse of the golden ratio as a value that works well: Example: x=123,456, m=10,000 h(x) = 10,000 ·(123,456·0.61803… mod 1) = = 10,000 ·(76,300.0041151… mod 1) = = 10,000 ·0.0041151… = 41.151... = 41

Efficient Implementation of the Multiplication Method h(x) = m (cx mod 1) w bits w bits Let w be the size of a machine word Assume that key x fits into a machine word Assume that m = 2p Restrict ourselves to values of c of the form c = s / 2w Then cx = sx / 2w sx is a number that fits into two machine words h(x) = p most significant bits of the lower word * w bits w bits Fractional part w bits Integer part after multiplying by m = 2p p bits

Example h(x) = m (cx mod 1) x = 123456, p = 14, m = 214 = 16384, w = 32, Then sx = (76300 ⋅ 232) + 17612864 The 14 most significant bits of 17612864 are 67; that is, h(x) = 67 x = 0000 0000 0000 0001 1110 0010 0100 0000 s = 1001 1110 0011 0111 0111 1001 1011 1001 sx = 0000 0000 0000 0001 0010 1010 0000 1100 0000 0001 0000 1100 1100 0000 0100 0000 h(x) = 00 0000 0100 0011 = 67

Open Addressing All elements are stored directly in the hash table. Load factor α cannot exceed 1. If slot T[h(x)] is already occupied for a key x, we probe alternative locations until we find an empty slot. Searching probes slots starting at T[h(x)] until x is found or we are sure that x is not in T. Instead of computing h(x), we compute h(x, i) i -the probe number. h(x)

Linear Probing Hash function: h(k, i) = (h'(k) + i) mod m, where h' is an original hash function. Benefits: Easy to implement Problem: Primary Clustering - Long runs of occupied slots build up as table becomes fuller.

Quadratic Probing Hash function: h(k, i) = (h'(k) + c1i + c2i2) mod m, where h' is an original hash function. Benefits: No more primary clustering Problem: Secondary Clustering - Two elements x and y with h'(x) = h'(y) have same probe sequence.

Double Hashing Hash function: h(k, i) = (h1(k) + ih2(k)) mod m, where h1 and h2 are two original hash functions. h2(k) has to be prime w.r.t. m; that is, gcd(h2(k), m) = 1. Two methods: Choose m to be a power of 2 and guarantee that h2(k) is always odd. Choose m to be a prime number and guarantee that h2(k) < m. Benefits: No more clustering Drawback: More complicated than linear and quadratic probing

Analysis of Open Addressing Uniform hashing: The probe sequence h(k, 0), …, h(k, m – 1) is equally likely to be any permutation of 0, …, m – 1. Theorem: In an open-address hash table with load factor α < 1, the expected number of probes in an unsuccessful search is at most 1 / (1 – α), assuming uniform hashing. Proof: Let X be the number of probes in an unsuccessful search. Ai = “there is an i-th probe, and it accesses a non-empty slot”

Analysis of Open Addressing – Cont.

Analysis of Open Addressing – Cont. Corollary: The expected number of probes performed during an insertion into an open-address hash table with uniform hashing is 1 / (1 – α).

Analysis of Open Addressing – Cont. Theorem: Given an open-address hash table with load factor α < 1, the expected number of probes in a successful search is (1/α) ln (1 / (1 – α)), assuming uniform hashing and assuming that each key in the table is equally likely to be searched for. A successful search for an element x follows the same probe sequence as the insertion of element x. Consider the (i + 1)-st element x that was inserted. The expected number of probes performed when inserting x is at most Averaging over all n elements, the expected number of probes in a successful search is

Analysis of Open Addressing – Cont.

Universal Hashing A family  of hash functions is universal if for each pair k, l of keys, there are at most || / m functions in  such that h(k) = h(l). This means: For any two keys k and l and any function h chosen uniformly at random, the probability that h(k) = h(l) is at most 1/m. P= (|| / m )/ || =1/m This is the same as if we chose h(k) and h(l) uniformly at random from [0, m – 1].

Analysis of Universal Hashing Theorem: For a hash function h chosen uniformly at random from a universal family , the expected length of the list T[h(x)] is α if x is not in the hash table and 1 + α if x is in the hash table. Proof: Indicator variables: Yx = the number of keys ≠ x that hash to the same slot as x

Analysis of Universal Hashing Cont. If x is not in T, then |{y  T : x ≠ y}| = n. Hence, E[Yx] = n / m = α. If x is in T, then |{y  T : x ≠ y} = n – 1. Hence, E[Yx] = (n – 1) / m < α. The length of list T[h(x)] is one more, that is, 1 + α.

Universal Family of Hash Functions Choose a prime p so that m = p. Let such For each define the hash function as follows: =  || = Example: m=p=253 a=[248,223,101] x=1025=[0,2,1]  .

Universal Family of Hash Functions Theorem: The class  is universal. Proof: Let such that x ≠ y, w.l.o.g For all a1,…,ar there exists a single a0 such that For all there exists a single w such that z·w=1(mod m) for mr values The number of hash functions h in , for which h(k1) = h(k2) is at most mr/ mr+1 =1/m

Universal Family of Hash Functions Choose a prime p so that m < p. For any 1 ≤ a < p and 0 ≤ b < p, we define a function ha,b(x) = ((ax + b) mod p) mod m. Let Hp,m be the family Hp,m = {ha,b : 1 ≤ a < p and 0 ≤ b < p}. Theorem: The class Hp,m is universal.

Summary Hash tables are the most efficient dictionaries if only operations Insert, Delete, and Search have to be supported. If uniform hashing is used, the expected time of each of these operations is constant. Universal hashing is somewhat complicated, but performs well even for adversarial input distributions. If the input distribution is known, heuristics perform well and are much simpler than universal hashing. For collision-resolution, chaining is the simplest method, but it requires more space than open addressing. Open addressing is either more complicated or suffers from clustering effects.