Advanced Algorithms for Massive Datasets Basics of Hashing.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Hash Tables.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Theory I Algorithm Design and Analysis (5 Hashing) Prof. Th. Ottmann.
Data Structures Using C++ 2E
Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Hebrew Univ., Israel)
Advanced Algorithms for Massive Datasets Basics of Hashing.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Cuckoo Hashing and CAMs Michael Mitzenmacher. Background For the past several years, I have had funding from Cisco to research hash tables and related.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing The Magic Container. Interface Main methods: –Void Put(Object) –Object Get(Object) … returns null if not i –… Remove(Object) Goal: methods are.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing1 Hashing. hashing2 Observation: We can store a set very easily if we can use its keys as array indices: A: e.g. SEARCH(A,k) return A[k]
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
CS 206 Introduction to Computer Science II 11 / 16 / 2009 Instructor: Michael Eckmann.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Dictionary search Exact string search Paper on Cuckoo Hashing.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
CSC 172 DATA STRUCTURES. SETS and HASHING  Unadvertised in-store special: SETS!  in JAVA, see Weiss 4.8  Simple Idea: Characteristic Vector  HASHING...The.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Copyright © Curt Hill Hashing A quick lookup strategy.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Data Structure & Algorithm Lecture 8 – Hashing JJCAO Most materials are stolen from Prof. Yoram Moses’s course.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Instructor Neelima Gupta Expected Running Times and Randomized Algorithms Instructor Neelima Gupta
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Data Structures Using C++ 2E
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Presentation transcript:

Advanced Algorithms for Massive Datasets Basics of Hashing

The Dictionary Problem Definition. Let us given a dictionary S of n keys drawn from a universe U. We wish to design a (dynamic) data structure that supports the following operations:  membership(k)  checks whether k є S  insert(k)  S = S U {k}  delete(k)  S = S – {k}

Brute-force: A large array Drawback: U can be very large, such as 64-bit integers, URLs, MP3 files, MPEG videos,... S

Hashing: List + Array +... Problem: There may be collisions !

Collision resolution: chaining

Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m =  (load factor)

A useful r.v.

The proof

Search cost m =  (n)

In summary... Hashing with chaining: O(1) search/update time in expectation O(n) optimal space Simple to implement...but: Space = m log 2 n + n (log 2 n + log 2 |U|) bits Bounds in expectation Uniform hashing is difficult to guarantee Open addressing array chains

In practice Typically we use simple hash functions: prime

Enforce “goodness” As in Quicksort for the selection of its pivot, select the h() at random From which set we should draw h ?

What is the cost of “uniformity”? h: U  {0, 1,..., m-1} To be “uniform” it must be able to spread every key among {0, 1,..., m-1} We have #h = m U So every h needs:  (log 2 #h) =  (U * log 2 m) bits of storage  (U/log 2 U) time to be computed

Advanced Algorithms for Massive Datasets Universal Hash Functions

The notion of Universality This was the key prop of uniform hash + chaining For a fixed x,y

Do Universal hashes exist ? Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ log 2 U / log 2 m a0a0 a1a1 a2a2 arar K a prime U = universe of keys m = Table size not necessarily: (...mod p) mod m

{h a } is universal Suppose that x and y are two distinct keys, for which we assume x 0 ≠ y 0 (other components may differ too, of course) Question: How many h a collide on fixed x,y ?

{h a } is universal a 0 ≤ a i < m m is a prime

Simple and efficient universal hash h a (x) = ( a*x mod 2 r ) div 2 r-t 0 ≤ x < |U| = 2 r a is odd Few key issues: Consists of t bits, taken from the ≈MSD(ax) Probability of collision is ≤ 1/2 t-1 (= 2/m)

Advanced Algorithms for Massive Datasets Perfect Hashing

No prime...recall... Quadratic size Linear size

3 main issues: Search time is O(1) Space is O(n) Construction time is O(n) in expectation Recall that: m = O(n) and m i = O(n i 2 ), with the h i -functions universal by construction

A useful result Fact. Given a hash table of size m, n keys, and a universal hash family H. If we pick h є H at random, the expected number of collisions is: m = n m = n 2 1° level 2° level (no collisions)

Ready to go! (construction) Pick h, and check if  slot i, randomly pick h i : U  {0,1,..., n i 2 }...and check if no collisions induced by h i (and in T i ) O(1) re-draws in exp

Ready to go! (Space occupancy) Exp # collisions at the 1°-level

Advanced Algorithms for Massive Datasets The power of d-choices

d-left hashing Split hash table into d equal sub-tables Each table-entry is a bucket To insert, choose a bucket uniformly in each sub-table Place item in the least loaded bucket (ties to left) d

Properties of d-left Hashing Maximum bucket load is very small O(log log n) vs O(log n / log log n) The case d=2, led to the approach called “the power of two choices”

What is left ? d-left hashing yields tables with High memory utilization Fast look-ups (w.h.p.) Simplicity Cuckoo hashing expands on this, combining multiple choices with ability to move elements: Worst case constant lookup time (as perfect hash) Dynamicity Simple to implement

Advanced Algorithms for Massive Datasets Cuckoo Hashing

A running example ABC ED 2 hash tables, and 2 random choices where an item can be stored

ABC ED F A running example

ABFC ED

ABFC ED G

EGBFC AD

Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key

Various Representations Buckets Elements Buckets Elements Buckets Elements

Cuckoo Hashing Failures Bad case 1: inserted key has very long path before insertion completes.

Cuckoo Hashing Failures Bad case 2: inserted key runs into 2 cycles.

What should we do? Theoretical solution ( path = O(log n) ) re-hash everything if a failure occurs. Fact. With load less than 50% (i.e. m > 2n), then n elements give failure rate of  (1/n).  (1) amortized time per insertion Cost is O(n), how much frequent?

Some more details and proofs If unsuccesfull (  amortized cost) Does not check for emptyness, but inserts directly actually log n, or cycle detect possible danger !!

Key Lemma Note: The probability vanishes exponentially in l Proof by induction on l : l =1. Edge (i,j) created by at least one of n elements probability is n * (2/r 2 ) ≤ c -1 /r l >1. Path of l -1 edges i  u + edge (u,j) probability is

Like chaining... Proof: This occurs if exists a path of some length l between any of {h 1 (x),h 2 (x)} and any of {h 1 (y),h 2 (y)} Recall r = table size, so avg chain-length is O(1). chain The positions i,j of the Lemma can be set in 4 ways.

What about rehashing ? This is ½ for c=3 Average cost over  n inserts is O(n) = O(1) per insert Take r ≥ 2c(1+  ) n, so to manage  n inserts Probability of rehash ≤ prob of a cycle: Prob of k rehash ≤ prob of k cycles ≤ 2 -k :

Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local

Generalization: use both! B=1B=2B=4B=8 4 hash97%99%99.9%100%* 3 hash91%97%98%99.9% 2 hash49%86%93%96% 1 hash0.06%0.6%3%12% Mitzenmacher, ESA 2009

Minimal Ordered Perfect Hashing 47 m = 1.25 n n=12  m=15 The h 1 and h 2 are not perfect

h(t) = [ g( h 1 (t) ) + g ( h 2 (t) ) ] mod n 48 computed h is perfect, no strings need to be stored space is negligible for h 1 and h 2 and m log n for g

How to construct it 49 Term = edge, its vertices are given by h1 and h2 All g(v)=0; then assign g() by difference with known h() Acyclic  ok No-Acycl  regenerate hashes