Hashing Alexandra Stefan.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing as a Dictionary Implementation
Hashing CS 3358 Data Structures.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Spring 2015 Lecture 6: Hash Tables
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
Hash Tables1   © 2010 Goodrich, Tamassia.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Hashing Hashing is another method for sorting and searching data.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
Hashing (part 2) CSE 2011 Winter March 2018.
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash Tables 3/25/15 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Hash functions Open addressing
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Advanced Associative Structures
Hash Table.
Hash Table.
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Dictionaries and Their Implementations
Introduction to Algorithms 6.046J/18.401J
Resolving collisions: Open addressing
Hash Tables – 2 Comp 122, Spring 2004.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Introduction to Algorithms
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
Collision Handling Collisions occur when different elements are mapped to the same cell.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Hashing.
CS 3343: Analysis of Algorithms
Data Structures and Algorithm Analysis Hashing
17CS1102 DATA STRUCTURES © 2018 KLEF – The contents of this presentation are an intellectual and copyrighted property of KL University. ALL RIGHTS RESERVED.
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Lecture-Hashing.
Presentation transcript:

Hashing Alexandra Stefan

Symbol Tables - Dictionaries A symbol table is a data structure that allows us to maintain and use an organized set of items. Main operations: Insert new item. Search and return an item with a given key. Modify an item Delete an item. Sort all items (visit them in order of their keys). Find k-th smallest item (min: k=1, max: k=N).

Symbol Tables vs Priority Queues In priority queues we care about: Insertions. Finding/deleting the max item efficiently. In symbol tables we care about: Finding/deleting any item efficiently.

Keys and Items What is the difference between a "key" and an "item"? An item contains a key, and possibly other pieces of information as well. The key is just the part of the item that we use for sorting/searching. For example, the item can be a student record and the key can be a student ID.

Search by Different Keys In actual applications, we oftentimes want to search or sort by different keys (at different times). For example, we want to search a customer database by: Customer ID. Last name. First name. Phone. Address. Shopping history. …

Search by Different Keys Accommodating the ability to search by different criteria (i.e., allow multiple keys) is a standard topic in a databases course. The general idea is fairly simple: Define a primary key, that is unique. E.g.: E.g.: Social Security Number, ID number. Build a main symbol table based on the primary key. For any other key (e.g. address, phone, last name) that we want to search by, build a separate symbol table, that simply maps values in this field to primary keys. The key for each such separate symbol table is NOT the primary key. In this course we will discuss searching by a single key.

Example

Overview Standard ways to implement symbol tables: Arrays and lists (sorted or unsorted). Simple implementations, problematic performance or severe limitations. Trees – commonly used. Relatively simple implementations (but more complicated than array/list-based implementations). Good performance. Hash tables.

Array and list implementations Note: this array implementation is different than the key-indexed implementation we just talked about. Here we just throw items into an array. Search Insert Delete Unordered Array Unordered List Ordered Array Ordered List

Array and list implementations Each of these versions requires linear time for at least one of insertion, deletion, search. We want methods that take at most logarithmic time for insertions, deletions, and searches. Search Insert Delete Unordered Array O(N) O(1) Unordered List O(1) * Ordered Array O( lg(N) )** Ordered List * Assuming we have the required nodes required to update the list after deletion: the node to be deleted and/or the node before it. ** Binary search

Ideal: Direct Access Tables (Key-Indexed Search) Suppose that: The keys are distinct positive integers, that are sufficiently small. “Sufficiently small" will be clarified in a bit. "Distinct" means that no two items share the same key. We store our items in an array (so, we have an array-based implementation). How would you implement symbol tables in that case? How would you store the items? How would you support insertions, deletions, search?

Ideal: Direct Access Tables (Key-Indexed Search) Keys are indices into an array. Initialization: set all array entries to null, O(N) time. Insertions, deletions, search: Constant time. Limitations: Keys must be unique. This can be OK for primary keys, but not for keys such as last names, that are not expected to be unique. Keys must be small enough so that the array fits in memory. Summary: optimal performance, severe limitations.

Hash Tables

Hash tables Tables Main aspects Direct access table (or key-index table): key => index Hash table: key => hash value => index Main aspects Hash function Collision Different keys mapped to the same index Dynamic hashing – reallocate the table as needed If an Insert operation brings the load factor to ½, double the table. If a Delete operation brings the load factor to 1/8 , half the table. Properties: Good time-space trade-off Good for: Search, insert, delete – O(1) Not good for: Select, sort – not supported, must use a different method Reading: chapter 11, CLRS (chapter 14, Sedgewick – has more complexity analysis)

Example Collision resolution: Separate chaining Open addressing index k 20 1 2 3 23 4 5 15 6 46 7 37 8 9 Let M = 10, h(k) = k%10 Insert keys: 46 -> 6 15 -> 5 20 -> 0 37 -> 7 23 -> 3 25 -> 5 collision 9 -> Collision resolution: Separate chaining Open addressing Linear probing Quadratic probing Double hashing

Hash functions Let M be the size of the table. h – hash function that maps a key to an index We want random-like behavior: any key can be mapped to any index with equal probability. Typical functions: Multiplicative: h(k,M) = round( ((k-s)/(t-s))* M ) Here s<=k<=t. Simple, good if keys are random, not so good otherwise. Modular (the division method) : h(k,M) = k % M Best M is a prime number. (Avoid M that has a power of 2 factor, will generate more collisions). Choose M as the prime number that is closest to the desired table size. Multiplicative and modular combination (the multiplication method): h(k,M) = floor(M*(k*A mod 1)) , 0<A<1 (in CLRS) Good A = 0.618033 (the golden ratio) Useful when M is not prime (can pick M to be a power of 2) Alternative: h(k,M) = (16161 * (unsigned)k ) % M (from Sedgewick)

Hashing Strings Given strings with 7-bit encoded characters View the string as a number in base 128 where each letter is a ‘digit’. Convert it to base 10 Apply mod M Example: “now”: ‘n’ = 110, ‘o’ = 111, ‘w’ = 119: h(“now”,M) = (110*1282 + 111*1281 + 119) % M See Sedgewick page 576 (figure 14.3) for the importance of M for collisions M = 31 is better than M = 64 When M = 64, it is a divisor of 128 => (110*1282 + 111*1281 + 119) % M = 119 % M => only the last letter is used. => if lowercase English letters: a,…,z=> 97,…,122 => (%64) => 26 indexes, [33,...,58], used out of [0,63] available. Additional collisions will happen due to last letter distribution (how many words end in ‘a’?).

Hashing Long Strings For long strings, the previous method will result in number overflow. E.g.: 64-bit integers 11 character long word: c*12810 = c*(27)10 = c* 270 Solution: partial calculations: Replace: c10*12810 + c9*1289 + … c1*1281 + c0 with: ((c10 * 128 + c9) * 128 + c8)*128+….+ c1)*128 + c0) Compute it iteratively and apply mod M at each iteration:

Hashing Long Strings (Sedgewick) Improvement: use 127 instead of 128 int hash(char *v, int M) { int h = 0, a = 127; for (; *v != '\0'; v++) h = (a*h + *v) % M; return h; } int hashU(char *v, int M) { int h, a = 31415, b = 27183; for (h = 0; *v != '\0'; v++, a = a*b % (M-1)) Universal hashing: different coefficients for each letter Better for avoiding collisions. See figure 14.5 (page 578, Sedgewick)

Collision Resolution: Separate Chaining α = N/M (N – items in the table, M – table size) load factor Separate chaining Each table entry points to a list of all items whose keys were mapped to that index. Requires extra space for links in lists Lists will be short. On average, size α.

Separate Chaining Example: insert 25 Let M = 10, h(k) = k%10 Insert keys: 46 -> 6 15 -> 5 20 -> 0 37 -> 7 23 -> 3 25 -> 5 collision 35 9 -> index k 1 2 3 4 5 6 7 8 9 Inserting at the beginning of the list is advantageous in cases where the more recent data is more likely to be accessed again (e.g. new students will have to go see several offices at UTA and so they will be looked more frequently than continuing students. 20 23 25 15 46 37

Collision Resolution: Open Addressing α = N/M (N – items in the table, M – table size) load factor Open addressing: Use empty cells in the table to store colliding elements. M > N α – ratio of used cells from the table (<1). Linear probing: h(k,i,M) = (h1(k) + i)% M, If the slot where the key is hashed is taken, use the next available slot. Long chains Quadratic probing: h(k,i,M) = (h1(k) + c1*i + c2*i2)% M, If two keys hash to the same value, they follow the same set of probes. Double hashing: h(k,i,M) = (h1(k) + i* h2(k)) % M, h2(k) should not evaluate to 0. Use a second hash value as the jump size (as opposed to size 1 in linear probing). Better See figure 14.10, page 596.

Open Addressing Example: insert 25 M = 10, h1(k) = k%10. Table already contains keys: 46, 15, 20, 37, 23 Next want to insert 25: h1(25) = 5 (collision: 25 with 15) Index Linear Quadratic Double hashing h2(k) = 1+(k%7) h2(k) = 1+(k%9) 20 1 25 2 3 23 4 5 15 6 46 7 37 8 9 Linear probing h(k,i,M) = (h1(k) + i)% M (try slots: 5,6,7,8) Quadratic probing example: h(k,i,M) = (h1(k) + 2i+i2)% M (try slots: 5, 8) Inserting 35(not shown in table): (try slots: 5, 8, 3,0) i h1(k) + 2i+i2 %10 5+0=5 5 1 5+3=8 8 2 5+8=13 3 5+15=20 Where will 9 be inserted now?

Open Addressing Example: insert 25 M = 10, h1(k) = k%10. Table already contains keys: 46, 15, 20, 37, 23 Next want to insert 25: h1(25) = 5 (collision: 25 with 15) Index Linear Quadratic Double hashing h2(k) = 1+(k%7) h2(k) = 1+(k%9) 20 1 25 2 3 23 4 5 15 6 46 7 37 8 9 Double hashing example h(k,i,M) = (h1(k) + i* h2(k)) % M Choice of h2 matters: h2(k) = 1+(k%7): try slots: 5, 9, h2(25) = 1 + 4 = 5 => slots: 5,0,5,0,… Cannot insert 25. h2(k) = 1+(k%9): h2(25) = 1 + 7 = 8 => slots: 5,3,1, (the cycle would have length 5 and give cells: 5 ( from (5+0)%10), 3 (from (5+8)%10) , 1 (from (5+16)%10), 9 (from (5+24)%10), 7 (from (5+32)%10), and then it cycles back to 5 ( from (5+40)%10 ). Where will 9 be inserted now?

Searching and Deletion in Open Addressing Report as not found when land on an empty cell Deletion: Mark the cell as ‘DELETED’, not as an empty cell Else you will not be able to find elements following in the sequence of cells. E.g., with linear probing, and auxiliary hash function h(k,i,10) = (k%10 + i) %10, insert 15,25,5,35, search for 5, then delete 25 and search for 5 or 35.

Simple uniform hashing Uniform hashing: Universal class of hash functions: A collection of hash functions, H, is universal if for each pair of distinct key k≠l, the number of hash functions h from H for which h(k) =h(l) is at most |H|/M.

Perfect Hashing Similar to separate chaining, but use another hash table instead of a linked list. Use universal hashing on both levels: pick the hash function at random from a universal class of hash functions. Can be done for static keys (once the keys are stored in the table, the set of keys never changes). Theorem 11.9:Suppose that we store n keys in a hash table of size m = n2 using a hash function h randomly chosen from a universal class of hash functions, then the probability is less than ½ that there are any collisions. It follows that after trying out a few different random hash function we will be able to find one that produces no collisions (note that the probability that the all of the t functions that we try have collisions is (1/2)t. Corollary 11.12:Suppose that we store N keys in a hash tables using perfect hashing. Then the probability is less than ½ that the total storage used for the secondary hash tables equals or exceeds 4N.

Analysis of chaining hash tables expected performance on average Theorem 11.1 and 11.2 (CLRS) In a hash table with load factor α, in which collisions are resolved by chaining, and assuming simple uniform hashing, the average-case time is: - Θ(1+α) in an unsuccessful search and also - Θ(1+α) in an successful search. Corollary 11.4 Using universal hashing and collision resolution by chaining in an initially empty tables with M slots, it takes expected time Θ(N) to handle any sequence of N Insert, Search and Delete operations containing O(M) Insert operations.

Analysis of open-address hash table expected performance on average Theorem 11.6 and 11.8 (CLRS) Given an open-address hash table with load factor of α = N/M<1, and assuming uniform hashing, the expected number of probes is at most: - 1/(1-α) in an unsuccessful search and - (1/α)ln(1/(1-α)) in a successful search (assuming each key in the table is equally likely to be searched for). Corollary 11.7: Inserting an element into an open-address hash table with load factor α and assuming uniform hashing, requires at most 1/(1-α) probes on average. Property 14.5 (Sedgewick): We can ensure that the average cost of all searches is less than t probes by keeping the load factor less than: 1-(1/sqrt(t)) for linear probing 1-(1/t) for double hashing