1 CSC 427: Data Structures and Algorithm Analysis Fall 2004 Associative containers  STL data structures and algorithms  sets: insert, find, erase, empty,

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing as a Dictionary Implementation
CSE Lecture 16 – Hashing & Tables
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
Hash Table March COP 3502, UCF.
1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.
HASHING Section 12.7 (P ). HASHING - have already seen binary and linear search and discussed when they might be useful (based on complexity)
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
CS121 Data Structures CS121 © JAS 2004 Tables An abstract table, T, contains table entries that are either empty, or pairs of the form (K, I) where K is.
Comp 335 File Structures Hashing.
STL multimap Container. STL multimaps multimaps are associative containers –Link a key to a value –AKA: Hashtables, Associative Arrays –A multimap allows.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing as a Dictionary Implementation Chapter 19.
Hash Tables - Motivation
CSC 427: Data Structures and Algorithm Analysis
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CSC 427: Data Structures and Algorithm Analysis
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
Hash Tables CSIT 402 Data Structures II. Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
1 CSC 427: Data Structures and Algorithm Analysis Fall 2011 Space vs. time  space/time tradeoffs  hashing  hash table, hash function  linear probing.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Prof. amr Goneid, AUC1 CSCE 110 PROGRAMMING FUNDAMENTALS WITH C++ Prof. Amr Goneid AUC Part 15. Dictionaries (1): A Key Table Class.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
CSCI  Sequence Containers – store sequences of values ◦ vector ◦ deque ◦ list  Associative Containers – use “keys” to access data rather than.
1 Designing Hash Tables Sections 5.3, 5.4, 5.5, 5.6.
Fundamental Structures of Computer Science II
1 CSC 321: Data Structures Fall 2012 Hash tables  HashSet & HashMap  hash table, hash function  collisions  linear probing, lazy deletion, primary.
Sets and Maps Chapter 9.
CSC 321: Data Structures Fall 2013 Hash tables HashSet & HashMap
CSC 321: Data Structures Fall 2015 Hash tables HashSet & HashMap
CSC 427: Data Structures and Algorithm Analysis
Hash Tables.
Advanced Associative Structures
Hash Table.
CS202 - Fundamental Structures of Computer Science II
Sets and Maps Chapter 9.
CSC 321: Data Structures Fall 2018 Hash tables HashSet & HashMap
Hashing.
Data Structures and Algorithm Analysis Hashing
CSE 373: Data Structures and Algorithms
Presentation transcript:

1 CSC 427: Data Structures and Algorithm Analysis Fall 2004 Associative containers  STL data structures and algorithms  sets: insert, find, erase, empty, size, begin/end  maps: operator[], find, erase, empty, size, begin/end  hash table versions ( hash_set, hash_map )  hash function, probing, chaining

2 Standard Template Library (STL) the C++ Standard Template Library is a standardized collection of data structure and algorithm implementations  the classes are templated for generality  we have seen,,,,  contains many useful routines, e.g., heap operations vector nums; for (int i = 1; i <= 10; i++) { nums.push_back(i); nums.push_back(2*i); } make_heap(nums.begin(), nums.end());// orders entries into heap Display(nums); pop_heap(nums.begin(), nums.end());// removes max (root) value, reheapifies nums.push_back(33);// adds new item at end (next leaf) push_heap(nums.begin(), nums.end());// swap up path to reheapify Display(nums); sort_heap(nums.begin(), nums.end());// converts the heap into an ordered list Display(nums);

3 Other STL classes other useful data structures are defined in STL classes, e.g.,  set : represents a set of objects (unordered collection, no duplicates)  hash_set : an efficient (but more complex) set that utilizes hash tables  map : represents a mapping of pairs (e.g., account numbers & balances)  hash_map : an efficient (but more complex) map that utilizes hash tables  multimap : represents a mapping of pairs, can have multiple entries per key  hash_multimap : an efficient (but complex) multimap that utilizes hash tables  slist : singly-linked list  deque : double-ended queue  priority_queue : priority queue  bit_vector : sequence of bits (with constant time access) all of these data structure include definitions of iterators, so can use algorithms from the library e.g., includes set_union, set_intersection, set_difference

4 Set class the set class stores a collection of items with no duplicates  repeated insertions of the same item still yield only one entry construct with no arguments, e.g., set nums; set words; member functions include: void insert(const TYPE & item); // adds item to set (no duplicates) void erase(const TYPE & item); // removes item from the set set ::iterator find(const TYPE & item); // returns iterator to item in set // (or to end() if not found) bool empty(); // returns true if set is empty int size(); // returns size of set set & operator=(const set & rhs); // assignment operator  to traverse the contents of a set, use an iterator and dereference for (set ::iterator iter = nums.begin(); iter != nums.end(); iter++) { cout << *iter << endl; }

5 Set example: counting unique words #include using namespace std; int main() { set uniqueWords;// SET OF UNIQUE WORDS string str; while (cin >> str) {// REPEATEDLY READ NEXT str uniqueWords.insert(str);// ADD WORD TO SET OF ALL WORDS } // TRAVERSE AND DISPLAY THE SET ENTRIES cout << "There were " << uniqueWords.size() << " unique words: " << endl; for (set ::iterator iter = uniqueWords.begin(); iter != uniqueWords.end(); iter++) { cout << *iter << endl; } return 0; } capitalization? punctuation?

6 Dictionary utilizing set Dictionary::Dictionary() { } void Dictionary::read(istream & istr) { string word; while (istr >> word) { words.insert(word); } void Dictionary::write(ostream & ostr) { for (set ::iterator iter = words.begin(); iter != words.end(); iter++) { ostr << *iter << endl; } void Dictionary::addWord(const string & str) { string normal = Normalize(str); if (normal != "") { words.insert(normal); } bool Dictionary::isStored(const string & word) const { return words.find(word) != words.end(); } class Dictionary { public: Dictionary(); void read(istream & istr = cin); void write(ostream & ostr = cout); void addWord(const string & str); bool isStored(const string & str) const; private: set words; string Normalize(const string & str) const; }; would this implementation work well in our speller program? better than vector? better than list? better than BST?

7 Map class the map class stores a collection of mappings ( pairs)  can only have one entry per key (subsequent adds replace old values) construct with no arguments, e.g., map wc; member functions include: ENTRY_TYPE & operator[](const KEY_TYPE & key); // accesses value at key void erase(const KEY_TYPE & key); // removes entry from the map map ::iterator find(const KEY_TYPE & key); // returns iterator to entry in map // (or to end() if not found) bool empty(); // returns true if map is empty int size(); // returns size of map map & operator=(const set & rhs); // assignment operator  to traverse the contents of a map, use an iterator and dereference and access fields for (map ::iterator iter = wc.begin(); iter != wc.end(); iter++) { cout first second << endl; }

8 Map example: keeping word counts #include using namespace std; int main() { map wordCount;// MAP OF WORDS & THEIR COUNTS string str; while (cin >> str) {// REPEATEDLY READ NEXT str if (wordCount.find(str) == wordCount.end()) { // IF ENTRY NOT FOUND FOR str, wordCount[str] = 0; // ADD NEW ENTRY WITH VALUE 0 } wordCount[str]++; // INCREMENT ENTRY } // TRAVERSE AND DISPLAY THE MAP ENTRIES for (map ::iterator iter = wordCount.begin(); iter != wordCount.end(); iter++) { cout first second << endl; } return 0; }

9 Another example: greedy gift givers consider the following simplification of a programming contest problem: For a group of gift-giving friends, determine how much more each person gives than they receive (or vice versa). assume each person spends a certain amount on a sublist of friends each friend gets the same amount (if 2 friends and $3 dollars, each gets $1) input format: 3 # of people Liz Steve Dave names of people Liz 30 1 Steve for each person: Steve 55 2 Liz Dave his/her name, initial funds, Dave 0 2 Steve Liz number of friends, their names output format: Liz -3 for each person: Steve -24 amount received – amount given Dave 27

10 Greedy Gift Givers use a map to store the amounts for each person  key is person's name, value is (amount received – amount given) so far  as each gift is given, add to recipient's entry & subtract from giver's entry map peopleMap; int numPeople; cin >> numPeople; for (int i = 0; i < numPeople; i++) { string nextPerson; cin >> nextPerson; peopleMap[nextPerson] = 0; } for (int i = 0; i < numPeople; i++) { string giver, givee; int amount, numGifts; cin >> giver >> amount >> numGifts; if (numGifts > 0) { int eachGift = amount/numGifts; for (int k = 0; k < numGifts; k++) { cin >> givee; peopleMap[givee] += eachGift; } peopleMap[giver] -= eachGift*numGifts; } for (map ::iterator iter = peopleMap.begin(); iter != peopleMap.end(); iter++) { cout first second << endl; }

11 Set and map implementations how are set and map implemented?  the documentation for set and map do not specify the data structure  they do specify that insert, erase, and find are O(log N), where N is the number of entries  also, they specify an ordering on the values traversed by an iterator for set, values accessed in increasing order (based on operator< for TYPE) for map, values accessed in increasing order (based on operator< for KEY_TYPE) what data structure(s) provide this behavior? there do exist more efficient implementations of these data structures  hash_set and hash_map use a hash table representation  as long as certain conditions hold, can provide O(1) insert, erase, find!!!!!!!!!

12 Hash tables a hash table is a data structure that supports constant time insertion, deletion, and search on average  degenerative performance is possible, but unlikely  it may waste some storage idea: data items are stored in a table, based on a key  the key is mapped to an index in the table, where the data is stored/accessed example: letter frequency  want to count the number of occurrences of each letter in a file  have a vector of 26 counters, map each letter to an index  to count a letter, map to its index and increment "A"  0 "B"  1 "C"  2 "Z"  25

13 Mapping examples extension: word frequency  must map entire words to indices, e.g., "A"  0"AA"  26"BA"  "B"  1"AB"  27"BB"  "Z"  25"AZ"  51"BZ"   PROBLEM? mapping each potential item to a unique index is generally not practical # of 1 letter words = 26 # of 2 letter words = 26 2 = 676 # of 3 letter words = 26 3 = 17,  even if you limit words to at most 8 characters, need a table of size 217,180,147,158  for any given file, the table will be mostly empty!

14 Table size < data range since the actual number of items stored is generally MUCH smaller than the number of potential values/keys:  can have a smaller, more manageable table e.g., vector size = 26 possible mapping: map word based on first letter "A * "  0"B * "  1... "Z * "  25 e.g., vector size = 1000 possible mapping: add ASCII values of letters, mod by 1000 "AB"  = 131 "BANANA"  = 417 "BANANABANANABANANA"  = 1251 % 1000 = 251  POTENTIAL PROBLEMS?

15 Collisions the mapping from a key to an index is called a hash function since |range(hash function)| < |domain(hash function)|, can have multiple items map to the same index (i.e., a collision ) "ACT"  = 216"CAT"  = 216 techniques exist for handling collisions, but they are costly (LATER) it's best to avoid collisions as much as possible – HOW?  want to be sure that the hash function distributes the key evenly  e.g., "sum of ASCII codes" hash function OK if array size is 1000 BADif array size is 10,000 most words are <= 8 letters, so max sum of ASCII codes = 1,016 so most entries are mapped to first 1/10th of table

16 Better hash function a good hash function should  produce an even spread, regardless of table size  take order of letters into account (to handle anagrams) unsigned int Hash(const string & key, int tableSize) // acceptable hash function for strings // Hash(key) = (key[0]* key[1]* … + key[n]*128 n ) % tableSize { unsigned int hashIndex = 0; for (int i = 0; i < key.length(); i++) { hashIndex = (hashIndex*128 + key[i]) % tableSize; } return hashIndex; } unsigned int Hash(const string & key, int tableSize) // better hash function for strings, often used in practice { unsigned int hashIndex = 0; for (int i = 0; i < key.length(); i++) { hashIndex = (hashIndex << 5) ^ key[i] ^ hashIndex; } return hashIndex % tableSize; }

17 Word frequency example returning to the word frequency problem  pick a hash function  pick a table size  store word & associated count in the table  as you read in words, map to an index using the hash function if an entry already exists, increment otherwise, create entry with count = 1 "FOO" 1... "BAR" WHAT ABOUT COLLISIONS?

18 Linear probing linear probing is a simple strategy for handling collisions  if a collision occurs, try next index & keep looking until an empty one is found (wrap around to the beginning if necessary) assume naïve "first letter" hash function  insert "BOO"  insert "C++"  insert "BOO"  insert "BAZ"  insert "ZOO"  insert "ZEBRA"

19 Linear probing (cont.) with linear probing, will eventually find the item if stored, or an empty space to add it (if the table is not full) what about deletions?  delete "BIZ" can the location be marked as empty? "AND" 3... "BIZ" "C++" 1 "BOO" 1 can't delete an item since it holds a place for the linear probing  search "C++" 3

20 Lazy deletion when removing an entry  mark the entry as being deleted (i.e., mark location)  subsequent searches must continue past deleted entries (probe until desired item or an empty location is found)  subsequent insertions can use deleted locations ADD "BOO" ADD "AND" ADD "BIZ" ADD "C++" DELETE "BIZ" SEARCH "C++" ADD "COW" SEARCH "C

21 Primary clustering in practice, probes are not independent  suppose table is half full maps to4-7 require1 check map to 3 requires2 checks map to 2 requires3 checks map to 1 requires4 checks map to 0 requires 5 checks average = 18/8 = 2.25 checks "AND" "BOO" "BIZ" "C++" using linear probing, clusters of occupied locations develop  known as primary clusters insertions into the clusters are expensive & increase the size of the cluster

22 Analysis of linear probing the load factor λ is the fraction of the table that is full empty tableλ = 0 half full table λ = 0.5full tableλ = 1 THEOREM: assuming a reasonably large table, the average number of locations examined per insertion (taking clustering into account) is roughly (1 + 1/(1-λ) 2 )/2 empty table(1 + 1/(1 - 0) 2 )/2 = 1 half full(1 + 1/(1 –.5) 2 )/2 = 2.5 3/4 full(1 + 1/(1 -.75) 2 )/2 = 8.5 9/10 full(1 + 1/(1 -.9) 2 )/2 = 50.5 as long as the hash function is fair and the table is less than half full, then inserting, deleting, and searching are all O(1) operations

23 Rehashing it is imperative to keep the load factor below 0.5 if the table becomes half full, then must resize  find prime number twice as big  just copy over table entries to same locations???  NO! when you resize, you have to rehash existing entries new table size  new hash function (+ different wraparound) LET Hash(word) = word.length() % tableSize ADD "UP" ADD "OUT" ADD "YELLOW" NOW RESIZE AND REHASH

24 Chaining there are variations on linear probing that eliminate primary clustering  e.g., quadratic probing increases index on each probe by square offset Hash(key)  Hash(key) + 1  Hash(key) + 4  Hash(key) + 9  Hash(key) + 16  … however, the most commonly used strategy for handling collisions is chaining  each entry in the hash table is a bucket (list)  when you add an entry, hash to correct index then add to bucket  when you search for an entry, hash to correct index then search sequentially "AND""APPLE" "CAT""C++""COWS" "DOG"

25 Analysis of chaining in practice, chaining is generally faster than probing  cost of insertion is O(1) – simply map to index and add to list  cost of search is proportional to number of items already mapped to same index e.g., using naïve "first letter" hash function, searching for "APPLE" might requires traversing a list of all words beginning with 'A' if hash function is fair, then will have roughly λ/tableSize items in each bucket  average cost of a successful search is roughly λ/2*tableSize chaining is sensitive to the load factor, but not as much as probing – WHY? hash_map and hash_set in the STL use chaining  a default hash function is defined for standard classes (e.g., string)  you can specify your own hash function as part of a hash_map/hash_set declaration