Data Structures Hashing 1.

Slides:



Advertisements
Similar presentations
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Advertisements

Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Hashing Techniques.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Hashing Lesson Plan - 8.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
Searching Chapter 2.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Comp 335 File Structures Hashing.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing. Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
Chapter 5 Record Storage and Primary File Organizations
Hash Tables. Group Members: Syed Husnain Bukhari SP10-BSCS-92 Ahmad Inam SP10-BSCS-06 M.Umair Sharif SP10-BSCS-38.
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
1 What is it? A side order for your eggs? A form of narcotic intake? A combination of the two?
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Appendix I Hashing.
Hashing.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Data Abstraction & Problem Solving with C++
Hashing CENG 351.
Subject Name: File Structures
Database Management Systems (CS 564)
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash functions Open addressing
Advanced Associative Structures
Hash Table.
Hash Table.
Hash Tables.
Chapter 10 Hashing.
Dictionaries and Their Implementations
Indexing and Hashing Basic Concepts Ordered Indices
CSCE 3110 Data Structures & Algorithm Analysis
Advance Database System
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
Data Structures – Week #7
What we learn with pleasure we never forget. Alfred Mercier
Collision Resolution.
EE 312 Software Design and Implementation I
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Lecture-Hashing.
Presentation transcript:

Data Structures Hashing 1

Hashing Other search techniques require multiple comparisons The goal of hashing is to reduce the number of comparisons to one That is, we wish to locate the key immediately CIS265/506: Chapter 11 Hashing 2

Hashing A hash search is a search in which the key, through an algorithmic function, determines the location of the data. Hashing is a key-to-address transformation in which the keys map to addresses in a list CIS265/506: Chapter 11 Hashing 3

F(key) = address Key Address Hash Function 4 CIS265/506: Chapter 11 Hashing 4

001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black Address Mapping a Key value Into a record location [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 Hash 5 2 100 Key CIS265/506: Chapter 11 Hashing 5

Synonyms It is possible that two (or more) different key values produce the same location. We call the set of keys that hashes to the same location in our list synonyms CIS265/506: Chapter 11 Hashing 6

Collisions If the actual data that we insert into our list contains two or more synonyms, then we will have a collision. A collision is the event that occurs when a hashing algorithm produces an address for insertion that is already occupied. CIS265/506: Chapter 11 Hashing 7

Collisions The address produced by the hashing algorithm is called the home address. The memory that contains all the home addresses is called the prime area. When two keys collide at a home address, we must resolve the collision by placing one of the keys & its data at another location CIS265/506: Chapter 11 Hashing 8

Searching If we use one hashing algorithm to insert a key, we must use the same algorithm to find it! Each calculation of an address and test for success (is the key & data in the list?) is called a probe. CIS265/506: Chapter 11 Hashing 9

Looking for the ideal Hashing Function Simple to use Instantaneous computation No waste of space Symmetrical (data placement & retrieval use same mechanism) Some examples follow CIS265/506: Chapter 11 Hashing

Direct Hashing The key is the address without any algorithmic manipulation The supporting data structure must contain an element for every possible key. Uses are somewhat limited. Never a risk for synonyms (Why?) CIS265/506: Chapter 11 Hashing 10

Direct Hashing Situation #1: Let’s say we need to look at daily sales figures for one month. We could set up an array with 31 distinct elements. We could use the day as the key (& address) and the sales figures as the data CIS265/506: Chapter 11 Hashing 11

Key Data 01 27.61 02 32.45 03 34.21 … … … 29 81.33 30 65.99 31 00.00 Since we are dealing with “direct hashing” - the address & the key are the same dailySales[current_day] = dailySales[current_day] + sale_amount CIS265/506: Chapter 11 Hashing 12

Direct Hashing That sounds easy! What are some drawbacks? Wasteful in terms of space (eg. SSN or CSU school ID produces a large….large…. prime area) How do you find every possible case? CIS265/506: Chapter 11 Hashing 13

Subtraction Method If our keys were sequential, but did not start from one (or zero), we could use the subtraction method. Example: Keys were from 1000 to 1100. We would subtract 1000 from the key to determine its address Same problems & issues as the direct method CIS265/506: Chapter 11 Hashing 14

Modulo-Division Method Also known as the division-remainder method Divides the key by the array size and uses the remainder plus one to produce the address address = key MODULUS (listSize + 1) CIS265/506: Chapter 11 Hashing 15

Modulo-Division Method Can work with any list size If the list size is a prime number, there will be fewer collisions (???) CIS265/506: Chapter 11 Hashing 16

Modulo-Division Method - Example If we had a need to store 300 pieces of information, we would choose an array size of 307 (the next largest prime number). Now, assume we have key 121267. Q. In what array location (address) is this key stored? 121267  307 = 395 with a remainder of 2 address = key MODULUS (listSize + 1) ?? = (121267 modulus 307) + 1 3 = 2 + 1 A. Address value is: 3 CIS265/506: Chapter 11 Hashing 17

Digit Extraction Method Selected digits are extracted from the key and used as the address: key  address 379452  394 121267  112 378845  388 CIS265/506: Chapter 11 Hashing 18

Mid-square Method The key is squared and the address selected from the middle of the squared number The full squared number may be too large for the computer ! If the key has 6 digits, the product is 12 digits, which is larger than the size of an integer in many computers CIS265/506: Chapter 11 Hashing 19

Midsquare Method Example: Assume we have a 4 digit address (0000-9999) Q. What is the address of the key 9452? A. 9452 * 9452 = 89340394 Address = 3403 CIS265/506: Chapter 11 Hashing 20

Folding Methods Fold Shift Method The key value is divided in two parts The left & right parts are shifted and added to the middle part CIS265/506: Chapter 11 Hashing 21

Fold Shift Method Address Size Key 3 123456789 123 456 + 789 1368 123 456 + 789 1368 ← left ← middle ← right the address for the key 123456789 is 368 discarded CIS265/506: Chapter 11 Hashing 22

Folding Methods Fold Boundary Method The key value is divided in two parts The left & right parts are folded and added to the middle part CIS265/506: Chapter 11 Hashing 23

note that the beginning & ending number have been reversed! Fold Boundary Method Address Size 3 Key 123456789 note that the beginning & ending number have been reversed! 321 456 + 987 1764 the address for the key 123456789 is 764 discarded CIS265/506: Chapter 11 Hashing 24

Note that the two folding hashing methods produce different addresses ! CIS265/506: Chapter 11 Hashing 25

Collision Resolution Whenever we are not using a direct one-to-one mapping of keys to addresses, there is a potential for a collision. CIS265/506: Chapter 11 Hashing 2

001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 202002 [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 5 100 2 Hash 5 Uh-oh!! These two keys (green, purple) hash to the same address! CIS265/506: Chapter 11 Hashing 3

A little bit of theory….. Because of the anticipatory nature of hashing algorithms, it is necessary to have some empty elements in the list Practice says that a hashed list should never be more than 75% full CIS265/506: Chapter 11 Hashing 4

LOAD FACTOR the number of elements in the list divided by the number of physical elements allocated for the list expressed as a percentage Assigned the symbol alpha (  ) k : the number of filled element n : number of total elements  = k/n * 100 CIS265/506: Chapter 11 Hashing 5

Clustering As data are added to a list and collisions are resolved, some hashing algorithms tend to cause data to group within a list The tendency of data to build up unevenly across a hashed list is called clustering. High number of clusters causes decreased search efficiency CIS265/506: Chapter 11 Hashing 6

Primary Clustering Primary clustering occurs when data becomes clusters around a home address. Easy to identify CIS265/506: Chapter 11 Hashing 7

Secondary Clustering Secondary clustering occurs when data becomes grouped along a collision path throughout a list Not easy to identify Rapidly decreases search efficiency CIS265/506: Chapter 11 Hashing 8

Secondary Clustering The data are widely distributed across the list so the list appears to be evenly distributed If the data all lie along a well-traveled collision path, the time to locate a requested element of data can become large CIS265/506: Chapter 11 Hashing 9

Secondary Clustering Example: Assume you have a group of n people. We will use their birthdates as a key. What size group is needed before you find two (or more) people with the same birthday (not necessarily the same year, but same date)? CIS265/506: Chapter 11 Hashing 10

Secondary Clustering Factoid: If there are more than 23 people in a group, there is a better than 50% chance that two people have the same birthday. CIS265/506: Chapter 11 Hashing 11

Secondary Clustering If we extrapolate this statistical curiosity into our hashing methods – we could say: if we have a list of 365 empty addresses (one for each day in a non-leap year), we can expect to get a collision within the first 23 inserts 50% of the time. CIS265/506: Chapter 11 Hashing 12

Open Addressing Collision Resolution Methods Four variations on the theme: Linear Probe Quadratic Probe Double Hashing Key Offset Collisions are resolved by placing the offending key in the prime area CIS265/506: Chapter 11 Hashing 13

Linear Probe Simplest Method If there is a collision, we add one to the address and try & insert the data in that location. If that fails, add one & try again. Keep going until there is success CIS265/506: Chapter 11 Hashing 14

001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 202002 [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 5 100 2 Hash 5 Collision! CIS265/506: Chapter 11 Hashing 15

Linear Probe Advantages: Disadvantages Easy to implement Data tends to remain near the home address Disadvantages Tends to produce primary clustering Makes search algorithms more complex - especially after data have been deleted CIS265/506: Chapter 11 Hashing 16

Quadratic Probe Similar to the linear probe The increment is the collision probe number squared Try (probe) number 1: add 12 Try (probe) number 2: add 22 to try #1 Try (probe) number 3: add 32 to try #2 CIS265/506: Chapter 11 Hashing 17

Quadratic Probe Advantage Disadvantages Eliminate primary clustering Secondary clustering remains Inefficient because of time required to square number CIS265/506: Chapter 11 Hashing 18

Quadratic Probe Disadvantages (cont.) Not possible to generate a new address for every element in the list If the list size is a prime number, it is possible to reach at least half the elements in the list CIS265/506: Chapter 11 Hashing 19

Double Hashing Rather than using an arithmetic probe function (i.e. linear probe), we re-hash the address Prevents primary clustering CIS265/506: Chapter 11 Hashing 20

“Pseudo-random collision resolution” We use a pseudo-random number generator such as: y = ( (ax + c) modulo (listSize) ) + 1 where a, x, and c are some pre-defined numbers CIS265/506: Chapter 11 Hashing 21

“Pseudorandom collision resolution” A relatively simple solution Once a collision occurs, there is only one collision resolution path that is followed by all the keys. Can create significant secondary clustering CIS265/506: Chapter 11 Hashing 22

Key Offset Key offset is a double hashing method that produces different collision paths for different keys Calculates the new address as a function of the old address and the key CIS265/506: Chapter 11 Hashing 23

Key Offset Offset = [ key / listSize ] address = ((offset + oldAddress) modulo (listSize) )+ 1 Example: oldAddress = (166702 modulo 307) + 1 = 2 offset = [166702/307] = 543 address = ((543 + 2) modulo 307) + 1 = 239 CIS265/506: Chapter 11 Hashing 24

Linked List Resolution A major disadvantage to open addressing is that each collision resolution increases the probability of future collisions This is eliminated using linked lists CIS265/506: Chapter 11 Hashing 25

30451 Harry Lee 00432 Sarah Trapp 02305 Vu Nguyen 23007 Ray Black [001] [002] [003] [004] [005] [006] [007] . . . [306] [307] 30451 Harry Lee 00432 Sarah Trapp 02305 Vu Nguyen 23007 Ray Black 47100 John Adams 49742 Peter Smith 86351 Harry Eagle . . . CIS265/506: Chapter 11 Hashing 26

Bucket Hashing Another way to avoid collisions is to hash to “buckets” or nodes that are large enough to hold multiple keys Collisions are postponed until the buckets are filled Wasteful in terms of space CIS265/506: Chapter 11 Hashing 27

. . . main buckets overflow buckets 340 460 record pointer 981 record pointer record pointer 182 record pointer 321 761 091 record pointer . . . 022 072 522 record pointer 652 record pointer record pointer record pointer 399 089 record pointer CIS265/506: Chapter 11 Hashing 28

File Organizations Heap or unordered Sorted or sequential Hashed places the records on disk in no particular order Sorted or sequential records order by a particular field Hashed Uses a hash function to determine record placement B-Trees Uses a tree structure to determine location CIS265/506: Chapter 11 Hashing 33

Files of Unordered Records (Heap Files) Records are placed in the file in the chronological order in which they are inserted Inserting is very efficient The last disk block is copied to a buffer; the new record is added; the block is rewritten to the disk Searching requires a linear search - one record at a time very expensive CIS265/506: Chapter 11 Hashing 34

Files of Unordered Records (Heap Files) Deletion Find the correct block Copy the block in to a buffer; delete the record; rewrite the block back to the disk Leaves wasted, empty, space CIS265/506: Chapter 11 Hashing 35

Files of Ordered Records (Sorted Files) We can order records based on the value of a field in the record Called the “ordering field” If the ordering field is also the key field (guaranteed to be unique) we call the field the ordering key for the file CIS265/506: Chapter 11 Hashing 36

Files of Ordered Records (Sorted Files) Reading (in key order)) is very efficient No sorting required Finding the next record usually requires no additional block accesses The next record is usually in the same block as the previous Faster access when binary search techniques are used Faster than linear search CIS265/506: Chapter 11 Hashing 37

Hashed Files Called a “hashed” or “direct” file The search condition is such that a record is found or not on a single attempt CIS265/506: Chapter 11 Hashing 38

Internal Hashing Hashing is typically implemented through an array of records We have seen this technique already CIS265/506: Chapter 11 Hashing 39

External Hashing Hashing for disk files is called external hashing The target address space is made of “buckets” Maps a key to a relative bucket number A table in the file header converts the bucket number in to the corresponding disk block address CIS265/506: Chapter 11 Hashing 40

External Hashing Fastest possible access for retrieving an arbitrary record Most hash functions do not maintain records in hash address order Requires a pre-allocated amount of space What happens when the file grows? CIS265/506: Chapter 11 Hashing 41

Dynamic Hashing The number of buckets is not fixed, but rather grows & shrinks as needed Once the bucket overflows, it is split and a new bucket is created The records are redistributed CIS265/506: Chapter 11 Hashing 42

Dynamic Hashing Internal Nodes: Leaf Nodes guide the search - left pointer = 1, right pointer = 0 Leaf Nodes hold a pointer to a bucket - a bucket address CIS265/506: Chapter 11 Hashing 43

DATA FILE BUCKETS buckets for records whose hash values start with 000 1 buckets for records whose hash values start with 001 buckets for records whose hash values start with 01 1 buckets for records whose hash values start with 10 1 buckets for records whose hash values start with 110 1 buckets for records whose hash values start with 111 1 44