Download presentation
Presentation is loading. Please wait.
1
Data Structures Hashing 1
2
Hashing Other search techniques require multiple comparisons
The goal of hashing is to reduce the number of comparisons to one That is, we wish to locate the key immediately CIS265/506: Chapter 11 Hashing 2
3
Hashing A hash search is a search in which the key, through an algorithmic function, determines the location of the data. Hashing is a key-to-address transformation in which the keys map to addresses in a list CIS265/506: Chapter 11 Hashing 3
4
F(key) = address Key Address Hash Function 4
CIS265/506: Chapter 11 Hashing 4
5
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black
Address Mapping a Key value Into a record location [001] [002] [003] [004] [005] [006] [007] [099] [100] 001 Harry Lee Sarah Trapp Vu Nguyen Ray Black 100 John Adams Hash 5 2 100 Key CIS265/506: Chapter 11 Hashing 5
6
Synonyms It is possible that two (or more) different key values produce the same location. We call the set of keys that hashes to the same location in our list synonyms CIS265/506: Chapter 11 Hashing 6
7
Collisions If the actual data that we insert into our list contains two or more synonyms, then we will have a collision. A collision is the event that occurs when a hashing algorithm produces an address for insertion that is already occupied. CIS265/506: Chapter 11 Hashing 7
8
Collisions The address produced by the hashing algorithm is called the home address. The memory that contains all the home addresses is called the prime area. When two keys collide at a home address, we must resolve the collision by placing one of the keys & its data at another location CIS265/506: Chapter 11 Hashing 8
9
Searching If we use one hashing algorithm to insert a key, we must use the same algorithm to find it! Each calculation of an address and test for success (is the key & data in the list?) is called a probe. CIS265/506: Chapter 11 Hashing 9
10
Looking for the ideal Hashing Function
Simple to use Instantaneous computation No waste of space Symmetrical (data placement & retrieval use same mechanism) Some examples follow CIS265/506: Chapter 11 Hashing
11
Direct Hashing The key is the address without any algorithmic manipulation The supporting data structure must contain an element for every possible key. Uses are somewhat limited. Never a risk for synonyms (Why?) CIS265/506: Chapter 11 Hashing 10
12
Direct Hashing Situation #1:
Let’s say we need to look at daily sales figures for one month. We could set up an array with 31 distinct elements. We could use the day as the key (& address) and the sales figures as the data CIS265/506: Chapter 11 Hashing 11
13
Key Data … … … Since we are dealing with “direct hashing” - the address & the key are the same dailySales[current_day] = dailySales[current_day] + sale_amount CIS265/506: Chapter 11 Hashing 12
14
Direct Hashing That sounds easy! What are some drawbacks?
Wasteful in terms of space (eg. SSN or CSU school ID produces a large….large…. prime area) How do you find every possible case? CIS265/506: Chapter 11 Hashing 13
15
Subtraction Method If our keys were sequential, but did not start from one (or zero), we could use the subtraction method. Example: Keys were from 1000 to 1100. We would subtract 1000 from the key to determine its address Same problems & issues as the direct method CIS265/506: Chapter 11 Hashing 14
16
Modulo-Division Method
Also known as the division-remainder method Divides the key by the array size and uses the remainder plus one to produce the address address = key MODULUS (listSize + 1) CIS265/506: Chapter 11 Hashing 15
17
Modulo-Division Method
Can work with any list size If the list size is a prime number, there will be fewer collisions (???) CIS265/506: Chapter 11 Hashing 16
18
Modulo-Division Method - Example
If we had a need to store 300 pieces of information, we would choose an array size of 307 (the next largest prime number). Now, assume we have key Q. In what array location (address) is this key stored? 307 = 395 with a remainder of 2 address = key MODULUS (listSize + 1) ?? = ( modulus 307) = A. Address value is: 3 CIS265/506: Chapter 11 Hashing 17
19
Digit Extraction Method
Selected digits are extracted from the key and used as the address: key address 394 112 388 CIS265/506: Chapter 11 Hashing 18
20
Mid-square Method The key is squared and the address selected from the middle of the squared number The full squared number may be too large for the computer ! If the key has 6 digits, the product is 12 digits, which is larger than the size of an integer in many computers CIS265/506: Chapter 11 Hashing 19
21
Midsquare Method Example: Assume we have a 4 digit address (0000-9999)
Q. What is the address of the key 9452? A * 9452 = Address = 3403 CIS265/506: Chapter 11 Hashing 20
22
Folding Methods Fold Shift Method
The key value is divided in two parts The left & right parts are shifted and added to the middle part CIS265/506: Chapter 11 Hashing 21
23
Fold Shift Method Address Size Key 3 123456789 123 456 + 789 1368
← left ← middle ← right the address for the key is 368 discarded CIS265/506: Chapter 11 Hashing 22
24
Folding Methods Fold Boundary Method
The key value is divided in two parts The left & right parts are folded and added to the middle part CIS265/506: Chapter 11 Hashing 23
25
note that the beginning & ending number have been reversed!
Fold Boundary Method Address Size 3 Key note that the beginning & ending number have been reversed! the address for the key is 764 discarded CIS265/506: Chapter 11 Hashing 24
26
Note that the two folding hashing methods produce
different addresses ! CIS265/506: Chapter 11 Hashing 25
27
Collision Resolution Whenever we are not using a direct one-to-one mapping of keys to addresses, there is a potential for a collision. CIS265/506: Chapter 11 Hashing 2
28
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black
202002 [001] [002] [003] [004] [005] [006] [007] [099] [100] 001 Harry Lee Sarah Trapp Vu Nguyen Ray Black 100 John Adams Hash 5 Uh-oh!! These two keys (green, purple) hash to the same address! CIS265/506: Chapter 11 Hashing 3
29
A little bit of theory….. Because of the anticipatory nature of hashing algorithms, it is necessary to have some empty elements in the list Practice says that a hashed list should never be more than 75% full CIS265/506: Chapter 11 Hashing 4
30
LOAD FACTOR the number of elements in the list divided by the number of physical elements allocated for the list expressed as a percentage Assigned the symbol alpha ( ) k : the number of filled element n : number of total elements = k/n * 100 CIS265/506: Chapter 11 Hashing 5
31
Clustering As data are added to a list and collisions are resolved, some hashing algorithms tend to cause data to group within a list The tendency of data to build up unevenly across a hashed list is called clustering. High number of clusters causes decreased search efficiency CIS265/506: Chapter 11 Hashing 6
32
Primary Clustering Primary clustering occurs when data becomes clusters around a home address. Easy to identify CIS265/506: Chapter 11 Hashing 7
33
Secondary Clustering Secondary clustering occurs when data becomes grouped along a collision path throughout a list Not easy to identify Rapidly decreases search efficiency CIS265/506: Chapter 11 Hashing 8
34
Secondary Clustering The data are widely distributed across the list so the list appears to be evenly distributed If the data all lie along a well-traveled collision path, the time to locate a requested element of data can become large CIS265/506: Chapter 11 Hashing 9
35
Secondary Clustering Example: Assume you have a group of n people.
We will use their birthdates as a key. What size group is needed before you find two (or more) people with the same birthday (not necessarily the same year, but same date)? CIS265/506: Chapter 11 Hashing 10
36
Secondary Clustering Factoid:
If there are more than 23 people in a group, there is a better than 50% chance that two people have the same birthday. CIS265/506: Chapter 11 Hashing 11
37
Secondary Clustering If we extrapolate this statistical curiosity into our hashing methods – we could say: if we have a list of 365 empty addresses (one for each day in a non-leap year), we can expect to get a collision within the first 23 inserts 50% of the time. CIS265/506: Chapter 11 Hashing 12
38
Open Addressing Collision Resolution Methods
Four variations on the theme: Linear Probe Quadratic Probe Double Hashing Key Offset Collisions are resolved by placing the offending key in the prime area CIS265/506: Chapter 11 Hashing 13
39
Linear Probe Simplest Method
If there is a collision, we add one to the address and try & insert the data in that location. If that fails, add one & try again. Keep going until there is success CIS265/506: Chapter 11 Hashing 14
40
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black
202002 [001] [002] [003] [004] [005] [006] [007] [099] [100] 001 Harry Lee Sarah Trapp Vu Nguyen Ray Black 100 John Adams Hash 5 Collision! CIS265/506: Chapter 11 Hashing 15
41
Linear Probe Advantages: Disadvantages Easy to implement
Data tends to remain near the home address Disadvantages Tends to produce primary clustering Makes search algorithms more complex - especially after data have been deleted CIS265/506: Chapter 11 Hashing 16
42
Quadratic Probe Similar to the linear probe
The increment is the collision probe number squared Try (probe) number 1: add 12 Try (probe) number 2: add 22 to try #1 Try (probe) number 3: add 32 to try #2 CIS265/506: Chapter 11 Hashing 17
43
Quadratic Probe Advantage Disadvantages Eliminate primary clustering
Secondary clustering remains Inefficient because of time required to square number CIS265/506: Chapter 11 Hashing 18
44
Quadratic Probe Disadvantages (cont.)
Not possible to generate a new address for every element in the list If the list size is a prime number, it is possible to reach at least half the elements in the list CIS265/506: Chapter 11 Hashing 19
45
Double Hashing Rather than using an arithmetic probe function (i.e. linear probe), we re-hash the address Prevents primary clustering CIS265/506: Chapter 11 Hashing 20
46
“Pseudo-random collision resolution”
We use a pseudo-random number generator such as: y = ( (ax + c) modulo (listSize) ) + 1 where a, x, and c are some pre-defined numbers CIS265/506: Chapter 11 Hashing 21
47
“Pseudorandom collision resolution”
A relatively simple solution Once a collision occurs, there is only one collision resolution path that is followed by all the keys. Can create significant secondary clustering CIS265/506: Chapter 11 Hashing 22
48
Key Offset Key offset is a double hashing method that produces different collision paths for different keys Calculates the new address as a function of the old address and the key CIS265/506: Chapter 11 Hashing 23
49
Key Offset Offset = [ key / listSize ]
address = ((offset + oldAddress) modulo (listSize) )+ 1 Example: oldAddress = ( modulo 307) + 1 = 2 offset = [166702/307] = 543 address = (( ) modulo 307) + 1 = 239 CIS265/506: Chapter 11 Hashing 24
50
Linked List Resolution
A major disadvantage to open addressing is that each collision resolution increases the probability of future collisions This is eliminated using linked lists CIS265/506: Chapter 11 Hashing 25
51
30451 Harry Lee 00432 Sarah Trapp 02305 Vu Nguyen 23007 Ray Black
[001] [002] [003] [004] [005] [006] [007] [306] [307] Harry Lee Sarah Trapp Vu Nguyen Ray Black John Adams Peter Smith Harry Eagle . . . CIS265/506: Chapter 11 Hashing 26
52
Bucket Hashing Another way to avoid collisions is to hash to “buckets” or nodes that are large enough to hold multiple keys Collisions are postponed until the buckets are filled Wasteful in terms of space CIS265/506: Chapter 11 Hashing 27
53
. . . main buckets overflow buckets 340 460 record pointer
981 record pointer record pointer 182 record pointer record pointer . . . record pointer 652 record pointer record pointer record pointer record pointer CIS265/506: Chapter 11 Hashing 28
54
File Organizations Heap or unordered Sorted or sequential Hashed
places the records on disk in no particular order Sorted or sequential records order by a particular field Hashed Uses a hash function to determine record placement B-Trees Uses a tree structure to determine location CIS265/506: Chapter 11 Hashing 33
55
Files of Unordered Records (Heap Files)
Records are placed in the file in the chronological order in which they are inserted Inserting is very efficient The last disk block is copied to a buffer; the new record is added; the block is rewritten to the disk Searching requires a linear search - one record at a time very expensive CIS265/506: Chapter 11 Hashing 34
56
Files of Unordered Records (Heap Files)
Deletion Find the correct block Copy the block in to a buffer; delete the record; rewrite the block back to the disk Leaves wasted, empty, space CIS265/506: Chapter 11 Hashing 35
57
Files of Ordered Records (Sorted Files)
We can order records based on the value of a field in the record Called the “ordering field” If the ordering field is also the key field (guaranteed to be unique) we call the field the ordering key for the file CIS265/506: Chapter 11 Hashing 36
58
Files of Ordered Records (Sorted Files)
Reading (in key order)) is very efficient No sorting required Finding the next record usually requires no additional block accesses The next record is usually in the same block as the previous Faster access when binary search techniques are used Faster than linear search CIS265/506: Chapter 11 Hashing 37
59
Hashed Files Called a “hashed” or “direct” file
The search condition is such that a record is found or not on a single attempt CIS265/506: Chapter 11 Hashing 38
60
Internal Hashing Hashing is typically implemented through an array of records We have seen this technique already CIS265/506: Chapter 11 Hashing 39
61
External Hashing Hashing for disk files is called external hashing
The target address space is made of “buckets” Maps a key to a relative bucket number A table in the file header converts the bucket number in to the corresponding disk block address CIS265/506: Chapter 11 Hashing 40
62
External Hashing Fastest possible access for retrieving an arbitrary record Most hash functions do not maintain records in hash address order Requires a pre-allocated amount of space What happens when the file grows? CIS265/506: Chapter 11 Hashing 41
63
Dynamic Hashing The number of buckets is not fixed, but rather grows & shrinks as needed Once the bucket overflows, it is split and a new bucket is created The records are redistributed CIS265/506: Chapter 11 Hashing 42
64
Dynamic Hashing Internal Nodes: Leaf Nodes
guide the search - left pointer = 1, right pointer = 0 Leaf Nodes hold a pointer to a bucket - a bucket address CIS265/506: Chapter 11 Hashing 43
65
DATA FILE BUCKETS buckets for records whose hash values start with 000 1 buckets for records whose hash values start with 001 buckets for records whose hash values start with 01 1 buckets for records whose hash values start with 10 1 buckets for records whose hash values start with 110 1 buckets for records whose hash values start with 111 1 44
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.