LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Hashing Part One Reaching for the Perfect Search Most of this material stolen from "File Structures" by Folk, Zoellick and Riccardi.
Chapter 11. Hashing.
Hashing CS 3358 Data Structures.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing Hashing is another method for sorting and searching data.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Chapter 5 Record Storage and Primary File Organizations
Hashing 1 Lec# 12 Presented by Halla Abdel Hameed.
CS203 Lecture 14. Hashing An object may contain an arbitrary amount of data, and searching a data structure that contains many large objects is expensive.
Design & Analysis of Algorithm Hashing
Data Indexing Herbert A. Evans.
CPSC 231 Organizing Files for Performance (D.H.)
CHP - 9 File Structures.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
CSCI 210 Data Structures and Algorithms
Hashing Alexandra Stefan.
Hashing CENG 351.
Database Management System
Subject Name: File Structures
Database Management Systems (CS 564)
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash Table.
Chapter 11: Indexing and Hashing
Computer Science 2 Hashing
Introduction to Database Systems
Hash Tables.
Hashing.
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Advance Database System
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
How to use hash tables to solve olympiad problems
Chapter 11. Hashing.
Hashing.
What we learn with pleasure we never forget. Alfred Mercier
Hashing Indirect Address Translation
EE 312 Software Design and Implementation I
Lecture-Hashing.
Presentation transcript:

LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing: definition how it works examples collisions pros and cons of hashing CPSC 231 Hashing (D.H.)

O(1) , O(N) and O(Logk N) accesses O(1) access to files means that the access time to a file record is CONSTANT (it does not depend on the files size) O(N) access to files means that the access time is proportional to the file size (if the number of records in the file is N) O(Logk N) access to files means that the access time is proportional to Logk N. CPSC 231 Hashing (D.H.)

Which of the file organization methods discussed so far has the O(N) access time? Which of the file organization methods discussed so far has the O(Logk N) access time? CPSC 231 Hashing (D.H.)

Hashing - O(1) access time Hashing is a technique for generating a unique home address for a given key. Hashing is used when rapid access to a key (or to its corresponding record) is required. Hashing can be used to: access records in a file access items in arrays in memory access directories of a file system CPSC 231 Hashing (D.H.)

How does hashing work? A hash function is a function that is applied to a key to generate a home address of the key (home address = h(K)). Home address of a key is the address generated by the hash function applied to this key. If a record is stored at its home address then access time to it is O(1). CPSC 231 Hashing (D.H.)

Example of hashing Suppose you want to store 75 records in a file in which the key to records is a person’s name. Suppose that you set aside space for 1000 records (assuming that your file will grow.) Use the following hash function for each key (h(K)): take the ASCII values of the first two characters of the last name and multiply them. Then take last three digits of the result to get a home address. CPSC 231 Hashing (D.H.)

Example Cont. For the name BALL h(K) = h(“BALL”)= last three digits of (66*65) = last three digits of 4290 = 290 So 290 is the home address of the key BALL. This means that if the record 290 in the file is available then the record with the key BALL will be stored at this location. CPSC 231 Hashing (D.H.)

Hashing vs indexing - two important differences With hashing, the address generated appears to be random (there is no obvious connection between the Key and the corresponding record.) - hashing is sometimes referred to as randomizing. With hashing, two different keys may be transformed to the same address so two records may be sent to the same place in the file. When this occurs, it is called a collision and some means must be found to deal with it. CPSC 231 Hashing (D.H.)

Collisions Collision is a situation when two or more keys produce the same home address. Give an example of a key that will produce a collision with the key BALL from the example of hashing? CPSC 231 Hashing (D.H.)

What to do about collisions? Collisions cause problems - we cannot put two different records in the same place. Ideal solution to this problem is to have a hashing algorithm that avoids collisions altogether. Such an algorithm is called a perfect hashing algorithm. A perfect hashing algorithm is usually very difficult (or impossible) to find. CPSC 231 Hashing (D.H.)

A practical solution to reduce the number of collisions. Spread out the records - find a hashing algorithm that spreads the records randomly among available addresses. Use extra memory - it is easier to avoid collisions if we have only a few records to distribute among many addresses than if we have about the same number of records as addresses. (Problem - fragmentation.) CPSC 231 Hashing (D.H.)

A practical solution to reduce the number of collisions Put more than one record at a single address - e.g. make physical records big enough to hold 5 data records. (E.g. each home address can hold 5 data records of synonyms (two or more different keys that hash to the same address) Addresses that can hold multiple records are called buckets. CPSC 231 Hashing (D.H.)

A Simple Hashing Algorithm Let’s look at an algorithm that randomizes home addresses much better than the hash function presented before. This algorithm has the following three steps: Represent the key in numerical form. Fold and add. Divide by a prime number and use the remainder as the address. CPSC 231 Hashing (D.H.)

Step one - represent the Key in Numerical Form If the key is a number then this step is already accomplished. If a key is a string of characters we may use ASCII codes to convert to a numerical form: 66 65 76 76 32 32 32 32 E.G BALL = B A L L | blank spaces | CPSC 231 Hashing (D.H.)

Step two: Fold and Add Folding and adding means chopping off pieces of the number and adding them together. E.G 6665|7676|3232| 3232 <<---- folding Adding : 6665+7676+3232+ 3232 =20805 In order to avoid an overflow in addition one may choose to use the mod function with a prime number e.g. 19,397 (see text p.469-470) 20805 mod 19397=1408 CPSC 231 Hashing (D.H.)

Step three: Divide by a Prime Number Close to the Size of the Address Space The purpose of this step is to assure that the final result of the calculation is the number that falls within the range of addresses of records in the file. This can be done by using the mod function of the current result over the maximum size of the file. CPSC 231 Hashing (D.H.)

Step three cont. If we decide that the file size is 100 records than we should do the following; a= s mod n or a = 1408 mod 100 =8. (8 is the home address of the key = BALL) What would be the home address of BALL if we allow 1000 records? CPSC 231 Hashing (D.H.)

Choosing a Prime Number for n Choosing the divisor n can have a major effect on how well records are spread out. A prime number is usually used for the divisor because primes tend to distribute remainders much more uniformly than non-primes. E.G. instead of using 100 in the previous example we could use 101. CPSC 231 Hashing (D.H.)

Progressive overflow Progressive overflow is a technique for handling collisions by storing a record in the next available address after its home address. Progressive overflow is not the most efficient overflow handing technique, but it is one of the simplest and is adequate for many applications. (See fig 11.4 , 11.5 p. 487-488) CPSC 231 Hashing (D.H.)

Record Deletion in Hashed Files Deleting a record from a hashed file is complicated by the following two reasons: The slot freed by deletion must not be allowed to hinder later searches; and It should be possible to reuse the freed slot for later additions. See fig 11.9 example on page 499. CPSC 231 Hashing (D.H.)

Tombstones for Handing Deletions Tombstone is a special marker placed in the key field of a record to mark it as no longer valid. Tombstones solve two problems associated with the deletion of records: the freed space does not break a sequential search for a record (WHY?) the freed space is easily recognized and can be reclaimed later (HOW?) (See fig 11.10, 11.11 p 500) CPSC 231 Hashing (D.H.)

Hashing - Pros and Cons Pros: hashing can provide faster access than most of other organizations, usually with very little storage overhead and its adaptable to most primary keys. Ideally, hashing makes it possible to find any record with only one disk access. Cons: primary disadvantage of hashing is that it hashed files may not be sorted by key. CPSC 231 Hashing (D.H.)