Hashing Indirect Address Translation

Slides:



Advertisements
Similar presentations
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
CST203-2 Database Management Systems Lecture 7. Disadvantages on index structure: We must access an index structure to locate data, or must use binary.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Appendix I Hashing.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
Hash table CSC317 We have elements with key and satellite data
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing CENG 351.
Subject Name: File Structures
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Introduction to Hashing & Hashing Techniques
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Design and Analysis of Algorithms
Hash Table.
Hash Table.
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Disk Storage, Basic File Structures, and Hashing
Hash Tables.
Hashing.
Indexing and Hashing Basic Concepts Ordered Indices
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Advance Database System
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
Database Design and Programming
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
Chapter 11. Hashing.
Introduction to Hashing & Hashing Techniques
Collision Handling Collisions occur when different elements are mapped to the same cell.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Hashing.
What we learn with pleasure we never forget. Alfred Mercier
EE 312 Software Design and Implementation I
Lecture-Hashing.
Presentation transcript:

Hashing Indirect Address Translation Jim Skon

Key Based Searching Often we have records of data, where each record is identified by a “key”. Example: An insurance company with 50,000 customers, look up customer by: Customer number Last Name, First Name Phone Number Records are stored on Disk Drive What is the optimal search strategy based on matching a key?

Key Based Searching Example: A core network switch has a table to route packets flowing through it. Each packet has a destination address. The switch must maintain a routing table that tells which port (link) to send the packet to send it toward the destination. Address Out Port 23:43:15:43:23:10 23 43:24:75:34:62:52 12 76:E2:E1:D0:1E:66 D9:10:9D:00:64:6F 43 . . . 47:BE:46:DF:71:8F 2 Must switch packets at rate of 10,000 or more packets a second.

Key Based Searching What is the optimal search strategy based on matching a key? What is the impact whether the data being searched is stored is in memory, or on a secondary storage device? Typical Access Time: - Memory access time: 50-100ns - Flash access time: 1-3 ms - Disk access time: 9-15 ms Secondary is 30-150 times slower. What does this mean?

Key Based Searching Claim: Generalized key based search is nlogn Binary search, or tree traversal What if we constrain some of the factors? Key range and type Ultimate size of the file

Direct Address Translation Direct translation Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. Simple direct access lookup systems use this technique. Index First Last 1 Bill Yeakus 2 - 3 4 Cindy Smith 5 Randy Nelson . . . 5234 Joseph Blithers 9999 4 Digit Employee ID: 5234

Indirect Address Translation Analysis – O(1) Direct translation - problems The PKs may not be numeric. Names Alpha numeric IDs

Indirect Address Translation Direct translation - problems Only a small percent of the possible range of PK's may actual have records assigned to them: Consider a keyfield for an employee file is a 9 digit ID number. (E.g. Social Security Number) The company has 200 employees. Since the ID's may have any of the 109 values, The file will have to be huge (109 records!). Thus the file will have a packing density of: 2   r e c o d s  u s 1 9 a l t = 7 . %

Indirect Address Translation Hashing A common technique of indirect translation is hashing. A solution in which the broad range of PK values are transformed into the smaller range of RRP values. Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.

Indirect Address Translation Hashing Algorithms Development of a hashing function requires careful attention The algorithm should distribute the keys as evenly as possible across the range of address. Some different key MUST necessarily map to the same addresses

Key Transformation Algorithms 3 general steps to convert a key to a RRP address: 1) If key is not numeric, convert it into a numeric form, without losing information. 2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required. 3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.

Key Transformation Algorithms Example: Key is a 9 Digit Number. Destination file has 7000 records Step 1 - Not needed (already a number) Step 2 - Divide Key by 10000 to get remainder between 0 - 9999 Step 3 - we multiply the value from 2 by .7 to put number within the range 0000 to 6999.

Key Transformation Algorithms Example: What would happen if we simply skip step 2 , and simply compress the number from step 1? What about clustered insertions? (Keys with contiguous values.)

Key Transformation Algorithms - Division The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP. A prime number or number with no small factors is used.

Key Transformation Algorithms - Division Example: records have 6-digit key, 5000 RRPs desired. divide by 4997 and use remainder consider key: 142536 = 28 remainder 2620. Use 2620 as RRP. How do you suppose this method would work with clustered insertions? 142536 4997

Key Transformation Algorithms - Extraction Select digits from different parts of key. Example: Records with 10-digit key, 5000 RRPs desired. Choose 3rd, 5th, 8th and 9th digits: Consider key = 3865324567 Compress into RRP range: INT(8625 * .5) = 4312. Use 4312 as RRP.

Key Transformation Algorithms - Folding Digits in the key are folded inward like folding paper. Then the digits are added. Folding tends to be more appropriate for large keys.

Key Transformation Algorithms - Folding Example Let key be 142537. Fold left at 4th digit, right at 3rd digit: Results in 4137 and 735 Add the two resulting values: 4137 + 735 = 4872 Compress into RRP range: 4872 x .5 = 2436. Use 2436 as RRP.

Key Transformation Algorithms - Mid-square method Square the key, and use the central digits of the result. Example: Let records have 6-digit key, and 5000 RRP's desired. Key value of 142536. 1425362 --> 020316511296 1651 - central digits Compress into RRP range: 1651 x.5 = 825. Use 825 as RRP.

Key Transformation Algorithms - Selection The best way to choose a transform is to take the key set for the file and simulate using different transforms. Choose the one which distributes the records most evenly. The division method seems to be the best general transform.

Important hashing considerations When designing a practical hashing scheme, several important issues must be addressed: record distribution A hashing function needs to be picked which will evenly distribute the records throughout the RRP range. Different key sets will have different distribution patterns. Thus the hashing function chosen may depend on the patterns of keys in the data set.

Important hashing considerations synonyms two or more PKs which transform to the same RRP address. The the goal is to devise a hashing function for a given key set of keys which will minimize synonyms. It is, however, statistically beyond reason to totally avoid synonyms. Not only would all keys need to be known in advance, but only one algorithm in 1012000 will work!

Important hashing considerations collisions When a new record hashes to a record already in use by another record. The new record and the existing record are called synonyms. The result is called an overflow. A scheme must be devised to handle overflows efficiently.

Important hashing considerations packing density ratio of records stored in a file to addresses available in the file. Typically the best packing density is 80-90%. The larger the file, the less the probability of an overflow. There is thus a trade-off between space and efficiency. efficiency space

Techniques for handling collisions Strategies for collision resolution: 1. Create the file so that each address (physical record) can hold several logical records (usually synonyms). Called Composite Records or buckets. This is particularly good for secondary storage devices … why? 2. Develop algorithms for relocating records which collide somewhere else.

Composite Records or buckets Reduce number of RRP’s, but increase the size of each to hold several records. Each RRP (called a bucket) now holds several logical records. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4

Composite Records or buckets buckets are arrays of logical records. bucket size - number of records/bucket Now room for several synonyms in each bucket. Probability of overflow is reduced. Overflow now only occurs when bucket is full. Overall file size need not increase, if bucket size 5, then reduce number of physical records by 5.

Composite Records or buckets May be implemented by having file record be arrays of logical records Example: Consider two half full files 1 2 3 4 5 6 7 8 9 10 11 12 rec rec 1 2 3 4 rec Probabity of Overflow? rec rec rec rec

Composite Records or buckets Trade-offs as bucket size increases, probability of a overflow is greatly reduced. as bucket size increases, time to read in and scan bucket increases Typical bucket sizes range from 5 to 30. Ideal bucket size often a multiple of the disk sector or track size. What is the extreme case of having the longest possible bucket?

Handling overflows Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with. Many algorithms exist for handling overflows , including: 1. Progressive overflow 2. Separate overflow area 3. Chained Progressive overflow

Progressive overflow Adding new record If home address is full, try the next record. If next address full, try next, and so one. If at end of file, wrap around to record 0 If search continues until home address again reached, file full.

Progressive overflow Finding a record If in home bucket, success! Else if home bucket not full, search fails. Else if home bucket full, go search next bucket. Keep searching successive buckets until either found, or a non-full bucket is searched.

Progressive overflow Finding a record Note that as file fills, search length will increase. What are some enhancements? Each bucket has flag indicating if bucket has really overflowed

Progressive overflow Delete record Can't simply remove, or find may not work correctly Must mark each record as used, unused, or deleted.

Progressive overflow Evaluation simple robust searches may get very long clustering

Progressive overflow Alternate version - skip x records each time, where x is prime relative to the number of records. Reduces the problem of record clustering

Separate overflow area Buckets contain pointers which may point to a record in a special overflow area. Records (or buckets) are linked together in the overflow area as a linked list. What happens if there are a lot of synonyms for a few home addresses?

Separate overflow area

Chained Progressive overflow similar to progressive, but pointers link synonyms together for quicker searches.

Perfect Hashing Static keyset Need for O(1) worst case access. In book – proof that if n keys stored in m = n2 sized table, the probability is less than ½ that there are any collisions. But storing such a table takes a lot of space Or does it?

Perfect Hashing Two Steps Hash into a small primary index with a typical hash function: h(k) = (ak +b) mod p) mod m Primary index has (as needed) pointer to secondary indexes of variable size. Each secondary index j has the parameters for a for size (mj) , and aj and bj for that sub index.

Perfect Hashing