File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Dictionaries Again Collection of pairs.  (key, element)  Pairs have different keys. Operations.  Search(theKey)  Delete(theKey)  Insert(theKey, theElement)
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
CSCE 3400 Data Structures & Algorithm Analysis
Skip List & Hashing CSE, POSTECH.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
CST203-2 Database Management Systems Lecture 7. Disadvantages on index structure: We must access an index structure to locate data, or must use binary.
Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Dictionaries Collection of pairs.  (key, element)  Pairs have different keys. Operations.  get(theKey)  put(theKey, theElement)  remove(theKey) 5/2/20151.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Hashing Techniques.
Hashing CS 3358 Data Structures.
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing. Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Hashing CENG 351.
Subject Name: File Structures
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash Table.
Hash Tables.
Indexing and Hashing Basic Concepts Ordered Indices
What we learn with pleasure we never forget. Alfred Mercier
Hashing Indirect Address Translation
Presentation transcript:

File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11

File Processing - Indirect Address Translation MVNC2 Indirect Address Translation l Direct translation »Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. »Simple direct access file systems use this technique.

File Processing - Indirect Address Translation MVNC3 Indirect Address Translation l Direct translation - problems »The PKs may not be numeric. –Names –Alpha numeric IDs

File Processing - Indirect Address Translation MVNC4 Indirect Address Translation l Direct translation - problems »Only a small percent of the possible range of PK's may actual have records assigned to them: –Consider a keyfield for an employee file is a 9 digit ID number. (E.g. Social Security Number) –The company has 200 employees. –Since the ID's may have any of the 10 9 values, The file will have to be huge (10 9 records!). Thus the file will have a packing density of: 200 records used 10 9 records allocated = = %

File Processing - Indirect Address Translation MVNC5 Indirect Address Translation l Hashing »A common technique of indirect translation is hashing. »A solution in which the broad range of PK values are transformed into the smaller range of RRP values. »Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.

File Processing - Indirect Address Translation MVNC6 Indirect Address Translation l Hashing Algorithms »Development of a hashing function requires careful attention –The algorithm should distribute the keys as evenly as possible across the range of address. –Some different key MUST necessarily map to the same addresses

File Processing - Indirect Address Translation MVNC7 Key Transformation Algorithms l 3 general steps to convert a key to a RRP address: 1) If key is not numeric, convert it into a numeric form, without losing information. 2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required. 3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.

File Processing - Indirect Address Translation MVNC8 Key Transformation Algorithms l Example: »Key is a 9 Digit Number. »Destination file has 7000 records »Step 1 - Not needed (already a number) »Step 2 - Divide Key by to get remainder between »Step 3 - we multiply the value from 2 by.7 to put number within the range 0000 to 6999.

File Processing - Indirect Address Translation MVNC9 Key Transformation Algorithms l Example: »What would happen if we simply skip step 2, and simply compress the number from step 1? »What about clustered insertions? (Keys with contiguous values.)

File Processing - Indirect Address Translation MVNC10 Key Transformation Algorithms - Division l The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP. l A prime number or number with no small factors is used.

File Processing - Indirect Address Translation MVNC11 Key Transformation Algorithms - Division l Example: »records have 6-digit key, 5000 RRPs desired. »divide by 4997 and use remainder »consider key: » = 28 remainder »Use 2620 as RRP. l How do you suppose this method would work with clustered insertions?

File Processing - Indirect Address Translation MVNC12 Key Transformation Algorithms - Extraction l Select digits from different parts of key. l Example: »Records with 10-digit key, 5000 RRPs desired. »Choose 3 rd, 5 th, 8 th and 9 th digits: »Consider key = »Compress into RRP range: INT(8625 *.5) = Use 4312 as RRP.

File Processing - Indirect Address Translation MVNC13 Key Transformation Algorithms - Folding l Digits in the key are folded inward like folding paper. Then the digits are added. l Folding tends to be more appropriate for large keys.

File Processing - Indirect Address Translation MVNC14 Key Transformation Algorithms - Folding l Example »Let key be »Fold left at 4 th digit, right at 3 rd digit: »Results in 4137 and 735 »Add the two resulting values: = 4872 »Compress into RRP range: »4872 x.5 = Use 2436 as RRP.

File Processing - Indirect Address Translation MVNC15 Key Transformation Algorithms - Mid-square method l Square the key, and use the central digits of the result. l Example: »Let records have 6-digit key, and 5000 RRP's desired. »Key value of » > » central digits »Compress into RRP range: »1651 x.5 = 825. Use 825 as RRP.

File Processing - Indirect Address Translation MVNC16 Key Transformation Algorithms - Selection l The best way to choose a transform is to take the key set for the file and simulate using different transforms. l Choose the one which distributes the records most evenly. l The division method seems to be the best general transform.

File Processing - Indirect Address Translation MVNC17 Important hashing considerations l When designing a practical hashing scheme, several important issues must be addressed: l record distribution »A hashing function needs to be picked which will evenly distribute the records throughout the RRP range. »Different key sets will have different distribution patterns. »Thus the hashing function chosen will depend on the patterns of keys in the data set.

File Processing - Indirect Address Translation MVNC18 Important hashing considerations l synonyms »two or more PKs which transform to the same RRP address. »The the goal is to devise a hashing function for a given key set of keys which will minimize synonyms. »It is, however, statistically beyond reason to totally avoid synonyms. »Not only would all keys need to be known in advance, but only one algorithm in will work!

File Processing - Indirect Address Translation MVNC19 Important hashing considerations l collisions »When a new record hashes to a record already in use by another record. »The new record and the existing record are called synonyms. »The result is called an overflow. »A scheme must be devised to handle overflows efficiently.

File Processing - Indirect Address Translation MVNC20 Important hashing considerations l packing density »ratio of records stored in a file to addresses available in the file. »Typically the best packing density is 80-90%. »The larger the file, the less the probability of an overflow. »There is thus a trade-off between space and efficiency. space efficiency

File Processing - Indirect Address Translation MVNC21 Techniques for handling collisions l Strategies for collision resolution: 1. Create the file so that each address (physical record) can hold several logical records (usually synonyms). Called Composite Records or buckets. 2. Develop algorithms for relocating records which collide.

File Processing - Indirect Address Translation MVNC22 Composite Records or buckets l Reduce number of RRP’s, but increase the size of each to hold several records. l Each RRP (called a bucket) now holds several logical records

File Processing - Indirect Address Translation MVNC23 Composite Records or buckets l buckets are arrays of logical records. l bucket size - number of records/bucket l Now room for several synonyms in each bucket. l Probability of overflow is reduced. l Overflow now only occurs when bucket is full. l Overall file size need not increase, if bucket size 5, then reduce number of physical records by 5.

File Processing - Indirect Address Translation MVNC24 Composite Records or buckets l May be implemented by having file record be arrays of logical records l Example: Consider two half full files rec Probabity of Overflow?

File Processing - Indirect Address Translation MVNC25 Composite Records or buckets l Trade-offs »as bucket size increases, probability of a overflow is greatly reduced. »as bucket size increases, time to read in and scan bucket increases »Typical bucket sizes range from 5 to 30. »Ideal bucket size often a multiple of the disk sector or track size. »What is the extreme case of having the longest possible bucket?

File Processing - Indirect Address Translation MVNC26 Handling overflows l Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with. l Many algorithms exist for handling overflows, including: 1. Progressive overflow 2. Separate overflow area 3. Chained Progressive overflow

File Processing - Indirect Address Translation MVNC27 Progressive overflow l Adding new record »If home address is full, try the next record. »If next address full, try next, and so one. »If at end of file, wrap around to record 0 »If search continues until home address again reached, file full.

File Processing - Indirect Address Translation MVNC28 Progressive overflow l Finding a record »If in home bucket, success! »Else if home bucket not full, search fails. »Else if home bucket full, go search next bucket. »Keep searching successive buckets until either found, or a non-full bucket is searched.

File Processing - Indirect Address Translation MVNC29 Progressive overflow l Finding a record »Note that as file fills, search length will increase. »What are some enhancements? –Each bucket has flag indicating if bucket has really overflowed

File Processing - Indirect Address Translation MVNC30 Progressive overflow l Delete record »Can't simply remove, or find may not work correctly »Must mark each record as used, unused, or deleted.

File Processing - Indirect Address Translation MVNC31 Progressive overflow l Evaluation »simple »robust »searches may get very long »clustering

File Processing - Indirect Address Translation MVNC32 Progressive overflow l Alternate version - skip x records each time, where x is prime relative to the number of records. l Reduces the problem of record clustering

File Processing - Indirect Address Translation MVNC33 Separate overflow area l Buckets contain pointers which may point to a record in a special overflow area. l Records (or buckets) are linked together in the overflow area as a linked list. l What happens if there are a lot of synonyms for a few home addresses?

File Processing - Indirect Address Translation MVNC34 Separate overflow area

File Processing - Indirect Address Translation MVNC35 Chained Progressive overflow l similar to progressive, but pointers link synonyms together for quicker searches.