Chapter 9 Tables and Information Retrieval. Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete.

Chapter 9 Tables and Information Retrieval

Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete a search of n items in fewer than lg n comparisons. Faster methods for searching ? –With an index n to locate the record of item n by ordinary table lookup. –The time required for retrieving an entry from tables is O(1).

Basic Concepts Table vs. Array –Table is an ADT, usually to implement tables in contiguous storage (array). Convention –The index defining an entry of a table is enclosed in parentheses, whereas the index of an entry of an array is enclosed in square brackets. –For example: T(1, 2, 3) denotes an entry of the table T. A[1][2][3] denotes an entry of the array A.

Rectangular Tables Rectangular Tables are so important that almost all high- level languages provide 2-dimensional array to store and access them. While, computer memory space is a contiguous sequence. The machine must do some work to convert the location with in a rectangle to a position along a line.

Storage of Rectangular Tables Store entries low by low, such as in C, C++, Pascal. Store entries column by column, such as in Fortran.

Indexing Rectangular Tables Problem –Giving an index (i, j), i = 0,1,…,m-1, j = 1,2,…,n-1, to calculate where in the array the corresponding entry of the table will be stored. Solutions –Index function –Access array

Solution 1: Index function In row-major ordering, entry (i, j) goes to position n*i + j. In column-major ordering ? Loc(i, j) = Loc(0, 0) + n * i + j Loc(i, j) = Loc(0, 0) + m * j + i

Solution 2: Access Array Access array is an auxiliary array to store some references and used to find data stored elsewhere. Index function: Loc(i, j) = Loc(0, 0) + Accessarray[i] + j n * i

Tables of Various Shapes Matrix: –A rectangular table of numbers, some of the positions within it will be 0.

Tables of Various Shapes (cont.)

Key point –We do not need to store all entries of the matrices in the rectangular array and leaving some position vacant. (for 0 entries) Implementation of Matrices with Various Shapes

Contiguous Implementation of a Triangular Table Index function: Loc(i, j) = loc(0,0) + i * (i + 1) /2 + j, or Loc(i, j) = loc(0,0) + Accessarray[i] + j

How about … ? the way to store the matrix and give the corresponding index function –Diagonal Matrix –Tri-diagonal Matrix –Upper Triangular Matrix –Symmetrically Matrix –Symmetrically Triangular Table Exercises 9.3

Jagged Tables –A rectangular array in which each row might have different length. Jagged Table with Access Array To set up the access array, we must construct the jagged table in its nature order, beginning with its first row.

Inverted Table Inverted table is a multiple access arrays, by which we can refer to a single table of records by several different keys at once. Unordered records for ordered access arrays

Formal Definition of Table

Function

In mathematics a function ( 函数 )is defined in terms of two sets and a correspondence from elements of the first set to elements of the second. If f is a function from a set A to a set B, then f assigns to each element of A a unique element of B. The set A is called the domain ( 定义域 ) of f, and the set B is called the codomain ( 值域 ) of f. The subset of B containing just those elements that occur as values of f is called the range ( 取值范围 ) of f.

ADT of Tables

Hashing

An Example of Hash Table

Hash Table Start with an array that holds the hash table. Use a hash function to take a key and map it to some index in the array. –Loc = h(key), that is, loc can be seen as an index. –Since the function might map several different keys to the same index. If the desired record is in the location given by the index, then we are finished; otherwise we must use some method to resolve the collision that may have occurred between two records wanting to go to the same location. To use hashing we must –(a) Find good hash functions. –(b) Determine how to resolve collisions.

Choosing a Hash Function A hash function should be easy and quick to compute. A hash function should achieve an even distribution of the keys that actually occur across the range of indices. A better spread of keys is often obtained by taking the size of the table (the index range) to be a prime number. If the hash function is poor, the performance of hashing can degenerate to that of sequential search.

Ways of Building Hash Function (1) Truncation –Ignore part of the key, and use the remaining part as the index. For example, keys are 8-digital integers, the hash table has 1000 locations 21296876  976 –Fast, but often fail to distribute the keys evenly through the table

Folding –Partition the key into several parts and combine the parts in a convenient way (+ or *). For example, keys are 8-digital integers, the hash table has 1000 locations, 21296876  212 + 968 +76 = 1256  256 –Achieves a better spread of indices than does truncation by itself. Ways of Building Hash Function (2)

Modular arithmetic –Convert the key to an integer, divide by the size of the index range, and take the remainder as the result. –The best choice for modulus is a prime number. For example, the hash table has 11 locations, Loc = Key % 11 For example, the hash table has 1000 locations, Loc = Key % 1009 –The best way, it can achieve a good spread of indices and it ensures that the results is in the proper range. Ways of Building Hash Function (3)

Hash function is H(key) = key mod 10, The record set is: No. Name Class … 5 Zhang c1 12 Liu c2 4 Wang c1 10 Li c3 … Then the hash table will be: 10 Li c3 12 Liu c2 4 Wang c1 5 Zhang c3 0 1 2 3 4 5 6 7 loc key Student No. example ：

Hash function is : C++ Example

Collision Resolutions 1)with Open Addressing ( 开地址法 ) 2)with Chaining ( 链接法 ) 3)with Overflow Table ( 溢出表 )

Collision Resolution with Open Addressing (1) Linear Probing ( 线性探测法 ) –Linear probing starts with the hash address and searches sequentially for the target key or an empty position. The array should be considered circular, so that when the last location is reached, the search proceeds to the first location of the array. –Clustering Problem Records start to appear in long strings of adjacent positions with gaps between the strings. This leads to lower hash performance and the distribution of keys become progressively more unbalanced.

The probability the b will be filled is 2/n The probability the d will be filled is 4/n The probability the e will be filled is 5/n Clustering Problem of Linear Probing

Collision Resolution with Open Addressing (2) Increment Functions (Rehashing) –Use a new hash function to obtain the next position to consider if the position is filled. –Quadratic Probing ( 二次探测法 ) h = h + i 2, i = 1, 2, … Quadratic probing can reduce clustering, but usually it does not probe all locations in the table. The maximum number of probes is: (hash_size+1)/2 –Random Probing ( 随机探测法 ) h = h + Random_number Random probing is e xcellent, but slow –Key-Dependent Increments h = h + one character in the key

For example ， h(key) = key mod 11, the size of hash table is 11. 3 records {17, 60, 29} have already put into the hash table like this: If the fourth record with key 38 will insert, The location should be 38 mod 11=5. Now collision occurs. Where to insert it ? 601729 0123 456789 10 60172938 0123 456789 10 For collision resolution with linear probing. 38601729 0123 456789 10 For collision resolution with random probing. Suppose the random number is 9. 38601729 0123 456789 10 For collision resolution with quadratic probing.

Collision Resolution by Chaining Take the hash table itself as an array of linked list. The linked lists from the hash table are called chains. A chained hash table

Giving a list {19,14,23,01,68,20,84,27,55,11,10,79}. The hash function is H(key) = key mod 13. Using chaining for collision resolution. Then the corresponding hash table will be: 0 1 2 3 4 5 6 7 8 9 10 11 12 01142779 5568 1984 1023 20 11 Count totally how many times of collision happened? Exercise:

Characteristics of Chained Hash Table Advantages –If the records are large, a chained hash table can save space. –Collision resolution with chaining is simple, clustering is no problem. –The hash table itself can be smaller than the number of records; overflow is no problem. –Deletion is quick and easy in a chained hash table. Disadvantages –If the records are very small and the table nearly full, chaining may take more space.

Idea –Simply put all entries that collide with occupied location into a overflow table. –Special search method (sequence search, binary search, etc.) could be used for overflow table. Collision Resolution with Overflow Table

ADT of Hash Table const int hash_size = 997; // a prime number of appropriate size class Hash_table { public: Hash_table( ); void clear( ); Error_code insert(const Record &new entry); Error_code retrieve(const Key &target, Record &found) const; private: Record table[hash_size]; };

E6. Exercise

Analysis of Hashing A probe( 探测 ) is one comparison of a key with the target. The load factor ( 装填因子 ) of the table isλ= n / t, where n positions are occupied out of a total of t positions in the table.

Analysis of Hashing

Conclusions: Comparison of Methods We have studied four principal methods of information retrieval –Sequential search –Binary search –Table lookup –Hash-table retrieval The first two for lists and the second two for tables. Often we can choose either lists or tables for our data structures.

Conclusions: Comparison of Methods Sequential search is O(n) –Sequential search is the most flexible method. The data may be stored in any order, with either contiguous or linked representation. Binary search is O(log n) –Binary search demands more, but is faster: The keys must be in order, and the data must be in contiguous storage. Table lookup is O(1) –Ordinary lookup in contiguous tables is best, both in speed and convenience, unless a list is preferred, or the set of keys is sparse, or insertions or deletions are frequent. Hash-table retrieval is O(1) –Hashing requires a peculiar ordering of the keys to retrieval from the hash table, but generally useless for any other purpose.

Chapter 9 Tables and Information Retrieval. Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete.

Similar presentations

Presentation on theme: "Chapter 9 Tables and Information Retrieval. Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 9 Tables and Information Retrieval. Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete.

Similar presentations

Presentation on theme: "Chapter 9 Tables and Information Retrieval. Tables Introduction In chapter 7 we showed that –By use of key comparisons alone, it is impossible to complete."— Presentation transcript:

Similar presentations

About project

Feedback