Download presentation
Presentation is loading. Please wait.
Published byMadeleine Taylor Modified over 9 years ago
1
11 Single level index Single-level index: file of entries –Will point to : The record in the data file or The block which contains the record –field value ordered by indexing field Single-level index: –Carry out binary search in the index file, then ? Then follow pointer Why single-level ? –Will see other types of indexes later –Including multi-level indexes
2
22 Types of Single-Level Indexes: Primary Index Defined on data file ordered on a key field –We will think of as Primary Key –Indexing field will also be ordered by same key One index entry for each block in data file –the index entry has the key field value for the first record in the block called the block anchor Dense or sparse ? Sparse : includes an entry for each disk block –Not for every record
3
33 [EN] FIGURE 18.1 Primary index on the ordering key field of the file shown in Figure 13.7.
4
44 Types of Single-Level Indexes: Primary Index Advantage of having primary index if file already sorted by that field ? Index file smaller, binary search on that faster –Why is index file smaller ? Fewer records (why?), smaller records (why?) If index file is much smaller, could have another big advantage –May be possible to keep (all or most of) index file in RAM. Advantage ? Fewer disk accesses
5
55 [EN] Eg 1: Primary Index Record size R = 100 bytes, block size B=1024 bytes, r = 30000 records For data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = ? Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocks If no index, how many block accesses for search by ordering field ? If no index, bin. search needs log b +1 = log 3000 +1 = 13 block accesses Indexing field 9 bytes, block pointer 6 bytes.If sparse primary index (on disk) like Figure 14.1, how many block accesses? Index entry size = ? Index entry size (9+6)= 15bytes For index file, # records in a block = ? For index file, Bfr = # records in a block = B div R = 1024 div 15 = 68 Total # index entries = ? Total # index entries = # data blocks = 3000. # index file blocks = ? # index file blocks = (3000/68) = 45 blocks. # block accesses to search ? Binary search : log 45 + 1 = 7 block accesses. Plus need one more. Why? To get the data block. Total # block accesses = 7 + 1 = 8
6
66 Types of Single-Level Indexes: Clustering Index Motivation: suppose we repeatedly wanted to ask some question about employees according to which department they work for. Eg: SELECT LNAME, FNAME FROM EMP WHERE DNUMBER = 3; How to do ? What would we like here : an index according to DNUMBER, even though non-key Also important if looking for range. Eg: (DNUMBER >= 2) AND (DNUMBER <= 7)
7
77 [EN] FIGURE 18.2 Clustering index on the DEPT NUMBER ordering nonkey field of EMP file.
8
88 Types of Single-Level Indexes: Clustering Index Data file ordered on non-key field called clustering field –Clustering field does not have unique values –Index built on same clustering field Includes one index entry for each distinct value of the field. Index entry points to the first data block that contains records with that field value. Terminology not standardized: clustering index can mean file sorted by clustering field –Could include primary index as special case
9
99 Types of Single-Level Indexes: Clustering Index Dense or sparse ? Sparse Insertion : similar problem as before. –Eg: if block full, has 7, 7, 8, 9 want to insert 7 How to deal with this ? Have an entire block for each value of clustering field –Insertion and Deletion now straightforward –Could have a lot of almost empty blocks
10
10 [EN] FIGURE 18.3 Clustering index with a separate block cluster for each group of records that share the same value for the clustering field.
11
11 Types of Single-Level Indexes: Secondary Index Motivation: suppose we want to access employees by both ssn and by name Assume EMP file is sorted by ssn and we have a primary index with ssn. How to do efficient access with name ? Build another index by name Secondary index: file not sorted by this field –Also called non-clustering index..
12
12 Secondary Indices Example [SKS] One type of secondary index –Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. Secondary index on balance field of account
13
13 Types of Single-Level Indexes: Secondary Index A secondary index provides a secondary means of accessing file for which some primary access already exists. Can have multiple secondary indexes Secondary index may be on a field which is a –Secondary key : has unique value in every record –Non-key with duplicate values.
14
14 Types of Single-Level Indexes: Secondary Index with Secondary Key The index is an ordered file with two fields. – The first field is of the same data type as some nonordering field of the data file that is an indexing field – The second field is either a block pointer or a record pointer. –If block pointer, have to search block Dense or sparse ? Dense
15
15 [EN] FIGURE 18.4 A dense secondary index (with record pointers) on a nonordering key field of a file.
16
16 [EN] Eg 2 : Secondary Index Record size R = 100 bytes, block size B=1024 bytes, r = 30000 records For data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocks If no index, how many block accesses for search by non-ordering field ? If no index, linear search needs 3000/2 = 1500 block accesses Indexing field 9 bytes, block pointer 6 bytes. If dense secondary index (on disk) like Figure 14.4, # block accesses? Index entry size (9+6) = 15 bytes For index file, Bfr = # records in index file = B div R = 1024 div 15 = 68 Total # index entries = ? Total # index entries = # records in index file = 30000 # index file blocks = (30000/68) = 442 blocks. # block accesses to search ? Binary search : log 442 + 1 + 1 (for getting data block) block accesses Compare: gone from 1500 to 11
17
17 [EN]FIGURE 18.5 A secondary index (with record pointers) on a nonkey field implemented using one level of indirection so that index entries are of fixed length and have unique field values.
18
18 Types of Single-Level Indexes: Secondary Index with Non-key Use extra level of indirection –Pointer points to block of record pointers Upside –efficiently retrieve all records with specific value –Index file is small Downside –May have to do another disk access to get block of record pointers
19
19 Query Optimization Two Egs of how optimizer might use indexes Eg 1: Get last names of employees who work on a project. –SQL query –2 approaches –Which index available Eg 2: Get last names of employees who make more than 60k and who are in department 5. –SQL query –3 approaches –Which index available
20
20 [EN] Table 18.1 Types of Indexes Based on Properties of Indexing Field
21
21 Hashing Internal Hashing: when the data is being kept in RAM External Hashing: when the data is being kept on disk –This is what we are interested in –But will first do a quick review of internal hashing Since internal hashing easier to understand
22
22 Mod review a mod b = c : short hand for saying that when we divide a by b, the remainder is c 7 mod 5 = 2, 19 mod 4 = 3 a mod b c or a = c mod b or a c mod b 7 = 2 mod 5, 19 = 3 mod 4
23
23 Direct Address Tables Eg: We want to keep information about students. Suppose we have 10 students, and we want to look up their names and grades etc. Operations: –Insert a student –Search for a student
24
24
25
25 Direct Address Tables Suppose students have id number between 0 and 9. Direct Address Table: info stored in table (array) with 10 entries. Eg: student 6 goes to table[6], student 4 goes to table[4]. Search for student 6. Slow/Fast ? Fast: just an array index calculation What if: 9 digit ssn ?
26
26 Idea behind Hashing Can we use direct address tables now? No, still want fast searches: hash tables. Want a way of getting from ssn to index in table. “Random” mapping ? No – because we will need to search for this element after we have inserted it –So the way for carrying out this search has to be exactly the same as for inserting it. Hashing: way of transforming key into array index. Hash Function: maps key to an index. Eg: Hash (SSN) = SSN % 10 – 123-45-6789 goes to 9 – 122-45-6566 goes to 6 –Searching for 123-45-6789. Where will we look? Looks straightforward. Possible problem?
27
27 [CLR] Example : Collisions
28
28 Collisions 123-45-6789 goes to 9 111-44-9999 goes to 9 Collision: When two different keys yield the same index. Two issues with collisions: –Dealing with collisions –Minimizing collisions : good hash functions, won’t study
29
29 Collision Resolution Chaining: keep all the entries which map onto the same hash value in a linked list Open addressing: put in another available available slot
30
30 [CLR] Example : Chaining
31
31 Chaining Idea: T[i] pointer to linked list which contains all elts whose keys hash to i. Eg: m=7, T[0..6]. a,b,c,d,e,f arrive in order. h(a) = 5, h(b) = 5, h(c) = 1, h(d) = 6, h(e) = 5, h(f) = 4. Now search for e Now search for z, h(z) = 0.
32
32 Open Addressing No linked lists, all elts stored directly in T. If collision: probe: look elsewhere in T. Where ever we look to insert, have to search in same way. There are a number of different says of doing open addressing –we look at linear probing.
33
33 Linear Probing Idea: If current slot is full, look at next one. Eg: m=7, T[0..6]. a,b,c,d,e,f arrive in order. h(a) = 5, h(b) = 5, h(c) = 1, h(d) = 6, h(e) = 5, h(f) = 4. Now search for e Now search for z, h(z) = 0.
34
34 External Static Hashing External Hashing : Hashing for disk files static hashing or dynamic hashing static hashing : The file blocks are divided into M equal-sized buckets, numbered bucket 0, bucket 1,..., bucket M-1 –Typically, a bucket corresponds to one disk block. The record with hash key value K is stored in bucket i, where i=h(K) Hash function h is a function from set of all search-key values to set of all bucket addresses.
35
35 Static Hashing [EN] Figure 17.9
36
36 Static Hashing Eg [SKS] Hash file organization of account file, using branch_name as hashing field There are 10 buckets, The binary representation of the ith character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10 –Eg h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
37
37 Static Hashing Eg [SKS] Hash file organization of account file, using branch_name as key (see previous slide for details).
38
38 Static Hashing Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket –What does this imply when looking for a record? Entire bucket has to be searched to locate record –But done in RAM, so not a problem Search is very efficient on the hash key How to deal with collisions –What is a collision now ?
39
39 Bucket Overflows Collisions occur when a new record hashes to a bucket that is already full –If it is not full, not a problem When would the bucket overflow start happening on a large scale ? Insufficient buckets Skew in distribution of records. Why ? Lousy hash function (or unlucky !) Although the probability of bucket overflow can be reduced, it cannot be eliminated;
40
40 Handling of Bucket Overflows How to handle bucket overflow ? Two ways: Overflow file kept for storing such records –All overflow records kept in same block –Even if coming from different buckets –See [EN] Eg. Overflow chaining –The overflow blocks of a given bucket are chained together in a linked list. –See [SKS] Eg
41
41 Overflow File Eg [EN] Figure 17.10
42
42 Overflow Chaining Eg [SKS] Advantage of doing it this way? Faster search. Disadvantage ? Wasted space
43
43 Static Hashing To reduce overflow records, a hash file is typically kept 70-80% full. The hash function h should distribute the records uniformly among the buckets. Why ? Otherwise, search time will be increased because many overflow records will exist. Ordered access on hash key efficient ? No: inefficient (requires sorting the records) –This is true of any hashing scheme –What about range queries : efficient ? Range queries also inefficient
44
44 Deficiencies of Static Hashing Databases grow or shrink with time. In static hashing, fixed # buckets. If # buckets too small ? If # buckets too small, and file grows, performance will degrade due to too much overflows. If # buckets too large ? Significant amount of space will be wasted initially (and buckets will be under full). –Similar problem if database shrinks, again space will be wasted. If too much overflow or underflow, solution ?
45
45 Deficiencies of Static Hashing One solution: periodic re-organization of the file with a new hash function. Problem ? Large overhead, disrupts normal operations Different solution: allow the number of buckets to be modified dynamically: dynamic hashing or extendible hashing –Allow the dynamic growth and shrinking of the number of file records. –If overflow, split –If underflow, merge –We won’t cover in detail, [EN] does
46
46 Multi-Level Indexes Suppose index too big to be in RAM, is on disk. Consequences ? Search expensive : log (#blocks). To improve ? Treat main index kept on disk as a sorted file –build a sparse index for the main index –first level (inner index )– the main (“primary”) index file –second level (outer index ) – sparse index of the primary index sorted file If even outer index too large to fit in RAM ? Build another index on outer index – … and so on, until all entries of top level fit in one block
47
47 Multi-Level Indexes [SKS]
48
48 Multi-Level Indexes - Eg How does this help. Look at an example: –Suppose we have 2 level with first level being dense (eg: secondary index), with bfr = 20 –Suppose 400 data records –Suppose 2 nd level is in RAM –How many disk accesses ? 400 index records, bfr 20, so # blocks in 1 st level = 400/20 = 20. If only 1 st level, log 2 20 + 1 = 6, 6+1 = 7 With 2 level (if top level in RAM) ? 2
49
49 [EN] FIGURE 18.6 A two-level primary index resembling ISAM (Indexed Sequential Access Method) organization. ISAM: Originally developed by IBM Now used in MYSQL –MYISAM
50
50 [EN] Eg 3 Multi-level indexes Record size R = 100 bytes, block size B=1024 bytes, r = 30000 records For data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocks We saw if dense secondary index (on disk), # block accesses = 11 Indexing field 9 bytes, block pointer 6 bytes, index entry size = 15 bytes If multi- level index like Figure 14.6, # block accesses? For index file, Bfr = # records in file = B div R = 1024 div 15 = 68 Total # first level index entries = # records in data file = 30000 # first level index file blocks = (30000/68) = 442 blocks. # second level index file blocks = ? # second level index file blocks = (442 /68) = 7 blocks. # third level index file blocks = ? # third level index file blocks = (7 /68) = 1 block. Top level. Total # block accesses assuming everything in disk = ? Total # block accesses = 1 + 1 + 1 + 1 (for data block) = 4 Compare: gone from 11 to 4
51
51 Multi-Level Indexes Multi-level index can be for any type of first-level index: primary, secondary, clustering. Multi-level index is a form of search tree. When records inserted/deleted expensive – why ? Every level of index is a sorted file. –Sorted file has to be updated –And so does every index on the file Performance degrades as file grows – why ? Potentially many overflow blocks can be created. –Periodic reorganization of entire file is required. –But can be expensive
52
52 Disadvantages of indexed sorted files Sequential scan using primary index (file sorted by indexing field) efficient – why ? Sequential scan using secondary index - fast? –Eg: EMPLOYEE file sorted by ssn –Secondary index by last name –Want to write out in alphabetical order. Expensive –Each record access may fetch a new block from disk –Block fetch requires about 5 to 10 micro seconds, versus about 100 nanoseconds for memory access Solution: B-trees, B + trees, hashing indexes
53
53 Indexes: B-Trees, B+ Trees Problems of indexed-sequential files –As file changes, expensive to maintain index B-tree, B + tree indexes solve this problem –When changes made, automatically reorganizes itself with small, local, changes B-tree, B+tree indices are an alternative to indexed-sequential files We will briefly look at B-trees, then B+trees –A kind of a multi-level index –Studied in more detail in CSCI 6632
54
54 [CLR] example of a B-tree
55
55 [EN] FIGURE 18.10 B-tree structure and example
56
56 Indexes: B-Trees Can keep entire records in trees –Entire file kept as a B-tree Alternative: Only keys with links (to rest of the record) in tree. –Full records kept elsewhere, maybe in unsorted file –Advantage of doing it like this? What has to be kept in B-tree is less –Advantage ? Fit in more per node, shallower depth
57
57 Indexes: B-Trees Advantage compared to binary search trees? Fewer disk accesses than search trees : why ? Related info in one block in B-Tree B-Trees: each node corresponds to disk block Insertion and deletion efficient ? Each node is kept between half-full and completely full –Because of this flexibility, relatively easy to do insertions and deletions Now look at B+ Trees
58
58 B+ Tree Indexes [RG] Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages have index entries; only used to direct searches: P 0 K 1 P 1 K 2 P 2 K m P m index entry Non-leaf Pages (Sorted by search key) Leaf
59
59 [EN] FIGURE 18.11 The nodes of a B+tree..
60
60 Example B+ Tree [RG] Find 7 ? 29 ? All > 15 and < 30 Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. –And change sometimes bubbles up the tree 23 Root 17 30 1416 3334 38 39 135 7582224 27 29 Entries <= 17Entries > 17 Note how data entries in leaf level are sorted
61
61 B-tree and B+tree Differences In both can do quickly : –Searches, insertions and deletions to indexes Also true of leaf nodes in B+tree B-tree: ptrs to data records at all levels of the tree B+tree: ptrs to data records only at leaf-level nodes –internal nodes only for navigation B+tree can have less levels than B-tree –B-tree index is dense –B+tree index is sparse, linked list is dense B+tree can also do fast sequential access : how? Linked list at bottom level is in sequential order B+ tree : greater complexity: maintaining leaf nodes
62
62 Multiple-Key Access/Indexes Use multiple indices for certain types of queries. [EN Eg:] : Emp who are 59 years old and are in dept 4 select ssn from Emp where dno = 4 and age = 59 Possible strategies for processing query using indices on single attributes ? –Depends on which indices are available What indices would be helpful ?
63
63 Multiple-Key Access/Indexes Suppose 2 indices: dno, age. How to do ? Method 1: Use index on dno to find Emp with dno 4 –then test age = 59 Method 2: Use index on age to find Emp with age 59 – then test dno = 4 Method 3: Use index on dno to find records of Emp with dno 4. Use index on age to find records of Emp with age 59. –Now what ? Take intersection of both sets of records.
64
64 Composite search keys are search keys containing more than one attribute –Eg: searching for combination of dno, age Lexicographic ordering: (a 1, a 2 ) < (b 1, b 2 ) if either a 1 < b 1, or a 1 = b 1 and a 2 < b 2 Eg: (4, 40) < (5, 20) Eg: (4, 40) < (4, 45) Can build a single index on multiple attributes Ordered Indices on Multiple Attributes
65
65 Consider the following: –where dno = 4 and age = 59 The index on (dno, age) can be used to fetch only records that satisfy both conditions. More efficient than using separate indices ? –Eg: use index on dno, age and take intersection Using separate indices is less efficient –we may fetch many records that satisfy only one of the conditions. Suppose we have an index on combined search- key (dno, age). Ordered Indices on Multiple Attributes
66
66 Is the following efficiently handled ? –where dno = 4 and age < 59 Yes: because of lexicographic ordering Is the following efficiently handled ? –where dno < 6 and age = 59 Not quite so efficient –may fetch many records that satisfy the first but not the second condition Ordered Indices on Multiple Attributes Suppose we have an index on combined search- key (dno, age).
67
67 Grid Files: [EN] Figure 18.14 Do well in terms of access time. Downside ? Space for grid array, maintenance when file changes Another alternative for composite search
68
68 Hash Indices [SKS] Can use hashing for indices: – A hash index organizes the search keys, with their associated record pointers, into a hash file structure. If the file itself is organized using hashing –a separate hash index on it using the same search- key is unnecessary. Why ? Sometimes, the term hash index to refer to both secondary index structures and hash organized files.
69
69 Example of Hash Index [SKS] Data file ordered by branch name Secondary index on Acct#
70
70 Ordered Indexing vs Hashing Which works better depends on particular situation. Relative frequency of insertions and deletions Average access time vs worst-case access time? Expected type of queries: which type of query will each be good at ? Hashing is generally better at retrieving records having a specified value of the key. Ordered indices better at range queries
71
71 Cost/Benefit of Indexes Indexes can have large benefits: –B-tree can search 1M rows of indexed data with < 20 lookups –Hashed index (on avg) about 1 lookup Why not have lots of indexes all the time ? Cost of mantaining index when updates. –Introduction to Oracle 10g: Perry and Post :“According to Oracle performance tuning documentation, each index requires about 3 times the resources as the original DML.” –“So adding 3 indexes to a table will slow down an INSERT command by about 10 times.” Balance faster retrieval vs slower updates
72
72 Data Warehousing Systems Used for analysis, not transaction processing Since no transaction processing, consequence ? No updates. Impact of this ? Data is denormalized and stored together and materialized views are used –Advantage of denormalized ? Data in fewer tables –Fewer joins. Why do we normalize? –Does that logic apply here ? No updates so no modification anomalies Advantage of materialized views?
73
73 Data Warehousing Systems Don’t have to go back and recalculate views every time a view is referred to –What is the problem with materialized views? Have to change on updates –Does it apply here ? Can create lots of indexes (indexes on most columns) –What is the problem with having lots of indexex? Cost of maintaining lots of indexes. Does this apply here ? No updates
74
74 Index Definition in SQL Index statements part of early versions of SQL – but not part of SQL standard today. Why ? Physical access path, not data specification –Responsibility of DBMS –Not of person writing SQL queries –Commercial DBMS have index specifications End users may not be aware of indices –SQL queries remain the same –Indices can be created/destroyed without affecting correctness of query But efficiency is effected
75
75 Indexes supported in DBMS Theoretically, DBMS not even required to support indices In practice, every commercial DBMS supports some form of indexing. Why ? –Some ops inefficient without indices. Which ones? Joins Range Queries Checking uniqueness –For keys –When DISTINCT ( no duplicates) specified Referential Integrity
76
76 Index Definition in SQL Many DBMS automatically create index on primary key –And on other keys (specified via UNIQUE ) In addition, DBMS allow for the programmer to explicitly create and destroy indexes. Since no current SQL standard, we will look at typical syntax for creating indexes –Based on old SQL syntax –We then look at Eg from Oracle, SQL-Server Also a drop index command DROP INDEX indexname
77
77 Index Definition in SQL CREATE INDEX LNAME-INDEX ON EMPLOYEE (LNAME) What type of index is this ? Secondary index –File not sorted by LNAME –LNAME is not a key Can create index on multiple attributes CREATE INDEX FULLNAME-INDEX ON EMPLOYEE (LNAME, FNAME) –On both, with LNAME being more significant
78
78 Index Definition in SQL: on key Index corresponding to a key: CREATE UNIQUE INDEX SSN-INDEX ON EMPLOYEE (SSN) Will enforce uniqueness In early versions of SQL, only way of specifying uniqueness. Why? o/w too inefficient to check uniqueness When we specify attribute is a key, typically an index like this is created. File may not be sorted on indexing field
79
79 Index Definition in SQL: CLUSTER Can do a clustering index –File has to be sorted by the indexing field –Indexing field may not be a key, may be repeated CREATE INDEX DNO-INDEX ON EMPLOYEE (DNO) CLUSTER Without CLUSTER may not be sorted on that field
80
80 Index Definition in SQL: Primary, B-tree If we want to get a primary index, how to do ? Use both CLUSTER and UNIQUE CREATE UNIQUE INDEX SSN-INDEX ON EMPLOYEE (SSN) CLUSTER User can specify wants B tree index: CREATE INDEX MY-INDEX ON EMPLOYEE (SALARY) WITH STRUCTURE = BTREE
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.