Download presentation
Presentation is loading. Please wait.
1
CM20145 File Structure and Indexing Dr Alwyn Barry Dr Joanna Bryson
2
Last Time Locking & Two-Phase Locking Review How Locks Work Other Possible Mechanisms Graph-Based Protocols Timestamp-Based Protocols Validation-Based Protocols Further Refinements Multiple Granularity Multi-version Schemes Deadlock Handling Prevention Cures Locking & Indexes Now: File Structure & Indexing
3
Overview Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary
4
Storage Hierarchy (Lecture 9) ©Silberschatz, Korth and Sudarshan Modifications & additions by S Bird, J Bryson
5
Storage Access Blocks A fixed-length unit. The units for storage allocation and data transfer. Database files are organized into blocks. Buffer portion of main memory available to store copies of disk blocks. Buffer Manager subsystem responsible for allocating buffer space in main memory. Block Transfers Want to minimize the number of block transfers between disk and memory. Keep as many blocks as possible in main memory.
6
Trans. Server Processes (Lec 9)
7
Buffer Manager Called when need a block from disk. If the block is already in the buffer: Hthe requesting program is given the address of the block in main memory. Otherwise: 1.The buffer manager allocates space in the buffer for the block: HDiscard some other block, if necessary for space. HDiscarded block is written back to disk if it was modified since it was last fetched. 2.The buffer manager: HReads the blocks from disk to buffer. HPasses block’s new main memory address to requesting program.
8
Buffer-Replacement Policies Most operating systems replace the block using the least recently used (LRU) strategy. Past usage often predicts future. But database queries have well-defined access patterns (e.g. sequential scans). A database system can predict future references from information in a query. LRU can be a bad strategy for access patterns that repeatedly scan tables. Join of r and s computed by nested loops for each tuple tr of r do for each tuple ts of s do if the tuples tr and ts match … Want a mixed replacement strategy with information provided by the query optimizer.
9
Buffer-Replacement Policies (2) Pinned block (e.g for inner loop, s) Memory block that is not allowed to be written back to disk. Toss-immediately strategy (e.g. for r) frees the space occupied by a block as soon as its final tuple has been processed. Most recently used (MRU) strategy (for s) If know iterating through table, then most recent block will be unused the longest. Buffer manager can use statistical information regarding the probability that a request will reference a particular relation. E.g. the data dictionary is frequently accessed. Heuristic: always keep data- dictionary blocks pinned in main memory.
10
Overview Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary
11
File Organization The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields. One approach: assume record size is fixed, each file has records of one particular type only, different files are used for different relations. This case is easiest to implement, will consider variable length fields later.
12
Fixed-Length Records Simple approach: Store record i starting from byte n (i – 1), where n is the size of each record. Record access is simple but records may cross blocks. Modification: do not allow records to cross block boundaries. Deletion of record I: alternatives: move records i+1,…,n to i,…,n – 1 move record n to i do not move records, but link all free records on a free list.
13
Free Lists Store the address of first deleted record in file header. Then store address of second deleted record in location of first, etc. Addresses are pointers to free space. More space efficient representation: store addresses in attributes, not one per record. (No pointers stored in in-use records.)
14
Variable-Length Records Variable-length records arise in database systems in several ways: Storage of multiple record types in one file. Record types that allow variable lengths for one or more fields. Record types that allow repeating fields: used in some older data models, not 1 st Normal Form! Approaches a)Byte strings, b)Slotted pages, c)Fixed length representation: reserved space, d)Fixed length representation: pointers.
15
a) Byte-String Representation Byte string representation: Attach an end-of-record ( ) control character to the end of each record. Difficulties with insertions and deletions.
16
b) Slotted Page Representation Slotted page header contains: number of record entries end of free space in the block location and size of each record Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated. Pointers should not point directly to record — instead they should point to the entry for the record in header.
17
c) Fixed Length Rep: reserved space Fixed-length representation: reserved space pointers Reserved space – can use fixed-length records of a known maximum length; unused space in shorter records filled with a null or end-of-record symbol.
18
d) Fixed Length Rep: Pointers Pointer method A variable-length record is represented by a list of fixed-length records, chained together via pointers. Can be used even if the maximum record length is not known. May waste less space.
19
d) Fixed Length Rep: Pointers (2) Space might still be wasted in all records except the first in a chain. Solution is to allow two kinds of block in file: Anchor block – contains the first records of chain Overflow block – contains records other than those that are the first records of chains
20
Overview Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary
21
Organization of Records in Files Heap (won’t consider this) a record can be placed anywhere in the file where there is space. Sequential store records in sequential order, based on the value of the search key of each record. Clustering file organization records of several different relations can be stored in the same file. Storing related records on same block minimizes I/O. Hashing (next lecture) a hash function computed on some attribute of each record, result of function specifies which block the record should be placed in.
22
Sequential File Organization Suitable for applications that require sequential processing of the entire file. The records in the file are ordered by a search key.
23
Sequential File Organization (2) Deletion – use pointer chains. Insertion – locate the position where the record is to be inserted. If there is free space insert there. If no free space, insert the record in an overflow block. In either case, pointer chain must be updated. Need to reorganize the file occasionally to restore sequential order
24
Sequential File Organization (3) It is necessary to fill up the space once a record has been deleted Shift everything up Move last record
25
Clustering File Organization Simple file structure stores each relation in a separate file. Can instead store several relations in one file using a clustering file organization. E.g., clustering organization of customer and depositor:
26
Clustering File Organization (2) Good for queries involving depositor customer, and for queries involving one single customer and their accounts. Bad for queries involving only customer. Results in variable size records.
27
Add Pointer Chains Clustering File Organization (3)
28
Overview Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary
29
Data Dictionary Storage Information about relations: names of relations, names and types of attributes of each relation, names and definitions of views, integrity constraints. User / accounting information, e.g. passwords. Statistical and descriptive data: number of tuples in each relation, access frequency. Physical file organization information: How relation is stored (sequential/hash/…). Physical location of relation: operating system file name, or disk addresses of blocks containing relation records. Information about indices. The data dictionary (or system catalog) stores metadata (data about data), such as:
30
Data Dictionary Storage (2) Catalog structure: use either specialized data structures designed for efficient access, or a set of relations, with existing system features used to ensure efficient access. The latter alternative is usually preferred. A possible catalog representation: Relation-metadata = (relation-name, number-of-attributes, storage-organization, location) Attribute-metadata = (attribute-name, relation-name, domain-type, position, length) User-metadata = (user-name, encrypted-password, group) Index-metadata = (index-name, relation-name, index-type, index-attributes) View-metadata = (view-name, definition)
31
Overview Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary
32
Indexing: Basic Concepts Indexing mechanisms are used to speed up access to desired data. E.g., author catalog in library Search Key - attribute or set of attributes used to look up records. An index file consists of records (called index entries) of the form Index much smaller than original file. Two basic kinds of indices: Ordered indices: search keys stored sorted. Hash indices: search keys distributed uniformly across buckets using hash function. search-key pointer
33
Index Evaluation Metrics Types: Access types supported efficiently. E.g., records with a specified value in the attribute. records with an attribute value falling in a specified range. Time: Access time Insertion time Deletion time Space: Space overhead
34
Ordered Indices Ordered index: Index entries are stored sorted on the search key value. E.g., author catalog in library. Primary index: In a sequentially ordered file, the index that’s search key specifies the sequential order of the file. Also called clustering index. The search key of a primary index is usually but not necessarily the primary key. Secondary index: An index whose search key specifies an order different from the sequential order of the file. Also called the non-clustering index. Index-sequential file: Ordered sequential file with a primary index. Efficient for random access and sequential search.
35
Dense Index Files: Example Dense index — Index record appears for every search-key value in the file.
36
Dense Index Files: Updates Deletion: If deleted record was only record with its search-key value, delete search-key too. Insertion: lookup using the search-key value appearing in the record to be inserted. if the search-key value does not appear in the index, insert it. Multilevel update algorithms are simple extensions of the single-level algorithms
37
Sparse Index Files Sparse Index: Contains index records for only some search- key values. Applicable only when records are sequentially ordered on search-key. To locate record with search-key value K: Find index record with largest search-key value < K, Search file sequentially starting at the record to which that index record points. Less space and less maintenance overhead for insertions and deletions than dense index. Generally slower than for locating records Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.
38
Sparse Index Files: Example
39
Sparse Index Files: Deletion If deleted record was only record in the file with its particular search-key value, the search-key is deleted too. Replacing entry in the index with the next search-key value in the file, in search-key order. If the next search-key value already has an index entry, just delete without replacement.
40
Sparse Index Files: Insertion Single-level index insertion: Perform a lookup using the search-key value appearing in the record to be inserted. If index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. If new block created, first search-key value appearing in the new block is inserted into the index. Multilevel insertion algorithms are simple extensions.
41
Multilevel Index If primary index doesn’t fit in memory, access becomes expensive. To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index – a sparse index of primary index. inner index – the primary index file. If outer index is too large to fit in main memory, create another level. Must update indices at all levels on insertion or deletion!
42
Secondary Indices Often want to find records based on values of fields which are not the primary index. Example: Suppose the account database is stored sequentially by account number, e.g. 1. find all accounts in a particular branch. 2. find all accounts with balances in a specified range. A secondary index would have an index record for each search-key value. Index record points to a bucket that contains pointers to all the actual records with that particular search-key value.
43
Example: Secondary Index Secondary Index on balance field of account
44
Primary and Secondary Indices Indices offer substantial benefits when searching for records. When a file is modified, every index on the file must be updated. Updating indices imposes overhead on database modification. Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive: Each record access may fetch a new block from disk. Secondary indices have to be dense.
45
Summary Storage Access & Buffers File Organization Fixed Length Records Variable Length Organization of Records in Files Sequential & Clustering Data-Dictionary Storage Intro to Indexing Basic Concepts Ordered Indices Dense & Sparse Multilevel Primary & Secondary Next: Indexing & Hashing
46
Reading & Exercises Reading: Silberschatz Chapter 11.5 – 11.8 Connolly & Begg 2.4, 20.3.2, 17.2 (sort of), Appendix C (this also covers Lecture 17) Exercises: Silberschatz 11.7,14,17,20 C & B – no exercises in the appendix, borrow Silberschatz!
47
Changing Whole Relations If two-phase locking is used : A delete operation may be performed only if the transaction deleting the tuple has an exclusive lock on the tuple to be deleted. A transaction that inserts a new tuple into the database is given an X-lock on the tuple. Insertions and deletions can lead to the phantom phenomenon. A transaction that scans a relation (e.g., find all Perryridge accounts) and a transaction that inserts a tuple in the relation (e.g., insert new Perryridge acct.) may conflict without accessing any tuple in common. Non-serializable schedules can result! (Find transaction may not see new account, yet be serialized after the insert transaction.)
48
Insert and Delete Operations A transaction scanning a relation is reading information about what tuples the relation contains, while another transaction changes the same info. Something should be locked! One solution: Associate a data item with the relation, to represent the information about what tuples the relation contains. Transactions scanning the relation acquire a shared lock in the data item. Transactions inserting or deleting a tuple acquire an exclusive lock on the data item. (Note: locks on the data item do not conflict with locks on individual tuples.)
49
Locking for Inserts & Deletes Previous protocol provides very low concurrency for insertions/deletions. Index locking protocols provide higher concurrency while preventing the phantom phenomenon, by requiring locks on certain index buckets.
50
Index Locking Protocol Every relation must have at least one index. Access to a relation must be made only through one of the indices on the relation. A transaction T i that performs a lookup must lock all the index buckets that it accesses, in S-mode. A transaction T i may not insert a tuple t i into a relation r without updating all indices to r. T i must perform a lookup on every index to find all index buckets that could have possibly contained a pointer to tuple t i, had it existed already, and obtain locks in X- mode on all these index buckets. T i must also obtain locks in X-mode on all index buckets that it modifies. The rules of the two-phase locking protocol must be observed.
51
Concurrency in Index Structures Indices are unlike other database items in that their only job is to help in accessing data. Index-structures are typically accessed very often, much more than other database items. Treating index-structures like other database items leads to low concurrency. Two-phase locking on an index may result in transactions executing practically one-at-a-time. It is acceptable to have nonserializable concurrent access to an index as long as the accuracy of the index is maintained. There are index concurrency protocols where locks on internal nodes are released early, and not in a two-phase fashion.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.