CS522 Advanced database Systems 9/12/2018 CS522 Advanced database Systems 5. Files and ISAM Huiping Guo Department of Computer Science California State University, Los Angeles
Files of records page record field 5. Files and ISAM CS522_S16 9/12/2018 Files of records page record field Page or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records. 5. Files and ISAM CS522_S16
Implementing heap files As file grows and shrinks, disk pages are allocated and de-allocated. Supported operations on a heap file: Create/destroy files Insert/delete/update a record with a given rid Scan all records in the file How to find the page that contains a given record Rid 5. Files and ISAM CS522_S16
Unordered (Heap) Files To support scans, we must keep track of the pages in a file To support insertions efficiently, we must keep track of free space on pages To support record level operations, we must: keep track of the records on a page There are many alternatives for keeping track of this 5. Files and ISAM CS522_S16
Linked list of pages Maintain a heap file as a doubly linked list of pages Header page – the first page DBMS keeps a table containing pairs of <heap_file_name, header_page_addr) How to keep track of free space within a page To be discuss later How to keep track of pages that have some free space? Maintain a doubly linked list of pages with free space and another doubly linked list of full pages 5. Files and ISAM CS522_S16
Linked list of pages Each page contains 2 `pointers’ plus data Full Pages Header Page Data Page Data Page Data Page Pages with Free Space Each page contains 2 `pointers’ plus data Each pointer is actually the page id Disadvantage All pages in a file will be on the free list of records of variable length 5. Files and ISAM CS522_S16
Heap File Using a Page Directory The directory is itself a collection of pages DBMS keeps the location of the header page Each directory entry identifies a page ( or a sequence of pages) in the heap file As the heap file grows or shrinks, the number of entries and the number of directory pages grow or shrink correspondingly Free space can be managed by maintaining A bit per entry, indicating whether the corresponding page has any free space or not A count per entry, indicating the amount of free space on the page. Works well for variable – length records 5. Files and ISAM CS522_S16
Heap File Using a Page Directory Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Since several entries fit on a directory page, we can efficiently search for a data page with enough space to hold a record to be inserted 5. Files and ISAM CS522_S16
Page Formats The page abstraction is appropriate when dealing with I/O issue Higher levels of DBMS see data as a collection of records How arrange a collection of records on a page? A page is treated as a collection of slots Each slot contains a record A record is identified by a pair (page id, slot number) – this is the record id(rid) 5. Files and ISAM CS522_S16
Page Formats: Fixed Length Records Slot 1 Slot 2 Slot N . . . N M 1 M ... 3 2 1 PACKED UNPACKED, BITMAP Free Space Slot M number of records of slots rid = (page id, slot number) 5. Files and ISAM CS522_S16
Page Formats: Variable Length Records Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N . . . 2 1 20 16 24 N # slots 5. Files and ISAM CS522_S16
Page Formats: Variable Length Records For each page, maintain a directory of slots Each slot has a pair <record offset, record length> Record offset is the offset in bytes from the start of the data area on the page to the start of record Deletion is rapidly accomplished by setting the record offset to -1 Records can be moved around on the page The rid, which is the page number and slot number(the position in the directory) doesn’t change when the record is moved Only the offset in the slot changes 5. Files and ISAM CS522_S16
Page Formats: Variable Length Records The page is NOT preformatted into slots How to maintain free space for new records? Maintain a pointer that indicates the start of the free space area. When a new record is too large to fit into the remaining free space, we have to move records on the page to reclaim the space freed by deleted records. After reorganization, all records appear in contiguous order, followed by the available free space 5. Files and ISAM CS522_S16
Page Formats: Variable Length Records A slot for deleted record cannot always be removed from the slot directory Slot numbers are used to identify records If a slot is deleted, the slot numbers of subsequent slots in the slot directory are also changed, and the rids of records pointed by subsequent slots are changed The only way to remove slots from slot directory is to remove the last slot if the record that it points to is deleted When a record is inserted, the slot directory should be scanned for an element that currently does not point to any record This slot should be used for the new record A new slot is inserted only if all existing slots point to records 5. Files and ISAM CS522_S16
Record format How to organized fields within a record? System catalog The fields of record are of fixed length or variable length? The cost of various operations on the record: retrieving and modification of fields System catalog A description of the contents of a database, maintained by DBMS Information common to all records of a given record type is stored in the system catalog 5. Files and ISAM CS522_S16
Record Formats: Fixed Length 9/12/2018 Record Formats: Fixed Length Base address (B) L1 L2 L3 L4 F1 F2 F3 F4 Address = B+L1+L2 5. Files and ISAM CS522_S16 9
Record Formats: Variable Length 9/12/2018 Record Formats: Variable Length 4 $ Field Count Fields Delimited by Special Symbols F1 F2 F3 F4 F1 F2 F3 F4 Array of Field Offsets Require a scan of the records to locate a desired field 5. Files and ISAM CS522_S16 10
Record Formats: Variable Length 9/12/2018 Record Formats: Variable Length Reserve some space at the beginning of a record for use as an array of integer offset The ith integer in this array is the starting address of the ith field value relative to the start of the record Get direct access to any field for the overhead of the offset array 4 $ Field Count Fields Delimited by Special Symbols F1 F2 F3 F4 F1 F2 F3 F4 Array of Field Offsets 5. Files and ISAM CS522_S16 10
Record Formats: Variable Length 9/12/2018 Record Formats: Variable Length Null values A null value is a special value used to denote that the value for a field is unavailable or inapplicable How to deal with NULL values? If a field contains a null value, the pointer to the end of the field is set to be the same as the pointer to the beginning of the field No space is used for representing the null values A comparison of the pointers to the beginning and the end of the field is used to determine that the value in the field is null 5. Files and ISAM CS522_S16 10
Variable Length: some issues Modifying a field may cause it grow All subsequent fields need to be shifted to make space for the modification A modified record may no longer fit into the space remaining on its page A record may grow so large that it no longer fits on any one page 5. Files and ISAM CS522_S16
Review of data storage Disks Disk space manager Buffer manager Structure What can we learn? Disk space manager Buffer manager Files of records File format Page format Record format 5. Files and ISAM CS522_S16
Keys to reduce I/O costs Reduce number of page I/Os Indexing Reduce the access time of retrieved pages Sequentially store the pages 5. Files and ISAM CS522_S16
Index Issues Search key Index pages Are the records sorted, hashed?? What is stored in an entry? How are data entries organized? Are the records sorted, hashed?? Clustered or un-clustered Primary or secondary 5. Files and ISAM CS522_S16
Data entries in an index file Actual data record <key,rid> <key, list of rid> 5. Files and ISAM CS522_S16
Organization of data Entries Tree-structured ISAM, B+tree, R-tree, Quad-tree, Hash-based Static, dynamic Other Bitmap, signature,... 5. Files and ISAM CS522_S16
Tree-structured indexing techniques Two structures ISAM: Indexed Sequential Access Method static structure B+ tree: dynamic, adjusts gracefully under inserts and deletes Benefits of tree indexing Reduce search time Reduce number of page I/Os Fan-out Average number of children for a non-leaf node 5. Files and ISAM CS522_S16
Intuition for tree indexing 9/12/2018 Intuition for tree indexing ``Find all employees with sal > 50k’’ Binary search Suppose all records are sorted by sal Page I/O: log2N Page 1 Page 2 Page N Page 3 5. Files and ISAM CS522_S16 3
Intuition for tree indexing (cont.) K1’’ K1’ Km’ Index File k1 k2 kN Page 1 Page 2 Page 3 Page N Data File 5. Files and ISAM CS522_S16
Format of an index page <key, pointer> index entry P K 1 2 m K 1 2 m index entry 5. Files and ISAM CS522_S16
ISAM Leaf pages contain data entries Alternative 1 is used. 9/12/2018 ISAM Non-leaf Pages Leaf Pages Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. Overflow page Primary pages Leaf pages contain data entries Alternative 1 is used. All leaf pages are allocated sequentially and sorted on the search key 5. Files and ISAM CS522_S16 4
More on ISAM Operation on ISAM File creation Search Insert Delete 9/12/2018 More on ISAM Operation on ISAM File creation Search Insert Delete 5. Files and ISAM CS522_S16 5
9/12/2018 Example ISAM Tree Each node can hold 2 entries; no need for `next-leaf-page’ pointers. (Why?) 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* 20 33 51 63 40 Root 5. Files and ISAM CS522_S16 6
Inserting 23*, 48*, 41*, 42* ... Root Index Pages Primary Leaf Pages 9/12/2018 Inserting 23*, 48*, 41*, 42* ... Root Index 40 Pages 20 33 51 63 Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages 23* 48* Overflow 41* Pages 42* 5. Files and ISAM CS522_S16 7
Then deleting 42*, 51*, 97*,55* Root Index Pages Primary Leaf Pages 9/12/2018 Then deleting 42*, 51*, 97*,55* Root Index 40 Pages 20 33 51 63 Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages 23* 48* Overflow 41* Pages 42* 5. Files and ISAM CS522_S16 8
Pros and Cons of ISAM Pros Cons Good for static databases Good for concurrency control Delete/insert affect only leaf pages Cons Long chain of overflow pages may develop 5. Files and ISAM CS522_S16