Download presentation
Presentation is loading. Please wait.
1
CS522 Advanced database Systems
5/26/2018 CS522 Advanced database Systems 2. Overview of data storage and indexing Huiping Guo Department of Computer Science California State University, Los Angeles
2
We are here … DBMS DBMS 2. Overview CS522_S16 5/26/2018 Plan Executor
Parser Operator Evaluator Optimizer Query evaluation engine Transaction Manager Lock Concurrency control File and Access Methods Buffer Manager Disk Space Manager Recovery Manager Storage & indexing DBMS 2. Overview CS522_S16
3
Outline Data on external storage File organization Indices Hash index
B+ tree index 2. Overview CS522_S16
4
A file field record page “File” 2. Overview CS522_S16
5
External storage Access unit: page
The unit of information read from or written to disk The size of a page is a DBMS parameter Cost unit: Page I/O Read a page into memory Write a page to disk Cost of database operations Dominated by page I/Os Our purpose: minimize page I/Os 2. Overview CS522_S16
6
External storage Disks Tapes Random access devices
Can retrieve any page at fixed cost Reading consecutive pages is much cheaper than reading them in random order Tapes Sequential access devices Can only read pages in sequence Cheaper than disks, used for archival storage 2. Overview CS522_S16
7
Files and access methods layer
How is a relation stored? A relation is stored as a file of records Each record has a record id (rid) which is used to locate a record (page number) Implemented by the component: files and access methods layer Files and access methods layer Operations supported creation, insertion, deletion, scan,… Keeps track of pages allocated to each file Tracks available space within pages allocated to the file Supports fast file access to desired subsets of queries 2. Overview CS522_S16
8
File organization How the records are organized? What is it? Target
a method of arranging the records in a file when the file is stored on disk Sorted or unsorted Target Minimize the cost of page I/O (input from disk to main memory and output from memory to disk) 2. Overview CS522_S16
9
Three file organization
5/26/2018 Three file organization Heap (random order) files Suitable when typical access is a file scan that retrieves all records or a particular record specified by its rid Sorted files Best if records mush be retrieved by some order, or a ‘range’ of records is needed Updates are expensive Index Data structures that organize records via trees or hashes Speed up search for a subset of records Updates are much faster than in sorted files Many alternatives exist, each ideal for some situations, and not so good in others: 2. Overview CS522_S16
10
Indices An index speeds up selections on the search key fields for the index An index file contains a collection of data entries A data entry is a record stored in a index file A data entry with search key k is denoted as k* A data entry is used to obtain a data record (if they are different) Index supports efficient retrieval of all data entries k* with a given key value k 2. Overview CS522_S16
11
Data entry Employees Data records sorted by name Data entries
age sal Data records sorted by name 75 13 Sue 20 15 Ioe 80 11 Cal 10 12 Bob Data entries Index on age Index on sal 2. Overview CS522_S16
12
Alternatives for data entries K*
1: A data entry k* is an actual data record (with search key value k) At most one index on a given collection of data records can use alternative 1. (?) If data records are very large, #of pages containing data entries is high. It’s a special file organization, called indexed file organization At most one of the indices should use alternative 1. 2. Overview CS522_S16
13
Alternatives for data entries K* (cont.)
2. a data entry is a <k, rid> pair rid is the record id of a data record with search key value k 3. a data entry is <k, rid-list> Rid-list is a list of record ids of data records with search key value k Advantages of alternative 2 & 3 Contain data entries that point to data records Are independent of the file organization that is used for the index file Alternative 3 offers better space utilization than Alternative 2. 2. Overview CS522_S16
14
Index data structures How to organize data entries in an index?
Hash index B+ Tree index 2. Overview CS522_S16
15
Hash index Hash data entries on the search key
Index is a collection of buckets. A bucket consists of a primary page and overflow pages linked in a chain How to determine which bucket a record belongs to? Hashing function h is applied to the search key Good for equality selections How to handle updates and search? 2. Overview CS522_S16
16
Example hash index H(age)=00 H(sal)=00 H(age)=01 h sal H(age)=10
Smith, 44, 3000 Jones, 40, 6003 Tracy, 44, 5004 Ashyby,25,4000 Basu, 23, 4003 Bristow, 29,2007 Cass, 50, 5004 Daniels,22, 6003 3000 ------ 4000 5004 4003 2007 6003 h sal H(sal)=00 H(sal)=01 H(age)=00 H(age)=01 H(age)=10 h:2 Least Significant Bits of search key Files of <sal,rid> pairs Hashed on sal Alternative 2 used Employee files Hashed on age Alternative 1 used 2. Overview CS522_S16
17
B+ Tree index Data entries are arranged in sorted order by search key value A hierarchical search data structure is maintained that directs searches to the correct page of data entries Allows to efficiently direct locate all data entries with search key values in a desired range 2. Overview CS522_S16
18
B+ Tree index Index entry Data entry Non-leaf Pages Leaf
(Sorted by search key) Leaf Index entry Data entry 2. Overview CS522_S16
19
B+ Tree index Root Non-leaf pages Leaf pages
Topmost node where all searches begin Non-leaf pages contain index entries directed searches to the correct leaf page Non-leaf pages contain node pointers separated by search key values The node pointer to the left of a key vale k points to a subtree that contains only data entries less than k The node pointer to the right of a key vale k points to a subtree that contains only data entries greater than k Leaf pages contain data entries and are chained 2. Overview CS522_S16
20
Example B+ Tree Entries < 17 Entries >= 17 age 47
Root age 17 47 Entries < 17 Entries >= 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* Why leaf pages are chained by a doubly linked list? Find 28*, 29*, All > 15* and < 30* Insert/delete Cost? 2. Overview CS522_S16
21
More on B+ tree index The number of page I/Os of a search is equal to
the length of a path from the root to a leaf plus the number of leaf pages with qualifying data entries B+ tree ensures that all paths from the root to a leaf in a given tree are of the same length. The tree is always balanced in height The height of a balanced tree is the length of a path from root to leaf fan-out The average number of children for a non-leaf node 2. Overview CS522_S16
22
Differentiate the three concepts
Data records Actual data Data entries Records in a index file Content of data entries Alternative 1: data entries are data records Alternative 2 & 3: data entries contain search keys and rids Index entries Specific to B+ tree index file Are on the non-leaf nodes(pages) Contain search keys and pointers 2. Overview CS522_S16
23
Index classification Primary vs. secondary Clustered vs. unclustered
If search key contains primary key, then called primary index Unique index: Search key contain a candidate key Clustered vs. unclustered If order of data records is the same as or ‘close to’, order of data entries, then called clustered index Alternative 1 implies clustered A file can be clustered on at most one search key Cost of retrieving data records through index varies greatly based on whether index is clustered or not 2. Overview CS522_S16
24
Clustered vs. unclustered index
Suppose that alternative 2 is used for data entries and that the data records are stored in a Heap file To build clustered index, first sort the Heap file Overflow pages may be needed for inserts Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records 2. Overview CS522_S16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.