1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Slides:

Advertisements

Similar presentations

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 How index-learning turns no student pale Yet holds.

Advertisements

File Organizations and Indexing Chapter 8

Overview of Storage and Indexing

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.

Overview of Storage and Indexing

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.

1 Overview of Storage and Indexing Chapter 8 (part 1)

1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.

File Organizations and Indexing R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck,

Murali Mani Overview of Storage and Indexing (based on slides from Wisconsin)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,

1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.

DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.

Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.

Storage and Indexing February 26 th, 2003 Lecture 19.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.

Overview of File Organizations and Indexing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,

Database Systems An Overview of Storage and Indexing

1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Queries, Database Design, Constraint Enforcement Specify Schema + specify constraints.

1 Overview of Storage and Indexing Chapter 8 (part 1)

Storage and Indexing1 Overview of Storage and Indexing.

1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope ( )

DATABASE MANAGEMENT SYSTEMS TERM B. Tech II/IT II Semester UNIT-VIII PPT SLIDES Text Books: (1) DBMS by Raghu Ramakrishnan (2) DBMS by Sudarshan.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.

Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.

1 Database Systems ( 資料庫系統 ) October 4, 2004 Lecture #4 By Hao-hua Chu ( 朱浩華 )

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “If you don’t find it in the index, look very.

1 Database Systems ( 資料庫系統 ) October 29/31, 2007 Lecture #6.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

File Organizations and Indexing

Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.

Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.

Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

1 Clustered vs. Unclustered Index Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “If you don’t find it in the index, look very.

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

CS522 Advanced database Systems

Record Storage, File Organization, and Indexes

Pertemuan <<6>> Tempat Penyimpanan Data dan Indeks

CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.

Overview of Storage and Indexing

File Organizations and Indexing

File Organizations and Indexing

Overview of Storage and Indexing

Overview of Storage and Indexing

CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

Overview of Storage and Indexing

CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.

Storage and Indexing.

CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

General External Merge Sort

Overview of Storage and Indexing

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.

Overview of Storage and Indexing

CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.

Presentation transcript:

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke

2 DBMS Architecture Disk Space Manager DB Access Methods Buffer Manager Query Parser Query Rewriter Query Optimizer Query Executor Lock Manager Log Manager

3 Data on External Storage  Disks: Can retrieve random pages at (almost) fixed cost  But reading several consecutive pages is much faster than reading them in random order.  Tapes: Can only read pages in sequence  Slower but cheaper than disks; used for archival storage  Page: Unit of information read from or written to disk  Size of page: DBMS parameter, 4KB, 8KB  Disk space manager:  Abstraction: a collection of pages.  Allocate/de-allocate a page.  Read/write a page.  Page I/O:  Pages read from disk and pages written to disk  Dominant cost of database operations

4 Access Methods  Access methods: routines to manage various disk-based data structures.  Files of records  Various kinds of indexes  File of records:  Important abstraction of external storage in a DBMS!  Record id (rid) is sufficient to physically locate a record  Indexes:  Auxiliary data structures  Given a value in the index search key field(s), find the record ids of records with this value

5 Buffer Management  Architecture:  Data is read into memory for processing  Data is written to disk for persistent storage  Buffer manager stages pages between external storage and main memory buffer pool.  Access method layer makes calls to the buffer manager.

6 Access Methods Many alternatives exist, each ideal for some situations, and not so good for others:  Heap (unorder) files: Suitable when typical access is a file scan retrieving all records.  Sorted Files: Best if records are retrieved in order of the sorting attribute(s), or only a `range’ of records is needed.  Indexes: Data structures to organize records via trees or hashing. Like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields Updates are much faster than in sorted files.

7 Indexes  An index on a file speeds up selections on the search key field(s) for the index.  Any subset of the fields of a relation can be the search key for an index on the relation.  Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).  An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.  Data entry (in an index) versus data record (in a file).  Given a data entry k*, we can find record(s) with the key k in at most one disk I/O. (Details soon …)

8 B+ Tree Indexes  Leaf pages contain data entries : Leaf pages are chained using prev & next pointers Data entries are sorted by the search key value Non-leaf Pages (Sorted by search key) Leaf

9 B+ Tree Indexes  Non-leaf pages have index entries; only used to direct searches P 0 K 1 P 1 K 2 P 2 K m P m index entry Non-leaf Pages (Sorted by search key) Leaf

10 Example B+ Tree  Equality selection: find 28*? 29*?  Range selection: find all > 15* and < 30*  Insert/delete: Find the data entry in a leaf, then change it.  Need to adjust the parent sometimes.  And the change sometimes bubbles up the tree. 2*3* Root *16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29* Entries <= 17Entries > 17 Data entries in leaf nodes are sorted.

11 Hash-Based Indexes  Good for equality selections.  A hash index is a collection of buckets.  Bucket = primary page plus zero or more overflow pages.  Buckets contain data entries. h … … … … … … N-1 Search key value k  Hashing function h : h ( k ) = bucket of data entries that may contain the search key value k.  No need for “index entries” in this scheme.

12 Alternatives for Data Entry k* in Index  In a data entry k* we can store:  Alt1: Data record with key value k  Alt2:  Alt3:  Choice of an alternative for data entries is orthogonal to an indexing technique used.  Indexing techniques: B+ tree, hashing  Every indexing technique can be combined with any alternative.

13 Alternative 1 for Data Entries  Alternative 1:  Index structure itself is a file organization for data records (instead of a Heap or sorted file).  At most one index on a given collection of data records can use Alternative 1. Otherwise, data records are duplicated, leading to redundant storage and potential inconsistency.  If data records are very large, # of pages containing data entries is high. Size of the index structure to direct searches (e.g., non-leaf nodes in B+ tree) can also be large.

14 Alternatives 2, 3 for Data Entries  Alternatives 2 and 3:  Data entries,, typically much smaller than data records. Index structure used to direct search is much smaller than with Alternative 1. So, better than Alternative 1 with large data records, especially if search keys are small.  Alternative 3 more compact than Alternative 2 In the presence of multiple K* entries!  Alternative 3 leads to variable sized data entries, even if search keys are of fixed length. Variable sizes records/data entries are hard to manage.

15 Index Classification  Primary index vs. secondary index:  If search key contains the primary key, then called primary index.  Other indexes are called secondary indexes.  Unique index: Search key contains a candidate key.  No data entries can have the same search key value.  Recall: Key? Primary key? Candidate key?

16 Index Classification (Contd.)  Clustered index: order of data records is the same as, or `close to’, order of (sorted) data entries.  Alternative 1 implies clustered  Alternatives 2 and 3 are clustered only if data records are sorted on the search key field.  A data file can be clustered on at most one search key.  Retrieving data records: fast, with one or a few I/Os. Why?  Unclustered index:  Multiple unclustered indexes on a data file.  Retrieving data records: can be slow, toughing many pages.

17 Clustered vs. Unclustered Indexes  Suppose that Alternative (2) is used for data entries, and data records are stored in a Heap file.  To build a clustered index, first sort the Heap file (with some free space on each page for future inserts).  Overflow pages may be needed for inserts. (Thus, order of data records is `close to’, but not identical to, the sort order.)

18 Index Classification (Contd.)  Dense index : contains (at least) one data entry for every search key value that appears in a record in the indexed file  Alternatives 1, 2, 3  Sparse index : contains one entry for each page of records in the indexed file.  Sparse index has to be clustered!  Alternative 1?  Alternative 2?  Alternative 3?

19 Cost Model for Our Analysis We ignore CPU costs, for simplicity:  B: The number of data pages  R: Number of records per page  D: (Average) time to read or write disk page  I/O cost: Measuring # of page I/O’s ignores pre-fetching of a sequence of pages; thus, approximate I/O cost.  Average-case analysis, based on simplistic assumptions. * Good enough to show the overall trends!

20 Comparing File Organizations  Heap files (random order; insert at eof)  Sorted files, sorted on  Clustered B+ tree file, Alternative (1), search key  Unclustered B + tree index on a heap file, search key  Unclustered hash index on a heap file, search key

21 Operations to Compare  Scan: Fetch all records from disk  Equality search: e.g. (age, sal) = (24, 50K)  Range selection: e.g. (age, sal)  [(22, 20K), (30, 200K)]  Insert a record  Delete a record

22 Assumptions in Our Analysis  Heap Files:  Equality selection on key; exactly one match.  Sorted Files:  Files compacted after deletions.  Indexes:  Alt (2), (3): size of data entry = 10% size of data record  Hash: No overflow chains. 80% page occupancy => File size = 1.25 data size  B+Tree: 67% occupancy (typical): implies file size = 1.5 data size Balanced with fanout F (133 typical) at each non-leaf level

23 Assumptions (contd.)  Scans:  Leaf nodes of a tree-index are chained.  Index data entries plus actual file scanned for unclustered indexes.  Range searches:  We use tree indexes to restrict the set of data records fetched, but ignore hash indexes.

24 Cost of Operations * Several assumptions underlie these (rough) estimates!

25 Cost of Operations * Several assumptions underlie these (rough) estimates! # of leaf pages: B*0.1/67%=0.15B; Cost of index I/O: 0.15BD # of data entries on a leaf page: R*10*0.67=6.7R; Cost of data I/O: 0.15B*6.7R*D=BDR Key: unclustered B+tree, one I/O per data entry! # of data entry pages: B*0.1/80%=0.125B; Cost of index I/O: 0.125BD # of data entries on a hash page: R*10*0.80=8R; Cost of data I/O: 0.125B*8R*D=BDR Key: unclustered hash index, one I/O per data entry!

26 Summary  Many alternative access methods exist, each appropriate in some situation.  Index is a collection of data entries plus a way to quickly find entries with given key values.  If selection queries are frequent, building an index is important.  Hash-based indexes only good for equality search.  Tree-based indexes best for range search; also good for equality search.

27 Summary (Contd.)  Data entries can be actual data records, pairs, or pairs.  Choice orthogonal to indexing technique used to locate data entries with a given key value.  Can have several indexes on a given file of data records, each with a different search key.  Indexes can be classified as clustered vs. unclustered, primary vs. secondary, etc. Differences have important consequences for utility/performance.

28 Questions