Indexing.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #5.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
Primary Indexes Dense Indexes
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
CS 255: Database System Principles slides: B-trees
CS4432: Database Systems II
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Lecture 11: DMBS Internals
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
1 Physical Data Organization and Indexing Lecture 14.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Implementation of Relational Operators/Estimated Cost 1.Select 2.Join.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CPSC-608 Database Systems Fall 2015 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
Chapter 5 Record Storage and Primary File Organizations
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Select Operation Strategies And Indexing (Chapter 8)
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
CPS216: Data-intensive Computing Systems
Indexing Goals: Store large files Support multiple search keys
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
CPSC-608 Database Systems
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
CPSC-310 Database Systems
Database Design and Programming
Lecture 20: Indexes Monday, February 27, 2006.
Presentation transcript:

Indexing

Physical Disk Structure

Disk Example Four platters providing eight surfaces 213 = 8192 tracks per surface 28 = 256 sectors per track 29 = 512 bytes per sector Sector is physical unit while block is logical unit dependent on the DBMS Typically a block is as large as a sector

Disk Access Characteristics To read a block Head has to move to the track containing the block (seek time) The block has rotate under the head (rotational latency) Transfer time (negligible)

I/O model of computation If a block needs to be moved between disk and main memory then the time taken to perform the read or write is much larger than the time likely to be used manipulating the data in main memory. Thus number of blocks accessed is a good approximation of the time needed by the algorithm and should be minimized Minimize seek time + rotational latency

Sorting data in Secondary Storage Suppose a relation consists of 10,000,000 records Want to sort this relation on a “key” 100 bytes per record. Approx 1 gigabyte Assume 50 MB of main memory available Disk block is 4096 bytes. Relation takes 250,000 blocks. 12,800 blocks can fit in memory.

Sorting…cont If data fits in main memory, the fastest algorithms for sorting are variants of Quicksort. Preferred method is to minimize the number of times a block is brought into main memory

Merge-Sort example Step List 1 List 2 Output Start 1,3,4,9 2,5,7,8 none 1) 3,4,9 1 2) 5,7,8 1,2 3) 4,9 1,2,3 4) 9 1,2,3,4 5) 7,8 1,2,3,4,5 6) 8 1,2,3,4,5,7 7) 1,2,3,4,5,7,8 8) 1,2,3,4,5,7,8,9

Two-Phase Multiway Merge Sort Phase 1: Sort main-memory sized pieces of data and store them as sorted lists. Fill all memory with blocks from orig. list Time:read and write 250,000 blocks; 15 millisecond per block; 7500 seconds, 125 min Phase 2: Merge all the sorted lists into a single sorted list.

Phase 2 If we use the main-memory merge then we would have to read the data 2log(n) times where n is the number of sorted lists. The common strategy is Bring the first block of each sorted list into memory and have one output buffer Find smallest key among remaining keys in all lists Move the smallest element to the first available position in the buffer If output block is full write buffer to disk and reinitialize to empty If the block from where the smallest element was taken is full, then bring in the next block from the same sorted list into the buffer Cost of phase 2 is again is 125 min; Total cost is 250 min

Motivation SQL is declarative How should the following queries be processed: SELECT * from R SELECT * from R where R.A = ’10’

Index Function Block Holding records IIndex value Matching Records

Book Analogy Just remember the book index Index is a set of pages (a separate file) with pointers (page numbers) to the data page which contains the value Also note difference between “search key” and “primary key”

Types of Indexes Simple indexes on sorted files Secondary indexes on unsorted files B-trees on any type of file

Sequential File A file sorted on the attribute(s) of the index. Very useful when the search attribute is the primary key. Build a dense index on the file. Called dense because every key from the data file is represented in the index. Note the index only contains the key and pointer of the data file and thus is usually much smaller than the data file.

Example Suppose a relation has 1,000,000 records A block is of size 4096 bytes. 10 records fit in one block. Thus size of data is > 400 MB An index will have a 12 byte representation for each record. Thus will fit 100 index entries in a block. Index will fit in 10000 blocks (40 MB). Log2(10000) ~= 14 (Thus 14 I/O’s for lookup)

Sparse Index Instead of one index record per data record, use one index record per block of data record. This is called a sparse index. Suppose query is “Is there a record with key value K”. Just check in dense index For sparse index a data block has to be retrieved

Secondary Indexes When you go to a library, books are sorted by Call Number. The call numbers ranges on the shelves are like a sparse index. Now what if you want to search by “last name” and not call number. You build a secondary structure on the books. Secondary structure does not determine or influence the place of data records. You can have several secondary structures on one file.

Example SELECT book.title FROM books WHERE books.lastname=“Codd” Create secondary index by SQL statement CREATE INDEX LNIndex on Books (lastname)’

Question? Can a secondary index be sparse?

B-Trees B-Trees are the most commonly used indexing structure in commercial systems. Several variants are available, but the most popular is called the B+ tree. Roughly speaking Ord Array (search: O(Log(n)), update:O(n)) Linked List (search:O(n), update: O(1)) Tree (search: Log(n), update:Log(n))

Structure of B-Tree Organizes its blocks into a tree Tree is Balanced: all paths from leaf to root have the same length There are three types of nodes Root, Interior Leaf Associated with each tree is a layout parameter n (n search keys, n+1 pointers)

Example Interior Node Leaf Node 20 31 52 57 81 95 Next leaf in sequence To keys K>=52 To record with key 57 To keys K < 20 To keys 31 <= K < 52 To record with key 81 To record with key 95 To keys 20 <= K < 31 Suppose block size is 4096 bytes. Each key is 4 bytes and pointer is 8 bytes. Want 4*n + 8*(n+1) <= 4096; n = 340

Example 13 7 23 31 43 2 3 5 11 17 19 29 37 41 47

17 7 - 37 43 2 3 5 7 13 13 17 23 23 23 23 37 41 43 47

Efficiency of B-Trees Earlier we saw up to 340 key-pointer pairs. Assume average block has half-occupancy =~ 255 1 root block, 255 child nodes and 255*255 leaf nodes. In the leaf we will have 2553 = 16.6 million records Thus upto 4 I/O to access any record!

B-Tree Animation Go to http://slady.cz/java/bt/ to see B-Tree animation.