M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004.

Slides:



Advertisements
Similar presentations
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
Advertisements

1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?
BTrees & Bitmap Indexes
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Chapter 8 File organization and Indices.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
M.P. Johnson, DBMS, Stern/NYU, Sp20041 C : Database Management Systems Lecture #26 Matthew P. Johnson Stern School of Business, NYU Spring, 2004.
CS4432: Database Systems II
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Lecture 11: DMBS Internals
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 Physical Data Organization and Indexing Lecture 14.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Indexing.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Lecture 5 Cost Estimation and Data Access Methods.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Chapter 5 Record Storage and Primary File Organizations
CS4432: Database Systems II
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
COMP261 Lecture 23 B Trees.
CPS216: Data-intensive Computing Systems
Record Storage, File Organization, and Indexes
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Lecture 16: Data Storage Wednesday, November 6, 2006.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Lecture 21: Hash Tables Monday, February 28, 2005.
COMP 430 Intro. to Database Systems
Database Management Systems (CS 564)
Lecture 11: DMBS Internals
Database Management Systems (CS 564)
Lecture 19: Data Storage and Indexes
Lecture 6: Data Storage and Indexes
Database Design and Programming
CSE 544: Lecture 11 Storing Data, Indexes
Lecture 13: Query Execution
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Lecture 20: Indexes Monday, February 27, 2006.
Lecture 20: Representing Data Elements
Presentation transcript:

M.P. Johnson, DBMS, Stern/NYU, Sp20041 C : Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004

M.P. Johnson, DBMS, Stern/NYU, Sp Agenda Previously: Hardware & sorting Next:  Indices  Failover/recovery  Data warehousing & mining Websearch Hw3 due Thursday  no extensions! 1-minute responses XML links up

M.P. Johnson, DBMS, Stern/NYU, Sp Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution plan Record, index requests Page commands Read/write pages Transaction manager: Concurrency control Logging/recovery Transaction commands Let’s get physical

M.P. Johnson, DBMS, Stern/NYU, Sp Hardware/memory review DBs won’t fit in RAM Disk access is O(100,000) times slower than RAM RAM Model of Computation  Single ops about same as single memory access I/O Model of Computation  We read/write one block (4k) at a time  Measure time in # disk accesses  Ignore processor operations – O(100,000) times faster Regular Mergesort  Divide in half each time and recurse

M.P. Johnson, DBMS, Stern/NYU, Sp Hardware/memory review Big problem: how to sort 1GB with 1MB of RAM?  Can use MS but must read/write all data 19+ times Soln: TPMMS (External MergeSort) 1. Sort data in 1MB chunks 2. Sort 249 of the chunks into a 249MB chunk 3. Sort 249 of the 249MB chunks… Each iteration:  RAM size/blocksize * last-chunk-size

M.P. Johnson, DBMS, Stern/NYU, Sp External Merge-Sort Phase one: load 1MB in memory, sort  Result: SIZE/M lists of length M bytes (1MB) M bytes of main memory Disk... M/R records

M.P. Johnson, DBMS, Stern/NYU, Sp Phase Two Merge M/B – 1 lists into a new list  M/B-1 = 1MB / 4kb -1 = 250 Result: lists of size M *(M/B – 1) bytes  249 * 1MB ~= 250 MB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

M.P. Johnson, DBMS, Stern/NYU, Sp Phase Three Merge M/B – 1 lists into a new list Result: lists of size M*(M/B – 1) 2 bytes  249 * 250 MB ~= 62,500 MB = 625 GB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

M.P. Johnson, DBMS, Stern/NYU, Sp Next topic: File organization 1 Heap files: unordered list of rows  One damn row after another. All row queries are easy:  SELECT * FROM T; Insert is easy: just add to end Unique/subset queries are hard:  Must test each row

M.P. Johnson, DBMS, Stern/NYU, Sp File organization 2 Sorted file: sort rows on some fields Since datafile likely to be large, must use an external sort like external MS Equality, range select now easier:  Do binary search to find first  Walk through rows until one fails test Insert, delete now hard  Must move avg of half rows forward or back Possible solns:  Leave empty space  Use “overflow” pages

M.P. Johnson, DBMS, Stern/NYU, Sp Modifications Insert: File is unsorted  easy  File is sorted: Is there space in the right block?  Then store it there If anything else fails, create overflow block Delete: Free space in block   Maybe be able to eliminate an overflow block If not, use a tombstone (null record) Update: new rec is shorter than prev.  easy  If it’s longer, need to shift records, create overflow blocks

M.P. Johnson, DBMS, Stern/NYU, Sp Overflow Blocks After a while the file starts being dominated by overflow blocks: time to reorganize Block n-1 Block n Block n+1 Overflow

M.P. Johnson, DBMS, Stern/NYU, Sp File organization 3 Datafile (un/sorted) + index Speeds searches based on its fields  Any subset/list of table’s fields  these called search key not to be confused with table’s keys/superkeys Idea: trade disk space for disk time  also may cost processor/RAM time Downsides:  Takes up more space  Must reflect changes in data

M.P. Johnson, DBMS, Stern/NYU, Sp Classification of indices Primary v. secondary Clustered v. unclustered Dense v. sparse Index data structures:  B-trees  Hash tables More advanced types:  Function-based indices  R-trees  Bitmap indices

M.P. Johnson, DBMS, Stern/NYU, Sp Dense indices Index has entry for each row  NB: index entries are smaller than rows   more index entries per block than rows

M.P. Johnson, DBMS, Stern/NYU, Sp Sparse indices Why make sparse? Fewer disk accesses  Bin search on shorter list – log(shorter N)  Analogy: “thumb” index in large dictionaries  Trade disk space for RAM space and comp. Time  May fit in RAM

M.P. Johnson, DBMS, Stern/NYU, Sp Secondary/unclustered indices To index other attributes than primary key Always dense (why?)

M.P. Johnson, DBMS, Stern/NYU, Sp Clustered v. unclustered Clustered means: data and index sorted same way  Sorted on the fields the index is indexing  Each index entry stored “near” data entry Sparse indices must be clustered  Unclustered indices must be dense Clustered indices can reduce disk latency  Related data stored together – less far to go  Good for range queries

M.P. Johnson, DBMS, Stern/NYU, Sp Primary v. secondary Primary indexes  usually clustered  Only one per table  Use PRIMARY KEY Secondary indexes  usually unclustered  many allowed per table  Use UNIQUE or CREATE INDEX

M.P. Johnson, DBMS, Stern/NYU, Sp Partial key searches Situ: index on fields a 1,a 2,a 3 ; we search on fields a i, a j When will this work? 1. i and j must be 1 and 2 (in either order) Searched fields must be a prefix of the indexed fields E.g.: lastname,firstname in phone book 2. Index must be clustered

M.P. Johnson, DBMS, Stern/NYU, Sp New topic: Hash Tables I/O model hash tables are much like main memory ones Hash basics:  There are n buckets  A hash function f(k) maps a key k to {0, 1, …, n-1}  Store in bucket f(k) a pointer to record with key k Difference for I/O model/DBMS:  bucket size = 1 block use overflow blocks when needed

M.P. Johnson, DBMS, Stern/NYU, Sp Assume: 10 buckets, each storing 5 keys and pointers (only 2 shown) h(0)=0 h(25)=h(5)=5 h(83)=h(43)=3 h(99)=h(9)=9 Example hash table

M.P. Johnson, DBMS, Stern/NYU, Sp Search for 82:  Compute h(82)=2  Read bucket 2  1 disk access Hash table search

M.P. Johnson, DBMS, Stern/NYU, Sp Place in corresponding bucket, if space Insert 42… Hash table insertion

M.P. Johnson, DBMS, Stern/NYU, Sp Hash table insertion Create overflow block, if no space Insert 91… More over- flow blocks may be added as necessary

M.P. Johnson, DBMS, Stern/NYU, Sp Hash table performance Excellent if no overflow blocks For in-memory indices, hash tables usually preferred Performance degrades as ratio of keys/(n*blocksize) increases

M.P. Johnson, DBMS, Stern/NYU, Sp Hash functions Lots of ideas for “good” functions, depending on situation One obvious idea: h(x) = x mod n  Every x mapped to one of 0, 1, …, n-1  Roughly 1/n th of x’s mapped to each bucket  Does this work for equality search?  Does this work for range search?  Does this work for partial-key search? Good functions of hashing passwords?  What was the point of hashing in that case?

M.P. Johnson, DBMS, Stern/NYU, Sp Extensible hash table Number of buckets grows to prevent overflows Also used for crypto, hashing passwords, etc. And: Java’s HashMap and object.hashCode()HashMapobject.hashCode()

M.P. Johnson, DBMS, Stern/NYU, Sp New topic: B-trees Saw connected, rooted graphs before: XML graphs Trees are connected, acyclic graphs Saw rooted trees before:  XML docs  directory structure on hard drive  Organizational/management charts B-trees are one kind of rooted tree

M.P. Johnson, DBMS, Stern/NYU, Sp Twenty Questions What am I thinking of?  Large space of possible choices  Can ask only yes/no questions  Each gives <=1 bit Strategy:  ask questions that divide searchspace in half   gain full bit from each question log 2 (1,000,000 ~= 2 20 ) = 20

M.P. Johnson, DBMS, Stern/NYU, Sp BSTs Very simple data structure in CS: BSTs  Binary Search Trees  Keep balanced  Each node ~ one item Each node has two children:  Left subtree: <  Right subtree: >= Can search, insert, delete in log time  log 2 (1MB = 2 20 ) = 20

M.P. Johnson, DBMS, Stern/NYU, Sp Search for DBMS Big improvement: log 2 (1MB) = 20  Each op divides remaining range in half! But recall: all that matters is #disk accesses 20 is better than 2 20 but: Can we do better?

M.P. Johnson, DBMS, Stern/NYU, Sp BSTs  B-trees Like BSTs except each node ~ one block Branching factor is >> 2  Each access divides remaining range by, say, 300  B-trees = BSTs + blocks  B+ trees are a variant of B-trees Data stored only in leaves  Leaves form a (sorted) linked list  Better supports range queries Consequences:  Much shorter depth  Many fewer disk reads  Must find element within node  Trades CPU/RAM time for disk time

M.P. Johnson, DBMS, Stern/NYU, Sp B-tree search efficiency With params:  block=4k  integer = 4b,  pointer = 8b the largest n satisfying 4n+8(n+1) <= 4096 is n=340  Each node has keys  assume on avg has ( )/2=255 Then:  255 rows  depth = 1  = 64k rows  depth = 2  = 16M rows  depth = 3  = 4G rows  depth = 4

M.P. Johnson, DBMS, Stern/NYU, Sp Next time Next: Failover For next time: reading online Hw3 due next time  no extensions! Now: one-minute responses