Presentation is loading. Please wait.

Presentation is loading. Please wait.

M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004.

Similar presentations


Presentation on theme: "M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004."— Presentation transcript:

1 M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004

2 M.P. Johnson, DBMS, Stern/NYU, Sp2004 2 Agenda Previously: Hardware & sorting Next:  Indices  Failover/recovery  Data warehousing & mining Websearch Hw3 due Thursday  no extensions! 1-minute responses XML links up

3 M.P. Johnson, DBMS, Stern/NYU, Sp2004 3 Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution plan Record, index requests Page commands Read/write pages Transaction manager: Concurrency control Logging/recovery Transaction commands Let’s get physical

4 M.P. Johnson, DBMS, Stern/NYU, Sp2004 4 Hardware/memory review DBs won’t fit in RAM Disk access is O(100,000) times slower than RAM RAM Model of Computation  Single ops about same as single memory access I/O Model of Computation  We read/write one block (4k) at a time  Measure time in # disk accesses  Ignore processor operations – O(100,000) times faster Regular Mergesort  Divide in half each time and recurse

5 M.P. Johnson, DBMS, Stern/NYU, Sp2004 5 Hardware/memory review Big problem: how to sort 1GB with 1MB of RAM?  Can use MS but must read/write all data 19+ times Soln: TPMMS (External MergeSort) 1. Sort data in 1MB chunks 2. Sort 249 of the chunks into a 249MB chunk 3. Sort 249 of the 249MB chunks… Each iteration:  RAM size/blocksize * last-chunk-size

6 M.P. Johnson, DBMS, Stern/NYU, Sp2004 6 External Merge-Sort Phase one: load 1MB in memory, sort  Result: SIZE/M lists of length M bytes (1MB) M bytes of main memory Disk... M/R records

7 M.P. Johnson, DBMS, Stern/NYU, Sp2004 7 Phase Two Merge M/B – 1 lists into a new list  M/B-1 = 1MB / 4kb -1 = 250 Result: lists of size M *(M/B – 1) bytes  249 * 1MB ~= 250 MB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

8 M.P. Johnson, DBMS, Stern/NYU, Sp2004 8 Phase Three Merge M/B – 1 lists into a new list Result: lists of size M*(M/B – 1) 2 bytes  249 * 250 MB ~= 62,500 MB = 625 GB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

9 M.P. Johnson, DBMS, Stern/NYU, Sp2004 9 Next topic: File organization 1 Heap files: unordered list of rows  One damn row after another. All row queries are easy:  SELECT * FROM T; Insert is easy: just add to end Unique/subset queries are hard:  Must test each row

10 M.P. Johnson, DBMS, Stern/NYU, Sp2004 10 File organization 2 Sorted file: sort rows on some fields Since datafile likely to be large, must use an external sort like external MS Equality, range select now easier:  Do binary search to find first  Walk through rows until one fails test Insert, delete now hard  Must move avg of half rows forward or back Possible solns:  Leave empty space  Use “overflow” pages

11 M.P. Johnson, DBMS, Stern/NYU, Sp2004 11 Modifications Insert: File is unsorted  easy  File is sorted: Is there space in the right block?  Then store it there If anything else fails, create overflow block Delete: Free space in block   Maybe be able to eliminate an overflow block If not, use a tombstone (null record) Update: new rec is shorter than prev.  easy  If it’s longer, need to shift records, create overflow blocks

12 M.P. Johnson, DBMS, Stern/NYU, Sp2004 12 Overflow Blocks After a while the file starts being dominated by overflow blocks: time to reorganize Block n-1 Block n Block n+1 Overflow

13 M.P. Johnson, DBMS, Stern/NYU, Sp2004 13 File organization 3 Datafile (un/sorted) + index Speeds searches based on its fields  Any subset/list of table’s fields  these called search key not to be confused with table’s keys/superkeys Idea: trade disk space for disk time  also may cost processor/RAM time Downsides:  Takes up more space  Must reflect changes in data

14 M.P. Johnson, DBMS, Stern/NYU, Sp2004 14 Classification of indices Primary v. secondary Clustered v. unclustered Dense v. sparse Index data structures:  B-trees  Hash tables More advanced types:  Function-based indices  R-trees  Bitmap indices

15 M.P. Johnson, DBMS, Stern/NYU, Sp2004 15 Dense indices Index has entry for each row  NB: index entries are smaller than rows   more index entries per block than rows 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80

16 M.P. Johnson, DBMS, Stern/NYU, Sp2004 16 Sparse indices Why make sparse? Fewer disk accesses  Bin search on shorter list – log(shorter N)  Analogy: “thumb” index in large dictionaries  Trade disk space for RAM space and comp. Time  May fit in RAM 10 30 50 70 90 110 130 150 10 20 30 40 50 60 70 80

17 M.P. Johnson, DBMS, Stern/NYU, Sp2004 17 Secondary/unclustered indices To index other attributes than primary key Always dense (why?) 10 20 30 20 30 20 10 20 10 30

18 M.P. Johnson, DBMS, Stern/NYU, Sp2004 18 Clustered v. unclustered Clustered means: data and index sorted same way  Sorted on the fields the index is indexing  Each index entry stored “near” data entry Sparse indices must be clustered  Unclustered indices must be dense Clustered indices can reduce disk latency  Related data stored together – less far to go  Good for range queries

19 M.P. Johnson, DBMS, Stern/NYU, Sp2004 19 Primary v. secondary Primary indexes  usually clustered  Only one per table  Use PRIMARY KEY Secondary indexes  usually unclustered  many allowed per table  Use UNIQUE or CREATE INDEX

20 M.P. Johnson, DBMS, Stern/NYU, Sp2004 20 Partial key searches Situ: index on fields a 1,a 2,a 3 ; we search on fields a i, a j When will this work? 1. i and j must be 1 and 2 (in either order) Searched fields must be a prefix of the indexed fields E.g.: lastname,firstname in phone book 2. Index must be clustered

21 M.P. Johnson, DBMS, Stern/NYU, Sp2004 21 New topic: Hash Tables I/O model hash tables are much like main memory ones Hash basics:  There are n buckets  A hash function f(k) maps a key k to {0, 1, …, n-1}  Store in bucket f(k) a pointer to record with key k Difference for I/O model/DBMS:  bucket size = 1 block use overflow blocks when needed

22 M.P. Johnson, DBMS, Stern/NYU, Sp2004 22 Assume: 10 buckets, each storing 5 keys and pointers (only 2 shown) h(0)=0 h(25)=h(5)=5 h(83)=h(43)=3 h(99)=h(9)=9 Example hash table 0 10 11 41 82 23 3 0 1 2 3

23 M.P. Johnson, DBMS, Stern/NYU, Sp2004 23 Search for 82:  Compute h(82)=2  Read bucket 2  1 disk access Hash table search 0 1 2 3 0 10 11 41 82 23 3

24 M.P. Johnson, DBMS, Stern/NYU, Sp2004 24 Place in corresponding bucket, if space Insert 42… Hash table insertion 0 10 11 41 82 23 3 0 1 2 3

25 M.P. Johnson, DBMS, Stern/NYU, Sp2004 25 Hash table insertion Create overflow block, if no space Insert 91… More over- flow blocks may be added as necessary 0 10 11 41 82 42 23 3 0 1 2 3 91

26 M.P. Johnson, DBMS, Stern/NYU, Sp2004 26 Hash table performance Excellent if no overflow blocks For in-memory indices, hash tables usually preferred Performance degrades as ratio of keys/(n*blocksize) increases

27 M.P. Johnson, DBMS, Stern/NYU, Sp2004 27 Hash functions Lots of ideas for “good” functions, depending on situation One obvious idea: h(x) = x mod n  Every x mapped to one of 0, 1, …, n-1  Roughly 1/n th of x’s mapped to each bucket  Does this work for equality search?  Does this work for range search?  Does this work for partial-key search? Good functions of hashing passwords?  What was the point of hashing in that case?

28 M.P. Johnson, DBMS, Stern/NYU, Sp2004 28 Extensible hash table Number of buckets grows to prevent overflows Also used for crypto, hashing passwords, etc. And: Java’s HashMap and object.hashCode()HashMapobject.hashCode()

29 M.P. Johnson, DBMS, Stern/NYU, Sp2004 29 New topic: B-trees Saw connected, rooted graphs before: XML graphs Trees are connected, acyclic graphs Saw rooted trees before:  XML docs  directory structure on hard drive  Organizational/management charts B-trees are one kind of rooted tree

30 M.P. Johnson, DBMS, Stern/NYU, Sp2004 30 Twenty Questions What am I thinking of?  Large space of possible choices  Can ask only yes/no questions  Each gives <=1 bit Strategy:  ask questions that divide searchspace in half   gain full bit from each question log 2 (1,000,000 ~= 2 20 ) = 20

31 M.P. Johnson, DBMS, Stern/NYU, Sp2004 31 BSTs Very simple data structure in CS: BSTs  Binary Search Trees  Keep balanced  Each node ~ one item Each node has two children:  Left subtree: <  Right subtree: >= Can search, insert, delete in log time  log 2 (1MB = 2 20 ) = 20

32 M.P. Johnson, DBMS, Stern/NYU, Sp2004 32 Search for DBMS Big improvement: log 2 (1MB) = 20  Each op divides remaining range in half! But recall: all that matters is #disk accesses 20 is better than 2 20 but: Can we do better?

33 M.P. Johnson, DBMS, Stern/NYU, Sp2004 33 BSTs  B-trees Like BSTs except each node ~ one block Branching factor is >> 2  Each access divides remaining range by, say, 300  B-trees = BSTs + blocks  B+ trees are a variant of B-trees Data stored only in leaves  Leaves form a (sorted) linked list  Better supports range queries Consequences:  Much shorter depth  Many fewer disk reads  Must find element within node  Trades CPU/RAM time for disk time

34 M.P. Johnson, DBMS, Stern/NYU, Sp2004 34 B-tree search efficiency With params:  block=4k  integer = 4b,  pointer = 8b the largest n satisfying 4n+8(n+1) <= 4096 is n=340  Each node has 170..340 keys  assume on avg has (170+340)/2=255 Then:  255 rows  depth = 1  255 2 = 64k rows  depth = 2  255 3 = 16M rows  depth = 3  255 4 = 4G rows  depth = 4

35 M.P. Johnson, DBMS, Stern/NYU, Sp2004 35 Next time Next: Failover For next time: reading online Hw3 due next time  no extensions! Now: one-minute responses


Download ppt "M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004."

Similar presentations


Ads by Google