Download presentation
Presentation is loading. Please wait.
1
M.P. Johnson, DBMS, Stern/NYU, Sp20041 C20.0046: Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004
2
M.P. Johnson, DBMS, Stern/NYU, Sp2004 2 Agenda Previously: Hardware & sorting Next: Indices Failover/recovery Data warehousing & mining Websearch Hw3 due Thursday no extensions! 1-minute responses XML links up
3
M.P. Johnson, DBMS, Stern/NYU, Sp2004 3 Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution plan Record, index requests Page commands Read/write pages Transaction manager: Concurrency control Logging/recovery Transaction commands Let’s get physical
4
M.P. Johnson, DBMS, Stern/NYU, Sp2004 4 Hardware/memory review DBs won’t fit in RAM Disk access is O(100,000) times slower than RAM RAM Model of Computation Single ops about same as single memory access I/O Model of Computation We read/write one block (4k) at a time Measure time in # disk accesses Ignore processor operations – O(100,000) times faster Regular Mergesort Divide in half each time and recurse
5
M.P. Johnson, DBMS, Stern/NYU, Sp2004 5 Hardware/memory review Big problem: how to sort 1GB with 1MB of RAM? Can use MS but must read/write all data 19+ times Soln: TPMMS (External MergeSort) 1. Sort data in 1MB chunks 2. Sort 249 of the chunks into a 249MB chunk 3. Sort 249 of the 249MB chunks… Each iteration: RAM size/blocksize * last-chunk-size
6
M.P. Johnson, DBMS, Stern/NYU, Sp2004 6 External Merge-Sort Phase one: load 1MB in memory, sort Result: SIZE/M lists of length M bytes (1MB) M bytes of main memory Disk... M/R records
7
M.P. Johnson, DBMS, Stern/NYU, Sp2004 7 Phase Two Merge M/B – 1 lists into a new list M/B-1 = 1MB / 4kb -1 = 250 Result: lists of size M *(M/B – 1) bytes 249 * 1MB ~= 250 MB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output
8
M.P. Johnson, DBMS, Stern/NYU, Sp2004 8 Phase Three Merge M/B – 1 lists into a new list Result: lists of size M*(M/B – 1) 2 bytes 249 * 250 MB ~= 62,500 MB = 625 GB M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output
9
M.P. Johnson, DBMS, Stern/NYU, Sp2004 9 Next topic: File organization 1 Heap files: unordered list of rows One damn row after another. All row queries are easy: SELECT * FROM T; Insert is easy: just add to end Unique/subset queries are hard: Must test each row
10
M.P. Johnson, DBMS, Stern/NYU, Sp2004 10 File organization 2 Sorted file: sort rows on some fields Since datafile likely to be large, must use an external sort like external MS Equality, range select now easier: Do binary search to find first Walk through rows until one fails test Insert, delete now hard Must move avg of half rows forward or back Possible solns: Leave empty space Use “overflow” pages
11
M.P. Johnson, DBMS, Stern/NYU, Sp2004 11 Modifications Insert: File is unsorted easy File is sorted: Is there space in the right block? Then store it there If anything else fails, create overflow block Delete: Free space in block Maybe be able to eliminate an overflow block If not, use a tombstone (null record) Update: new rec is shorter than prev. easy If it’s longer, need to shift records, create overflow blocks
12
M.P. Johnson, DBMS, Stern/NYU, Sp2004 12 Overflow Blocks After a while the file starts being dominated by overflow blocks: time to reorganize Block n-1 Block n Block n+1 Overflow
13
M.P. Johnson, DBMS, Stern/NYU, Sp2004 13 File organization 3 Datafile (un/sorted) + index Speeds searches based on its fields Any subset/list of table’s fields these called search key not to be confused with table’s keys/superkeys Idea: trade disk space for disk time also may cost processor/RAM time Downsides: Takes up more space Must reflect changes in data
14
M.P. Johnson, DBMS, Stern/NYU, Sp2004 14 Classification of indices Primary v. secondary Clustered v. unclustered Dense v. sparse Index data structures: B-trees Hash tables More advanced types: Function-based indices R-trees Bitmap indices
15
M.P. Johnson, DBMS, Stern/NYU, Sp2004 15 Dense indices Index has entry for each row NB: index entries are smaller than rows more index entries per block than rows 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
16
M.P. Johnson, DBMS, Stern/NYU, Sp2004 16 Sparse indices Why make sparse? Fewer disk accesses Bin search on shorter list – log(shorter N) Analogy: “thumb” index in large dictionaries Trade disk space for RAM space and comp. Time May fit in RAM 10 30 50 70 90 110 130 150 10 20 30 40 50 60 70 80
17
M.P. Johnson, DBMS, Stern/NYU, Sp2004 17 Secondary/unclustered indices To index other attributes than primary key Always dense (why?) 10 20 30 20 30 20 10 20 10 30
18
M.P. Johnson, DBMS, Stern/NYU, Sp2004 18 Clustered v. unclustered Clustered means: data and index sorted same way Sorted on the fields the index is indexing Each index entry stored “near” data entry Sparse indices must be clustered Unclustered indices must be dense Clustered indices can reduce disk latency Related data stored together – less far to go Good for range queries
19
M.P. Johnson, DBMS, Stern/NYU, Sp2004 19 Primary v. secondary Primary indexes usually clustered Only one per table Use PRIMARY KEY Secondary indexes usually unclustered many allowed per table Use UNIQUE or CREATE INDEX
20
M.P. Johnson, DBMS, Stern/NYU, Sp2004 20 Partial key searches Situ: index on fields a 1,a 2,a 3 ; we search on fields a i, a j When will this work? 1. i and j must be 1 and 2 (in either order) Searched fields must be a prefix of the indexed fields E.g.: lastname,firstname in phone book 2. Index must be clustered
21
M.P. Johnson, DBMS, Stern/NYU, Sp2004 21 New topic: Hash Tables I/O model hash tables are much like main memory ones Hash basics: There are n buckets A hash function f(k) maps a key k to {0, 1, …, n-1} Store in bucket f(k) a pointer to record with key k Difference for I/O model/DBMS: bucket size = 1 block use overflow blocks when needed
22
M.P. Johnson, DBMS, Stern/NYU, Sp2004 22 Assume: 10 buckets, each storing 5 keys and pointers (only 2 shown) h(0)=0 h(25)=h(5)=5 h(83)=h(43)=3 h(99)=h(9)=9 Example hash table 0 10 11 41 82 23 3 0 1 2 3
23
M.P. Johnson, DBMS, Stern/NYU, Sp2004 23 Search for 82: Compute h(82)=2 Read bucket 2 1 disk access Hash table search 0 1 2 3 0 10 11 41 82 23 3
24
M.P. Johnson, DBMS, Stern/NYU, Sp2004 24 Place in corresponding bucket, if space Insert 42… Hash table insertion 0 10 11 41 82 23 3 0 1 2 3
25
M.P. Johnson, DBMS, Stern/NYU, Sp2004 25 Hash table insertion Create overflow block, if no space Insert 91… More over- flow blocks may be added as necessary 0 10 11 41 82 42 23 3 0 1 2 3 91
26
M.P. Johnson, DBMS, Stern/NYU, Sp2004 26 Hash table performance Excellent if no overflow blocks For in-memory indices, hash tables usually preferred Performance degrades as ratio of keys/(n*blocksize) increases
27
M.P. Johnson, DBMS, Stern/NYU, Sp2004 27 Hash functions Lots of ideas for “good” functions, depending on situation One obvious idea: h(x) = x mod n Every x mapped to one of 0, 1, …, n-1 Roughly 1/n th of x’s mapped to each bucket Does this work for equality search? Does this work for range search? Does this work for partial-key search? Good functions of hashing passwords? What was the point of hashing in that case?
28
M.P. Johnson, DBMS, Stern/NYU, Sp2004 28 Extensible hash table Number of buckets grows to prevent overflows Also used for crypto, hashing passwords, etc. And: Java’s HashMap and object.hashCode()HashMapobject.hashCode()
29
M.P. Johnson, DBMS, Stern/NYU, Sp2004 29 New topic: B-trees Saw connected, rooted graphs before: XML graphs Trees are connected, acyclic graphs Saw rooted trees before: XML docs directory structure on hard drive Organizational/management charts B-trees are one kind of rooted tree
30
M.P. Johnson, DBMS, Stern/NYU, Sp2004 30 Twenty Questions What am I thinking of? Large space of possible choices Can ask only yes/no questions Each gives <=1 bit Strategy: ask questions that divide searchspace in half gain full bit from each question log 2 (1,000,000 ~= 2 20 ) = 20
31
M.P. Johnson, DBMS, Stern/NYU, Sp2004 31 BSTs Very simple data structure in CS: BSTs Binary Search Trees Keep balanced Each node ~ one item Each node has two children: Left subtree: < Right subtree: >= Can search, insert, delete in log time log 2 (1MB = 2 20 ) = 20
32
M.P. Johnson, DBMS, Stern/NYU, Sp2004 32 Search for DBMS Big improvement: log 2 (1MB) = 20 Each op divides remaining range in half! But recall: all that matters is #disk accesses 20 is better than 2 20 but: Can we do better?
33
M.P. Johnson, DBMS, Stern/NYU, Sp2004 33 BSTs B-trees Like BSTs except each node ~ one block Branching factor is >> 2 Each access divides remaining range by, say, 300 B-trees = BSTs + blocks B+ trees are a variant of B-trees Data stored only in leaves Leaves form a (sorted) linked list Better supports range queries Consequences: Much shorter depth Many fewer disk reads Must find element within node Trades CPU/RAM time for disk time
34
M.P. Johnson, DBMS, Stern/NYU, Sp2004 34 B-tree search efficiency With params: block=4k integer = 4b, pointer = 8b the largest n satisfying 4n+8(n+1) <= 4096 is n=340 Each node has 170..340 keys assume on avg has (170+340)/2=255 Then: 255 rows depth = 1 255 2 = 64k rows depth = 2 255 3 = 16M rows depth = 3 255 4 = 4G rows depth = 4
35
M.P. Johnson, DBMS, Stern/NYU, Sp2004 35 Next time Next: Failover For next time: reading online Hw3 due next time no extensions! Now: one-minute responses
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.