Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.

Slides:



Advertisements
Similar presentations
- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
Advertisements

CS 400/600 – Data Structures External Sorting.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 512Mb Typically magnetic disks, magneto­ optical (erasable), CD­ ROM.
The Memory Hierarchy fastest, perhaps 1Mb
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
Storage. The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 2Gb Typically magnetic disks, magneto­ optical (erasable),
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
SECTION 13.3 Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #5.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
Storage. The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 8Gb Typically magnetic disks, magneto­ optical (erasable),
Storage. The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 512Mb Access times in milliseconds, great variability. Unit.
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
Chapter 2 Data Storage How does a computer system store and manage very large volumes of data ?
Lecture 11: DMBS Internals
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
1 Secondary Storage Management Submitted by: Sathya Anandan(ID:123)
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Storage.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.
Chapter 2. Data Storage Chapter 2.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CPSC-608 Database Systems Fall 2015 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
DBMS 2001Notes 2: Hardware1 Principles of Database Management Systems Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Subject Name: File Structures
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and
CPSC-608 Database Systems
Lecture 11: DMBS Internals
15.5 Two-Pass Algorithms Based on Hashing
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Lecture 2- Query Processing (continued)
Chapter 11 I/O Management and Disk Scheduling
13.3 Accelerating Access to Secondary Storage
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
Presentation transcript:

Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data takes as much time as any other. When implementing a DBMS, one must assume that the data does not fit into main memory. –In designing efficient algorithms, one must take into account the use of secondary, and perhaps even tertiary storage. Often, the best algorithms for processing very large amounts of data differ from the best main-memory algorithms for the same problem. –There is a great advantage in choosing an algorithm that uses few disk accesses, even if the algorithm is not very efficient when viewed as a main-memory algorithm.

Assumptions (for now) One processor One disk controller, and one disk. The database itself is much too large to fit in main memory. Many users, and each user issues disk-I/O requests frequently, –Disk controller serving on a first-come-first-served basis. –We will change this assumption later. Thus, each request for a given user will appear random –Even if the relation that a user is reading is stored on a single cylinder of the disk.

I/O model of computation Disk I/O = read or write of a block is very expensive compared with what is likely to be done with the block once it arrives in main memory. –Perhaps 1,000,000 machine instructions in the time to do one random disk I/O. Random block accesses is the norm if there are several processes accessing disks, and the disk controller does not schedule accesses carefully. Reasonable model of computation that requires secondary storage: count only the disk I/O's. In examples, we shall assume that the disk is a Megatron 747, with 16Kbyte blocks and the timing characteristics determined before. In particular, the average time to read or write a block is about 11ms

Good DBMS algorithms a)Try to make sure that if we read a block, we use much of the data on the block. b)Try to put blocks that are accessed together on the same cylinder. c)Try to buffer commonly used blocks in main memory.

Sorting Example Setup: 10,000,000 records of 160 bytes = 1.6Gb file. –Stored on Megatron 747 disk, with 16Kb blocks, each holding 100 records –Entire file takes 100,000 blocks 100M available main memory –The number of blocks that can fit in 100M bytes of memory (which, recall, is really 100 x 2 20 bytes), is 100 x 2 20 /2 14, or 6400 blocks  1/16th of file. Sort by primary key field.

Merge Sort Common main­memory sorting algorithms don't look so good when you take disk I/O's into account. Variants of Merge Sort do better. Merge = take two sorted lists and repeatedly chose the smaller of the “heads” of the lists (head = first of the unchosen). Example: merge 1,3,4,8 with 2,5,7,9 = 1,2,3,4,5,7,8,9. Merge Sort based on recursive algorithm: divide records into two parts; recursively mergesort the parts, and merge the resulting lists.

Two­Phase, Multiway Merge Sort Merge Sort still not very good in disk I/O model. log 2 n passes, so each record is read/written from disk log 2 n times. The secondary memory algorithms operate in a small number of passes; –in one pass every record is read into main memory once and written out to disk once. 2PMMS: 2 reads + 2 writes per block. Phase 1 1. Fill main memory with records. 2. Sort using favorite main­memory sort. 3. Write sorted sublist to disk. 4. Repeat until all records have been put into one of the sorted lists.

Phase 2 Use one buffer for each of the sorted sublists and one buffer for an output block. Initially load input buffers with the first blocks of their respective sorted lists. Repeatedly run a competition among the first unchosen records of each of the buffered blocks. Move the record with the least key to the output block; it is now “chosen.” Manage the buffers as needed: If an input block is exhausted, get the next block from the same file. If the output block is full, write it to disk.

Analysis – Phase of the 100,000 blocks will fill main memory. We thus fill memory  100,000/6,400  =16 times, sort the records in main memory, and write the sorted sublists out to disk. How long does this phase take? We read each of the 100,000 blocks once, and we write 100,000 new blocks. Thus, there are 200,000 disk I/O's for 200,000*11ms = 2200 seconds, or 37 minutes. Avg. time for reading a block.

Analysis – Phase 2 In the second phase, unlike the first phase, the blocks are read in an unpredictable order, since we cannot tell when an input block will become exhausted. However, notice that every block holding records from one of the sorted lists is read from disk exactly once. Thus, the total number of block reads is 100,000 in the second phase, just as for the first. Likewise, each record is placed once in an output block, and each of these blocks is written to disk. Thus, the number of block writes in the second phase is also 100,000. We conclude that the second phase takes another 37 minutes. Total: Phase 1 + Phase 2 = 74 minutes.

How Big Should Blocks Be? We have assumed a 16K byte block in our analysis of algorithms using the Megatron 747 disk. However, there are arguments that a larger block size would be advantageous. –Recall that it takes about a quarter of a millisecond (0.25ms) for transfer time of a 16K block and milliseconds for average seek time and rotational latency. If we doubled the size of blocks, we would halve the number of disk I/O's. On the other hand, the only change in the time to access a block would be that the transfer time increases to 0.25*2=0.50 millisecond. We would thus approximately halve the time the sort takes. For a block size of 512K (i.e., an entire track of the Megatron 747) the transfer time is 0.25*32=8 milliseconds. At that point, the average block access time would be 20 milliseconds, but we would need only 12,500 block accesses, for a speedup in sorting by a factor of 14.

Reasons to limit the block size First, we cannot use blocks that cover several tracks effectively. Second, small relations would occupy only a fraction of a block, so large blocks would waste space on the disk. There are also certain data structures for secondary storage organization that prefer to divide data among many blocks and thereforework less well when the block size is too large. –In fact, the larger the blocks are, the fewer records we can sort by 2PMMS. Nevertheless, as machines get faster and disks more capacious, there is a tendency for block sizes to grow.

How many records can we sort? 1.The block size is B bytes. 2.The main memory available for buffering blocks is M bytes. 3.Records take R bytes. Number of main memory buffers = M/B blocks We need one output buffer, so we can actually use (M/B)-1 input buffers. How many sorted sublists that makes sense to produce? (M/B)-1. What’s the total number of records we can sort? Each time we fill in the memory we sort M/R records. Hence, we are able to sort (M/R)*[(M/B)-1] or approximately M 2 /RB. If we use the parameters in the example about TPMMS we have: M=100MB = 100,000,000 Bytes = 10 8 Bytes B = 16,384 Bytes R = 160 Bytes So, M 2 /RB = (10 8 ) 2 / (160 * 16,384) = 4.2 billion records, or 2/3 of a TeraByte.

Sorting larger relations If our relation is bigger, then, we can use 2PMMS to create M 2 /RB sorted sublists. Then, in a third pass we can merge (M/B)-1 of these sorted sublists. The third phase let’s us sort [(M/B)-1]*[M 2 /RB]  M 3 /RB 2 records For our example, the third phase let’s us sort 75 trillion records occupying 7500 Petabytes!!

Improving the Running Time of 2PMMS Here are some techniques that sometimes make secondary­ memory algorithms more efficient: Group blocks by cylinder. One big disk  several smaller disks. Mirror disks = multiple copies of the same data. ``Prefetching'' or “double buffering.” Disk scheduling; the “elevator” algorithm.

Cylindrification If we are going to read or write blocks in a known order, place them by cylinder, so once we reach that cylinder we can read block after block, with no seek time or rotational latency. Application to Phase 1 of 2PMMS 1. Initially, records on 196 consecutive cylinders. 2. Load main memory from 13 consecutive cylinders. –Order of blocks read is unimportant, so only time besides transfer time is one random seek and 12 1­cylinder seeks (neglect). –Time to transfer 6,400 blocks at 0.25 ms/block = 1.60 sec. 3. Write each sorted list onto 13 consecutive cylinders, so write time is also about 1.60 sec. Total for Phase1 about (1.60)  2  16=52 sec But in Phase 2 …?

Cylindrification – Phase 2 Storage by cylinders does not help in the second phase. Blocks are read from the fronts of the sorted lists in an order that is determined by which list next exhausts its current block. Output blocks are written one at a time, interspersed with block reads Thus, the second phase will still take 37 min. We have cut the sorting time almost half, but cannot do better by cylindrification alone.

Multiple Disks Use several disks with independent heads Example: Instead of a large disk of 8GB (Megatron 747), let’s use 4 smaller disks of 2GB each (Megatron 737) We divide the given records among the four disks; the data will occupy 196 adjacent cylinders on each disk. Phase 1 of 2PMMS: Load in main memory from all 4 disks in parallel. Time for the fill of ¼ of the memory: 1/4 × 1.60 Sec = 0.4 Sec Since, we read in parallel during that time we fill from the other disks, the other ¾ of the memory. Hence, in total, one full fill of the memory takes 0.4 Sec. We do this 16 times, which means 0.4 * 16 = 6.4sec Plus 6.4sec for writing for a total of 12.8sec in phase 1.

Multiple Disks – Phase 2 Once a fill is sorted in main memory write the blocks onto the 4 disks: Phase 1 = 16 × (0.4sec + 0.4sec) = 12.8sec; –It was 52sec when having 1 disk+cylindrification. Phase 2: use 4 output buffers one per disk and and cut writing time in about 1/4. However, what about the reading part? If we are careful about timing we could manage to read from four different sorted lists whose previous blocks were exhausted. This cuts the reading time in about 1/2 to 1/3. Total time for phase 2 is about 37/2=18min. Total time for both phases  18min

Mirror Disks Mirror disk = identical copy of disk Improves reliability (when one crashes) at the expense of disk copies. With n copies we can handle n reads in time equal to 1 read A read can be done from the disk with the shortest seek time (i.e., closer head) Writing: no speedup, no slowdown compared to single disk (why?)

Mirroring disks (Cont’d) Writing: no speedup, no slowdown That is because: whenever we need to write a block, we write it on all disks that have a copy. Since the writing can take place in parallel, the elapsed time is about the same as for writing to a single disk. Obvious minus: extra disk cost. In the “multiple disks” example, if we were careful about timing we could manage to read from four different sorted lists whose previous blocks were exhausted. With mirroring disks way we are guaranteed for that.

Prefetching and large scale buffering If we have extra space for main­memory buffers, consider loading buffers in advance of need. Similarly, keep output blocks buffered in main memory until it is convenient to write them to disk. Example: Phase 2 of 2PMMS With 128Mb of main memory, we can afford to buffer 2 cylinders for each of 16 sublists and for the output (one cylinder = 4Mb). Consume one cylinder for each sublist while the other is being loaded from main memory. Similarly, write one output cylinder while the other is being constructed. Thus, seek and rotational latency are almost eliminated, lowering total read and write times to about 27sec each.