Query Processing Exercise Session 1.

Slides:



Advertisements
Similar presentations
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Advertisements

CS 277 – Spring 2002Notes 21 CS 277: Database System Implementation Notes 02: Hardware Arthur Keller.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Lecture 11: DMBS Internals
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Chapter 111 Chapter 11: Hardware (Slides by Hector Garcia-Molina,
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
CPS216: Advanced Database Systems Notes 03: Data Access from Disks Shivnath Babu.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 3 Secondary Storage and System Software I
1 Query Processing Exercise Session 1. 2 The system (OS or DBMS) manages the buffer Disk B1B2B3 Bn … … Program’s private memory An application program.
1 Query Processing Part 1: Managing Disks. 2 Main Topics on Query Processing Running-time analysis Indexes (e.g., search trees, hashing) Efficient algorithms.
File organization Secondary Storage Devices Lec#7 Presenter: Dr Emad Nabil.
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems DISK I/0.
File Organization Record Storage and Primary File Organization
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
CS522 Advanced database Systems
Five-Minute Rule for trading memory for disc access-Jim Gray and G. F
Query Processing Part 1: Managing Disks 1.
Database Applications (15-415) DBMS Internals- Part I Lecture 11, February 16, 2016 Mohammad Hammoud.
Module 11: File Structure
Database Management System
Lecture 16: Data Storage Wednesday, November 6, 2006.
FileSystems.
Database Management Systems (CS 564)
Computer Science 210 Computer Organization
CS 554: Advanced Database System Notes 02: Hardware
Database Management Systems (CS 564)
9/12/2018.
Lecture 11: DMBS Internals
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Disks.
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk Storage, Basic File Structures, and Hashing
Disk Storage, Basic File Structures, and Buffer Management
Disk storage Index structures for files
Database Management Systems (CS 564)
Midterm Review – Part I ( Disk, Buffer and Index )
Persistence: hard disk drive
Lecture 2- Query Processing (continued)
File Storage and Indexing
Contents Memory types & memory hierarchy Virtual memory (VM)
Chapter 12 Query Processing (1)
External Sorting.
Chapter 14: File-System Implementation
Secondary Storage Management Hank Levy
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
CPS216: Advanced Database Systems Notes 04: Data Access from Disks
CS 245: Database System Principles Notes 02: Hardware
CSE 190D Database System Implementation
Presentation transcript:

Query Processing Exercise Session 1

How I/O is Done An application program reads from and writes to its private memory Disk B1 B2 B3 Bn … Program’s private memory The system is in charge of removing blocks to make room for new ones The system (OS or DBMS) transfers data along the green arrows RED is RAM

When a program wants to read, the system brings the blocks from the disk if they are not already in the buffer When a program writes, the system is responsible for transferring the blocks from the buffer to the disk How I/O is Done An application program reads from and writes to its private memory Disk B1 B2 B3 Bn … Program’s private memory The system is in charge of removing blocks to make room for new ones The system (OS or DBMS) transfers data along the green arrows RED is RAM

Application Programs Only Deal with Records An application program works only with records and is not aware that there are blocks and buffers When a program reads a record, the system brings the relevant block from the disk to the buffer and then copies the record to the program’s private memory Similarly (but in the opposite direction) for writing

Replacement Policies When the buffer is full, which block should be removed? The one that will be needed again only a long time from now OS usually implements a policy of LRU (least recently used) What if all the blocks in the buffer are still needed by the programs running now? The answer to the question in the last item will be given during the last week of the semester.

Why LRU is not Good for DBMS An example: The size of the buffer is n-1 blocks We need to read several times a sequential file that has n blocks In this case, MRU (most recently used) is the best policy (for deciding which block to remove) Same when reading nodes of a B+tree

How to Use a Buffer Efficiently Problem: Have a File Sequence of Blocks B1, B2 Have a Program Process B1 Process B2 Process B3 ...

Single-Buffer Solution (1) Read B1  Buffer (2) Process Data in Memory (3) Read B2  Buffer (4) Process Data in Memory ...

Total Time (not just I/O) Say P = time to process 1 block R = time to read 1 block from disk n = # blocks Single-buffer time = n(P+R)

Double Buffering A A B C D G E F Buffer: Disk: For simplicity, we assume that the processing is done in the buffer (rather than in the program’s memory)

While the Program Processes Block A, the Systems Reads Block B Buffer: Disk: B done A A B C D G E F

Now the Program Processes Block B While the System Reads Block C Buffer: Disk: C A B A B C D G E F done

Once Again B A C B A A B C D G E F Buffer: Disk: process process done

Total Time Assuming P  R P = Time to process 1 block R = Time to read 1 block n = # blocks What is the total time? Single buffering time = n(R+P) Double buffering time = The CPU time hardly affects the total length of the computation It is correct to count just the I/O operations when analyzing running time The answer is nR+P

The Actual Difference The actual difference between single and double buffering is much worse than n(R+P) – (nR+P), why? These are actually two different R’s Because double buffering enables reading the file sequentially, whereas single buffering is even worse than random reading since the latency is almost a full revolution

Questions Is double buffering useful also when writing to the disk? How do you activate double buffering? Suppose your program is a CPU cruncher, that is, P  R Compute the total time for single and double buffering when P  R Does double buffering help? Double buffering is useful also when writing to the disk. However, it is a different story if you have to read each block, modify it (based on some computation) and then write it back, before reading the next block. In many cases, double buffering is activated automatically by the DBMS. To be sure, check the details of the system you are using.

Using a Buffer of 2k Blocks “Double buffering” is not limited to using just a buffer of two blocks An application program processes k blocks in main memory while the system reads the next k blocks Is it better to use k blocks, rather than 2? In an ideal situation it is not better But practically, double buffering does not work perfectly, because other programs occasionally “steal” the controller After a steal, there are nonzero seek time and latency, and it is better if their cost is spread over k blocks

Read-Ahead Buffering When an application asks for one block, the system reads several more blocks sequentially in anticipation that the application will need them This is just one example of using double buffering

Best Case of Joining 2 Relations It is meaningless to specify the I/O cost without saying how much memory is needed to realize that cost Best Case of Joining 2 Relations Relation R has BR blocks Relation S has BS block The size of the result is C blocks The best possible I/O cost is BR + BS + C How much memory is needed to achieve this cost?

Selection Using an Index An index is a data structure that gives the addresses of records with a given value in some field(s) For now, we only consider the I/O cost of accessing the file itself and ignore the cost of using the index ID is a unique key, so what is the cost of doing the selection ID=102 using an index? The I/O cost is 1 (because only one record satisfies the condition of the selection)

Selection Using an Index (cont.) Name is not a unique key, there are 1,000 records with the name “levy”, and a block can store 50 records What is the cost of the selection Name=“levy”? Depends on whether the file is clustered on Name, that is, whether all the records with the same name are physically close to each other on the disk If sorted on Name, then clustered Note that a file cannot be clustered on two different fields! (unless one is a unique key)

So, What is the Answer? If the file is clustered on Name, then there are at least 20 and at most 21 blocks holding records with Name=“levy” When does the best case (i.e., 20) happens? So, the I/O cost (in the worst case) is 21 If the file is not clustered, then (in the worst case) the I/O cost is 1,000 Does the cost depend on the file size? Yes, if the file has only 700 blocks, then it is better to read all the blocks of the file one by one than to use the index. That is, we do not use an index when just scanning the file is better.

Zone Bit Recording All sectors have the same capacity (typically 512 bytes) All tracks used to have the same number of sectors, but not anymore why? Sustained transfer rate OD (outer diameter) is higher This rate goes down as the heads move toward the center Use a software tool to measure the sustained transfer rate of your disks

How It Used to Be Tracks are concentric circles, divided into sectors Gaps between sectors and between tracks All sectors have the same number of bytes (typically 512)

Zone Bit Recording

Physical Addresses are Just “Logical” The physical address of a block consists of Device ID Cylinder # Surface # (i.e., track number) Sector # Due to zone bit recording (and other reasons), the physical addresses do not reflect the true geometry of the disk Same number of sectors in every track

The Five-Minute Rule The Five-minute Rule for Trading Memory for Disc Accesses Jim Gray & Franco Putzolu, 1987 The Five Minute Rule, Ten Years Later Goetz Graefe & Jim Gray, 1997 The five-minute rule 20 years later (and how flash memory changes the rules) Goetz Graefe, 2009 (originally 2007)

IOPS IOPS = I/O Operations Per Second D = price of a disk Currently, IOPS is in the range 100 – 200 D = price of a disk I = # of IOPS A block has to be brought into memory every X seconds The (proportional) cost is D/(XI)

An Alternative D X = IM Keep the block in memory all the time M = the cost of memory (RAM) for 1 block (varies with the size of the block) Break-even point is when equality holds, that is, M = D/(XI) and hence X = IM D

The New Rule Cost of 1 IOP is about $1 Cost of 1MB RAM is about $0.05 The # of 4KB blocks in 1MB is 256 Hence, X is about 90 minutes Used to be about 5 minutes in 1987 & 1997 Buy RAM for each block you need at least every 90 minutes

Not Only A Matter of Cost The poor IOPS performance of hard disks is a bottleneck of I/O-intensive systems The solution is solid-state drives (SSD) http://www.theregister.co.uk/2009/09/23/insane_ssd_performance/

Disk Arrays RAIDs (various flavors) Block Striping Mirrored logically one disk

RAID Tutorial http://www.acnc.com/04_01_00.html

On-Disk Cache P ... ... M C cache cache

Summary of Optimizations Disk-Scheduling Algorithms e.g., elevator algorithm Larger Blocks (8KB nowadays) and larger buffers As the price of RAM drops, blocks and buffers get bigger Read-Ahead Buffering – this is useful if The system knows in advance the blocks that will be needed shortly, or The systems guesses correctly that the following N contiguous blocks are going to be needed RAID On-Disk Cache

A Bit More on Bytes What does burst rate mean? Gibibytes vs. Gigabytes gibibytes = gigabinary bytes Memory is measured in gibibytes whereas the capacity of disks is given in gigabytes 1MB = K  K, 1GB = K  K  K K = 1024 for RAM but only 1000 for disks

Relational Operations on Bags What are the definitions of the five basic operators when they are applied under the bag semantics, that is, relations may have duplicates? When can we push selection and projection through join?

Pushing Selections and Projections Does it work also for bags? Repeatedly split each selection with ⋀ using the equivalence C1⋀C2(E) ≡ C1(C2(E)) Repeatedly do the following: Push selections through projections Push selections into every operand of a natural join if possible (i.e., if the operand contains all the attributes of the selection) After each selection and each join, do projection that leaves only attributes that are needed either for later selections and joins, or for the final result

The Duplicate-Elimination Operator  is the operation of duplicate elimination The result of (R) is obtained from R by removing duplicates Through which operations can we push ?