Lecture 20: Representing Data Elements

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Advertisements

1. 1. Database address space 2. Virtual address space 3. Map table 4. Translation table 5. Swizzling and UnSwizzling 6. Pinned Blocks 2.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
CS 4432lecture #61 CS4432: Database Systems II Lecture #6 Professor Elke A. Rundensteiner.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
Data Storage and Access Methods Min Song IS698. Database Design Process Conceptual Model Logical Model External Model Conceptual requirements Conceptual.
13.6 Representing Block and Record Addresses Ramya Karri CS257 Section 2 ID: 206.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Lecture 11: DMBS Internals
Bhanu Choudhary CS257 Section 1 ID: 101.  Introduction  Addresses in Client-Server Systems  Logical and Structured Addresses  Pointer Swizzling 
13.6 Representing Block and Record Addresses
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 75 Database Systems II Record Organization.
Chapter 3 Representing Data Elements 1.How to lay out data on disk 2.How to move it to memory.
Sorting.
CS4432: Database Systems II Record Representation 1.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Representing Block & Record Addresses
CS411 Database Systems Kazuhiro Minami 09: Storage.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Storage and Representation Spring 2016.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Storage and File Organization
Module 11: File Structure
CS522 Advanced database Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Secondary Storage Management 13.5 Arranging data on disk
Database Management Systems (CS 564)
Performance Measures of Disks
Lecture 11: DMBS Internals
Lecture 10: Buffer Manager and File Organization
Database Implementation Issues
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
Disk Storage, Basic File Structures, and Buffer Management
Database Implementation Issues
Secondary Storage Management 13.5 Arranging data on disk
Database Management Systems (CS 564)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Introduction to Database Systems
Selected Topics: External Sorting, Join Algorithms, …
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 15: Midterm Review Data Storage
Lecture 19: Data Storage and Indexes
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CSE 544: Lectures 13 and 14 Storing Data, Indexes
Representing Block & Record Addresses
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Lecture 6: Data Storage and Indexes
CS 245: Database System Principles Disk Organization
RDBMS Chapter 4.
DATABASE IMPLEMENTATION ISSUES
CSE 544: Lecture 11 Storing Data, Indexes
CS222: Principles of Data Management Lecture #10 External Sorting
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
CSE 444: Lecture 25 Query Execution
File Organization.
CS222P: Principles of Data Management Lecture #10 External Sorting
Database Implementation Issues
Lecture 18: DMBS Overview and Data Storage
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 15: Data Storage Tuesday, February 20, 2001.
Database Implementation Issues
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Lecture 20: Representing Data Elements Wednesday, November 8, 2000

Outline External sorting (2.3.3-2.3.5) Representing data elements (3)

Sorting Illustrates the difference in algorithm design when your data is not in main memory: Problem: sort 1Gb of data with 1Mb of RAM. Arises in many places in database systems: Data requested in sorted order (ORDER BY) Needed for grouping operations First step in sort-merge join algorithm Duplicate removal Bulk loading of B+-tree indexes. 4

2-Way Merge-sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3, …, etc.: three buffer pages used. INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk 5

Two-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Improvement: start with larger runs Sort 1GB with 1MB memory in 10 passes PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 2,3 4,7 1,3 2-page runs 4,6 8,9 5,6 2 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 8,9 6 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 6

Can We Do Better ? We have more main memory Should use it to improve performance

Cost Model for Our Analysis B: Block size M: Size of main memory N: Number of records in the file R: Size of one record 3

External Merge-Sort Phase one: load M bytes in memory, sort Result: runs of length M/R records M/R records . . . . . . Disk Disk M bytes of main memory

Phase Two . . . . . . Merge M/B – 1 runs into a new run Result: runs have now M/R (M/B – 1) records Input 1 . . . Input 2 . . . Output . . . . Input M/B Disk Disk M bytes of main memory 7

Phase Three . . . . . . Merge M/B – 1 runs into a new run Result: runs have now M/R (M/B – 1)2 records Input 1 . . . Input 2 . . . Output . . . . Input M/B Disk Disk M bytes of main memory 7

Cost of External Merge Sort Number of passes: Think differently Given B = 4KB, M = 64MB, R = 0.1KB Pass 1: runs of length M/R = 640000 Have now sorted runs of 640000 records Pass 2: runs increase by a factor of M/B – 1 = 16000 Have now sorted runs of 10,240,000,000 = 1010 records Pass 3: runs increase by a factor of M/B – 1 = 16000 Have now sorted runs of 1014 records Nobody has so much data ! Can sort everything in 2 or 3 passes ! 8

Representing Data Elements Relational database elements: CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name)) A tuple is represented as a record

Representing Data Elements Representing objects: interface Company { attribute string name; relationship Set<Product> makes inverse Product::maker; } An object is represented as a record plus object identifier What to do with repeating fields (e.g. makes)

Record Formats: Fixed Length Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field requires scan of record. Note the importance of schema information! 9

Record Header To schema length F1 F2 F3 F4 L1 L2 L3 L4 header timestamp Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist 9

Variable Length Records Other header information header F1 F2 F3 F4 L1 L2 L3 L4 length Place the fixed fields first: F1, F2 Then the variable length fields: F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end) 9

Records With Repeating Fields Other header information header F1 F2 F3 L1 L2 L3 length 9

Storing Records in Blocks Blocks have fixed size (typically 4k) BLOCK R4 R3 R2 R1

Spanning Records Across Blocks When records are very large Or even medium size: saves space in blocks block header block header R1 R2 R3 R2

BLOB Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together

Modifications: Insertion File is unsorted: add it to the end (easy ) File is sorted: Is there space in the right block ? Yes: we are lucky, store it there Is there space in a neighboring block ? Look 1-2 blocks to the left/right, shift records If anything else fails, create overflow block

Overflow Blocks Blockn-1 Blockn Blockn+1 Overflow After a while the file starts being dominated by overflow blocks: time to reorganize

Modifications: Deletions Free space in block, shift records Maybe be able to eliminate an overflow block Can never really eliminate the record, because others may point to it Place a tombstone instead (a NULL record)

Modifications: Updates If new record is shorter than previous, easy  If it is longer, need to shift records, create overflow blocks

Physical Addresses Each block and each record have a physical address that consists of: The host The disk The cylinder number The track number The block within the track For records: an offset in the block sometimes this is in the block’s header

Logical Addresses Logical address: a string of bytes (10-16) More flexible: can blocks/records around But need translation table: Logical address Physical address L1 P1 L2 P2 L3 P3

Main Memory Address When the block is read in main memory, it receives a main memory address Need another translation table Memory address Logical address M1 L1 M2 L2 M3 L3

Optimization: Pointer Swizzling = the process of replacing a physical/logical pointer with a main memory pointer Still need translation table, but subsequent references are faster

Pointer Swizzling Block 2 Block 1 Disk read in memory swizzled Memory unswizzled

Pointer Swizzling Automatic: when block is read in main memory, swizzle all pointers in the block On demand: swizzle only when user requests No swizzling: always use translation table

Pointer Swizzling When blocks return to disk: pointers need unswizzled Danger: someone else may point to this block Pinned blocks: we don’t allow it to return to disk Keep a list of references to this block