Indexing Large Data COMP # 22

Slides:



Advertisements
Similar presentations
COSC 2007 Data Structures II Chapter 14 External Methods.
Advertisements

Chapter 6 File Systems 6.1 Files 6.2 Directories
SE-292 High Performance Computing
Storing Data: Disks and Files
Disk Storage, Basic File Structures, and Hashing
File Management.
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 2 Advanced Computers Architecture UNIT 2 CACHE MEOMORY Lecture7.
More on File Management
Chapter 6 File Systems 6.1 Files 6.2 Directories
Chapter 7 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
B-Trees. Motivation When data is too large to fit in the main memory, then the number of disk accesses becomes important. A disk access is unbelievably.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Copyright © 2004 Pearson Education, Inc.. Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 4 : File Systems What is a file system?
1. 1. Database address space 2. Virtual address space 3. Map table 4. Translation table 5. Swizzling and UnSwizzling 6. Pinned Blocks 2.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
1 Lecture 8: Data structures for databases II Jose M. Peña
Copyright © 2004 Pearson Education, Inc.. Chapter 14 Indexing Structures for Files.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
Efficient Storage and Retrieval of Data
Ceng Operating Systems
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
1 Lecture 7: Data structures for databases I Jose M. Peña
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
B+ Trees COMP
Disk Storage Copyright © 2004 Pearson Education, Inc.
13.6 Representing Block and Record Addresses
Cis303a_chapt03-2a.ppt Range Overflow Fixed length of bits to hold numeric data Can hold a maximum positive number (unsigned) X X X X X X X X X X X X X.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
External data structures
Indexing.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
CS246 Data & File Structures Lecture 1 Introduction to File Systems Instructor: Li Ma Office: NBC 126 Phone: (713)
File Organization Lecture 1
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
3 Data. Software And Data Data Data element – a single, meaningful unit of data. Name Social Security Number Data structure – a set of related data elements.
CE Operating Systems Lecture 17 File systems – interface and implementation.
File Organization Record Storage and Primary File Organization
COMP261 Lecture 25 Indexing Large Data.
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
Secondary Storage Data Retrieval.
Chapter 11: File System Implementation
Database Management Systems (CS 564)
Oracle SQL*Loader
9/12/2018.
Chapter 11: File System Implementation
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk storage Index structures for files
Chapter 11: File System Implementation
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
Secondary Storage Management Brian Bershad
File Storage and Indexing
Introduction to Operating Systems
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Secondary Storage Management Hank Levy
Chapter 11: File System Implementation
Lecture 20: Indexes Monday, February 27, 2006.
Department of Computer Science
Presentation transcript:

Indexing Large Data COMP 261 2013 # 22

Introduction Files, file structures, DB files, indexes B-trees and B+-trees Reference Book (in VUW Library): Fundamentals of Database Systems by Elmasri & Navathe (2006), (Chapters 13 and 14 only!)

The Memory Hierarchy Registers Memory caches Primary memory: in CPU. Memory caches Very fast, small Copy of bits of primary memory Not controllable. Primary memory: Fast, large Volatile Secondary Storage (hard disks) Slow, very large Persistent Data cannot be manipulated directly, data must be copied to main memory, modified written back to disk Need to know how files are organised on disk

Disk Organisation Data on a disk is organised into chunks: Sectors: physical organisation. hardware disk operations work on sector at a time traditionally: 512 bytes modern disks: 4096 bytes Can't address more than 4G sectors with 32 bit number ⇒ at most 2TB with 512 byte sectors Blocks: logical organisation operating system retrieves and writes a block at a time 512 bytes to 8192 bytes, ⇒ For all efficient file operations, minimise block accesses

Reading/Writing from files In memory Disk Program Disk Buffers Disk Controller Hold one block at a time

File Organisation Files typically much larger than blocks A file must be stored in a collection of blocks on disk Must organise the file data into blocks Must record the collection of blocks associated with the file Must enable access to the blocks (sequential or random access) Disk In memory Program Disk Buffers Disk Controller Hold one block at a time

Files of records File may be a sequence of records (especially for DB files) record = logical chunk of information eg a row from a DataBase table an entry in a file system description … May have several records per block "blocking factor" = # records per block can calculate block number from record number (& vice versa) (if records all the same size) # records = # blocks  bf

Using B+ trees for organising Data B+ tree is an efficient index for very large data sets The B+ tree must be stored on disk (ie, in a file) ⇒ costly actions are accessing data from disk Retrieval action in the B+ tree is accessing a node ⇒ want to make one node only require one block access ⇒ want each the nodes to be as big as possible ⇒ fill the block B+ tree in a file: one node (internal or leaf) per block. links to other nodes = index of block in file. need some header info. need to store keys an values as bytes.

Implementing B Tree in a File To store a B Tree in a block-structured file Need a block for each node of the tree Can refer to blocks by their index in the file. Need an initial block (first block in file) with meta data: index of the root block number of items information about the item sizes and types? Need a block for each internal node type child item child key…. key child link to next key-value key-value …. key-value …… ……… index of block containing child node ………

Cost of B tree If the block size is 1024 bytes, how big can the nodes be? Node requires some header information leaf node or internal node number of items in node, internal node: mN x item size mN+1 x pointer size leaf node mL x item size pointer to next leaf How big is an item? How big is a pointer? Must specify which node, ie which block of the file Leaf nodes could hold more values than internal nodes!

Cost of B tree If block has 1024 bytes If key is a string of up to 10 characters ⇒ 10 bytes if value is a string of up to 20 characters ⇒ 20 bytes If child pointer is an int ⇒ 4 bytes Internal node size = 5 + (10 + 4) mN + 4 ⇒ mN ≤ (1024 – 9 )/ 14 = 72.5 ⇒ 72 keys in internal nodes Leaf node size = 5 + 4 + (10 + 20) mL ⇒ mL ≤ (1024 – 9 )/ 30 = = 33.8 ⇒ 33 key-value pairs in leaves