COMP261 Lecture 25 Indexing Large Data.

Slides:



Advertisements
Similar presentations
Chapter 12: File System Implementation
Advertisements

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Indexing Large Data COMP # 22
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Chapter 4 : File Systems What is a file system?
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
1 Lecture 8: Data structures for databases II Jose M. Peña
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
1 Lecture 7: Data structures for databases I Jose M. Peña
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
B+ Trees COMP
CSE373: Data Structures & Algorithms Lecture 15: B-Trees Linda Shapiro Winter 2015.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
B-Trees And B+-Trees Jay Yim CS 157B Dr. Lee.
COSC 2007 Data Structures II Chapter 15 External Methods.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
Lecture1 introductions and Tree Data Structures 11/12/20151.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
CE Operating Systems Lecture 17 File systems – interface and implementation.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
B+ Trees.
File Organization Record Storage and Primary File Organization
10/3/2017 Chapter 6 Index Structures.
File Organization and Processing
COMP261 Lecture 23 B Trees.
Data Indexing Herbert A. Evans.
Module 11: File Structure
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
B-Trees B-Trees.
CS522 Advanced database Systems
COMP261 Lecture 24 B and B+ Trees
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Secondary Storage Data Retrieval.
Database Management Systems (CS 564)
Oracle SQL*Loader
B+-Trees.
Disk Storage, Basic File Structures, and Hashing
9/12/2018.
CSE373: Data Structures & Algorithms Lecture 15: B-Trees
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Chapter 11: File System Implementation
CPSC-310 Database Systems
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk Storage, Basic File Structures, and Hashing
Disk storage Index structures for files
B-Trees.
B-Trees.
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
Random inserting into a B+ Tree
Files Management – The interfacing
2018, Spring Pusan National University Ki-Joune Li
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Secondary Storage Management Hank Levy
File Organization.
Lecture 13: Computer Memory
Lecture 20: Indexes Monday, February 27, 2006.
Department of Computer Science
Introduction to Operating Systems
Presentation transcript:

COMP261 Lecture 25 Indexing Large Data

Introduction Files, file structures, DB files, indexes B-trees and B+-trees Reference Book (in VUW Library): Fundamentals of Database Systems by Elmasri & Navathe (2011), (Chapters 16 and 17 only!)

The Memory Hierarchy Memory in a computer system is arranged in a hierarchy:

The Memory Hierarchy Registers in CPU. Memory caches Very fast, small Copy of bits of primary memory Primary memory: Fast, large Volatile: may be lost in power outage Secondary Storage (hard disks) Slow, very large Persistent Data cannot be manipulated directly, data must be copied to main memory, modified written back to disk Need to know how files are organised on disk

Disk Storage Devices Disk controller: interfaces disk drive to computer Implements commands to read or write a sector

Disk Organisation Data on a disk is organised into chunks: Sectors: physical organisation. hardware disk operations work on sector at a time traditionally: 512 bytes modern disks: 4096 bytes Blocks: logical organisation operating system retrieves and writes a block at a time 512 bytes to 8192 bytes, ⇒ For all efficient file operations, minimise block accesses

Reading/Writing from files In memory Disk Program Disk Buffers Disk Controller Hold one block at a time

File Organisation Files typically much larger than blocks A file is stored as a collection of blocks on disk organise the file data into blocks record the collection of blocks associated with the file enable access to the blocks (sequential or random access) Disk In memory Program Disk Buffers Disk Controller Hold one block at a time

Files of records File may be a sequence of records (especially for DB files) record = logical chunk of information eg a row from a DataBase table an entry in a file system description … May have several records per block "blocking factor" = # records per block can calculate block number from record number (& vice versa) (if records all the same size) # records = # blocks  bf

Using B+ trees for organising Data B+ tree is an efficient index for very large data sets The B+ tree must be stored on disk (ie, in a file) ⇒ costly actions are accessing data from disk Retrieval action in the B+ tree is accessing a node ⇒ want to make one node only require one block access ⇒ want each node to be as big as possible ⇒ fill the block B+ tree in a file: one node (internal or leaf) per block. links to other nodes = index of block in file. need some header info. need to store keys and values as bytes.

Implementing B+ Tree in a File Use a block for each node of the tree Can refer to blocks by their index in the file. Need an initial block (first block in file) with meta data: index of the root block number of items information about the item sizes and types? Use a block for each internal node type child key child key…. key child Use a block for each leaf node link to next key-value key-value …. key-value …… ……… index of block containing child node ………

Cost of B+ tree If the block size is 1024 bytes, how big can the nodes be? Node requires some header information leaf node or internal node number of items in node, internal node: mN x key size mN+1 x pointer size leaf node mL x item size pointer to next leaf How big is an item? How big is a pointer? Must specify which node, ie which block of the file Leaf nodes may hold more values than internal nodes!

Cost of B+ tree: Example Suppose: a block has 1024 bytes each node has header ⇒ 5 bytes a key is a string of up to 10 characters ⇒ 10 bytes a value is a string of up to 20 characters ⇒ 20 bytes a child pointer is an int ⇒ 4 bytes Internal node (mN keys, mN+1 child pointer) size = 5 + (10 + 4) mN + 4 ⇒ mN ≤ (1024 – 9 )/ 14 = 72.5 ⇒ 72 keys in internal nodes Leaf node (with pointer to next) size = 5 + 4 + (10 + 20) mL ⇒ mL ≤ (1024 – 9 )/ 30 = = 33.8 ⇒ 33 key-value pairs in leaves

K0-V0, K1-V1, K2-V2, …, Kleaf.size-1-Vleaf.size-1 B+ Tree: Find To find value associated with a key: Find(key): if root is empty return null else return Find(key, root) Find(key, node): if node is a leaf for i from 0 to node.size-1 if key = node.keys[ i ] return node.values[ i ] return null if node is an internal node for i from 1 to node.size if key < node.keys[ i ] return Find(key, getNode(node.child[i-1])) return Find(key, getNode(child[node.size] )) C0, C1, C2, Cnode.size-1 Cnode.size K1 K2 … Knode.size Could use binary search Moved to lecture 25 slides, but covered in lecture 26 K0-V0, K1-V1, K2-V2, …, Kleaf.size-1-Vleaf.size-1

B+ Tree Add (1) Add(key, value): if root is empty If root was full: create new leaf, add key-value, root  leaf else (newKey, rightChild)  Add(key, value, root) if (newKey, rightChild) ≠ null node  create new internal node node.size  1 node.child[0]  root node.keys[1]  newKey node.child[1]  rightChild root  node If root was full: returns new key and new leaf node, Make a new root node

K0-V0, K1-V1, K2-V2, …, Ksize-1-Vsize-1 B+ Tree Add (2) Add(key, value, node): if node is a leaf if node.size < maxLeafKeys insert key and value into leaf in correct place return null else return SplitLeaf(key, value, node) if node is an internal node for i from 1 to node.size if key < node.keys[i] (k, rc)  Add(key, value, node.child[i-1]) if (k, rc)=null return null else return dealWithPromote(k,rc,node) (k, rc)  Add(key, value, node.child[node.size]) if (k, rc)= null return null else return dealWithPromote( k, rc, node) K0-V0, K1-V1, K2-V2, …, Ksize-1-Vsize-1 Returns new key and new leaf node, C0, C1, C2, Csize-1 Csize K1 K2 … Ksize Inserts new key and child into node, Inserts new key and child into node,

B+ Tree Add (3) SplitLeaf(key, value, node): Could make the array one larger than necessary to give room for this. SplitLeaf(key, value, node): insert key and value into leaf in correct place (spilling over end) sibling  create new leaf mid  (node.size+1)/2 move keys and values from mid … size out of node into sibling. sibling.next  node.next node.next  sibling return (sibling.keys[0] , sibling) = (maxL+2)/2 since size is now maxL+1 K0-V0, K1-V1, K2-V2, …, Ksize-1-Vsize-1 K0-V0, K1-V1, K2-V2, … key-value … Ksize-1-Vsize-1 K0-V0, …, Kmid-1-Vmid-1 Kmid-Vmid, …, Ksize-1-Vsize-1

No need to promote further B+ Tree Add (4) Nothing was promoted DealWithPromote( newKey, rightChild, node ): if (newKey, rightChild) = null return null if newKey > node.keys[node.size] insert newKey at node.keys[node.size+1] insert rightChild at node.child[node.size+1] else for i from 1 to node.size if newKey < node.keys[i] insert newKey at node.keys[i] insert rightChild at node.child[i] if size ≤ maxNodeKeys return null sibling  create new node mid  size/2 +1 move node.keys[mid+1… node.size] to sibling.node [1… node.size-mid] move node.child[mid … node.size] to sibling.child [0 … node.size-mid] promoteKey node.keys[mid], remove node.keys[mid] return (promoteKey, sibling) C0, C1, C2, Csize-1 Csize K1 K2 … Ksize K No need to promote further Node is overfull: Have to split and promote