School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP 261 2015.

Slides:



Advertisements
Similar presentations
Chapter 12: File System Implementation
Advertisements

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Indexing Large Data COMP # 22
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Chapter 4 : File Systems What is a file system?
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Lecture 8: Data structures for databases II Jose M. Peña
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Copyright © 2004 Pearson Education, Inc.. Chapter 14 Indexing Structures for Files.
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
BTrees & Bitmap Indexes
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Efficient Storage and Retrieval of Data
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
B-Trees Chapter 9. Limitations of binary search Though faster than sequential search, binary search still requires an unacceptable number of accesses.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
1 Lecture 7: Data structures for databases I Jose M. Peña
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
IntroductionIntroduction  Definition of B-trees  Properties  Specialization  Examples  2-3 trees  Insertion of B-tree  Remove items from B-tree.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
B+ Trees COMP
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
Chapter Trees and External Storage © John Urrutia 2014, All Rights Reserved1.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Physical Database Design File Organizations and Indexes ISYS 464.
CSE373: Data Structures & Algorithms Lecture 15: B-Trees Linda Shapiro Winter 2015.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
B-Trees And B+-Trees Jay Yim CS 157B Dr. Lee.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
Indexing.
COSC 2007 Data Structures II Chapter 15 External Methods.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
Lecture1 introductions and Tree Data Structures 11/12/20151.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
CompSci 100E 39.1 Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Chapter 5 Record Storage and Primary File Organizations
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
B+ Trees.
File Organization Record Storage and Primary File Organization
COMP261 Lecture 25 Indexing Large Data.
COMP261 Lecture 23 B Trees.
Data Indexing Herbert A. Evans.
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
CS522 Advanced database Systems
9/12/2018.
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk storage Index structures for files
B-Trees.
Lecture 19: Data Storage and Indexes
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Lecture 20: Indexes Monday, February 27, 2006.
Department of Computer Science
Presentation transcript:

School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP

COMP261: 2 Introduction Files, file structures, DB files, indexes B-trees and B+-trees Reference Book (in VUW Library): Fundamentals of Database Systems by Elmasri & Navathe (2011), (Chapters 16 and 17 only!)

COMP261: 3 The Memory Hierarchy Memory in a computer system is arranged in a hierarchy:

COMP261: 4 The Memory Hierarchy Registers in CPU. Memory caches Very fast, small Copy of bits of primary memory Primary memory: Fast, large Volatile: lost in power outage Secondary Storage (hard disks) Slow, very large Persistent Data cannot be manipulated directly, data must be copied to main memory, modified written back to disk Need to know how files are organised on disk

Disk Storage Devices Disk controller: interfaces disk drive to computer Implements commands to read or write a sector

COMP261: 6 Disk Organisation Data on a disk is organised into chunks: Sectors: physical organisation. hardware disk operations work on sector at a time traditionally: 512 bytes modern disks: 4096 bytes Blocks: logical organisation operating system retrieves and writes a block at a time 512 bytes to 8192 bytes, ⇒ For all efficient file operations, minimise block accesses

COMP261: 7 Reading/Writing from files Disk Program Buffers Hold one block at a time In memory Disk Controller

COMP261: 8 File Organisation Files typically much larger than blocks A file must be stored in a collection of blocks on disk Must organise the file data into blocks Must record the collection of blocks associated with the file Must enable access to the blocks (sequential or random access) Disk Program Buffers Hold one block at a time In memoryDisk Controller

COMP261: 9 Files of records File may be a sequence of records (especially for DB files) record = logical chunk of information eg a row from a DataBase table an entry in a file system description … May have several records per block "blocking factor" = # records per block can calculate block number from record number (& vice versa) (if records all the same size) # records = # blocks  bf

COMP261: 10 Using B+ trees for organising Data B+ tree is an efficient index for very large data sets The B+ tree must be stored on disk (ie, in a file) ⇒ costly actions are accessing data from disk Retrieval action in the B+ tree is accessing a node ⇒ want to make one node only require one block access ⇒ want each of the nodes to be as big as possible ⇒ fill the block B+ tree in a file: one node (internal or leaf) per block. links to other nodes = index of block in file. need some header info. need to store keys and values as bytes.

COMP261: 11 Implementing B+ Tree in a File To store a B Tree in a block-structured file Need a block for each node of the tree Can refer to blocks by their index in the file. Need an initial block (first block in file) with meta data: index of the root block number of items information about the item sizes and types? Need a block for each internal node type number of items child key child key…. key child Need a block for each leaf node type number of items link to next key-value key-value …. key-value ……… index of block containing child node ……

COMP261: 12 Cost of B+ tree If the block size is 1024 bytes, how big can the nodes be? Node requires some header information leaf node or internal node number of items in node, internal node: m N x key size m N +1 x pointer size leaf node m L x item size pointer to next leaf How big is an item? How big is a pointer? Leaf nodes could hold more values than internal nodes! Must specify which node, ie which block of the file

COMP261: 13 Cost of B+ tree If block has 1024 bytes each node has header⇒ 5 bytes If key is a string of up to 10 characters⇒ 10 bytes if value is a string of up to 20 characters⇒ 20 bytes If child pointer is an int ⇒ 4 bytes Internal node (m N keys, m N +1 child pointer) size = 5 + (10 + 4) m N + 4 ⇒ m N ≤ (1024 – 9 )/ 14 = 72.5 ⇒ 72 keys in internal nodes Leaf node (with pointer to next) size = ( ) m L ⇒ m L ≤ (1024 – 9 )/ 30 = = 33.8 ⇒ 33 key-value pairs in leaves

COMP261: 14 B+ Tree: Find To find value associated with a key: Find(key): if root is empty return null else return Find(key, root) Find(key, node): if node is a leaf for i from 0 to node.size-1 if key = node.keys[ i ] return node.values[ i ] return null if node is an internal node for i from 1 to node.size if key < node.keys[ i ] return Find(key, getNode(node.child[i-1])) return Find(key, getNode(child[node.size] )) Could use binary search C 0, C 1, C 2, C node.size-1 C node.size K 1 K 2 … K node.size K 0 -V 0, K 1 -V 1, K 2 -V 2, …, K leaf.size-1 -V leaf.size-1

COMP261: 15 B+ Tree Add (1) Add(key, value): if root is empty create new leaf, add key-value, root  leaf else (newKey, rightChild)  Add(key, value, root) if (newKey, rightChild) ≠ null node  create new internal node node.size  1 node.child[0]  root node.keys[1]  newKey node.child[1]  rightChild root  node Make a new root node If root was full: returns new key and new leaf node,

COMP261: 16 B+ Tree Add (2) Add(key, value, node): if node is a leaf if node.size < maxLeafKeys insert key and value into leaf in correct place return null else return SplitLeaf(key, value, node) if node is an internal node for i from 1 to node.size if key < node.keys[i] (k, rc)  Add(key, value, node.child[i-1]) if (k, rc)=null return null else return dealWithPromote(k,rc,node) (k, rc)  Add(key, value, node.child[node.size]) if (k, rc)= null return null else return dealWithPromote( k, rc, node) K 0 -V 0, K 1 -V 1, K 2 -V 2, …, K size-1 -V size-1 C 0, C 1, C 2, C size-1 C size K 1 K 2 … K size Returns new key and new leaf node, Inserts new key and child into node,

COMP261: 17 B+ Tree Add (3) SplitLeaf(key, value, node): insert key and value into leaf in correct place (spilling over end) sibling  create new leaf mid   (node.size+1)/2  move keys and values from mid … size out of node into sibling. sibling.next  node.next node.next  sibling return (sibling.keys[0], sibling) K 0 -V 0, K 1 -V 1, K 2 -V 2, …, K size-1 -V size-1 K 0 -V 0, K 1 -V 1, K 2 -V 2, … key-value … K size-1 -V size-1 K 0 -V 0, …, K mid-1 -V mid-1 K mid -V mid, …, K size-1 -V size-1 Could make the array one larger than necessary to give room for this. =  (max L +2)/2  since size is now max L +1

COMP261: 18 B+ Tree Add (4) DealWithPromote( newKey, rightChild, node ): if (newKey, rightChild) = null return null if newKey > node.keys[node.size] insert newKey at node.keys[node.size+1] insert rightChild at node.child[node.size+1] else for i from 1 to node.size if newKey < node.keys[i] insert newKey at node.keys[i] insert rightChild at node.child[i] if size ≤ maxNodeKeys return null sibling  create new node mid   size/2  +1 move node.keys[mid+1… node.size] to sibling.node [1… node.size-mid] move node.child[mid … node.size] to sibling.child [0 … node.size-mid] promoteKey  node.keys[mid], remove node.keys[mid] return (promoteKey, sibling) C 0, C 1, C 2, C size-1 C size K 1 K 2 … K size K Nothing was promoted No need to promote further Node is overfull: Have to split and promote