File Processing : Index and Hash 2018, Spring Pusan National University Ki-Joune Li
What is index ? Index in a book Index for a file or database Index : Keyword Pages Without Index Exhaustive search : Too Expensive Index for a file or database A function or mechanism FIndex : SPredicate B (block numbers on hard disk) e.g. find student records where student.GPA > 4.0
Data Retrieval Time Data retrieval on disk : Two phases 1st phase : Search with a condition (Predicate) 2nd phase : Data access Data Access Time - File Structure - Disk Placement - Clustering, etc.. 2nd Phase Search Block Number Search Condition { Block# } Database on Disk 1st Phase
By maximizing blocking factor, we reduce the number of disk accesses Blocking Factor Bf Blocking Factor Number of Records in a Block Blocking Number and Number of Disk Accesses ND = Nrecord / Bf By maximizing blocking factor, we reduce the number of disk accesses
How to Accelerate Phase 1 ? Of course, we could accelerate the phase 1 by index or by hash Index vs. Hash Index : a type of data structures Needs additional data structures Hash : a type of mechanism May not need any additional data structure (not exactly true)
A Simple Idea on Index Mapping Table from keywords to block numbers Inverted File Why inverted file is better than nothing ? If the table is too large (to fit in main memory) It has to be stored on disk Disk Access for Index Access Keyword Block# Juliet Romeo B26 Hamlet B22 … … Carmen B212
Searching Algorithms and Index A good way to accelerate searching Tree : O( logn ) Reorganize Inverted File to Tree Binary Search Tree : Branching Factor = 2 Tree in memory space vs. in disk space Memory space : Number of Comparisons Disk space : Number of Block Accesses 30, b27 14, b17 40, b26 34, b17 55, b26
Paged Tree : m-way search tree How to determine m ? One Node : One Disk Page e.g. When 1 disk page is 4 K bytes 4+4m+8(m-1) = 4096 m = 341 Very fat tree Number of delimiters Delimiter 57, b27 34 103, b28 … 343, b14 Block number 1, b29 44 … 54, b21 58, b17 32 … 96, b127
Problem of m-Way search tree Search Performance : determined by the height Not balanced Average : O(log n) Worst case : n / Bf O(n) Height : determined by insertion order e.g : insertion by ascending order How to make it balanced ? Balanced m-Way search tree : B-tree
B-tree B-tree : Balanced m-way search Tree Root Node : no child node or more than one child nodes Internal Node : m/2 ~ m child nodes (block number) External Node : data block number instead of child node Balanced Upward split instead of downward split : Binary Tree
Downward Split Suppose m=3 Insert 10, 20 10 20 20 Insert 30 10 20 30 Upward Split overflow 10 20 30 40 Insert 40
Downward Split 10 20 30 40 50 10 20 40 Insert 50 30 50 60 Insert 60 10 70 10 20 30 40 60 50 70 40 50 10 20 30 60 70
Meaning of Downward Split Always Balanced Not so much influenced by the order of insertions Internal Nodes : m/2 ~ m child nodes (block number) Root Node 40 50 10 20 30 60 70 Internal Node External Node
Search by B-tree ? 45 45 40 45 20 60 45 10 30 50 70 Not Found
Performance of B-tree Number of Comparison within a node : Trivial Number of Nodes to visit : Depth
Problem of B-tree Types of Search B-tree Exact Match Search Range Search E.g. find students where 25<student.GPA<50 B-tree Good for Exact match search Bad for range search 40 50 10 20 30 60 70
B+-tree A Variant of B-tree Performance Duplicate all elements at leaf nodes (external nodes) Linked List of Leaf Nodes Performance Exact Match Search and Insertion A small fraction of performance sacrifice Range Search : much more powerful than B-tree
B+-tree : Example Duplication 40 10 20 30 10 20 30 40 10 20 30 overflow Linked List 40 10 20 30 50 40 10 20 30 50 60 40 10 20 30 50 60
Range Search with B+-tree Find students where GPA>3.5 35 40 10 20 30 50 60 40 10 20 30 50 60 35 40 10 20 30 50 60 35 40 10 20 30 50 60 35
Performance of B+-tree Determined by the Depth Exact Match Search and Insertion (without split) d node (page) accesses Range Search node accesses ( nq : number of records to retrieve)