CPSC 252 External Searching Page 1 External Searching Motivation: To this point in the course we have assumed that any data that we are searching through can be stored in memory. In practice, this is not always a reasonable assumption. Some large databases are such that they cannot be read into memory. We refer to such data as disk-bound. Accessing data that is stored on a hard disk is extremely slow compared to accessing data in memory. The time taken to access data in memory is typically on the order of nanoseconds (10 -9 ) while the time taken to access data on a disk is of the order of milliseconds (10 -3 ). Given the roughly million-to-1 ratio of disk access time to memory access time we obviously want to minimize the number of times we access the disk!
CPSC 252 External Searching Page 2 We have already seen that an AVL tree allows us to cut down the amount of data to be searched by approximately one half each time we compare the data we are searching for against the data stored in a node. Now let’s suppose that we have a database with 30,000,000 records – not unreasonable as a database of Canadian citizens, for example, would have such a size. How many levels would there be in the AVL tree? We know that a complete binary tree that has its last level filled has: 2 L – 1 nodes, where L is the number of levels in the tree. Hence a tree with 30,000,000 nodes will have log 2 (30,000,001) or 25 levels. So if we are searching for a record that happens to be stored in a leaf node, we will have to perform 25 disk accesses – one for each node visited as we work down the tree. If we have to search the database frequently, this will be unacceptably slow.
CPSC 252 External Searching Page 3 In order to minimize the number of disk accesses, we need to minimize the number of levels in our search tree. Definition: a tree of order m is a tree that has at most m children. Definition: an m-way search tree T is a tree of order m such that: - T is either empty or - each node has subtrees: T 0, T 1, …, T n and key values: K 1 < K 2 < … < K n, where 1 <= n < m - for every key value V in subtree T i : V K n, i = n - every subtree T i is also an m-way search tree.
CPSC 252 External Searching Page 4 Example: The following is an 4-way search tree:
CPSC 252 External Searching Page 5 Now suppose that we have a database of 30,000,000 records and that each node in our 4-way tree is full (ie. contains 3 records), how many disk accesses will be required to retrieve a node at the bottom of the tree? If a 4-way tree has each node filled (ie. contains 3 records) and has every level filled then it contains: (4 L – 1) nodes, where L is the number of levels. Hence a database containing 30,000,000 records will have log 4 ( 30,000,001 ) or 13 levels – a definite improvement over a binary search tree. In practice, commercial databases use specialized versions of m- way search trees where m is of the order of 100. By increasing the value of m we decrease the number of levels in the tree for a fixed number of records. In the following sections we will examine some of these specialized m-way search trees.