Fast Trie Data Structures Seminar On Advanced Topics In Data Structures Jacob Katz December 1, 2001 Dan E. Willard, 1981, “New Trie Data Structures Which Support Very Fast Search Operations”
Agenda Problem statement Existing solutions and motivation for a new one P-Fast tries & their complexity Q-Fast tries & their complexity X-Fast tries & their complexity Y-Fast tries & their complexity JK 11/16/2018
Problem statement Let S be a set of N records with distinct integer keys in range [0, M], with the following operations: MEMBER(K) – does the key K belong to the set SUCCESSOR(K) – find the least element which is greater than K PREDECESSOR(K) – find the greatest element which is less than K SUBSET(K1, K2) – produce a list of elements whose keys lie between K1 and K2 The problem: efficient data structure supporting this definition JK 11/16/2018
Existing solutions AVL trees, 2-3 trees use O(N) space and O(log N) time in worst case With no restriction on the keys better performance is impossible Expected O(log log N) time is possible when keys are uniformly distributed Stratified trees use O(M * log log M) space and O(log log M) time in worst case for integer keys in range [0, M] Disadvantage: O(M * log log M) space is much larger when O(N), if M >> N JK 11/16/2018
Motivation for another solution More space-efficient data structure is wanted for restricted keys, which still maintains the time efficiency… JK 11/16/2018
The way to the solution We first define P-Fast Trie: O( ) time; O(N * * 2 ) space Then show Q-Fast Trie improvement to the space requirement to O(N) Then show X-Fast Trie O(log log M) time; O(N*log M) space; no dynamic operations Then show Y-Fast Trie O(log log M) time; O(N) space; no dynamic operations JK 11/16/2018
What’s Trie Trie of size (h, b) is a tree of height h and branching factor b All keys can be regarded as integers in range [0, bh] Each key K can be represented as h-digit number in base b: K1K2K3…Kh Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits root 20 22 24 31 32 42 43 2 3 4 1 JK 11/16/2018
Trivial Trie In each node store vector of branches MEMBER(K) – O(h) visits O(h) nodes, spends O(1) time in each SUCCESSOR(K)/PREDECESSOR(K) – O(h*b) visits O(h) nodes, spend O(b) time in each node this is too much time Observation: increasing b (the base of key representation, the branching factor) decreases h (number of digits required to represent a key, the height of the tree) and vice versa JK 11/16/2018
Example for worst case complexity root bh-1 b-1 JK 11/16/2018
P-Fast Trie Idea Improve SUCCESSOR(k)/PREDECESSOR(k) time by overcoming the linear search in every intermediate node JK 11/16/2018
P-Fast Trie Each internal node v has additional fields: LOWKEY(v) – leaf node containing the smallest key descending from v HIGHKEY(v) – leaf node containing the largest key descending from v INNERTREE(v) – binary tree of worst-case height O(log b) representing the set of digits directly descending from v Each leaf node points to its immediate neighbors on the left and on the right CLOSEMATCH(K) – query returning the node with key K if it exists in the trie; returning PREDECESSOR(K) or SUCCESSOR(K) otherwise JK 11/16/2018
CLOSEMATCH(k) Algorithm Intuitively Starting from Root, look for k=k1k2..kh If found, return it If not, then v is the node at depth j from which there’s no way down any more: kj Ï INNERTREE(v) Looking for kj in INNERTREE(v), find D – existing digit in INNERTREE(v) that is either: the least digit greater than kj the greatest digit less than kj If D > kj, then return LOWKEY(d’s child of v), else if D < kj, then return HIGHKEY(d’s child of v) JK 11/16/2018
P-Fast Trie Complexities CLOSEMATCH(K) time complexity is O(h + log b) Other queries require O(1) addition to the CLOSEMATCH(K) complexity Space complexity of such trie is O(h*b*N) Representing the input keys in base 2 requires digits, therefore with such h and b the desired complexities are achieved JK 11/16/2018
Q-Fast Trie Idea Improve space by splitting the set of keys into subsets How to split is the problem: To preserve the time complexity To decrease the space complexity JK 11/16/2018
Q-Fast Trie Let S’ denote the ordered list of keys from S: Define: 0 = K1 < K2 < K3 < … < KL < M Define: Si = {K Î S | Ki £ K £ Ki+1} for i < L SL = {K Î S | K ³ KL} S’ is a c-partition of S iff each Si has cardinality in range [c, 2c-1] Q-Fast Trie of size (h, b, c) is a two-level structure: Upper part: p-fast trie T of size (h, b) representing set S’ which is a c-partition of S Lower part: forest of 2-3 trees, where ith tree represents Si The leafs of 2-3 trees are connected to form an ordered list JK 11/16/2018
Example of Q-Fast Trie 71 35 10 17 33 70 77 81 95 99 JK 11/16/2018
CLOSEMATCH(k) Algorithm Intuitively Look for D=PREDECESSOR(k) in the upper part O(h + log b) Then search the D’s 2-3 tree for k O(log c) JK 11/16/2018
Q-Fast Trie Complexities CLOSEMATCH(K) time complexity is O(h + log b + log c) Other queries require O(1) addition to the CLOSEMATCH(K) complexity Space complexity is O(N+N*h*b/c) By choosing h = , b = 2 , c = h*b, the desired complexities are achieved JK 11/16/2018
P/Q-Fast Trie Insertion/Deletion P-fast trie Use AVL trees for INNERTREEs O(h + log b) for insertion/deletion Q-fast trie O(h + log b + log c) for insertion/deletion Maintenance of c-partition property through trees splitting/merging in O(log c) time JK 11/16/2018
X-Fast Trie Idea P/Q-Fast trie uses top-down search to get to the wanted level, making binary search in each node on the way. Thus, P/Q-Fast Trie relies on the balance between the height of the tree and the branching factor X-Fast trie idea: Use binary search of the wanted level Requires to be possible to find the wanted node by knowing its level without top-down pass For the purpose of worst case complexity the branching factor is not important any more, since it only affects the basis of the log JK 11/16/2018
X-Fast Trie Part 1: Trie of height h and branching factor 2 (representing all keys in binary) Each node has additional field DESCENDANT(v): If v has only right branch, it points to the largest leaf descending from v (thru the left branch) If v has only left branch, it points to the smallest leaf descending from v (thru the right branch) All leaves form doubly-linked list Node v at height j may have descending leaves only in range [(i-1)*2j+1, i*2j] for some integer i; this i is called ID(v) Node v at height j is called ancestor of key K, if K/2j=ID(v) BOTTOM(k) is the lowest ancestor of K JK 11/16/2018
X-Fast Trie Part 2: h+1 Level Search Structures (LSS), each of which uses perfect hashing as we have seen in the first lecture: Linear space & constant time JK 11/16/2018
BOTTOM(k) Algorithm Intuitively Make binary search among the h+1 different LSSs Searching each LSS is O(1) h = log M, therefore binary search of h+1 LSSs is O(log log M) JK 11/16/2018
X-Fast Trie Complexities BOTTOM(k) is O(log log M) All queries require O(1) addition to BOTTOM(k), with assistance of the DESCENDANT field and the doubly-linked list: BOTTOM(K) is either K itself, or its DESCENDANT is PREDECESSOR(K)/SUCCESSOR(K) Space is O(N * log M) No more than h * N nodes in the trie (h=log M) log M LSSs each using O(N) space JK 11/16/2018
Y-Fast Trie Idea Apply similar partitioning technique, as done for P-Fast trie to move to Q-Fast trie: c-partitioning of all the keys to L subsets each containing [c, 2c-1] keys Upper part: X-Fast trie representing S’ Lower part: forest of binary trees of height log c JK 11/16/2018
Y-Fast Trie Complexities Upper part can be searched within O(log log M) time and occupies no more than O((N/c) * log M) space Each binary tree can be searched within O(log c) and they all together occupy O(N) space Choosing c=log M: O(N) space; O(log log M) time JK 11/16/2018
X/Y-Fast Trie Insertion/Deletion LSSs have practically uncontrolled time complexity for dynamic operations At least at the time the article was presented Therefore, X/Y-Fast tries inherit this limitation JK 11/16/2018